Strategies for implementing operational metadata can improve the design and performance of any data warehouse or business intelligence environment.
Many data warehouses and analytical systems include metadata management as part of their architectural considerations, but they do not explore fully the strategic approaches to using operational metadata. Ignoring the opportunities that a focus on metadata management can give to data warehouse and analytical systems management and performance can lead to a reduction in design capabilities, and in the execution of the extraction / transformation / loading (ETL) processes. The omission of operational metadata increases costs for maintenance, data acquisition, and data management across the business intelligence activities, and hampers the data warehouse audit processes.
This article, part one, explores various techniques to implement operational metadata in the design of a data warehouse. Part two (insert link here) will describe strategies for using operational metadata by data warehouse administrators and for all data consumers. The techniques described in these articles originated in the book Building and Managing the Metadata Repository, written by David P. Marco (John H. Wiley and Sons, 2000)
It is common to define “metadata” as “data about data”. Many data management and other information technology professionals define “metadata” as “the context about the content”, implying that metadata gives meaning to raw facts. This broader definition enables a deeper understanding of the value of metadata, value that indicates that without metadata, data has little or no actual meaning.
Generally, metadata is classified into two types: business metadata, and technical metadata. In a data warehouse, metadata examples could include:
|Business terms and definitions for tables and columns
|Physical table and column names
|Subject area names
|Data mapping and transformation logic
|Query and report definitions
|Source systems information
|Foreign keys and indexes
|Data Steward information
|ETL process names
Many data warehouse architectures use a metadata repository to capture, store and provide access to metadata for business and technical users, to satisfy their requirements for contextual meaning to the data found in the data warehouse. Unfortunately, many products that are marketed as metadata repositories do not fulfill their promises to manage metadata properly across the spectrum of business and technical metadata, nor do they interface easily with the rest of the architecture needed for the data warehouse. As a result, many organizations suffer from a lack of effectively managed metadata for their decision support / business intelligence / analytics environments.
Organizations that use a metadata repository can add an additional layer of detail to their data warehouse and ETL processes by incorporating operational metadata into their architecture and design, providing increased capabilities in data acquisition and maintenance, along with opportunities for data consumer reconciliation and data warehouse audit functions.
Operational metadata offers a connection between the metadata repository and the data warehouse by adding physical database columns to the data warehouse tables, enabling easier use for business and technical consumers.
The main benefits of operational metadata can be summarized in these points:
- Operational meta data, unlike information stored in the meta data repository, is referenced at a row level of granularity in the data warehouse.
- Operational meta data provides a detailed row level explanation of actual information content in the data warehouse.
Incorporation of Operational Meta Data
A variety of formats can be used in data warehouse design to enhance the automation of its support and maintenance. Data warehouse requirements will vary, and the inclusion of specific metadata will be determined by the requirements, but certain operational metadata has been found to be useful for all data warehouse implementations.
- Current Flag Indicator
- Load Date
- Load Cycle Identifier
- Update Date
- Operational System(s) Identifier
- Active in Operational System Flag
- Confidence Level Indicator
Current Flag Indicator
The current flag indicator column is used to identify the latest version of a row in a table. It facilitates rapid identification of the latest version of a row, rather than performing date comparison calculations. This flag is especially useful for managing the processing of slowly changing dimensions (SCD), type 2, where history of a production record must be maintained. (Part 2 of this article will describe how to use this column in SCD 2 processing.) Link to that article here
Load date column is the most commonly used and best understood operational metadata field in data warehouse design. This attribute indicates the date and/or time when data instance (row) was loaded into the data warehouse, or in some business cases, when the data was extracted from the operational source system. This snapshot date is used to maintain the temporal integrity of the data in the warehouse since new information is added during each refresh cycle. The column can be referenced by warehouse administrators to identify candidate rows for archival or purge processing. Data consumers and data stewards can use this column to reconcile and audit information in the data warehouse against the operational source systems to confirm the validity of the data found in the data warehouse.
Load Cycle Identifier
The load cycle identifier is a column assigned during each load cycle to the data warehouse regardless of the refresh frequency. It is a sequential identifier, and it can be used to remove data from a particular load cycle run if data corruption or other data quality issues are discovered. Typically, the load cycle identifier is used in conjunction with a static reference table or a metadata repository table that describes other operational statistics about the load cycle.
Another extremely common operational metadata column is the update date. This column indicates when a row was last updated in the warehouse during a refresh cycle. This column, like load date, is used to maintain the temporal integrity of information in the data warehouse. It is used in almost all dimension table designs that implement SCD type 1, 2 or 3 processing methods to identify when the row was refreshed, and can serve as important metadata for data consumers and data stewards. The column, like load date, can be employed in administration activities such as archival / purge processing or reconciliation / audit by data consumers.
Operational System(s) Identifier
One of the most useful operational metadata attributes for the warehouse administrator and the data consumer is the operational system(s) identifier. This column is used to track the origination source or sources of an instance (row) in the data warehouse.
In cases where a row of data was integrated from more than one operational source system, a column value indicating the combination of these systems can be assigned. It can be used by data consumers who question the quality and/or validity of data in the warehouse to trace back information to the operational source system that furnished the information.
In certain cases, this column can be used by administrators to identify and remove corrupt data from a particular operational source system(s). Part 2 of this article will describe how to use this column in combination with other operational metadata fields to remove corrupt data loads, or loads where data may have quality issues. Link to that article here
Active Operational System Flag
This column is used to indicate if the production keys in a dimension table are still active in the originating operational system. The active operational system flag provides a variety of analytical alternatives to queries posed to the data warehouse. Data stewards and other analysts can use this metadata effectively to identify dormant data or data that should be constrained in reporting (e.g., out of bounds results, products no longer supported, previous customers, etc…).
Confidence Level Indicator
One of the more controversial operational metadata attributes is the confidence level indicator. This column is used to indicate how business rules or assumptions were applied during the ETL processes for a particular data instance (row). This field provides a way for the data consumer and data steward to indicate the credibility level of an instance (data row) based on the transformation processing action.
Often used to identify potential problems with data quality from operational source systems and to facilitate correcting these issues, each organization’s data warehouse team and data governance program will vary in the approach and implementation of this operational metadata attribute.
Too Much Operational Metadata?
Organizations should not incorporate operational metadata columns universally (to every table) and without a clear, justifiable need. In some cases, these attributes do not make a positive contribution to the architecture or design of the data warehouse and its ETL processes:
- For an aggregation table, use of a load cycle identifier and / or an operational system identifier would not be beneficial since the context of the operational metadata would be lost.
- For a fact table, the use of current flag indicator on a row would not be useful since the concept of slowly changing dimensions does not apply to a fact.
- In very simple cases, the identification of an operational system may provide little value to a data consumer or data steward due to a very limited number of sources with insignificant integration requirements.
Adding operational metadata allows for detailed, row level explanation of the facts found in the data warehouse. The inclusion of these types of attributes into the data warehouse (starting at the DW architecture) and ETL processes can reconcile data quality issues and enable enhanced performance capabilities. This reconciliation offers increased data integrity, and the performance improvement provides greater capability in executing data warehousing administration tasks. Data consumers and data stewards can become more confident about the data quality provided through the data warehouse when making decisions and performing analytical actions based on the data contained in the environment.
Part 2 of this article will illustrate strategies for use of these operational metadata fields for data warehouse administration activities, and for business analysis queries. Administrators and data consumers as well as data stewards will find these strategies for the use of operational metadata informative.