Knowing how to use operational metadata in a data warehouse, business intelligence or analytics environment can be extremely beneficial to a wide variety of users.
Many data warehouses and analytical systems include metadata management as part of their architectural considerations, but they do not explore fully the strategic approaches to using operational metadata. Ignoring the opportunities that a focus on metadata management can give to data warehouse and analytical systems management and performance can lead to a reduction in design capabilities, and in the execution of the extraction / transformation / loading (ETL) processes. The omission of operational metadata increases costs for maintenance, data acquisition and data management across the business intelligence activities, and hampers the data warehouse audit processes.
Part one, explored various techniques to implement operational metadata in the design of a data warehouse. This article, part two, will describe strategies for using operational metadata by data warehouse administrators and data consumers. The techniques described in these articles originated in the book Building and Managing the Metadata Repository, written by David P. Marco (John H. Wiley and Sons, 2000)
Operational Meta Data Column Assignment
Now that the data warehouse architect has realized the importance of operational metadata, the appropriate attributes have been incorporated into the design of the data warehouse data model and the ETL processes. Assigning the initial value to these columns can be accomplished in a variety of methods, including the following:
|Operational Metadata Attribute||Initial Value Assigned|
|Load Cycle Identifier||The value for this column is assigned through insertion of a new row in a metadata table or lookup table that is added to the data model to capture operational statistics.
The primary key on the table consists of a serial or sequence data type.
The value is incremented during each refresh cycle to the data warehouse regardless of the completion status.
|Current Flag Indicator||The value of this column is assigned through a comparison of data currently in the data warehouse versus data in a new load cycle.
The two data files are sorted and then consecutively read to determine the current record.
The current record has a “Y” value assigned to it while historical records are assigned an “N”.
|Load Date||The value of this column is determined at the beginning of the load cycle.
If a Load Cycle table is available in the meta data repository or has been added to the model, the value for this column can be extracted for this source.
In some cases the value assigned may be the extraction date from the operational system versus the actual date the information was loaded into the data warehouse.
|Update Date||The value of this column is determined, like load date, at the beginning of the load cycle but is applied only to historical rows.|
|Operational System Identifier||The value of this column is assigned based on which operational system provided the information to the data warehouse.
If two or more sources are integrated to construct a single row in a data warehouse table, the value assigned should represent the multiple operational systems.
|Active Operational System Flag||The value of this column is determined by comparing the natural or production keys of the operational system to data rows in the data warehouse.
A value of “Y” is assigned to a record if it currently exists in the operational system.
|Confidence Level Indicator||The value of this column is assigned during ETL processing based on conformance to the business rules associated with the data warehouse target table.
If business rule validation indicates that transformation occurred correctly with no unknown or missing information the confidence level can be set high.
If derived or estimated data is inserted during ETL processing, the confidence level may need to be set lower.
Operational Use Strategies
A variety of processing optimizations can be used in data acquisition design and maintenance activities, as well as with data consumer reconciliation and auditing of information performed against the data warehouse, through the incorporation of operational metadata into the architecture. Some examples of these tasks include:
- Extraction of current data
- Identification of rows for archive/purge processing
- Load cycle rollback
- Processing of slowly changing dimensions type 2 records
These activities take advantage of the incorporation of operational metadata columns and provide a starting set for metadata management through enhanced operational metadata.
Extraction of Current Data
Most data warehouse data models use the load date column to maintain an historical view for an instance of data (row) in a table. While referential integrity is maintained through use of the load date column, a high degree of processing overhead is incurred when a data consumer attempts to determine the most current row in a dimension table. By sequentially reading each row of data in the table, the RDBMS must determine which row has the most current load date.
The current flag indicator serves as an alternative to this brute force method of date determination. Through ETL processing and comparison of natural or production keys, the current flag column facilitates rapid identification concerning the currency of data rows from the operational systems. The last row added to a data warehouse table for a particular production key is given an assignment of “Y” for the current indicator flag, while historical records are set to “N” or other value (best practice does not permit the use of NULL). The data consumer can use this column in their queries as a constraint to retrieve only the most current information. For certain reporting requirements, RDBMS views can be established that are constructed as constrained on the current indicator column, value of “Y”, automatically thereby avoiding errors in queries by data consumers.
Load Cycle Rollback
Many data warehouses use monthly load cycle. During ETL processing of this load cycle, an error can be detected in the RDBMS or, data from an operational source system can be considered as corrupt or, some other data quality measurement issue could be discovered in the data. Before the incorporation of operational metadata columns, the data warehouse administrator would have only limited methods of isolating and removing this corrupt or questionable information. Using the Load Cycle Rollback capability, the administrator can address the removal of the erroneous data from the data warehouse.
The most common method used in Load Cycle Rollback is to remove completely the last refresh cycle to the data warehouse, by constraining on the value of the questionable load cycle and removing the affected rows in each table from the data warehouse.
This method can be enhanced, in certain circumstances, by removing rows from a particular load cycle for a specific operational system by constraining on the erroneous operational system indicator.
For either approach, the rollback process must accommodate the re-assignment of the Current Flag Indicator, once the erroneous / questionable rows have been removed from the data warehouse. These flags, that were set to “Y” before the rollback process started, will be set to “N” or to the other chosen value to indicate they are no longer the current record.
Archiving and Purging
At some point, the data warehouse will be functioning with periodic load cycles processing successfully. Data consumers and the database monitoring applications will notify the data warehouse administrator that dormant or inactive data exists in the data warehouse; this data has not been used for any purpose for some time.
To maintain performance, to improve access times, to reduce storage costs by eliminating unused data, and to improve query and analysis results through access to active / valid data, decisions to archive or purge data should occur.
Operational metadata can assist with the identification of data that are candidates for archiving / purging. The load date attribute can be interrogated to isolate and discover candidate rows, but this method uses additional cycles of the data warehouse’s RDBMS to analyze the load date.
An alternative would be to create a constraint on a set of load cycles applied to instances of data (rows), providing a more efficient means of detecting candidate rows for archiving or purging while avoiding the need to focus on a specific date column to identify a particular period.
Slowing Changing Dimensions (SCD) Type 2
Data warehouses or marts that use one of the forms of a star schema data model design can use the Current Flag Indicator to support SCD type 2 processing for dimension table loads.
The established method for processing of SCD type 2 records is to sort both sources of information, the operational system versus the data warehouse / data mart, based on the production or natural key of the dimension table. The current flag indicator can be used to constrain which records from the dimension table will be compared to the current load cycle data that is scheduled to be processed to the warehouse. Only dimension records containing a “Y” in the current indicator column are required to be compared to the load cycle data, since they contain the most updated information for the specific production key(s).
Designers make an additional comparison between the two data files based on these key column(s). Current load cycle records with production key values not found in the dimension table are loaded as new rows. These rows receive new surrogate keys for the dimension and have their current indicator flag set to “Y.” The current flag setting is due to the fact that no other occurrence of the production key(s) combination exists in the dimension table to date.
Production key(s) found in both the load cycle data and the dimension table are examined based on columns that the data consumers believe to be relevant for tracking changes to the data warehouse / data mart. The two sources are sorted by the key and relevant columns. Administrators make a consecutive comparison against these files. Current load cycle records that have relevant columns that do not match their corresponding dimension table rows are flagged as new rows. These rows are inserted with new surrogate keys and have the current flag indicator set to “Y” due to an update made to a relevant column since the last load cycle.
Previously loaded dimension rows that have a matching production key(s) when compared to the current load cycle file but have differences in relevant columns are assigned a current indicator of “N”. These rows receive this setting since new information on relevant columns of the dimension have been received in the most recent load cycle. Generally, data consumers review these instances and correct the anomalies for later inclusion into the warehouse / data mart.
This process of constraining on the current flag indicator and performing comparisons on production keys between the dimension table and load cycle is repeated during each update process to the data warehouse.
Unique Questions Answered by Operational Metadata
Many administrative and data quality measurement questions can be answered due to the incorporation of operational metadata columns into the data warehouse architecture. Some examples of the questions that the use of operational metadata can answer could include:
- What operational system or systems provided this particular row of information to the data warehouse (data quality and administrative)?
- When was this particular row of information last refreshed by the operational systems (data quality)?
- How many rows of information will be affected by an archive to near-line storage or a purge process (administrative)?
- What is the growth rate for a table per load cycle (administrative)?
Adding operational metadata allows for detailed, row level explanation of the facts found in the data warehouse. The inclusion of these types of attributes into the data warehouse (starting at the DW architecture) and ETL processes can reconcile data quality issues and enable enhanced performance capabilities. This reconciliation offers an increase in data integrity, and the performance improvement provides greater capability in data warehousing administration tasks. Data consumers can become more confident about the data quality provided through the data warehouse when making decisions and performing analytical actions from the data contained in the environment.
Part one explored various techniques to implement operational metadata in the design of a data warehouse.