Improved Change Discovery through Operational Meta Data
By Michael F. Jennings
March 31 2001
Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 771-7904
Email: mfjennin@hewitt.com or mjennings@igcom.net
Improved Change Discovery through Operational Meta Data
How are you identifying changes in information coming to your data warehouses? This is a common question all data warehouse developers have to answer when designing the data model and extraction, transformation, and loading (ETL) methodology for their business intelligence environment. The ability of your ETL processes to quickly identify and process changes is critical in today's demanding business intelligence environment where real time updates to the data warehouse are becoming a reality (and nightmare for the data warehouse developer). This ability to rapidly identify changes from your source system data feeds can be enhanced through the incorporate and use of operational meta data. This is accomplished by incorporating operational meta data directly into the data mart design and ETL processes. This technique extends the design and architecture of the data warehouse to provide increased processing optimizations for data acquisition.
The accepted approach of depicting changing information in a dimension table, when designing a data mart using a star schema modeling method, is through the slowly changing dimension (SCD) technique. SCD 2 is used to model a dimension table when changes to relevant columns need to be captured over time. This technique relies on the use of a production or natural key(s) in the dimension table not changing. New surrogate keys are assigned to the dimension table when changes to relevant columns are detected during a batch load cycle. Comparisons are made between the previously loaded production key(s) of the dimension table and the new load cycle data. Changes to relevant columns in the new load cycle data for matching production keys are loaded with new surrogate key assignments. If operational meta data columns are incorporated into the design of the table, newly assigned surrogate keys can be identified as current while records previously loaded for a matching production key can be flagged as historical (See figure 1).
Figure 1, SCD 2
SCD 3 is used to model a dimension table to capture both the current and previous changes made to a production key on a single row. This technique uses two sets of columns in the table, one for the current column values and one for previous or historical values (See figure 2).
Figure 2, SCD 3
Both of these methods allow for capture of change information in the data mart but how will you detect this new information in your ETL processing? In the simply examples in figure 1 & 2, there are six different columns that need to be analyzed by your ETL process for change detection. Information from your source system needs to be compared to information currently stored in this dimension by each of these columns to determine if a new record or update will be created. This type of comparative processing can be very time intensive especially when the dimension table contains significantly more relevant columns for comparison or columns in the dimension table are very large. This is where incorporation of a cyclic redundancy check (CRC) operational meta data column into to your data mart design and ETL processes can reduce this processing requirement.
Cyclic Redundancy Check as Operational Meta Data
Cyclic Redundancy Check or CRC is a data encoding method (Non-cryptographic) originally developed for detecting errors or corruption in data that has been transmitted over a data communications line. The sender encodes the data being transmitted into a code that is transmitted with the data. The receiver recalculates the CRC using the data received and compares that to the CRC code originally transmitted. If the CRC values are not the same, it indicates that errors occurred during transmission. The CRC is redundant in that it adds no information about the data. The mathematics behind CRCs is based on something called polynomial division, in which each bit in a chunk of data represents one coefficient in a large polynomial. The International Telecommunication Union (ITU), formally CCITT, has standard algorithms available on the Internet for calculating CRC's (also see ISO standard 3309). Other method of encoding or encrypting data could be substituted in place of CRC's such as Base64 encoding or MD5 & SHA hashing methods.
For ETL processing, the CRC is used to direct SCD processing for a dimension table where many columns need to be compared to determine if data from the source system(s) should processed as an update to the dimension table. This technique can also be used in cases where a column in the dimension table is very long and variable in length such as a URL in clickstream data (See figure 3).
During ETL processing for the dimension table, all relevant columns needed to determine change of content from the source system(s) are combined and encoded through use of a CRC algorithm. The encoded CRC value is stored in a column on the dimension table as operational meta data. During subsequent ETL processing cycles, new source system(s) records have their relevant data content values combined and encoded into CRC values during ETL processing. The source system CRC values are compared against CRC values already computed for the same production/natural key on the dimension table. If the production/natural key of an incoming source record are the same but the CRC values are different, the record is processed as a new SCD record on the dimension table. The advantage here is that CRC's are small, usually 16 or 32 bytes in length, and easier to compare during ETL processing versus the contents of numerous data columns or large variable length columns.
Figure 3, CRC example
The encoding algorithm for calculating CRC values is not a hundred percent reliable due to the possibility of collision. This occurs when two different data strings map to the exact same CRC value. The probability of having collision occur during ETL processing for the dimension table can be eliminated through the inclusion of the source system(s) production/natural keys in the comparison step of the ETL process. Additionally you can use more a complex CRC algorithm that generate a larger number of unique values (less than a billion for 32-bit versus over three billion for 64-bit CRC).
Summary
The benefits of operational meta data use in the your business intelligence environment can be extended to through the use of CRC's or other encoding/encrypting algorithms in your data warehouse and ETL process design. Use of this operational meta data column offers warehouse developers a more efficient means of processing records for inclusion into dimension tables using SCD 2 or 3 designs then comparison of multiple or large variable length columns.
Biography
Michael Jennings is an architect and group manager specializing in business intelligence, data warehousing, web based delivery strategies & architectures, and web traffic analysis at Hewitt Associates. He has more than eighteen years of information technology experience in the manufacturing, telecommunications, insurance, and human resources industries. Michael's articles on business intelligence topics have appeared in DM Review and Intelligent Enterprise magazines. He is a regular columnist for the Real World Decision Support newsletter. Michael speaks frequently on business intelligence issues at major data warehousing conferences and is an instructor of information technology strategies at the University of Chicago's Graham School. He is a contributing author to the recently released book "Building and Managing the Meta Data Repository" published by John Wiley & Sons.
Michael F. Jennings
Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 771-7904
Email: mfjennin@hewitt.com or mjennings@igcom.net