Skip to content.

Sections
Home » Resource Center » Real-World Decision Support (RWDS) Journal » October 2000 - Volume 1, Issue 09 » Strategies for Custom Data Warehouse ETL Processing

Strategies for Custom Data Warehouse ETL Processing

By Michael F. Jennings
October 18, 2000

Many data warehouse efforts struggle with determining the best methodology and implementation strategy for performing extraction, transformation and load (ETL) processing for their particular enterprise. Often budget constraints, deadlines, and lack of internal warehouse architecture experience or a combination of the three, often hinder these projects. These hurdles often lead to additional time and expenses being expended in order to manage these risk factors on the project.

In other cases, data warehouse projects expend a great deal of time and effort analyzing, investigating and evaluating the various ETL tools in the market place. Often these groups over look the internal capabilities already available and proven in their internal infrastructure environments.

This article provides a technical methodology and implementation strategy for performing custom ETL processing. A detailed example of a full life cycle ETL process, from source to warehouse, is systematically reviewed during the article. These techniques can be beneficial for decision support projects that are just getting started or for the implementation of a prototype by providing a quick, easy to learn, cost effective method for loading information into a warehouse. Existing infrastructures, even those who have purchased and implemented commercial off the shelf (COTS) ETL products, can benefit from these techniques by augmenting their existing procedures and processes with portions of this method.

Custom ETL Processing Fundamentals

For the purposes of this discussion, we are following the principle that flat file processing, outside of the RDBMS, is seen as the most efficient means of performing data transformations for the data warehouse. Flat file processing offers the advantage of not leaving us to tightly coupled to one particular RDBMS vendor in case this changes during a project’s life span. Sorting and transformation of flat files is typically faster then similar processing done within a RDBMS.

I was involved on one project where aggregate tables were needed in a schema to increase reporting performance. A fact table containing millions of rows was being reduced in the aggregate by a factor of twenty. Using RDBMS techniques only to produce the aggregate required more than twelve times more processing time than flat file processing of the same aggregate plus use of the database’s load utility (using a commercial sorting product).

We are further assuming that our warehouse operating system is using one of the more popular flavors of UNIX and a known RDBMS (preferably one with a warehouse focus). Three versions of the data warehouse database (development, test, production) should be created. Source extract files, temporary work files, DBMS load ready files, scripts and programs will be loaded into a project directory that is subdivided into development, test and production. Data files, scripts and programs that are shared between decision support projects will be loaded into a common directory area, at the same level as project (See figure 1).

Figure 1

Figure 1

In order to create a custom ETL environment you will need six basic infrastructure components available in your environment. The first component is a network environment that can connect your source data systems to your warehouse platform. Most corporate network architecture can move large amounts of data from platform to platform through shear bandwidth availability or through various compression utilities. The wide acceptance on TCP/IP communications has made utilities such as FTP available in a variety of platforms and operating systems.

The second component needed for the custom ETL environment is an RDBMS for the warehouse. Most of the large RDBMS vendors have made their products available on a variety of platforms and operating systems. Some of the vendors have addressed the needs of decision support environments by adding optimization packages for data warehouses such as high speed parallel processing.

I have used one of these RDBMS versions, specialized for decision support, and have seen insert speeds of 3.5 million records per minute with update speeds of 750 thousand per minute. That equates to a billion records being inserted into a warehouse table in less than five hours. Of course, this was on a twenty four node box within several gigabytes of RAM so compare your results appropriately. An additional plus is an RDBMS that supports remote access of database on other servers. This will help facilitate the data extraction process from source systems directly to the data warehouse.

The third component needed is a sort/merge utility to integrate data from the various source systems. Several operating systems come with their own sort utility such as UNIX that can be used for limited volume. Otherwise, several popular commercial sort utilities have been ported to a number of platforms and operating systems. If a sort package is used in house but not currently available on the warehouse operating system look into extending or migrating the current license. Some of these sort packages also come with a merge utility for combining records from different source files with similar keys. If your sort utility does not have this merge capability, you can search the Internet for the numerous script and programming examples of how to perform this function. I have seen several IT organizations address this need by developing their own in house utility for this type of activity that is shared between platforms.

The fourth component in our custom ETL environment is a method to perform calculations. This can be accomplished in SQL during loading of the target warehouse table or, depending on volume, could be accomplished through some script method (e.g., awk, Perl, etc.).

The fifth component needed is a means to schedule and run ETL batch cycles based on events or timelines. This can be accomplished through a commercially purchased program or a utility that is available through the operating system or DBMS (e.g., cron, daemon, stored procedure, trigger).

The last component we will need is a change management utility to manage updates and version control of programs and scripts. This can also be a commercially purchased program or a utility that is available free with the operating system of the warehouse (e.g., SCCS in UNIX).

Methodology for Implementing a Custom ETL Process Environment

Once the business requirements for the warehouse project have been defined, the source systems analyzed and a physical data model has been developed you are ready to begin designing your ETL processes. A systematic approach needs to be applied as to how you are going to design your ETL flows. In our custom ETL environment, we will use a staged approach for designing ETL process flows. Each process is directed toward loading of a particular target table or cluster of tables depending on the business needs.

The staged transformation is used to optimize data acquisition and transformation from source systems. This method provides increased flexibility and capabilities to data acquisition developers. The various stages provide a single processing method for initial and successive loads to the data warehouse. The five stages provide a modular and adjustable transformation process for the target table that can adapt easily to changes in the source systems or warehouse model design. The combined use of technical meta data tags, sometimes referred to as operational meta data, in the warehouse schema design and ETL transformation processes allows for improved capabilities in loading and maintenance designs. Not all stages will be used in every target table case depending on the business requirements and complexity of business rules needed in the transformation.

The first stage, source verification, performs the access and extraction of data from the source system. This stage builds a temporal view of the data at the time of extraction. The source extract built in this stage is included in backups of the entire batch cycle for reload purposes and for audit/reconciliation purposes during testing. If audit files are provided from the source system, these files are compared against the extract files to verify such items as row counts, byte counts, amount totals or hash sums. During this stage, both technical and business meta data can be captured and verified against the meta data repository (if available). Verification of business rules unique to the source system can also be applied during this stage using the repository to identify any exceptions or errors.

The second stage, source alteration, can perform a variety of transformations unique to the source depending on business requirements. These transformation options include integration of data from multiple source systems based on priority ranking or availability, integration of data from secondary sources, splitting of source system files into multiple work files for multiple target table loads (clusters), and application of logic and conversions unique to the source systems. Technical meta data tags such as source system(s) identifiers, production key active in source system and confidence level indicators are applied during this transformation stage.

The third stage, common interchange, applies business rules and/or transformation logic that is frequent across multiple target tables. Examples of transformation logic applied during this stage include referential integrity (e.g., population of fact table surrogate keys from dimension tables) and application of enterprise definitions and business rules from the meta data repository (e.g., conversion of country codes to ISO standard formats).

The fourth stage, target load determination, performs final formatting of data to produce load ready files for the target table, identifies and segregates rows to be inserted versus updated (if applicable), applies remaining technical meta data tagging and processes data into RDBMS. Data records processed from the source system up to this stage for the current batch cycle are compared against records loaded in previous cycles to determine if insertion or update is required. An example of this is a dimension table that uses slowly changing dimension (SCD) type 2 philosophy for updating. Technical meta data tags such as load date, update date, current flag, load cycle, hash values and/or cyclic redundancy checksums (CRCs). For updated records, the technical meta data tags are also updated to indicate the change in state of the data records. The load ready files built in this stage are included in backups of the entire batch cycle for reload purposes and for audit/reconciliation purposes during testing.

Depending on the DBMS used for the warehouse and the volume of records being processed, the high speed parallel load option may be substituted in this stage in place of the standard database load utility. Faster load performance can sometimes be achieved through dropping and re-insertion of all data rows versus updating due to optimizations made available through use of the DBMS’s high speed parallel load utility.

The fifth and final stage, aggregation, uses the load ready files from the previous fourth stage to build aggregation tables needed to improve query performance against the warehouse. This stage is typically applied only against fact table target load processes. Care needs to be taken to ensure that aggregated records end up with the correct surrogate keys from the dimension tables for the rollup levels required in reports.

This five stage method provides a modular and adaptable means to efficiently load a data warehouse (See figure 2). Transformation process designs can easily adapt to changes in source systems or the addition or removal of source systems without affecting the entire ETL work flow. Additionally, changes to the data warehouse schema will cause minimal impact to ETL processes through the insulation provided through the various transformation stages.

Each target table load ETL process follows the staged transformation method. Each stage is accomplished through sequential execution of steps using one of the appropriate infrastructure components outlined earlier. These steps and target table loads can be accomplished in parallel with some additional planning of the overall ETL batch cycle process. The steps for each stage are assembled and executed through a script or higher level language program available for the particular operating system. For this discussion, we will assume use of some form of shell script. The stage transformation scripts are collected together in a higher level target table load script. Finally, the target table load scripts are gathered together in the required execution order to form the processing needs of the batch cycle (e.g., daily, weekly, monthly, etc.). This component method of assembling the batch cycle loads to the warehouse promotes flexibility and reuse in the design.

ETL Process Example

Now we will use our infrastructure components and transformation method and apply them to a common ETL process. For our example, we will assume that our source system is a human resources (HR) enterprise resource planning system (ERP). The target table we wish to build an ETL process for is an organization dimension table that happens to utilize slowly changing dimensions (SCD) type 2 for maintaining history.

Slowly changing dimensions (SCD) type 2 is used to model a dimension table when changes to relevant columns need to be captured over time (See figure 2). This modeling technique relies on the use of production or natural key(s) in the dimension table and the source system not changing in order to detect changes to relevant columns. New surrogate keys are assigned to the dimension table when changes to these relevant columns are detected during a batch load cycle. Comparisons are made between the previously loaded production key(s) of the dimension table and the new load cycle data from the source system. Changes to relevant columns in the new load cycle data for matching production keys are loaded with new surrogate key assignments.

Figure 2

Figure 2

Since we have utilized technical meta data tag columns into the design of the dimension table and ETL process, newly assigned surrogate keys can be easily identified as current while records previously loaded for a matching production key can be flagged as historical. Data warehouses that make use of some variant of a star schema data model design can utilize this current flag indicator to aid in SCD type 2 processing for dimension table loads.

Figure 3

Figure 3

In the first stage of transformation, source verification, data is extracted into flat files from the Organization and Region tables in the HR ERP source system using the unload utility of the DBMS (See figure 3). Initial sorting and reformatting of data is applied to the extracted records from the Organizational table since it will be the primary data source for this target table load. The infrastructure components, DBMS unload and sort utility, are used in this stage to extract data from the source system into flat files and sort/reformat the primary source records.

Figure 4

Figure 4

In the second stage of transformation, source alteration, data from secondary sources, in this case the HR ERP Region table, is appended to the primary organizational extract file (See figure 4). The appended column, region name, is required for loading to the data warehouse target dimension table, organization. The infrastructure component, merge utility, is used in this stage to match the two files based on node id and then append the region.

Figure 5

Figure 5

In the third stage of transformation, common interchange, data records from the source system are compared against data mappings stored in the meta data repository (See figure 5). The region name values stored in the HR ERP system do not conform to the established enterprise definitions. The meta data repository contains a data mapping of the region names from the HR ERP system cross referenced to the data warehouse enterprise definitions. Using the merge infrastructure utility, the organization record region names are updated to reflect the enterprise versions.

Figure 6

Figure 6

In the fourth and final stage of this transformation example, target load determination, the current load of organization records are compared against those previously loaded in earlier batch cycles (See figure 6). In this particular case, processing is slightly more complex due to the use of SCD type in the design of the organization dimension table. The established method for processing of SCD type 2 records is to sort both sources of information, in this case processed HR ERP organizational versus the warehouse organization table records, based on the production or natural key (node id) of the dimension table. The current flag indicator can be used to constrain which records from the organization dimension table that will need to be compared to the current load cycle data being processed to the warehouse. Only organization dimension records containing a “Y” in the current indicator column are required to be compared to the HR ERP load cycle data since they contain the most updated information for the specific node id (production key).

A consecutive comparison is made between the two data files based on node id using the merge utility. Current load cycle records with node id key values not found in the organization dimension table are loaded as new rows (See figure 6, top left). These rows receive new surrogate keys for the organization dimension and have their current indicator flag set to “Y”. The current flag setting is due to the fact that no other occurrence of the node id combination exists in the dimension table to date.

Matching node ids found in both the load cycle data and the organization dimension table are further interrogated based on columns deemed relevant to track changes on by the business users, in this example region name and manager id (See figure 6, bottom left). The two sources are sorted by the node id and two relevant columns. The merge utility is used to perform a consecutive comparison against the two files.

Figure 7

Figure 7

Current load cycle records that have relevant columns that do not match their corresponding organization dimension table rows, new region names and/or manager ids, are flagged as new rows for insertion (See figure 7, top). These rows are inserted with new organization surrogate keys and have the current flag indicator set to “Y” due to an update made to a region name and/or manager id since the last load cycle.

Previously loaded organization dimension rows that have a matching node id when compared to the current HR ERP load cycle file but have differences in relevant columns are assigned a current indicator of “N” (See figure 7, bottom). These rows receive this setting due to changes that have been received on the region name and/or manager id of the dimension in the current load cycle.

This same process of constraining on the current flag indicator and performing comparison on production keys between the dimension table and load cycle is repeated during each update process to the data warehouse.

About the Author

Michael Jennings is an architect and group manager specializing in business intelligence, data warehousing, and clickstream analysis at Hewitt Associates. His articles on business intelligence topics have appeared in DM Review and Intelligent Enterprise magazines. He is a regular columnist for the Real World Decision Support newsletter. Michael speaks frequently on business intelligence issues at major data warehousing conferences and seminars on e-business technology strategies at the University of Chicago's Graham School. He is a contributing author to the book "Building and Managing the Meta Data Repository". He has more than eighteen years of information technology experience.

Michael F. Jennings
Group Manager, Business Intelligence & Data Warehousing
Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 771-7904
Email: mfjennin@hewitt.com