Enterprise Data Management: Integrating The Manual Steps
By Lance Carroll
Copyright© 2000 Metagenix, Inc.
Enterprise Data Management
Enterprise Data Management (EDM) is the strategy behind understanding and sharing data across the enterprise and making information available to decision makers, regardless of the source of the data. The technology has been through several generations in the last few decades. In the past, it was thought of as a data conversion. After several iterations, the technology is now known by the general term of data migration. Data migrations are present everywhere today as they are used to build EDM applications such as E-commerce, Enterprise Resource Planning, Customer Relationship Management (CRM), Business Intelligence (BI), Data Warehousing, and Data Marts.
Software tool vendors have marketed a variety of partial solutions to allow companies with large Information Technology budgets to perform their own data migrations. No data migration tool currently on the market, however, has been able to integrate all of the steps involved in building an EDM application.
There have been three distinct generations of EDM applications until this moment:
|
Handwritten Code |
The projects
were labor intensive, time consuming, and rarely repeatable. They were very difficult to complete successfully, and mistakes
in the content and structure of the source data could destroy
the project. |
|
Automated
Code Generation tools |
Although code
was easily replicated, the programmers were still involved heavily
at the code level and errors still persisted. |
|
ETL tools |
These tools automated
the process of extracting and transforming data for loading (ETL)
into EDM applications. But
performing a data migration was still an awful lot like programming,
albeit in a visual environment. The ETL tools recognized the concept of meta
data, which allowed a development team to produce consistent results.
However, developers still had to interview the customer
to understand the source systems, which led to mistakes causing
large sections of the application to be completely redone. |
The Importance of Meta Data
In varying degrees, the traditional approach to data migration takes place in the following steps:
1. Analyze the user's needs and build a target database specification: After interviewing the users, a grand scheme is devised for a database model that will answer all of the questions the user would like answered by the EDM application.
2. Analyze the data sources available: A set of data sources from legacy systems, operational systems, etc. is compiled and analyzed to determine their relevance to the target database. Documentation for the data sources may or may not be available, or it may be inaccurate. A series of “experiments” is run on the source data to detect the properties of the data.
3. Build a set of source data to target database mappings: A plan for transforming the various data sources to the target is devised. Typically, this step is performed with an ETL tool. Other companies rely upon hand-coded programs.
4. Stage the data: The source data is loaded into a staging area where it can be massaged, cleansed, and manipulated into the form needed for the EDM application.
5. Load the data: The data is moved from the staging area into the EDM application. This step includes formatting the data for reporting and the schemas necessary for OLAP cubes.
Why do most data migration projects using third generation technology either overrun or fail? The easy answer says they do not deliver the required features, they exceed their budget, or they are canceled before they are completed.
But what is the driving force behind these failures?
The main weakness in this manual approach is that it makes the assumption that the data required for the application is actually available from the data sources. Major corporations have spent millions of dollars on EDM projects, only to find out that the source data will not support the model they have built for their target database. Because the process is made up of a series of disjointed steps usually executed manually by independent teams of programmers, the discontinuity between the steps often leads to disaster.
“The problem is that most project teams do not discover fatal flaws in their source data until after they have invested time, money, and programming resources building target data models, source-to-target mappings, transformational rules and extract feeds. The teams then scramble to rewrite these programs and start again. But their next round of validation checks again reveals unexpected problems with the source data. Some project teams cycle through this process many times and wind up never delivering the data warehouse…. They fell victim to the code, load, and explode syndrome.” (i) The next generation of software should address these weaknesses.
Mr. W.H. Inmon, in his books on data warehousing, estimates that, on average, 80% of the effort of building a data warehouse goes into the tasks of extraction, transformation, and loading. (ii) The lack of software that can detect problems in the ETL process upstream is costing businesses a significant portion of their data warehousing budgets.
Unfortunately, the actual mechanics of specifying a set of source data to target mappings is a small part of the overall task of extraction, transformation, and loading. The real work is the intellectual exercise of determining:
- What exactly is in the source data?
- How it is organized?
- How can this data best be expressed in a target database schema?
- How can we map these sources and targets together?
Usually, there is little known about the source data, minimizing the possibilities for success in step 2, and killing the success of the subsequent steps.
Most data migration projects that overrun or fail entirely do so because of a lack of understanding of the meta data. Without the ability to automatically reverse engineer meta data, developers are left to investigate the source data by hand. Documentation for the meta data of legacy systems is usually incomplete at best or non-existent. The personnel needed for interpretation of the data often have left the company. Haphazard guesses are used instead of a complete analysis of content. This leads to a process where the migration of source data into the warehouse is debugged far downstream in the development cycle. Rather than catching errors at design time, problems in the meta data are reflected in production systems.
The fourth generation of EDM, data profiling and mapping software, solves these issues. The software uses sophisticated reverse engineering processes to allow the user to load raw data in and automatically receive a complete analysis of the source data. It reverse engineers meta data from raw data by analyzing the source data and providing recommendations for the target database, such as primary keys, foreign keys, normalization, dimensions, measures, and views of the data. With this information a model is built with the result being the optimal target database for the EDM application based on what currently exists. However, just having this analysis is not enough. This information must be used to drive subsequent steps of the data migration process.
The Perils of the Manual Process
As long as the data migration process consists of multiple, independent and manual steps, the failure rate and cost for EDM projects will remain high. This series of steps has to be repeated every time there is a material error in specifying the source data, which is arduous and expensive.
EDM projects have many pitfalls to overcome, which are costly and can lead to the failure of the project. The lack of software to detect problems with the ETL process, and to give an accurate specification of the actual data, is constantly causing businesses to spend a sizable portion of their project budgets needlessly, addressing problems that companies should be able to avoid, for example:
- Existing data is often invalid. Date fields might contain invalid dates in the new system, such as the difference between MMDDYY and YYYYMMDD schemes.
- Fields that are guaranteed to be a certain type are only that type in 98% of the cases, but the 2% exception is really valid and tells us something about the true meaning of the field.
- Two separate systems were merged together in the past, creating conflicting sets of domain values for coded values. For example, Company A used a rating of 1-10 on its prospects, where Company B ranked them A-F. The new Company’s documentation says that this field is numeric and has values 1-10.
- Assumed dependency relations are not correct. For instance, the relationship between invoices and parts is one to many according to the specification, but there are many invoices without corresponding entries in the parts file.
Data Profiling arrives
The solution is to know all the data anomalies before you are in the middle of a project. The solution is being able to tell your customer that their requirement is impossible to achieve given the current state of the data. Usually this type of information cannot be gathered until analysts are deep into the data, already behind schedule, with the scope of the project far outpacing their ability to deliver.
Data profiling, used by fourth generation software, involves a number of analyses to help answer these questions before the project is underway:
- Column Analysis – provides characteristics for each column in your source system by analyzing every data value. Such things discovered are data type, size, frequency and distribution values, range of values, nullity and uniqueness characteristics.
- Table Analysis – discovers relationships between columns within a table. These relationships are used in order to compute functional dependencies, which are derived from the actual source data, rather than from suspect documentation or guesses. By utilizing the results from table analysis, the software can infer items such as primary keys, normalization candidates, and derived columns. This analysis will not only show where dependencies exist 100% of the time, but it will also provide instances where they exist most of the time. This is usually an area where an analyst would look to find inconsistent data.
- Cross Table Analysis – provides information about the relationships between columns across tables. The common domains between fields across tables are compared, and an intersection factor is calculated. If the number of intersections between fields of two different tables reaches a certain threshold, a relationship between the tables may exist. If common domains exists between two columns but the user decides it is not a relationship, then this is a candidate for a redundancy.
- Normalization – involves computing a third normal form relational model for the target database. The information gained through the analysis phases is used to aid the user in making intelligent decisions in the construction of the target data model. The software presents the user with a proposed normalization and the user can accept the proposed normalization, reject the normalization, or modify the model, as they desire.
Where do you go with it?
The amount of meta data gathered through these data profiling is unprecedented for software. Previously, the level of detail provided had been available only after weeks or months of manual work by analysts. However, just because you have this meta data does not mean you are successful.
It is what you do with this information that matters.
One only has to examine the history of corporate repositories to realize that gathering meta data does not in and of itself create success. People have been populating corporate repositories for years, but their internal efforts were not eased because they did not use the power of the available meta data. Many repositories became “shelf-ware” because they were not kept up to date or the meta data was not used proactively. The level of meta data for tables and columns derived from data profiling is much more detailed than any repository has provided before. Because of this, companies have a rare opportunity to dramatically affect internal efforts quickly.
After reviewing the analyses, making informed decisions, and designing the target system, analysts can have a specification supplied to them that documents the source and target systems and the transformation rules. If we were to stop at this point and only allow the software to provide this specification, then we have still impacted a project’s timeline. However, an analyst must still exert substantial amounts of effort either in designing and creating programs or creating the mappings in an ETL tool. The solution is to allow these mappings to be imported into an ETL tool.
The key to helping companies leverage their investment in third generation software is to integrate them with the fourth generation software. This revolutionary technology therefore does not replace previous incarnations of ETL tools, but rather is a step forward in their evolution and extends their investment.
An EDM environment allows users to integrate multiple disparate systems by providing a complete understanding of the meta data, and should generate the ETL programs to merge the data.
Since the code is automatically generated, the speed of implementation is dramatically increased. Because the meta data is based upon the actual source data, accuracy is 100%, avoiding the code, load, and explode phenomenon. The outcome is an implementation that avoids the problems commonly encountered by EDM projects that use the traditional approach.
This is not to say that using an EDM environment is a silver bullet that eliminates any possible problems in the process. Analysts still have to make informed decisions and apply their talents to the problems. But by eliminating the vast array of pitfalls that traditional multi-step EDM projects encounter, companies can dramatically reduce the time and effort needed to complete the project. This type of environment can take the typical six-to-eight month project and deliver the same results in thirty to sixty days, at significantly less risk.
With the power of meta data, it should be only a matter of time that this next generation EDM environment defines methods for cleansing and scrubbing. Once this level of functionality is reached, we will truly have an end-to-end data migration tool.
(i) Wayne Eckerson, “Analyst Insight: The Source of the Problem,” DM Review, May 1999
(ii) Larry Greenfield, http://pwp.starnetinc.com/larryg/clean.html