Effective data integration is an important, precise, detailed, and rigorous analysis process that can offer business value to any organization
Effective data integration is much more than connecting a few databases, building bridges among applications, or consolidating disparate data into one database without addressing the high percentage of data inconsistency and redundancy found in every organization. Integrating data is not a rote task that can be relegated to a tool. It is an important, precise, detailed, and rigorous analysis process performed by business analysts and data management professionals. This process produces an inventory of data (either as logical data models or physical databases) where all data is integrated within a common business context and appropriately deployed for maximum use in supporting the business information needs of the organization. This task is not without challenges.
Know the Business Entities
Each organization has key business entities that are the heart of their business. The automobile manufacturer has dealers, the retailer has suppliers, the mortgage loan company has brokers, and almost every organization has customers of one sort or another (policyholders, depositors, borrowers, students, patients, citizens, non-profit contributors, convicts, etc.). To be successful and to compete effectively, organizations must understand these entities to the point of knowing their profiles including who they are, how they shop, what they buy, what services they use, what channels they prefer, their preferences for receiving marketing information, price sensitivity, service level requirements, and quality requirements. This is partly the responsibility of data management professionals – knowing the data and its meaning, its context, and the rules that govern the data.
The information about these entities must be integrated in a fashion that will allow the organization to take clever and most-effective actions. For example, a banking customer can choose to use an automated teller machine (ATM), the bank’s Web site, a voice response unit (VRU), a customer service representative (CSR), or the teller at the counter in the bank. The choice of customer channels will have a major impact on the cost to the bank. Speaking to a CSR will cost the bank several dollars while a Web inquiry will cost only pennies.
Mergers and Acquisitions
When organizations are merged and when businesses are acquired, the normal process is to integrate the two companies. The potential for cost savings related to the integration is usually one of the primary justifications for the merger. The operative word is “potential.” The reality may be quite different and it is up to the IT teams, including data management, to determine just how technically difficult it would be to for the two organizations in question to merge. The following are some of the difficulties involved in the process:
- Duplicate records – a customer may belong to the two companies that are merging.
- Duplicate keys – while the supplier numbers are unique within each company’s systems, there are overlapping numbers between the two companies.
- Different data types and different field lengths (metadata) are used by the two companies.
- Fields with the same names have different meanings.
- Fields with the same meaning have different names.
- Corresponding fields have different data values.
- Corresponding fields have different business rules.
- If the two companies use different database management systems (DBMS), there may be some technical or design incompatibilities.
Many companies only resolve items that are technical hurdles to physical data integration, but they rarely address the other items that only have a business impact, but not a technical impact. Sometimes, the reason given for their lack of rigorous data analysis is that expectations from the business people are difficult to manage. Managers and end-users think that the integration process should be easy and fast (unless they have been through it before).
Data redundancy is rampant in almost every organization. It is not uncommon to find data repeated ten, twenty, or more times within the organization, and nobody knows which file is the system of record and which copy of the data most accurately reflects the real world. While there are legitimate reasons for consciously maintaining controlled redundant data (e.g., performance, legal, and security reasons), un-controlled redundant data contributes significantly to the costs of developing and maintaining multiple redundant databases and applications. The costs include programmer effort, disk and machine costs, reconciling inconsistent results, needing to update (sometimes manually) multiple systems. Redundant data results in a loss of control, misunderstandings, and a continuing bad reputation for IT. The goal is to minimize data redundancy and have a single version of true and accurate data.
How did the data get there and why is it redundant and different? It might have started with the same raw data but it was filtered, transformed, and had different business rules applied. A redundant set of data may have been created if the application developer for a new system was unaware of the existing data. The redundant data may have been consciously created when a new system did not want the political burdens of having to rely on another data owner.
Webster’s definition for lineage is “the direct descent from an ancestor.” Therefore, data lineage is the process of tracking the descendent data elements from their origins to their current instantiations. Since most organizations have a plethora of redundant data, it is important to know where the data came from to determine which data source is most reliable. Documenting the data lineage for each data element in a data dictionary or a metadata repository would provide the origin and subsequent alterations and duplication of each data element. This data inventory is an invaluable data management resource. Not only does it provide guidance for using data elements correctly, but also it is also extremely useful for impact analysis. One only has to remember the immense effort with Y2K impact analysis because most companies did not track – and still do not track – their data lineage.
Establishing data lineage for each data element is a major effort given the fact that organizations have millions of data elements in thousands of files and tables. Being overwhelmed by the scope of the effort, most organizations do nothing. Rather than taking the all or nothing approach, a good place to start would be on a Data Warehouse project. For example, the team could capture data lineage for the subset of data elements used by the DW applications with the metadata in the ETL tool. It will track what data is sourced, how the data is transformed, and where it is stored for access. In addition to the ETL tool vendors, most of the BI and DW software vendors have some metadata capability to trace data lineage.
What Data to Integrate
Data integration is performed in two layers: logical and physical. Logical data integration is the process of building an enterprise logical data model – not all at once, but one project at a time as new systems are developed or old systems are converted. Physical data integration is the process of filtering redundant data, retiring redundant files and databases, and combining data elements for the same business entity into one physical database. This process also includes stopping uncontrolled data propagation where data originates from a single provider application and then is propagated to individual, subscriber applications.
Data Integration Prioritization
No one can boil the ocean; prioritize the physical data integration deliverables. An enterprise-wide data integration effort has to be carved up into small iterative projects, starting with the most critical data and working down to the less significant data. The data integration team must determine which data is most appropriate for integration. Some data may not be suitable for integration, such as private data (e.g., department-specific data), highly secured data, and data that is too risky to integrate. The team also needs to look at historical data and decide how much of it to include in their data integration process.
This is especially true for DW and BI projects. The following questions should be considered when determining which data to include in the integration process: How will the data be used? What are the political issues surrounding data sharing? What are the security issues? Are there regulatory and legal requirements that would or should prevent data sharing?
Risks of Data Integration
There are a number of risks associated with data integration, such as lack of management commitment, cost and effort, sustainability, external data, data selection and validation. These should be considered carefully before the integration process begins, so that the investment in data integration is not wasted.
Many people confuse data integration with data consolidation. Consolidating data simply means gathering data elements that describe the same business entity (e.g., customer or product) from multiple source databases and storing them in one database. Integrating data goes beyond that.
In addition to consolidating data, integration enforces data uniqueness by standardizing the business data, enforcing business rules, and eliminating (or reducing and controlling) data redundancy. Integration also means that data is formally named, unambiguously defined, appropriately architected, and its lineage is properly documented.
In a federated approach, data does not have to be consolidated or moved to a common, integrated database. A simplistic definition of data federation is leaving the data where it is, installing middleware like Enterprise application integration (EAI) or EII, and providing metadata to make people aware of the existence of the data, how current the data is, and how to get to it. This eliminates the need to convert, match, filter, and merge the data and avoids the struggles with data anomalies. However, this approach requires complete and current metadata as well as clean and consistent business data, not to mention good quality processes to promote and allow federation. The federated approach is not mutually exclusive with consolidation or integration. If the data is to be infrequently accessed, federation may make sense.
Data Integration Capability Maturity Model
The generic capability maturity model (CMM) can be adapted easily to a company’s data integration strategy, as summarized in Table 1.
|Limited data federation; often with redundant and inconsistent data
|Limited data consolidation; documenting redundancies and inconsistencies
|Data integration initiated; new “dis-integration” is discouraged
|Data integration widely adopted; “dis-integration” is penalized
|Enterprise-wide data integration and other data strategy principles practiced by all departments in the organization
Table 1: Data Integration CMM Levels
Data integration is a topic that is well-publicized, much talked about, and heavily exaggerated by many tool vendors. It is a complicated topic that requires considerable effort if implemented fully. It requires a strategy, a plan, a team, skills, sponsorship, and an ongoing commitment from the business people.