Performing and documenting data lineage can be the basis for a successful enterprise data management function
Though data management and data linage may seem quite different, practical experience has shown they are related. Using data lineage as a basis for establishing a data management function is a successful approach.
WHAT are data management and data lineage?
One of the key data management tasks is ensuring aligned and unambiguous definitions. Yet, it is not the case for most data management terms.
In many organizations, it is difficult to receive the same answer to the question “What is your understanding of data management?”. When trying to analyze the definitions of data management from published sources, similar challenges arise. This shows that data management has a variety of meanings that strongly depend on the context.
One accepted definition of data management is “a business capability that safeguards the company’s data and information resources and optimizes data and information value chains (‘the chain’) to ensure effective functioning of business. The word “capability” stresses the ability of data management to reach goals and deliver outcomes. Data and information value chain support the business in the creation of business value as shown in Figure 1. The key data management sub-capabilities such as data governance, metadata management, data quality, data modeling, and information systems (data and application) architecture are needed to design the chain.
An IT-related set of data management sub-capabilities enables the functioning of the chain.
Data lineage is one of the most abstract and misaligned concepts in data management. Metadata data lineage documents the flows of data across the organization. The concept of data lineage intercepts with those of five other concepts: data chain, data value chain, integration architecture, data flow, and information value chain. Some of these terms are even considered to be synonyms of “data lineage”. There are different viewpoints on the constituent components of data lineage. Based on the analysis of the concept of data lineage and requirements of several legislative documents, the following set of components have been specified. They are:
- application landscape
- three levels of data models (conceptual, logical, and physical) with corresponding business rules/ETLs, linked vertically
- business processes and roles
- reports catalogue
- data quality checks and controls catalogues.
(For more detailed explanation, please check out this article.)
The scheme of the metadata lineage components can be seen in Figure 2.
This scheme supports implementing the data management function and demonstrate its relation to data lineage.
HOW are the setup of the data management function and the documentation of data lineage linked?
The answer to this question lies in similarities between:
- the deliverables of data management sub-capabilities and key components of data lineage and metadata
- the logical steps of implementation of data management and documentation of data lineage.
Figure 3 demonstrates the process of data management implementation and data lineage documentation.
Step 1. Defining needs and requirements.
Data management has different business stakeholders with specific needs concerning data and information. In Step 1, a company will focus on the specification of a feasible scope of data management initiative. The list of deliverables includes a list of business drivers, stakeholders, and their most urgent information needs. Information is delivered in the form of reports and dashboards. The report catalogue is one of the data lineage components. The scope of the data management initiative will limit the scope of the documentation of data lineage.
When the scope is clear, the corresponding data management tasks and responsibilities should be defined.
Step 2. Dividing tasks and responsibilities
The set of tasks and responsibilities belong to the data management framework which is a set of rules and roles. Rules include but not limited to data management strategy, policies, standards, processes, procedures, plans. Roles should be linked to data management processes, tasks, and deliverables.
Data lineage is one of the deliverables of data management. Therefore, a company needs to specify and document its understanding of data lineage, constituting components, as well as the way to document it (descriptive or automated). Accountabilities regarding data lineage documentation should be assigned to the relevant data management-related roles, such as data governance professionals and data stewards.
Step 3. Building the data management framework.
The implementation of data management is done in several steps.
Step 3.1. Specify data requirements
To meet information requirements, specified in Step 1, corresponding data should be found, delivered, and processed. Frequently, the relationship between raw data and information is not (fully) known. Data lineage is a means to fill in this gap. Usually, data lineage documentation starts with the specification of existing business processes and mapping them to the data sets.
Step 3.2. Document business processes
Business process documentation is not considered to be a part of any of the data management sub-capabilities. Still, this is a required component of data lineage. Most companies begin their data lineage documentation with the analysis of business processes. The performance of business processes is closely related to systems and applications used in these processes.
Step 3.3. Document system and application landscape
Data transformation usually takes place in systems and applications. Documentation of applications and data flows is a part of information systems architecture. At the same time, these flows are mandatory components of data lineage.
Step 3.4. Develop conceptual, logical, and physical data models and link them with each other.
Data flows can be documented on different levels of data model: conceptual, logical, and physical. A company might document data flow/data lineage on any of these levels. A company can choose to document data lineage on the combination of these levels. These models and links between them are deliverables of data modeling and data architecture. These models and links are at the same time are mandatory components of data lineage. The links between different levels of data models are either called ‘vertical data lineage’ or ‘linkage’.
Step 3.5. Identify critical data elements
The definition of critical data elements is a state-of-the-art task, with practical techniques for the specification of critical data elements. The benefits of applying the concept of critical data support prioritization of data management initiatives, including building data quality checks and controls. The set of critical data elements is the deliverable of the data modeling sub-capability. The mandatory pre-requisite to specify critical data elements is knowing data lineage.
Specification of data quality requirements and corresponding checks and controls are deliverables of data quality sub-capability. Data checks and controls are considered being a component of data lineage.
Step 3.6. Assemble data lineage
Only when all above-mentioned steps are done, can data lineage be assembled. At this point, an organization can begin the implementation of the data management function.
Step 4. Intermediate assessment and gap analysis
This step compares the desired results specified in Step 1 with the results. The organization can perform a maturity assessment of data management.
Step 5. Planning further actions
As soon as the company has achieved the desired results, it might want to extend the scope of its data management initiative, including the scope of data lineage to be documented.
Data lineage is a foundational aspect of data management and documenting the processes and results of data lineage can support the establishment of an enterprise data management function.