Data Lineage is an essential component in all business metadata management. Often overlooked, the value of data lineage can be seen in many areas.
There is a growing interest in data lineage for many reasons, across all areas of the enterprise data management community, especially as business metadata becomes more necessary to non-IT professionals.
There are several groups of stakeholders within any company that might be interested in data lineage. Formerly, only the Information Technology (IT) department understood the concept of data lineage and its value. As the explosion of data has affected every business area, business stakeholders have embraced the need for improved metadata management and a deeper approach for data lineage. Stakeholders in finance and risk have become the biggest data lineage enthusiasts.
This hidden interest in data lineage has several reasons:
- appearance of new legislation requirements
- business changes
- an increase in data quality initiatives
- supervisor and audit requirements.
Defining “data lineage”
Industry reference industry guides provide some definitions on data lineage.
However, the definition of data lineage is ambiguous and intercepts other terms, such as “data flow”, “integration architecture”, “data and information (value) chain.” Finally, the definition of data lineage has a lot in common with other data-related terms.
These definitions can serve as a basis for understanding:
Data flow is “the transfer of data between systems, applications, or data sets. Data lineage is a description of the pathway from the data source to their current location and the alterations made to the data along the pathway.”
“[…] data […] has lineage (i.e., a pathway along which it moves from its point of origin to its point of usage, sometimes called the data chain).”
“Data flows are a type of data lineage / metadata documentation that depicts how data moves through business processes and systems. End-to-end data flows illustrate where the data originated, where it is stored and used, and how it is transformed as it moves inside and between diverse processes and systems.”
None of these definitions is clear, and all intersect each other to some extent.
“Data flow”, “data lineage” and “data chain” are terms that describe similar concepts of data movement and transformation. Therefore, these terms often are used interchangeably. Data lineage is a description of the path along which data flows from the point of its origin to the point of its use.
Still, the definitions say nothing about documenting data lineage. To understand the way to document this movement, it is important to know the components that constitute data lineage.
Data lineage components
The same guides give clarification on data lineage component.
“Data flows map and document relationships between data and:
- Application within a business process
- Data stores or databases in an environment
- Network segments (useful for security mapping)
- Business roles, depicting which roles have responsibility for creating, updating, using and deleting data
- Location where local differences occur.”
Therefore, the key components of data flow / lineage are IT system components (applications, databases, network segments) and business processes.
TOGAF 9.1 by The Open Group, the leading guide in enterprise architecture stipulates, “The Data Flow view is concerned with storage, retrieval, processing, archiving, and security of data.”
The definition of TOGAF9.1 seems to have nothing in common with definitions from other reference guides. Rather, it refers to the concept of a data lifecycle.
There are several legislation requirements which requirements cause interest for data lineage: the Basel Committee on Banking Supervision‘s standard number 239: “Principles for effective risk data aggregation and risk reporting” (BCBS 239 or PERDARR), the EU General Data Protection Regulation (GDPR), IFRS9, TRIM (Targeted review of internal models) and others. Many specialists consider data lineage as the ultimate remedy to meet these requirements. At the same time, the term “data lineage” is never mentioned directly in these regulatory documents.
All conclusions about the necessity of data lineage are based on careful investigation of legislation requirements and consequent matching of these requirements to the data management methods and techniques, with data lineage forming part of it.
Often, a company deals with different types of business changes, such as changes in information needs and requirements, changes in application landscape, organizational changes etc. As an example, consider a change in a database of a business application. Usually, data is transformed and processed through the chain of applications, as noted in Figure 1:
Figure 1: A chain of applications
For convenience, the chain consists of just a few applications, but in reality, especially in large companies, such chains consist of dozens of applications.
If, for example in “Company web-page” (the starting point of the chain on the left side of the Figure 1) the database of one of the applications is changed, it means that professionals will need to estimate all required changes in the consequent applications, including the impact on the end reports and/or dashboards. In this case, data lineage will be able to ease the impact analysis of the change.
For example, if changes touch information and reporting requirements (the end point of the chain in Figure 1), professionals will need to use root-cause analysis that will allow them to assess which data is required to produce this new information, where data should come from and how it should be transformed. In such a case, a root-cause analysis will be much easier to do if the data lineage is already recorded.
Usually, knowledge about data processing is kept in the minds of professionals or in the best-case scenario, on local computers in the form of Word or Excel documents.
In many organizations, there are there are a variety of initiatives around the quality of data. In large international companies, a major data quality program may require several years for development and implementation, and longer for the user community to judge it successful. Unfortunately, many business stakeholders and IT staff do not understand the essential part that accurate data lineage plays in resolution of data quality issues. For example, data lineage plays the key role in performing root-cause analysis while investigating data quality issues.
Supervisor and audit requirements
Supervisors and audit requirements affect the use of data lineage in every organization. There is a growing tendency that in addition to aggregated reports, supervisors require companies to provide granular reporting data for support of the reported results. Also, especially finance and risk functions often develop requirements that explain how critical metrics and figures in their reports have been derived. For that, professionals must be able to trace back the full chain of data transformations and explain each transformation’s path. This need requires knowledge of end-to-end data linage.
- There is no agreed list of components that constitute data lineage.
- These are the essential components of data lineage:
- IT systems (applications, databases, network segments)
- Data elements
- Business processes, including different functional roles (data- and non-data related)
- Data controls.
Figure 2: Key components of Data Lineage]
There are several key points concerning data lineage:
- Data lineage is a representation of the path along which data flows from the point of its origin to the point of its usage.
- Data lineage is used to design and describe processes of data transformation and processing.
- Data lineage is recorded by representing a set of linked components such as data (elements), business processes, IT systems and applications, data controls. These components could be presented on different level of abstraction and detail. Usually, such a lineage is called a ‘horizontal’ data lineage.