How should an organization document its data lineage? There are several approaches from which to choose
When dealing with data lineage, one of the main questions is: how should it be documented?
There are a few crucial decisions to be made that would help find an answer to this question:
- What is the scope of data lineage in the company situation?
- What method of documentation to choose: descriptive or automated?
- What kind of lineage is required: process design data lineage or value data lineage?
Following is the content of each decision and what they would mean for any organization.
The scope of data lineage
To specify the meaning of data lineage ‘scope’ it is important to review the concepts of ‘horizontal and vertical data lineage.’ Horizontal data lineage represents the path along which data flows starting from its point of origin to the point of its usage. Horizontal data lineage can be documented on different data model levels such as conceptual, logical and physical. Links between the components of data lineage on these different levels are often called ‘vertical data lineage’.
Both concepts are illustrated in Figure 1:
Figure 1: An illustration of ‘horizontal’ and ‘vertical’ data lineage’
The scope of data lineage is based on the following:
A list of (critical) data sets or elements for which data lineage will be documented.
Critical data elements are those elements that derive from the scope of the data management initiative and those that make the greatest impact on company performance. The bigger the organization, the less probable it is that it would be able to document data lineage for all existing data sets, as data lineage documentation consumes resources and time to a significant extent.
Therefore, choosing the critical data sets or elements is the best way to make data lineage documentation attempts feasible.
The question of the definition of CDE is a complicated topic, and it is not part of this article.
The ‘length’ of the horizontal data lineage.
Of course, there can always be an intention to document data lineage from the ‘golden’ source to its ultimate destination. Larger companies will have longer chains. Many organizations choose to document only a part of the whole data lineage chain. The starting point of data lineage can be ‘relative’, corresponding to the needs and the scope of the whole data lineage initiative. In this respect, a ‘golden’ source could be some application in the middle of the complete chain.
The ‘depth’ of the vertical data lineage.
Data lineage can be documented on one of the three data model levels. It could be only one level, or it could be two combined levels. The choice of the number of levels lineage depends mostly on the chosen method of documentation.
Method of documentation: descriptive vs automated
Major companies start their journey with descriptive data lineage. What does descriptive data lineage mean?
Descriptive data lineage
Descriptive data lineage means that a description of data lineage is made manually using one or another application. The most used applications are Microsoft Office PowerPoint, Word, Excel and Visio. There are some well-known data governance applications that may be used for documenting data lineage. Regardless of the chosen tooling, there are several common features of descriptive data lineage:
- It is time and resource consuming, even if work is done with designated data governance applications.
- Data lineage is usually documented on either conceptual or logical level.
- Conceptual level.
On this level, the following components are documented:
- business processes
- applications, including reports and ‘golden’ sources
- data sets or data terms interlinked with help of restriction rules; data sets should be provided with business definitions are mapped.
- Logical level
On this level, the following components of data lineage are linked:
- applications, including databases, interfaces, reports, and ‘golden’ sources
- data entities and data attributes interlinked with help of business rules; data entities should be provided with corresponding business definitions; repository of business rules should be kept in a special repository.
- data checks and controls.
- Conceptual level.
- Although it would require a tremendous amount of time and effort, some IT professionals manage to document data lineage on the physical level in Excel, but such an approach is not recommended. For documentation of the physical level, an automated solution could be a case.
- If it is decided to proceed with documenting data lineage at least at two levels, a vertical data lineage between these two levels should be created:
- Data sets or terms at the conceptual level to be linked with data entities & attributes at the logical one.
- Restriction rules are to be linked to business (transformation) rules.
Automated data lineage
Automated data lineage means that the process of recording of metadata at physical level of data processing using one of application available on the market is automated. There are a few vendors of automated metadata capture. The extended list of providers of such a solution can be found on www.metaintegration.com. The company provides meta integration components to major providers of the metadata lineage function. Such a solution requires the following:
- Carefully defining the scope of data lineage initiative, as automated data lineage is an expensive and resource consuming task.
- Expectation management, since the number of legacy systems will contribute to the number of difficulties to be expected.
There are two sources of metadata for automated data lineage:
‘As built’ – meaning that metadata which is needed for the solution is already available in the database structure.
‘As designed’ – meaning that information can be read from the documentation of system design, for example, from data modeling tools.
Often, legacy systems cannot provide either of these metadata sources. It might cause the ‘reverse engineering’ task. So, it is recommended to check all applications before making decisions about automated data lineage:
- When purchasing specialized metadata software, it is important to think about a certain set of modules to get automated data lineage. Since different types of metadata need different repositories, for example metadata and business rules repositories, it is important to identify what types of lineage is required.
- Existing software usually allows documentation of data lineage on physical levels. In some cases, it may provide a capability to document data lineage on logical level. Still, documentation on conceptual and logical levels will have be done manually, as well as mapping between physical and logical levels.
- The number of metadata lineage objects types can rise to dozens and hundreds. Therefore, the corresponding amount of metadata could explode your storage capacities.
- Frequently, business stakeholders overestimate capabilities of automated data lineage.
Different groups of stakeholders have different requirements for data lineage. There are at least two key stakeholder groups: IT technical professionals and business users such as financial and business controllers, business analysts, auditors. The key expectations of business users are the ability to follow changes in data values and the meaning of data, and the ability to get historical information on data processing up to 5-7 months in the past.
An automated solution cannot satisfy these requirements. Automated data lineage is data processing design documentation. Business users may require a process called ‘value data lineage’ since data lineage has nothing to do with most user requirements. Value data lineage can be called ‘drill-down capabilities’. The most challenging aspect is that none of the existing data lineage providers have a solution for the ‘value’ data lineage.