Data lineage is an important aspect of a variety of legislation affecting the data from individuals and organizations.
Several pieces of legislation have revived interest in data lineage. The Basel Committee on Banking Supervision‘s standard number 239 “Principles for effective risk data aggregation and risk reporting” (BCBS239 or PERDARR) and the EU General Data Protection Regulation (GDPR) are two of the key legislations that require data lineage even if it is never literally mentioned in them.
The general practice is that data management professionals investigate the requirements and translate them into the data management language.
Key data lineage components from the data management perspective
Data lineage consists of the following interlinked components, shown in Figure 1:
Figure 1 Key data lineage components from legislative viewpoint
IT systems (application, database, network segment)
Data flows through the chain of systems or applications in which data is being transformed and integrated.
‘Golden sources’ and reports/ dashboards are two boundaries that denote correspondingly the point where data is created and its final destination.
Business processes ensure a set of activities related to data processing. Business processes usually include references to related applications.
Data elements themselves form the key components of data lineage. Data elements can be specified on different levels of abstraction and details. Usually, this is done on one of the following data model levels:
- Conceptual: data elements are presented in the form of terms and related constrains.
- Logical, application related: data entities & attributes of a specific database and related data transformation rules.
- Logical, not application related: data entities & attributes and related data transformation rules.
- Physical: tables & columns & related ETLs (Extract, Transform, Load).
It is advisable to link data elements on different levels of data models need to be linked. Such a link is sometimes called ‘vertical data lineage’ as opposed to ‘horizontal data lineage’ that represents the path of data from the point of origination to the point of usage. In any case, physical data models are always linked to a specific application.
Data checks and controls
According to the definition of data lineage specified by the Enterprise Data Management Council in The Standard Glossary of Data Management Concepts, ‘lineage may include a mapping of the data controls’.
There are certain requirements in the legislation that can be interpreted as components of data lineage, see Figure 2:
Figure 2: Legislative requirements in relation to data lineage
Information / reports
BCBS239 stresses that ‘the right information needs to be presented to the right people at the right time’5, followed by requirements to ‘distribute risk reports to the relevant parties’. In Figure 2, this component is noted as ‘Dashboards/ Reports’.
BCBS239 also specifies that it is necessary to ‘to document and explain all of their risk data aggregation processes whether automated or manual’.
BCBS239 draws attention of organizations to a business dictionary, which is ‘the concepts used in a report such that data is defined consistently across the organization’.
From data management perspective on data lineage, business dictionary or glossary, which is the set of business terms, corresponds to the conceptual level of data models.
Data elements and business rules at logical level
BCBS239 points out the requirement to maintain ‘inventory and classification of risk data items’, which can be translated as data elements at logical level of data models. In addition, ‘automated and manual edit and reasonableness checks, including an inventory of the validation rules that are applied to quantitative information’ are also required. ‘The inventory should include explanations of the conventions used to describe any mathematical or logical relationships that should be verified through these validations or checks’. In the language of data management, it is interpreted as a repository of business rules.
One of BCBS239 principles states that ‘a bank should design, build and maintain data architecture and IT infrastructure which fully supports its risk data aggregation capabilities and risk reporting practices’.
GDPR requires that a company should ‘implement appropriate technical and organizational measures to ensure and to be able to demonstrate that processing is performed in accordance with this Regulation’. Several articles in GDPR (e.g. 24, 25, and 32) focus on the necessity of appropriate technical and organizational measures to ensure proper processing of personal data.
Even if there is no direct requirement to document data flow through applications, every data management professional still ‘translates’ these requirements as such.
Business and technical metadata
Metadata is one of the crucial components of data lineage. Metadata describes all other types of data, including all other components of data lineage which are mentioned above.
BCBS239 stresses the necessity to record business metadata, for example, in the form of ‘ownership of risk data and information for both the Business and IT function’. It also recommends to document ‘integrated data taxonomies and architecture […], which includes information on the characteristics of the data (metadata) as well as use of single identifiers and / or unified naming conventions for data including legal entities, counterparties, customers and accounts’. This last requirement is related to both business and technical metadata.
GDPR has extended requirements for recording personal information, such as, requirements that ‘each controller […] shall maintain a record of processing activities under its responsibility. That record shall contain all of the following information: (b) the purposes of the processing; (c) a description of the categories of data subjects and of the categories of personal data; d) the categories of recipients to whom the personal data have been or will be disclosed including recipients in third countries or international organizations; (e)where applicable, transfers of personal data to a third country or an international organization, including the identification of that third country or international organization; (f) where possible, the envisaged time limits for erasure of the different categories of data; (g) where possible, a general description of the technical and organizational security measures’. Everything that is mentioned in Article 30 of GDPR can be recognized as business metadata.
Furthermore, to ensure the exercise of some rights of data subject a company definitely needs knowledge of metadata and data lineage capabilities in place. It is applicable, for example for such rights of data subject as the ‘right to obtain from the controller the erasure of personal data concerning him or her’, ‘right to obtain from the controller restriction of processing’, ‘the right to receive the personal data […] in a structured, commonly used and machine-readable format and […] the right to transmit those data to another controller’. Knowing how data flows through applications on the physical level seems to be an unavoidable requirement.
Data (quality) controls
BCBS239 (PERDARR) states directly the necessity to ‘measure and monitor accuracy of data’. It stresses that ‘Banks must produce aggregated risk data that is complete and measure and monitor the completeness of their risk data’ and ‘controls surrounding risk data should be as robust as those applicable to accounting data’. ‘Integrated procedures for identifying, reporting and explaining data errors or weaknesses in data integrity via exceptions reports’ are to be in place.
GDPR focuses on ’technical and organizational measures to ensure a level of security appropriate to the risk’ related to processing of personal data, a difference with BCBS239.
Based on the requirements of these pieces of legislation, there are several components of data flow/ lineage that a company should document and maintain:
- Report (catalogue)
- Application flow
- Conceptual level of data model: terms and business dictionary
- Logical level of data model: data entities and repository of related business / validation rules
- Physical level of data model: database schemes and ETLs repository
- Business processes
- Data (quality) checks and controls