Skip to content.

Sections
Home » Resource Center » Real-World Decision Support (RWDS) Journal » April 2000 - Volume 1, Issue 07 » Generic Meta Data Repository Design - Part 2

Generic Meta Data Repository Design - Part 2

By Michael F. Jennings

Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 295-7634
Email: mfjennin@hewitt.com

Figure 1

Abstract:

Many data warehouse projects face the decision of purchasing or building a meta data repository for their environment. The decision to build a repository is often made due to the lack of data integration between the various data warehouse products or as a result of budgetary constraints on the project. For those projects that make the decision to implement their own meta data repository solution a generic model is often the best choice to advance that effort.

This article explores the design of generic meta data repository data model for use in a decision support system project. The design will address the various base components that are typically required in meta data repository in order to support a data warehouse environment. The intent of the model is to offer those individuals who wish to implement a meta data repository a generic solution that can be integrated into an in-house implementation. Alternatively, this model design can be used to complement a requirement checklist when evaluating meta data repository products in the market place.

Part 1 of this article discussed the design of the generic meta data repository from a business logical modeling or macro level perspective. Part 2 will explore design of the model at a logical modeling level or more micro level.

Assumption:

  • The reader has had some level of database data modeling experience that covers entity relationship diagramming using information engineering notation.
  • The article is intended to provide the experienced reader with sufficient modeling detail to customize, physicalize and implement the meta data repository in a database for their environment.

Defining the Generic Repository Model Design

From part one of this paper, you should now have a clear understanding of the importance a meta data repository has in your decision support environment and of the advantages a generic model can have in either creating or evaluating this essential warehouse product. The base components comprising the generic model will now be reviewed in greater detail in order for you to determine their applicability to your particular decision support environment.

It is important to remember that the generic repository model should only be used as a guide, not as an all encompassing road map. Individual components of the model may be revised, supplemented and or removed depending on the business and infrastructure environment needs of your specific enterprise. Methods of versioning and source control for the model components are not addressed in sufficient detail for most projects, but are left to the discretion and design of the repository architect.

The Data Warehouse Model - Logical and Physical

The four tables the store both logical and physical information about the actual data warehouse model are the Target Table, Target Column, Target Column Map and Target Domain (see below).

Figure 2

The Target Table is able to store both logical and physical views of tables in the warehouse through use of the Target Table Type column, which distinguishes the two categories. An implied business rule for this table is that only physical table types are actually mapped to the source systems feeding information to the warehouse. The Target Table Name column contains either the logical or the actual physical, database name, of the warehouse table. The Target Table Effective Date provides a mean of distinguishing revisions made to a particular table. The Status Code column indicates the current approval status of the warehouse table using domain values such as pending, approved or removed. The Table Business Name column contains the common name of the table that will be presented to users through the front-end access method of the warehouse. The Table Business Alias contains a common alternative name or acronym of the field (e.g., alias for profitability ratio is PR). The Table Business Definition column contains a detailed descriptive explanation of the business process the field plays in the enterprise allowing the end user to make an educated determination about its use. The Table Business Rules is used to denote any conventions or practices the table must adhere to beyond referential integrity constraints. The Business Data Steward column lists the individual and/or groups that are responsible in the organization for defining this particular table. The Integration Flag column is used as a quick query method to denote which tables obtain source information from more than one system.

The Table Column Map table simply provides a cross reference of the logical and/or physical tables to their associated columns.

The Target Column table contains both technical and business information about the field in the table (logical or physical since the Target Table Type is inherited onto this table from its parent). The Target Column Name column contains either the logical or the actual physical, database name, of the warehouse field. The Target Column Effective Date provides a mean of distinguishing revisions made to a particular field. The Base Unit column contains a description or code of the measurement unit used for entries in the field (e.g., dollars, pounds, meters). The Column Business Rules field is used to denote any conventions or practices the column must follow. For example, an amount field must have a corresponding currency value. The Calculation Formula column contains a description of the method used to determine the value stored in the field including the names of other tables/columns used. The Column Business Acronym contains a common acronym coding of the field (e.g., UOM for unit of measure). The Column Business Definition field contains a detailed descriptive explanation of the business meaning of the column in the context of the enterprise. The Column Business Name field contains the common name of the column that will be shown to users through the front-end access method of the warehouse. The Degree of Accuracy column is used to denote the required number of places after the decimal point that entries in this column are required to contain. For example, currency conversions involving the Euro monetary unit should contain six decimals of precision to accurately convert. The Maximum Range and Minimum Range columns indicate the upper and lower bounds of numeric values for the column. The Status Code column indicates the current approval status of the column (e.g., pending, approved, removed). The Length, Data Type and Null Flag columns provide the basic technical information required to describe the fields into a data modeling tool or to a DBMS. For example, a company name column could be described as having a length of hundred, data type of character or varchar and does not allow null values.

The Target Domain table contains definitions of values for columns containing codes (lookup list) only. Each possible domain value or code for a column is stored in this table. For example, the column country code has domain values that include USA for United States and CAN for Canada, etc.

Warehouse Data Sources

The three tables that store physical information about the operational systems that feed the data warehouse information are the Source Table, Source Column and Source Domain.

Figure 3

The Source table contains information about the source system database or extract file used to populate the target tables of the warehouse. The Source ID column is used to uniquely identify a particular system of record (e.g., ERP, Order Management 1, Trouble Ticket, Client Extract 5). The Source Effective Date provides a mean of distinguishing revisions made to an operational system or extract file. The Source Format Type provides a means to identify the category of the source information such as a server/database, directory/file or spreadsheet file. In cases where operational information is being extracted directly from a source database table the Source DBMS column contains the physicals names of these items. The Source Description column contains a detailed narrative describing the particular source of information for the warehouse. The Source Update Frequency column denotes the cycle at which this source is updated (e.g., daily, weekly, monthly, annually). The Status Code column indicates the current approval status of the source system or extract file (e.g., pending, approved, cancelled).

The remaining tables and columns in this meta data repository component, Source Column and Source Domain, follow the same definitions found in the Target component tables, Target Column and Target Domain, by just substituting source in place of target.

Source to Target Mappings

The Source To Target Column and Domain Map tables provide the necessary design strategy to build extraction, transformation and load (ETL) processes by linking the operational sources to the warehouse tables they feed.

Figure 4

Besides the actually mapping documented in these cross references tables, any additional instructions that need to be described are located in the Mapping Semantic Resolution columns on both tables. An example of a meta data repository query from these tables is illustrated below.

Figure 5

Extraction, Transformation and Loading (ETL)

Three tables, ETL Process, ETL Process Source Map and ETL Process Statistics, map warehouse tables and operational sources to the procedures that access them plus capture statistics about load processing.

Note: How your organization actually populates these tables will depend on the ETL methodology and products utilized by your organization thus is outside the scope of this paper.

Figure 6

The ETL Process table contains a Process Id column to uniquely identify the ETL procedure in the warehouse environment. The Process Effective Date identifies when this particular process was included into the load cycle steps. The Process Description column contains a detailed description of the ETL process from a technical and business perspective. The Process Owner column lists the individual and/or groups that are technically responsible for this process.

The ETL Process Statistics table contains both the mapping of the process to a warehouse table or tables plus the date/time statistics of the process for a specific batch cycle load. A single ETL process can load one or many target warehouse tables depending on the sources and/or business requirements involved. The Batch Cycle ID is a sequential identifier assigned during each load cycle to the data warehouse regardless of the refresh frequency (e.g., daily, weekly, monthly, etc.). The Load Date column provides a date/time stamp of when the specific ETL process ran. The Elapsed Time and CPU Time columns provide information on the computer resources that the ETL process used during this batch cycle. The Process Return Code column captures the completion, warning or error return code value from the ETL process for examination by the data acquisition developer. The Process Return Code Message column contains a narrative explanation of warning or error return codes for the particular operating system (if available).

The ETL Process Source Map tables provides a cross reference of ETL processes to operational systems or extract files.

Subject Areas

The Subject Area and Subject Area Table Map provide the end user with a logical grouping or business view of the data warehouse tables (e.g., sales, financial, human resources).

Figure 7

The Subject Area table contains a Subject Area Id column to uniquely identify the group of tables within the enterprise from a business perspective. The Subject Area Effective Date identifies when this particular business view was last update or created. The Status Code column indicates the current approval status of the grouping (e.g., pending, approved, cancelled). The Subject Area Description column contains a detailed description of the business grouping.

The Subject Area Table Map simple groups the warehouse tables, logical and physical, to a subject area.

Warehouse Queries

Two tables, Query Statistics and Query Table Column Hits, map warehouse tables and columns to the query requests that access them plus capture statistics about processing.

Note: The actual method of how your organization populates these tables will depend on the DBMS, reporting tool and/or data-monitoring product being employed thus is also outside the scope of this paper.

Figure 8

The Query Statistics table contains various information about query results made against the warehouse. These requests can be from your warehouse front-end reporting tool (e.g., OLAP) or from adhoc request made directly against the database depending on the informational needs of the DBA's, Data Acquisition Developers, Data Access Developers or the Architect. The Query ID is a sequential identifier assigned by the DBMS or the data monitoring tool to uniquely identify a query request. The Query Start and End Time columns capture the date/time of when the query was initiated and completed. The Number Of Rows Returned column indicates just that the count of rows successfully gathered by the query. The Size of the Result Set column contains the size in kilobytes of the row returned in the query. This can be useful in diagnosing report problems with query reporting tools. The User Id column contains the operating system identifier of the individual or group who requested the query. This column can be useful for identifying business needs from users who have frequent requests or long running queries. The Server and Database Name columns identify which version of the warehouse database is being queried such as development, QA, production or training.

The Query Table Column Hits table provides a cross reference of database queries to warehouse tables and columns.

Summary

For those firms that decide to develop a meta data repository in-house, a generic model can help to quickly advance these efforts by providing a starting template to edit against. This generic model can also assist those firms evaluating meta data repository products in the marketplace.

The generic repository model will need to be tailored for the decision support environment it is being integrated into by the repository architect. Revisions, additions or deletions to the generic repository model will depend on your warehouse development methodology and decision support products being used. Capture of statistical information in some cases may not be possible due use of proprietary standards or lack of integration between developmental components.

About the Author

Michael F. Jennings is a data warehouse architect, columnist and conference speaker on data warehousing and business intelligence issues at Hewitt Associates where his current role is Unit Manager of Business Intelligence & Data Warehousing Infrastructure. His talks on decision support can be heard at many of the major data warehousing conferences in the U.S. and abroad. His experience includes the design and implementation of decision support systems in the manufacturing, telecommunication, insurance and human resource industries. Mike is a contributing author to the new book "Building and Managing the Meta Data Repository: A Full Life-Cycle Guide" being published this summer. He has more than seventeen years of information technology experience.

Michael F. Jennings
Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 295-7634
Email: mfjennin@hewitt.com