Skip to content.

Sections

Generic Meta Data Repository Design - Part 1

By Michael F. Jennings

Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 295-7634
Email: mfjennin@hewitt.com

Figure 1

Abstract:

Many data warehouse projects face the decision of purchasing or building a meta data repository for their environment. The decision to build a repository is often made due to the lack of data integration between the various data warehouse products or as a result of budgetary constraints on the project. For those projects that make the decision to implement their own meta data repository solution a generic model is often the best choice to advance that effort.

This article explores the design of generic meta data repository data model for use in a decision support system project. The design will address the various base components that are typically required in meta data repository in order to support a data warehouse environment. The intent of the model is to offer those individuals who wish to implement a meta data repository a generic solution that can be integrated into an in-house implementation. Alternatively, this model design can be used to complement a requirements checklist when evaluating meta data repository products in the market place.

Part 1 of this article will discuss the design of the generic meta data repository from a business logical modeling or macro level perspective. Part 2 will explore design of the model at a logical modeling level or more micro level.

Assumption:

  • The reader has had some level of database data modeling experience that covers entity relationship diagramming using information engineering notation.
  • The article is intended to provide the experienced reader with sufficient modeling detail to physicalize and implement the meta data repository in a database.

Introduction

There is not a more important data model that the data warehouse developer needs to understand in the decision support environment than the meta data repository model. The focus on many projects is on operational source system models or data warehouse model because that is where the actual raw data is stored. While this is unquestionably the primary purpose of the decision support environment, it is the meta data repository and its data model structure that allow the various function areas in the warehouse to communicate. The repository’s role in furnishing context to the data content, processes and the reports that are elements of the environment is often overlooked. The meta data repository model is the central hub of the environment. It is the one place where integration between operational source systems, the data warehouse, the ETL (extraction, transformation and loading) processes, the business views, reports and operational statistics occurs.

While many firms understand the significance of the meta data repository, few have the experience to construct one from a blank sheet of paper. This is where a generic meta data repository steps in to help get the project started by providing a template that can be tailored by the repository architect for specific business needs.

Unlike a generic data warehouse model that is useful only for a specific vertical industry or functional purpose, the generic meta data repository model can be utilized across various businesses. This reusability is possible because the repository's purpose is to capture and store meta data, data about data. This function is not influenced by the type of business being supported but by the data warehouse environment infrastructure needed. The generic repository model is not an all encompassing "shrink wrap" solution for any company. Its purpose is to provide the repository architect with a point of departure to begin the revision process needed for their enterprise. The repository architect's modeling decisions are made primarily on what other type of warehouse components need to be supported.

Meta Data Repository Components

There are typically seven standard components to a meta data repository data model.

Figure 1

These components can be viewed as the subject areas of the generic meta data repository model. Depending on the individual decision support environment's business requirements, the model may be revised by removal, editing or integration of a new component area. If a company is using a data cleansing tool to scrub dirty data from a source system the ETL component area of the model may need to be revised to capture pertinent information. Alternatively, if capture of query statistics is deemed extraneous or is accommodated through some other means in the environment (e.g., OLAP tool meta data layer database) this component area may be removed from the model.

The first component area of the generic meta data repository model contains information about the logical data warehouse model. The logical data warehouse model contains all the business entities for specific business subject areas, relationships between the entities and attributes of each entity comprising the target model. The level of detail found in the logical model will vary depending on your firm's data modeling methodology and practices. Individual modeling practices for the logical data model may also necessitate changes in the generic model to accommodate additional information.

The second component contains very detailed information about the physical data warehouse target model. The logical data warehouse model must go through a transformation process in order to become a physical data warehouse dimensional model. A fundamental set of dimensional modeling steps is applied to the logical model in order to produce the physical model (e.g., addition of time). Depending on the requirements of your dimensional data modeling and/or database administration departments, this repository component may be extended to include additional physical attributes such as indexes and fragmentation strategies.

The third component of the generic model contains physical information about the sources feeding data to the data warehouse. This operational information can originate from source databases, file extractions, spreadsheets, Internet and other formats. The information stored in this component can be used to alert the data warehouse development team of pending changes to a source that will potentially affect the data warehouse model, ETL and report processing. Additionally, the information in this component can be used by warehouse analysts, both business and technical, for reconciliation and auditing purposes.

The fourth component of the generic model is the source to target data warehouse mappings. This component stores the table/column cross reference mapping and semantic resolution between the source operational systems and the target physical data warehouse model. This is one of the primary purposes of a meta data repository since other environment tools; i.e. data modeling, often do not do or inadequately perform this function. Once again, the detail to which the semantic resolution is documented will vary based on your firm's own desires and goals. You may decide to extend the model to take a more structured approach towards business rule identification and use.

The fifth component is the business subject areas. The logical grouping or business views of physical data warehouse tables is stored here (e.g., sales, financial, human resources). This information provides business end users with a more intuitive navigation method or view of the information stored in the data warehouse. This information can be used by both ETL and front-end reporting tools. Your particular business needs may require a more hierarchical method to store and navigate subject area information, thus requiring revisions to the generic model.

The sixth component of the generic model is ETL statistics. Information on individual extract, transformation and load processes is captured within this component. This information can be used to determine process improvements, database enhancements, fault isolation and other optimization procedures for the warehouse. Use of this particular component will depend greatly on your specific ETL methodology and tool set capabilities. The capture of this type of information in the repository database will require some degree of additional time during ETL processing.

The seventh and last component of the generic meta data repository model is query statistics. Information about every query made against the data warehouse database is stored in this component. This information is used to determine a variety of optimizations that can be performed against the warehouse database. Usage statistics on tables and columns is analyzed to identify dormant data, aggregation and index candidates. Use of this component will be dependent on whether your company has purchased a data monitoring tool for the decision support environment since these products typically come with their own database.

Logical View of the Generic Repository Model

The logical modeling view of the generic meta data repository is comprised of sixteen (16) entities and eighty two (82) attributes.

Figure 2

The tables that comprise the seven components of the repository are:

Logical Data Warehouse Model:

Primary Tables

Target Table (Target Table Type = Logical)

Target Column

Target Domain

Associative Tables

Subject Area Table Map

Table Column Map

ETL Process Statistics

Table Column Map

Source To Target Domain Map

Physical Data Warehouse Model

Primary Tables

Target Table (Target Table Type = Physical)

Target Column

Target Domain

Associative Tables

Subject Area Table Map

Table Column Map

ETL Process Statistics

Table Column Map

Source To Target Domain Map

Source (Operational) Data Models

Primary Tables

Source

Source Column

Source Domain

Associative Tables

ETL Process Source Map

Source To Target Column Map

Source To Target Domain Map

Source to Target Data Warehouse Mappings

Primary Tables

Source To Target Column Map

Associative Tables

Target Column Map

Source Column

Business Subject Areas

Primary Tables

Subject Area

Associative Tables

Subject Area Table Map

ETL Statistics

Primary Tables

ETL Process

ETL Process Statistics

Associative Tables

ETL Process Source Map

Target Table

Query Statistics

Primary Tables

Query Statistics

Associative Tables

Query Table Column Hits

Part 2 of this article will explore in detail the table and columns that comprise the generic meta data repository.

Summary

For those firms that decide to develop a meta data repository in-house, a generic model can help to quickly advance these efforts by providing a starting template to edit against. This generic model can also assist those firms evaluating meta data repository products in the marketplace. The generic repository model will need to be tailored for the decision support environment it is being integrated into by the repository architect. The repository model is typically comprised of seven subject area views; logical and physical data warehouses models, operational source system data models, data mappings between the source and target warehouse, business subject area views of the physical warehouse, ETL processing step descriptions and statistics and query statistics against the warehouse database.

About the Author

Michael F. Jennings is a data warehouse architect, columnist and conference speaker on data warehousing and business intelligence issues at Hewitt Associates where his current role is Unit Manager of Business Intelligence & Data Warehousing Infrastructure. His talks on decision support can be heard at many of the major data warehousing conferences in the U.S. and abroad. His experience includes the design and implementation of decision support systems in the manufacturing, telecommunication, insurance and human resource industries. Mike is a contributing author to the new book "Building and Managing the Meta Data Repository: A Full Life-Cycle Guide" being published this summer. He has more than seventeen years of information technology experience.

Michael F. Jennings
Hewitt Associates LLC
100 Half Day Road, MS: 1OP-4S
Lincolnshire, IL 60069
(847) 295-5000
Fax: (847) 295-7634
Email: mfjennin@hewitt.com