Performance Benefits of Surrogate Keys in Dimensional Models

April 1, 2007 Peter Stiglich, CBIP Data Warehousing & Business Intelligence

There are many benefits to implementing surrogate keys in a dimensional model design for a data warehouse data model

Introduction

There are many reasons for implementing surrogate keys in Dimensional Models (enterprise data modeling) such as insulating dimensions from changes to source systems and enabling historical versioning of dimension members. However, query performance in the data warehouse is another primary reason for incorporating surrogate keys into your dimensional models that should not be neglected.

A surrogate key is a non-intelligent, system generated, numeric (integer or smallint) value assigned as the primary key of a dimension. An alternate key (aka natural key) must always be defined as well for dimensions to understand the granularity of the dimension – something vitally important for enabling conformed dimensions across multiple fact tables and areas of analysis in any data warehouse environment.

Some database and DW designers believe that surrogate keys are not needed due to increases in performance capabilities in hardware. However, due to the ever-expanding volume of data, the increasing volume of queries, and the data volumes queried it is important to make query joins as efficient as possible for every application and database management system.

According to a recent OLAP Survey, the most commonly reported problem is poor query performance. Using surrogate keys is the foremost means to optimize dimensional queries on a RDBMS platform. The primary reasons are:

Simplified, High Performance Joins

Using a surrogate key will simplify the join between a fact and dimension table. A single, small numeric value (usually 4 bytes (integer) or 2 bytes (smallint)) is scanned in the fact table or fact table index, rather than a large character value or multiple attributes of mixed data types. Obviously, the join on the single, small numeric value will be faster. It is very common for dimensions to have multi-part natural keys comprised of the business key (attributes that are well known to business users as providing uniqueness in source systems) plus requisite meta data tags (e.g. source system identifier, effective date). The natural key describes the uniqueness and granularity of the dimension table. It is critical that the metadata associated with the natural and surrogate keys be identified and documented.

These multi-part natural keys often consist of character values that can add up to a significant amount of space when your fact tables consist of millions or billions of records. It is very common for a Business Intelligence (BI) query to need to retrieve a very large number of records from the database, even though the results presented to the user are usually summarized.

In dimensional models, it is very common to need to view the state of a dimension member as it appeared as of fact occurrence date. If the Type II Slowly Changing Dimension method is used for handling changes in a dimension, changes to a dimension member will result in the existing record being “deactivated” – e.g. the inactive date is set to the previous date, and the current values are inserted into a new record. Without a surrogate key, an equi-join cannot be performed to accurately join the fact and dimension tables – instead, an inefficient date range operation must be performed.

Example:
select d.col1, d.col2, sum (f.amt)
from dimension d, fact f
where d.col3 = f.col3
and f.tran_date between d.effective_date and d.inactive_date
and d.col4 = “Widgets”
group by d.col1, d.col2

When a surrogate key is used, the task of optimizing queries is reduced – there is only a single possible way to join a fact table to the dimension. A side benefit of this is that users cannot incorrectly join a dimension and fact table. In the above example, if the user forgot to include the range join on the fact transaction date, double counting could result.

Simplifying dimension joins by utilizing a surrogate key will be appreciated by anyone who has to troubleshoot or tune complex analytical queries generated by BI tools. A fact table usually has many associated dimension tables – it would not be unusual to have fifteen (15) dimensions that could join to a fact table. In addition, many dimensions may have multiple relationships (roles) to a fact table. There are usually many types of dates that need to be represented with a fact record – each of which would need to join to the date dimension. For example, on an Order fact table you might have an Order Placed Date, Order Shipped Date, Order Invoice Date, etc.

Reduced I/O Operations

While disk space is (relatively) cheap, usually many fact table rows have to be read into memory to satisfy queries – if the records are not already in memory. Relational databases perform I/O using pages. A page can contain many data or index rows, so if a page contains a row required to satisfy a query, the data page (with all the rows contained in it) is read into memory. Using natural keys increase the width of the fact table, so fewer fact table rows can be stored on a data page. The result is increased I/O operations – usually the main cause of performance bottlenecks.

A significant amount of the time spent during ETL processing can be attributed to index builds. Utilizing surrogate keys will reduce amount of time spent in building and maintaining indexes and will reduce the amount of space allocated for indexes. As a result, the indexes will be more compact and efficient.

Enables RDBMS Optimizations

Using surrogate keys help enable database optimizations developed specifically for use with dimensional models, e.g. bitmap indexing, star transformations.

Some database features developed specifically for Star Schemas require single part foreign keys. Due to the performance impact of enabling Referential Integrity (RI) for a dimensional model, foreign key constraints are usually not enforced in the database. Foreign keys in the fact table, however, usually have a bitmap or b-tree index on the column. Unlike a B-Tree index, most bitmap indices cannot be comprised of multiple columns. Bitmap indices were developed specifically for BI application and can have a dramatic impact on query performance.

In order for a RDBMS query optimizer to execute a query using a Star Transformation, a single part foreign key with a bitmap index is required. Star Transformations will “rewrite” a query to optimize it by using sub-queries that take advantage of the bitmap indices on the foreign keys.

Conclusion

The increases in performance capabilities of hardware and software plus the low cost of DASD may tempt one to forego using surrogate keys. More powerful infrastructure does not negate the need for good data architecture. Using natural keys in dimensional models will usually be a mistake. Use of surrogate keys should be considered the standard for dimensional models, unless there are very specific, valid, and justified reasons. Surrogate keys have significant performance benefits, in addition to other data architecture benefits.

Peter Stiglich, CBIP

Pete Stiglich, CBIP, is a Principal Consultant with Data-Principles, LLC and has written and presented extensively on data architecture, data management, and Big Data. He is an AWS Technical Professional and a Hortonworks Architecture Professional. Pete also is an experienced trainer in data architecture and data modeling, and has a background in data governance and metadata management.

Performance Benefits of Surrogate Keys in Dimensional Models

Peter Stiglich, CBIP

More DMU Knowledge

DMU Sponsors

DMU Partners

DMU Categories

Online Courses

Contact Us

Subscribe To DMU