Skip to content.

Sections

Relationships Between Dimensions - Part II

 

By Steve Hoberman

In Part I of Relationships Between Dimensions, I discussed the benefits of creating relationships between dimensions in our data marts, along with some of the drawbacks. I discussed an "all or nothing" approach. That is, a "nothing" approach where we avoid relationships between dimensions (see Figure 1), and an "all" approach where we create all possible relationships between dimensions, (see Figure 2). This article will focus on two additional approaches, followed by a set of guidelines to help determine the most appropriate design option for your situation. I will run through an example of applying these guidelines and arrive at the most appropriate option. As always, please visit www.stevehoberman.com, click on Discussion Topics, and share your experiences (both good and bad) and I will possibly include them in a future article.

Figure 1 contains a typical star schema design described in my last article.

Figure 1 - Typical star schema design (no relationships between dimensions)

Figure 2, also described in my last article, replaces all of the dates with relationships back to the Time dimension. Customer First Order Date is now replaced with Time First Order Date Identifier, a foreign key back to the Time dimension. So for example, a customer first order date of December 15, 2001 might contain a Time First Order Date Identifier of 123 which points back to a Time dimension row with Time Identifier 123 containing December 15, 2001 in the Date data element and "Day" in the Level data element.

Figure 2 - Representing all relationships between dimensions.

Figures 1 and 2 seem to imply an all or nothing approach to having relationships between dimensions. Pros and cons of either approach appear in my previous article. There are also two intermediate approaches, shown in Figures 3 and 4. Figure 3 contains Time First Order Date Identifier, used in the relationship back to time, while the other date data elements in Customer remain natural foreign keys, such as Customer Birthday Date. Figure 4 also contains the natural foreign key, that is the Customer First Order Date. Figures 3 and 4 demonstrate that you don't have to relate all of the natural foreign keys but could select a subset of one or more, depending on the user requirements. Figure 4 additionally shows that you can redundantly store both the natural foreign key (the date) and foreign key (the time identifier), if required.

Figure 3 - A subset of relationships created between dimensions, in this case just Customer First Order Date represented by Time First Order Date Identifier.

Figure 4 - Same as Figure 3, except additionally contains the actual Customer First Order Date.

Which of these four do we select? Do we close our eyes and wherever the pen lands choose it? Do we do eni-meni-mini-mo? Hopefully not! Table 1 contains a set of guidelines we can use to determine the most appropriate design. In this table I've listed six important factors that come into play in selecting either Figures 1, 2, 3 or 4. For each factor, I've rated each figure. This way we can simply fill in the response column with how the factor influences our design decision. This will help us make an educated decision as to which design to go with. There could be more factors to consider but using these six will illustrate the point.

Table 1 - Guidelines for choosing the most appropriate design.

#

Factor

Example

Figure 1

Figure 2

Figure 3

Figure 4

Response

1

Frequency of direct reporting from dimension to fact table

"How much did Bob Jones the Customer order on their first order date?"

Low

High

High on subset

High on subset

 

2

Frequency of summarized reporting on reference data

"How many customer birthdays in October 1960?"

Low

High

High on subset

High on subset

 

3

Frequency of single dimension reference data reporting

"Let's look at Bob Jones' complete customer record."

High

Low

Low on subset

High

 

4

Referential integrity

If someone accidentally enters February 30th 1960 as a Customer First Order Date, it would not permit it because February 30th 1960 does not exist in the Time dimension.

Low

High

High on subset

Low

 

5

Money and time

Which design option costs the least and takes the least amount of time to complete?

Least effort

Most effort

Average effort

Average effort

 

6

Frequency of updates or inserts to data

When we first created the record for Bob the Customer, we needed to take his Customer First Order Date of February 28th 1960 and do a lookup against the Time dimension to get the appropriate Time Identifier.

High

Low

Low on subset

Low on subset

 

Let's apply these guidelines to the example of Customer, Order Fact, and Time that are shown in Figures 1 through 4. The completed table with our responses filled in appears on Table 2.

Factors 1 through 3 focus on the user requirements - what types of reports or questions will the user initiate? To determine how important each of these first three factors are requires referencing functional requirements document or validating their frequency of occurrence directly with the users. Factors 1 and 2 lean heavily towards having relationships between dimensions (all or a subset). Factor 1, including questions such as "How much did Bob Jones the Customer order on their first order date?", would mean we do not have to go back to the Time dimension to answer such questions. We can simply take the Time identifier foreign key that might exist in the Customer dimension and join directly to the fact table without going through the Time dimension. Factor 2 addresses when we need to use the hierarchies of one dimension when looking at another dimension. That is, using Time's hierarchy when reporting on customer in answering the question, "How many customer birthdays in October 1960?" We can rule out Figure 1 if factors 1 and 2 are of high occurrence. For our example, let's say we have a high frequency of occurrence for factors 1 and 2, especially on the Customer First Order Date.

Factor 3 addresses how much reporting the users will be doing on solely the dimension in question. This is mostly documented in terms of looking at purely customer reference data, purely item data, and so on. Whenever there is a join required to complete the reference data picture, there is a performance hit. Figures 2 and 3 require joins and Figures 1 and 4 do not. Let's assume in our example that the users will do a substantial amount of reporting just on the customer. Therefore this factor is of high occurrence, at least on the Customer First Order Date. This leaves us with only Figure 4 at this point (because Figure 1 has been ruled out in the first two factors). Figure 4 will be the design option we will select unless one of the other factors weigh much more heavily against Figure 4 and towards another design option.

Factor 4 discusses the additional referential integrity such relationships provide. Because our database would catch invalid dates such as February 30th, this is a very low priority for us and therefore not a concern.

Factor 5 is a tricky one because at heart we always want to do the right thing, but in reality we often cut corners. If Time and Money are the highest priority for your project (as it is in a project I am working on now), then please choose the design option in Figure 1 and do not continue this exercise any further. Although it is important to mention that the extra effort required to build in relationships is minimal. Let us assume in our example that the amount of effort is small enough so it does not weigh heavily in our decision, and therefore we ignore Factor 5 in our example.

Factor 6 is mentioned for two reasons. One is the amount of additional time it takes to load data when you have to first do lookups against another table. Figure 1 takes the least amount of time and Figure 2 takes the most amount of time. One reader of my previous article mentioned that this factor was the main reason they removed the relationships between dimensions. The second reason is that if we repeat the same value more than once and there are lots of updates to this value, there is a greater chance the data might become out of synch. Figure 4 has both the natural foreign key and the foreign key, therefore if there are lots of updates the values could become out of synch. Let's assume for First Order Date that there are a minimal number of updates and therefore Figure 4 is still the optimal choice in our example. Table 2 shows the final results (see the Response column), which documents why we've chosen Figure 4.

Table 2 - Example of applying these guidelines.

#

Factor

Example

Figure 1

Figure 2

Figure 3

Figure 4

Response

1

Frequency of direct reporting from dimension to fact table

"How much did Bob Jones the Customer order on their first order date?"

Low

High

High on subset

High on subset

High on subset

2

Frequency of summarized reporting on reference data

"How many customer birthdays in October 1960?"

Low

High

High on subset

High on subset

High on subset

3

Frequency of single dimension reference data reporting

"Let's look at Bob Jones' complete customer record."

High

Low

Low on subset

High

High

4

Referential integrity

If someone accidentally enters February 30th 1960 as a Customer First Order Date, it would not permit it because February 30th 1960 does not exist in the Time dimension.

Low

High

High on subset

Low

Not important (caught by database)

5

Money and time

Which design option costs the least and takes the least amount of time to complete?

Least effort

Most effort

Average effort

Average effort

Not a huge factor

6

Frequency of updates or inserts to data

When we first created the record for Bob the Customer, we needed to take his Customer First Order Date of February 28th 1960 and do a lookup against the Time dimension to get the appropriate Time Identifier.

High

Low

Low on subset

Low on subset

Low on subset

This article presented four options for capturing relationships between dimensions, along with a listing of factors to make the decision of which option to choose more methodological and less "gut feel". The most important factors we discuss (Factors 1 through 3) focus on the user requirements. That is, what does the user want? We need to keep this in mind in all of our design decisions. Thanks to all who responded to my first article, and as mentioned earlier please visit my web site at www.stevehoberman.com and click on Discussion Topics to share your experiences with this technique, including your experiences with other design options, or other factors in addition to the six discussed here.


Steve Hoberman is the Lead Data Warehouse Developer for Mars, Inc. He has over ten years data modeling experience in the Telecommunications, Financial, and Manufacturing industries. He speaks often on data modeling for the Data Warehousing Institute (www.dw-institute.com) and is the author of The Data Modeler's Workbench, Tools and Techniques for Analysis and Design, published by John Wiley (www.wiley.com). Steve has a Master of Science in Information Networking from Carnegie Mellon University. He can be reached at me@stevehoberman.com, and provides a number of resources for the data modeler at www.stevehoberman.com.