Relationships Between Dimensions - Part I
by Steve Hoberman
Did you ever see the movie "The Matrix"? In several scenes, a diskette was inserted on a particular topic into the computer and then downloaded into a person's brain. This download process made you an instant expert on a particular subject. For example, if you wanted to become a black belt in karate, simply insert the karate diskette into the computer and download it directly into your brain. It's that easy and effortless. Wouldn't it be great if we had the same process to become an expert in data modeling? Simply select the data modeling diskette, perform the download, and we are an instant data modeling expert. (If anyone knows of such a process, please let me and all of the readers know as soon as possible!) Instead it becomes a career-long continuous process to become a data modeling expert. And what we find as we get better and better at data modeling, is that we move from learning the general "textbook" structures such as normalized designs, star schemas, etc., into more specific techniques that make us better data modelers. Little by little, by understanding and mastering these techniques, we become experts in data modeling. This quarterly column will focus on data modeling techniques.
In this article, I will discuss Relationships Between Dimensions. I will give an example of this design situation and mention the advantages and disadvantages to using this technique. It is important to note that the context for the technique in this article is for a data mart design. We intuitively apply this technique in our operational designs and possibly in certain areas of our data warehouse architecture, but we tend to overlook this technique when it comes to designing reporting applications. I believe this overlook is due to our image of a data mart data model as a pristine star schema design, where the only relationships are from dimension to fact tables. We will see in this article that sometimes building relationships between dimensions can lead to certain reporting efficiencies.
This article focuses on the pros and cons of this technique. Part II to Relationships Between Dimensions will focus on experiences with applying this technique, including a hybrid approach that I have used several times. I would also like the next article to include some of your experiences with this technique. Please visit, www.stevehoberman.com, and share your experiences (both good and bad) and I will possibly include them in Part II.
Figure 1 contains a typical star schema design. We have an Order fact table and 2 dimensions, Customer and Time. An order quantity is defined by both a time period and a particular customer. Bob orders 50 on September 1, 2001. In reality there will be more facts in Order Fact, such as financial amounts, units of measure (e.g. Are these 50 cases or 50 pallets?), and so on, but this single quantity data element will illustrate this technique.
Figure 1 - Typical star schema design.
Look at the data elements within the Customer table. Nothing particularly suspicious until we get to the dates: Customer First Order Date, Customer Birthday Day, and Customer Order Preference Day. These dates are really natural keys back to the Time table. By a natural key, I mean a data element a user would use instead of the unintelligent key (also know as a counter or surrogate key) that defines a table. So for example, a Customer Social Security Number might be the natural key to the Customer table. A user would consider the Social Security Number to uniquely identify a customer, not the Customer Identifier. The Customer First Order Date would be a natural key to the Time table, not the Time Identifier.
Are we missing key relationships by keeping these dates in Customer? In Figure 2, you will see that we replaced all of the dates with relationships back to the Time table. Customer First Order Date is now replaced with Time First Order Date Identifier, a surrogate key back to the Time dimension. Is the model in Figure 2 the preferred design?
Figure 2 - Example of relationships between dimensions.
Well, let's think about the advantages to the model in Figure 2. Building these relationships between dimensions gives us four main advantages:
1) Easier and faster to do reporting with facts. Having the relationships means that there will be certain queries that will be much more efficient. If the user is asking the question, "How much did Bob Jones the Customer order on their first order date?", we can answer that without joining back to the Time table. Naively assuming that the customer first and last names are enough for uniqueness, the query becomes:
SELECT SUM(ORDER FACT.QUANTITY)
FROM ORDER FACT, CUSTOMER
WHERE
ORDER FACT.CUSTOMER IDENTIFER = CUSTOMER.CUSTOMER IDENTIFIER
AND ORDER FACT.TIME IDENTIFIER = CUSTOMER.TIME FIRST ORDER DATE IDENTIFIER
AND CUSTOMER.CUSTOMER FIRST NAME = 'Bob'
AND CUSTOMER.CUSTOMER LAST NAME = 'Jones'
2) Easier and faster to do summarized reporting on reference data. Having the relationships means that summarized reference data reporting will be much more efficient. For example, "How many customer birthdays in October 1960?" This query becomes:
SELECT COUNT(*)
FROM CUSTOMER
WHERE
TIME FIRST ORDER DATE IDENTIFIER IN
(SELECT TIME IDENTIFIER
FROM TIME
WHERE MONTH = 'OCTOBER'
AND YEAR = '1960'
AND LEVEL = 'DAY'
3) Provides additional referential integrity. Having the relationship to the Time table ensures we will have high quality data in this data element. For example, if someone accidentally enters February 30th 1960 as a Customer First Order Date, it would not permit it because February 30th 1960 does not exist in the Time table. However, this benefit of additional referential integrity may provide limited value because by defining a data element as a date format, most databases would automatically flag a February 30th value as a problem. If your dates were defined as Character(8) format for example, having this integrity would be very important. Also, if there was a particular domain in the Time table that was a subset of all dates, such as a ten-year range, having this relationship in place would enforce that all customer dates be between this range.
4) Encourages greater understanding of data. I am a firm believer that the more we define relationships on a data model, the greater understanding we have of the structures. That is because knowing a data element is really a natural key of another table and defining this relationship requires some amount of analysis, leading to greater insight and understanding.
However, there is also a dark side to having the relationships defined as in Figure 2. I can think of at least four disadvantages:
- Requires additional design and development effort. Doing the analysis and modeling to determine that Customer First Order Date is really a natural key back to Time requires more work. It is easier to just define the dates rather than take the time to realize we really might need relationships back to other entities. It also takes more development time because we have to build in lookups into our mappings to get the appropriate Time Identifier when loading data. So we would do a lookup against the Time table to determine Time Identifier 123 represents February 28th 1960.
- Causes loops in reporting tool. Many reporting tools have a tough time when there are two or more paths back to the same table. For example, in Figure 2 we can navigate to the Customer table a couple of ways. We can get to Customer from Order Fact, and we can get to Customer from Order Fact to Time to Customer. We will need to resolve these loops in the reporting tool.
- Causes performance hit in loading and changing data. Having the relationships means data integrity checks are being done and we are doing lookups whenever we load or change data. When we first created the record for Bob the Customer, we needed to take his Customer First Order Date of February 28th 1960 and do a lookup against the Time table to get the appropriate Time Identifier. This takes longer than if we just have the dates within the Customer table.
- Causes performance hit in single reference subject area reporting. If I need to view the Customer records independent of any other subject areas on my design, I would need to join back to the Time table whenever I encountered one of these dates. So if I wanted to view Bob's record in customer, I will be joining back to the Time table three times, one for each date. This could cause performance problems if the users are doing lots of reporting on just customer information.
This article discussed the pros and cons of defining relationships between dimensions. If the pros outweigh the cons, we learn towards adding the relationships between dimensions. But there is also a hybrid design option. In Part II of this technique, we will discuss this hybrid approach to this technique as well as relationships between dimensions other than to Time. As mentioned earlier please visit my web site at www.stevehoberman.com to share your experiences with this technique, which might be included as part of the next article.
About the Author
Steve Hoberman is an expert in the fields of data modeling and data warehousing, and teaches several data modeling courses throughout the year including a brand new Data Modeling Master Class. He is currently a global reference data expert for Mars, Inc. He has been data modeling since 1990 across industries as diverse as telecommunications, finance, and manufacturing. Steve speaks regularly for the Data Warehousing Institute. He is the author of The Data Modeler's Workbench, Tools and Techniques for Analysis & Design. Steve specializes in data modeling training, design strategy, and in creating techniques to improve the data modeling process and deliverables. He enjoys reviewing data models and is the founder of Design Challenges, a discussion group which tackles complex data modeling scenarios. To learn more about his data model reviews and to add your email address to the Design Challenge distribution list, please visit his web site at www.stevehoberman.com. He can be reached at me@stevehoberman.com