The Power of Abstraction - Part 1 of a series on abstraction
by Steve Hoberman
I remember meandering through the large galleries of a modern art museum, and stopping in my tracks in front of a very large canvas painted completely red with a single white dot in the center. The title was "City Skyline", and as hard as a squinted or twisted my head, I just couldn't see anything but a large red canvas with a white dot in the center. This was an abstract painting of a detailed city skyline yet was so generic and vague it could have been a painting of a tiger, a summer beach scene, or anything else for that matter. Abstraction is a tool that lets the artist efficiently capture and represent complex topics. As data modelers, we are all artists to an extent, and abstraction is an efficient tool at our disposal as well.
The data modeler is responsible for not only correctly representing the requirements of an application, but also having the foresight to design flexible structures in areas where requirements might change in the near future. There are certain designs we create such as for the data warehouse, where flexibility can make the difference between success and failure of the entire project. The data warehouse is a living design, one that is constantly being expected to easily accommodate new requirements, yet at the same time minimize design changes that impact existing downstream data marts. A very good design tactic I have used over the years to accommodate this need for flexibility is abstraction.
Abstraction is a very important tool for data modelers to be aware of and use appropriately. It is worth dedicating a series of articles to this topic, with this article being the first. This article explains abstraction including providing examples along with its benefits and costs. Subsequent articles will include:
- Guidelines on when to abstract. I will explain three questions that must be asked before we abstract, to avoid over- or under-abstracting.
- Use of abstraction in our data warehouse architecture. I'll discuss here those areas within our architecture where abstraction is extremely valuable and those areas where we need to minimize the use of abstraction.
- Use of meta data helper entities. I'll show an approach to capturing definitions and business rules on an abstracted model and therefore in the corresponding database.
- Reusable abstract entities. I will share the abstract entities I use most often.
- Reusable abstract relationships. I will share the abstract relationships I use most often.
- Reusable abstract data elements. I will share the abstract data elements I use most often.
If you have questions on abstraction or if there are other areas within flexible design strategies you would like me to address, please let me know. I can be reached at me@stevehoberman.com. For more on abstraction and other modeling techniques, please refer to my book, The Data Modeler's Workbench, Tools and Techniques for Analysis & Design. Also, if you are up for a data modeling challenge, sign up for my Design Challenges on my web site (www.stevehoberman.com). I send out periodic complex data modeling scenarios followed by several possible solutions.
What is Abstraction?
Abstraction, which we will discuss in detail in the sections that follow, is a method to bring flexibility to your logical and physical data models by redefining and combining some of the data elements, entities, and relationships within the model into more generic terms. Abstraction is the removal of details in such a way as to make applicable to a wide class of situations, while preserving the important properties and essential nature from concepts or subjects. By removing these details, we remove differences and therefore change the way we view these concepts or subjects, including seeing similarities that were not apparent or even existent before. For example, instead of having separate Customer and Employee entities, there might be situations where it would be valuable to have the generic Person entity, which groups both customers and employees together.
Let's go through a more detailed example. Say you are creating a data model for the owner of a tuxedo shop. You have already completed a fully normalized design for this project. One of the main entities on your design is Customer. The Customer entity on your data model includes a number of data elements including the following:
- Customer waist size
- Customer leg length
- Customer neck size
- Customer arm length
- Customer shoulder width
The owner of the tuxedo shop is very happy with your design and his application is developed based on this model. Two months later you receive a phone call from the owner of the shop, asking where he can store the customer's shoe size. You explain to him there is no place to put the shoe size, as that was not part of the original requirements. There is a long silence. The owner of the shop explains that he is expanding his business and many of his customers who rent tuxedos will also require shoes. You explain how much money and time it would take to make this change to his application. Again there is a long silence. And then you hear a few clicks as the owner hangs up the phone in disgust.
To paint a brighter picture, let's say you realize that all of the data elements around leg, waist, neck, arm and shoulder measurements are really just a person's apparel measurements. Instead of listing each measurement as a separate data element, you abstract this concept in your data model to just two data elements:
- Customer Measurement Number
- Customer Measurement Type Name
The Customer Measurement Number data element includes any size value such as:
- 42
- 17 1/2
- 34
And each of these values has a type, captured in Customer Measurement Type Name. Such as:
- 42 = Arm length
- 17 1/2 = Neck size
- 34 = Leg length
A Customer can have one or more of these measurements. Then when we get the phone call from the tuxedo shop owner requesting where shoe size should be stored, we simply add a new row in the Measurement Type table called "Shoe size" and the owner can now enter his shoe size values with this shoe size type. This flexibility in our design has led to effortlessly supporting this new requirement.
Benefits Of Abstraction
Abstraction offers the following benefits:
Provides greater flexibility. Flexibility is the primary benefit of abstraction. We are able to design our structures to better handle the unknown. New situations can arise that we did not originally anticipate, and if we use abstraction these situations may require little if any changes to our data model and therefore keep the database and code intact. In a recent design, I decided to use abstraction in the part of the model that captured sales organization information. I abstracted all of the relationships between salespeople and their positions. Even though we did not immediately need all possible combinations of relationships, my abstracted model would handle them if they arose. Sure enough, shortly after the sales organization subject area went into production, we had the need to add a new type of relationship. Our sales organization structure was able to handle this new requirement without any changes to the data model and hence keep the database structure and code intact.
Reduces analysis and design time. Once we are familiar with abstraction, we can apply it to areas of our model before completing a thorough analysis of the requirements. In the example mentioned previously, instead of taking the time to capture all of the additional types of measurements for a tuxedo fitting, we can simply add the measurement number and type data elements and have all possible measurements automatically covered within your model. Within two weeks, I was able to create a model showing the entire manufacturing industry by using a substantial amount of abstraction. Instead of spending significant amounts of time analyzing each data element, I abstracted to make sure any possible requirement was handled within the model. Realize that you might save time here but you can also forfeit an understanding of the business requirements.
Costs Of Abstraction
Abstraction might sound like the cure for everything from the common cold to world peace (slight exaggeration). But its benefits come with a price:
Hides business concepts on the model. When we just have Measurement Number and Measurement Type Name in our data model, we can no longer see the individual tuxedo fitting measurements. Customer Arm Length is no longer a data element we can clearly identify and explain, but a value within the Measurement Number data element. When we abstract we lose the actual business data element name. The data model however, still contains the same information. These measurements become row values instead of individual data elements. Six months after we create the tuxedo data model with abstraction, someone might ask us, "Hey, what does Measurement Type mean?" What sounded like a great idea when we were doing the data model, six months later might no longer make sense. So there is a task we need to do even more rigorously than before. We need to create very good definitions.
This way down the road when you or someone else needs to understand what Measurement Type Name encompasses, they can view all of the values of this code within the definition. You can also use meta data helper entities to capture this important name and definition meta data, which will be discussed in a future article.
Leads to indulgence. Too much abstraction can give us a design totally meaningless in understanding and impossible to develop. Let's say we continue to abstract the tuxedo data model. Eventually, we will wind up with a single entity called "Thing" with a recursive relationship, as shown in Figure 1. Imagine the power of this entity! We can represent anything with this entity. In fact, we can represent an entire Customer Relationship Management (CRM) application using this single entity! However imagine explaining this model to someone in our business area. Imagine trying to write the code that uses this entity. Now the "Thing" concept is a bit of an exaggeration, but I have seen an item data model where abstraction was taken to the point where it was difficult to find anything business related on the model. It was also very difficult to walk a business user through this model.
Lengthens development time. Although we can decrease analysis and design time using abstraction, we increase development time. It takes longer to write code and logic against row values than against columns. For example, it would be more complex and take longer to write the code to populate or select information out of the Measurement Number and Measurement Type Name data elements than it would be if we had the specific measurement data elements at our disposal. We would need to have more complicated query statements. Without abstraction, our queries would be much simpler and easier to create. For example, without abstraction, returning the name and leg length of all of the customers with a waist size of 42 would look like this:
SELECT CUSTOMER_FIRST_NAME, CUSTOMER_LAST_NAME, CUSTOMER_LEG_LENGTH
FROM CUSTOMER
WHERE
CUSTOMER_WAIST_SIZE = '42'
It is very easy and straightforward to create this query. On the other hand, using abstraction to get the same results might look like this:
SELECT CUSTOMER.CUSTOMER_FIRST_NAME,
CUSTOMER.CUSTOMER_LAST_NAME,
MEASUREMENT_TYPE.CUSTOMER_LEG_LENGTH
FROM CUSTOMER, MEASUREMENT_TYPE
WHERE
CUSTOMER.CUSTOMER_IDENTIFIER = MEASUREMENT_TYPE.CUSTOMER_IDENTIFIER
AND MEASUREMENT_TYPE.MEASUREMENT_TYPE_NAME = 'Customer waist size'
AND MEASUREMENT_TYPE.MEASUREMENT_NUMBER = '42'
Note that we need to do an additional table join on the customer's identifier. Also, abstract queries can be much more complicated than this example, especially if we have a separate code table or lookup table for the different measurement type names (e.g. 01 = Customer Waist Size).
Increases reporting difficulty and complexity. There are some reporting tools on the market today that have a very difficult if not impossible time extracting data from abstract structures. There have been several cases where I introduced abstraction into the logical and physical data models, and then had to reverse this decision because the reporting tool, and the query language in general, had difficulty retrieving information from these structures.
This article provided an overview to abstraction along with its pros and cons. Our next article in this series will contain several guidelines I put to use in determining when to abstract. Abstraction is such a powerful technique, that it can be easily abused by the modeler, and therefore I have some guidelines to make sure we only use it where most appropriate. We want to avoid the "Thing" model that is really nothing more than a large red canvas with a white dot in the center.
About the Author
Steve Hoberman is an expert in the fields of data modeling and data warehousing, and teaches several data modeling courses throughout the year including a brand new Data Modeling Master Class. He is currently a global reference data expert for Mars, Inc. He has been data modeling since 1990 across industries as diverse as telecommunications, finance, and manufacturing. Steve speaks regularly for the Data Warehousing Institute. He is the author of The Data Modeler's Workbench, Tools and Techniques for Analysis & Design. Steve specializes in data modeling training, design strategy, and in creating techniques to improve the data modeling process and deliverables. He enjoys reviewing data models and is the founder of Design Challenges, a discussion group which tackles complex data modeling scenarios. To learn more about his data model reviews and to add your email address to the Design Challenge distribution list, please visit his web site at www.stevehoberman.com. He can be reached at me@stevehoberman.com