Five Commandments of practical data management can ensure that fundamental best practices are followed consistently
Commandments are like best practices (see the DAMA-DMBOK©); they are basic performance statements. In data management there are some practical, fundamental administrative rules for behavior. These are commandments and should only be compromised when there is no other option (I would like to say never, but that usually gets me in trouble). These basic rules apply regardless of the methodology or modeling tool you use. Like the Ten Commandments, these data management statements may seem like common sense, but we still seem to have trouble following them.
I Thou Shalt Consistently Name Thy Attributes
II Thou Shalt Use Consistent Data Formats
III Thou Shalt Not Bundle Thy Elements
IV Thou Shalt Define Business Keys
V Thy Code Attributes Shalt Have DeCode Tables
We should very rarely compromise the enforcement of common naming practices. Common naming is necessary to insure a corporate view of the data. Logically we should use consistent business names. We should always abbreviate the business names the same way in the physical models. If we call it “camel” one place, we should not call it “dromedary” somewhere else. A more practical example may be using ‘emp_id’ to uniquely identify an employee. We should be able to search the repository (or rdbms catalog) for ‘emp_id’ and have a good deal of confidence that we located every instance of the employee identifier. Without common naming, many of the benefits of modeling are lost.
In order to have a solid basis for common names it is imperative that a list of standard abbreviations be developed. I do not think there are any great methodologies for abbreviations, but I agree with Nike: “Just Do It”. When you publish the abbreviation list be prepared for the complaints. It is almost guaranteed that not everyone will be happy, but it is important that once you start using the list you stick to it. This is a critical issue.
Start your list out as small as possible. Abbreviate what you will use then add to it as you add new elements. It is imperative that, as they are used, new abbreviations are added to the list. This list of standard abbreviations, along with consistent attribute naming, will enable everyone to name elements the same way. It is a critical step in gaining an enterprise view of the data.
At one company I worked for we were pretty good about abbreviating consistently. One of our abbreviations was “dt”, for date. We ensured that all columns that contained a date or a derivation of date ended with “dt”. This enabled us to provide a list of all date columns by running a query against the catalog for all columns ending with “dt”. Because we set this standard before we developed many tables we had a high level of confidence.
Consistent naming is one of the most fundamental rules of data modeling. Violations to this rule cause confusion and incomplete analysis. I cannot think of a single example when this rule should be violated. If a customer insists on naming an element something other than the standard, and if we are not empowered to stop them, I would consider physically naming the element correctly and creating a view for the user using the non-standard name. This approach still has its drawbacks. The databases would be consistent but there could still be confusion caused by any programs or queries using the non-standard view.
Along with consistently naming elements we should also use consistent data formats. If emp_id is determined to be a five-digit number, it should be a five-digit number everywhere. This will help enforce integrity and increase the performance and accuracy of joins between tables. It is much easier to compare or join on two five digit numeric columns then on one five digit numeric and one six digit alpha. If you want an enterprise view of your data this is another one worth sticking to your guns on.
Each attribute should represent one and only one thing. We should avoid the temptation of combining what should be two distinct attributes into one (bundling). This greatly complicates validation and enforcement of business rules. In order to reduce the number of attributes many people have the tendency to combine multiple attributes into one element. They think they are simplifying things but it only leads to complexity and headaches. Bundled elements are a frequent occurrence when mapping older legacy data. Believe me, it will be worth the initial headache to unbundle them.
I worked for a company that had many different clubs that each marketed their own accounts. The clubs were grouped into companies for reporting purposes. Each club could belong to one and only one company. Each company could have one or many clubs. In the older legacy systems it was common to combine company and club into one attribute. This poses many problems in logical and physical design. It is harder to validate the values and relationships to other entities may not be clear. If we wanted to report on a particular company, we would have to use substring functions to do any joins or groupings (this is a very bad and inefficient thing).
With one combined column we must also rely on application logic to ensure referential integrity. If company and club are separate columns we can let the rdbms validate the relationship with the company and club tables. This field should always be separated into two attributes on a logical and into two columns on a physical model.
Another example involves combining two related code values. On a return of a product we stored a code that represented a combination of the source of the return and the reason for the return (see example 1). What, at first, looked like a short cut actually resulted in a table with many more values then needed. It was also difficult to use. It should be separated into two attributes, carrier id and return type (see example 2). The headaches bundled elements cause also makes this rule worth standing your ground on. As Albert Einstein once said “Any intelligent fool can make things bigger and more complex. It takes a touch of genius, and a lot of courage, to move in the opposite direction.”
Example 1 – bundled attribute
Example 2 – unbundled attributes
|Mail Carrier Id||Description|
|Return Rsn Code||Description|
In a practical sense, the only exception to this rule could involve a complex bundled element that is being used in a backend reporting or transaction entity. What I mean by complex bundled element is that different positions of the element stand for different things depending on the value of other positions of the same element. In other words an element that would not be easy to unbundle. If this element is to be stored in an enterprise data store it should be unbundled and the unbundling process should be clearly documented. If we can not correct the problem at the source it may not provide any practical benefit to unbundle it.
If we would like to provide a link to the original value while preparing the user community for the future, we could store the element bundled and unbundled. There are benefits of this approach but if we do this, we should clearly document our intention. This approach enables the user and applications to start using the correct elements while still maintaining the link to the legacy system feeding the data. The actual approach you use will vary depending upon the situation and requirements.
Even if we use sequence numbers as keys to our entities we should always document the true business key. In the interest of simplicity I am considering the business key to be the attribute or group of attributes needed to uniquely identify an entity. It is important to the business, and to the model, that we identify the attributes that make the entity unique. Identifying a business key is an important step in defining the boundaries of an entity. Simply assigning a sequence number without understanding what actually makes the entity unique makes it easier for us to dump attributes in an entity that do not truly belong there.
There are times when we may be required to carry a sequence number as part of a business key:
- No other elements or combination of elements can uniquely and clearly identify the row
- If the elements needed to uniquely identify the row are optional (nullable).
- The combination of elements needed to uniquely identify a row is so large that it is impractical to use it as a foreign key
A practical example of the use of a sequence number would be an order id. While you may be able to uniquely identify a row with a combination of order attributes (customer, timestamp, items ordered etc.) Using these as the primary key would only add confusion to the model. It would also not be practical to carry that combination of elements as a foreign key on every entity that must relate to an order. The attributes may uniquely identify the row but they do not clearly identify the row. Creating an unintelligent order id would make the model more understandable and easier to use.
Nullable elements should never be part of the business key. The definition of nullable means it does not exist. It does not mean that it contains low values or whatever method your RDMS uses to represent them. A null value means the attribute does not exist for all cases of this entity. If something does not exist, it cannot be part of the business key.
Code values should always be unintelligent and the values should be documented in code tables. Every code should have a table documenting the values. This will enable you to enforce referential integrity. This will ensure that the values entered into the code attribute are defined as valid. Code tables also are a reference for anyone writing queries or programs using the code. Code tables help provide boundaries on the use of the code element. A review of code values will quickly reveal if a code is being misused or bundled. As we move toward an enterprise view of data, having the code values documented in a table will provide an excellent resource to coders and users querying data.
If we are forced to deviate from good database design make sure it is documented. At one place that I worked we developed three spreadsheets. One sheet contained all the exceptions we had to make (whether from uncooperative users or having to link to legacy data), the second contained all the data stores we created that would be shareable beyond a specific application, and the third contained uses of the shareable data. This clearly documented the benefits of what we were doing. Remember that just because you decide to give in on a battle does not mean you lost the war. Take your victories where you can and move on, but do not give up the principles you know to be correct.
On those days when you feel like you are standing alone, remember the wisdom of Mahatma Gandhi. “Just because you are the only voice in the desert, does not mean that you are wrong.” Follow these commandments and practice practical data management.