Data Quality and the Data Analyst
by Dan Roth
Sometimes Data Analysts can remind me of Don Quixote. They think they are brave knights slaying dragons when in reality they are chasing windmills. They can become so caught up in the theoretical that they lose touch with reality. We must be careful that we are fighting Dragons not windmills. If we don't we will waste our time and energy and just appear foolish and proud. We may brag about the Dragons that we have slain while everyone else is snickering about the windmills they saw us chasing.
Like any good knight there are 3 things we must always keep in mind:
- Slay Dragons not Windmills (data quality)
- Improve the Castle and life inside (add business value)
- Keep the King Happy (PR with users)
First we must be sure we are slaying Dragons and not windmills. We may believe we are attacking the real problem (all the magazine articles said so) when in reality we are chasing theories and ideas that have no real impact. How do we know if the issue we are facing is a Dragon or a windmill? One good test is to see if it will burn the villagers. Will the issue cause real problems for the users or IT later? Real dragons breath fire and do damage, windmills just sit there and spin.
Letting one department redundantly store an account's address when it is available in a corporate data store is probably a real dragon. If we have created a table to hold account addresses everyone should use that table whenever they need to send correspondence (see diagram 1). For each customer an address is only stored once. This will enable all areas to benefit from customer feedback to other departments. If the billing department is informed of a change of address and updates the database customer service will automatically have access to the new address.
Now let's assume that the fraud department wants to store it's own address table. (see diagram 2) This separate address table will cause many headaches. If customer service receives a change of address how will fraud know about it? This could cause correspondence to be mistakenly sent to an old address. What if a customer contacts the Fraud department with an address change? How will customer service know about it? When we evaluate this problem it is evident that the villagers will get burnt trying to fight this dragon. This is an issue we should fight.
Let's look at another example. XYZ Company takes orders by catalog. The catalog id is derivable by joining the line to the order then joining the order to the customer. (see diagram 3)
The users in the warehouse have requested that the catalog be redundantly stored on the line item. (see diagram 4)
We all know the problems that can be caused by redundant data so therefore our first reaction is probably to tell them that the data is derivable and this change would violate third normal form. Before we go off on a crusade as the lonely defender of the data universe let's be sure it isn't a windmill. Let's get the facts.
- Customers are assigned to a particular catalog and their orders are credited to that catalog
- In order to process the orders the warehouse only cares about the line items to be shipped
- There are many times when the line items need to be accessed by catalog
a) The catalog id of the customer is used to prioritize shipments
b) There are often misprints in a catalog that need to be corrected at the line item level.
c) Daily sales by catalog is important to the business
Denormalizing (redundantly storing) the catalog id on the order line table may not be correct in a pure modeling sense but will it have a negative data quality impact? One important question to ask when considering denormalization is the frequency of updates to the denormalized field(s). How hard will the data be to keep synchronized? In this case the catalog id should never change so therefore the risk of data getting out of sync is small. What benefit will be gained by the denormalization? Queries by item and catalog, an important business fact, will be faster and easier. After we review the facts it becomes obvious that if we put on our armor and attack this request as a problem we will be chasing windmills.
The second thing we need to accomplish is to improve the Castle and life inside. We must be sure that we are adding business value with our decisions. Every process, standard and project that we are involved in should further enhance the business direction. If what we are doing does not add to the bottom line of the business, directly or indirectly, we should question why we are doing it. Some things, like standards, do not appear to add business value but in reality they can have a large indirect affect. Common ways of doing things enables us to learn from the mistakes (and successes) of others and provide a consistent approach across the company. They prevent us from repeating the same mistakes and provide a consistent approach across the company. That is added business value. That is improving Castle life.
I once worked with someone who wanted to change they way we model. I asked why and the only answer they could give was that it was easier for them. It would save them some time on each model but it would have reduced the clarity of the model. This would have been a selfish decision and would have had a negative business impact. The models would have become less useable. We need to improve Castle life by making the best decision for everyone in the Castle, not just for ourselves.
Third we must keep the King happy. It we think we have a perfect design but our customers do not agree, then we do not have a perfect design. It is important that we are listening to the customers and delivering what they need to do their jobs. We are providing a service and if the user is not happy then we were not successful. If a builder believes he built the perfect house but they person paying the bill doesn't agree, then he was not successful. We are no better. If the King (the business) isn't happy with our work then it is not correct. Part of our job is to "sell" the benefits of data quality and convince them of the obvious benefits but we can not, and should not, force it on them. Often times we need to show the benefit in incremental steps, showing benefit with each step. This enables us to eventually accomplish the ultimate goal while still keeping the King happy.
If we keep these simple rules in mind we will be successful. Stand up for data quality not data theory, prioritize tasks that add real business value, and understand that the user (the one paying the bills) is the ultimate judge of the success of our work.
Good Luck!!