Data Warehouse Standards
By Sid Adelman
Many dog owners give their dogs what they consider to be commands. They are really more like guidelines. ("Boscoe come!… pause, pause, pause… Well I guess Boscoe is busy with his chewy toy and doesn't want to come just now.") A number of organizations claim to have standards but they are also just guidelines. People will follow those "standards" if they feel like it and if they feel it benefits them. A standard should be "Thou shall" while a guideline is a recommendation, more like "You should if your situation warrants." This article will discuss standards, not guidelines. They are:
- Data sourcing
- Meta data
- Service level agreements
- Code standardization
- Project prioritization
- Data stewardship
- Ease of use for access and analysis
- Data Quality
There is probably no other area in data warehousing that is so labor intensive and has such exposure for mistakes. The methodology by which data sourcing is executed will have a major impact on the success of the project.
Determining user requirements - The first step in developing a data warehouse is determining what the users need, want and are willing to pay for. The pay comes in the form of budget, the users' effort and involvement and the elapsed time it will take to implement all that the user wants. It is critical that the user understands the impact of the request for "just one more source file."
In this step the users can be prompted on their requirements (not their desire) for the cleanliness of the data. There are a number of approaches, three of which are one-on-one interviews with users, Joint Application Design (JAD), and some more formalized approaches. These methods of determining user requirements can all be effective. The important point is that they should be followed and followed in a rigorous manner. Seat-of-the-pants methods are almost sure to fail.
Researching source data - Data warehouse data can often come from multiple sources. What is the most accurate, timely and which data provides the most comfort to the user? While some of the source data may come from external sources, it is usually more difficult to understand data from outside the organization. In the process of searching source data, the use of timely and accurate meta data can be invaluable. Data models can aid both IT and the users in their understanding of potential data and the interrelationships of the data. Since almost all source data has some quality problems, this is the time to determine how clean the different sources are. There are tools that can be effective in evaluating cleanliness, basically giving you a report card of the quality of your data.
Transformation standards - There are a number of ways data can be moved to the data warehouse. The language or tools used in the transformation process must be determined. There are a number of ETL tools that can aid in the migration and cleansing process. This is the place to implement business rules to minimize bad data from making their way into the data warehouse.
Meta data is information about the data in your data warehouse. Meta data can document the business definitions of the data, the valid values, security characteristics, ownership, timeliness, quality, data type, and data length. Meta data is a key component of a data warehouse and it's important to know what meta data will be captured and how it will be used.
A number of the data warehouse tools have meta data capability and there are some interfaces and even some integration among those tools. The meta data capability of the data warehouse tools and how they interface and integrate with other selected tools should be an important determinant in the tool evaluation process.
Meta data standards relate to the how developers will be using the meta data to improve their own productivity and the quality of their work. It also relates to the documentation they produce and the documentation that is subsequently available to others in the organization.
Service Level Agreements
A service level agreement (SLA) is a written agreement between IT and the project sponsor who employs the users of the system. The agreement is that IT will provide a level of service that is, hopefully, both reasonable and cost effective.
SLAs are commonly written for availability - the hours/day and the days/week the system is scheduled for access, e.g. 18/6. In addition, availability also includes the percentage of time the system is up and running during the scheduled hours, usually represented as a percentage, e.g. 98%.
While performance SLAs are appropriate for online transaction processing systems they are not relevant to the data warehouse due to the extremely high variability of the data warehouse ad-hoc query characteristics. For example, the first query might access 20 rows and the next query might access 20,000,000 rows - performance will vary.
Timeliness SLAs would indicate by what date following the close of business the data warehouse would be accessible and timely. For example, "two days after the close of the month month-end data will be available."
If an organization had SLAs for problem resolution and response to requests, it should also have these SLAs for the data warehouse environment.
According to an FBI study, the average cost of an attack by an insider on a computer system is $2.7 million dollars. This does not include the impact on morale, the reputation of the organization, the embarrassment to the CIO, and the cost of management attention. Security is becoming more important to every segment. The data warehouse presents some new security challenges especially as some portions of the data warehouse are made available to people outside the organization. These would typically include suppliers and large customers.
Lines-of-business are often retentive about their own data, severely reserving access to only their department. This anal retentive characteristic is usually explained by the inability of anyone outside of the department to be able to understand the subtle nuances that allow an understanding of the data. The unwashed outside of the department would have neither the experience nor the mental capacity to decipher the raw data. It is only when the department analysts examine the data - applying an appropriate spin - and explaining the results that the information could be disseminated to the rest of the organization.
As data warehouse tools are being selected, their security capabilities must be evaluated not just for the function they provide but also for the effort involved in administering security - some security administration is very labor intensive.
The in-house security office must be aware of the potential exposures and must work with the IT people responsible for the security capabilities of the data warehouse tools. The security office should know what the requirements are and the IT personnel should take these requirements and determine how the tools will satisfy the requirements. The security office would then validate the implementation.
Privacy is becoming more and more important and relevant in the lives of people whose evenings are disturbed by cold-call brokers promoting a sure-fire winner, the initial public offering of beefstake.com. Applications that use customer information, most notably customer relationship management (CRM) applications that may overstep the line into a person's private life have grave implications for a company wishing to optimize its marketing efforts while not offending and annoying its existing customer base.
The types of data that must be kept private are often industry specific. For example, access to health care data including patient history, diagnosis, laboratory results and pharmaceuticals prescribed are specifically restricted by federal law.
Some privacy standards may come from the legal department while others may come from the CEO, marketing or the public relations department.
Code standardization is especially important for companies with multiple divisions, for companies in more than one location and definitely for multinational organizations. Without common codes, rolling up numbers is all but impossible and is fraught with potential errors as numbers are assigned to the wrong buckets.
For each field, the domains (valid values) will dictate the edit rules in the operational systems that feed the data warehouse. Unfortunately, most organizations have not enforced such standards in their operational systems. Even when domains have been defined, the edits rules in the operational systems have not followed suit and are often incomplete.
Agreements on code standardization are always a problem as groups jockey for getting their own codes accepted. It is expensive and disruptive for a department to alter the codes they have been using and they will not be happy if they are forced to change. For this reason, commitment for code standardization must come from the top and budgets should be allocated for the additional expense of changing codes. There will be cases where it becomes a Herculean effort to standardize all the codes and so an organization should just focus on the codes that can reasonably be standardized.
The powerful manager is usually the one who gets his project implemented, often pushing out a project with greater justification. A standard for project prioritization that includes cost justification should put the projects in the correct implementation order and should eliminate projects that cannot be cost justified.
A best practice is a Business Advisory Board that meets to determine the priority sequence in which projects will be implemented as well as deciding which projects should never be implemented at all. Information about the ROI of each project will help the Business Advisory Board with its efforts.
Data stewardship, the care and feeding of the data warehouse database is a critical success factor, i.e., a factor that is necessary for the project to be a success. Data stewardship includes the responsibility for performance, incorporating a good design, monitoring and tuning the database. Stewardship extends to designing and administering the database security including updates, deletions (if they are allowed) and access. The data steward is responsible for backup and recovery and is also responsible for archiving old data and the restoration of that archived data as needed.
Data stewardship implies not just responsibility for protecting and nurturing the database but also the authority to make decisions and take actions, which cannot to be overridden by a ham-fisted manager pushing for his or her own agenda.
Ease of use for access and analysis
Many designated users of the data warehouse are reluctant to use the tools and access the data warehouse and only do so when forced. Most often, these users will ask someone in their department who is more comfortable with the access tool to create the report for them. Insufficient or inappropriate training is sometimes the culprit but quite often power users were the ones who choose the access tool. While it might have been easy and obvious for the power users, the more casual users found access to be frightening and difficult.
A standard for ease of use must be incorporated in the tool selection process. The implementation of the tool, including a well-documented and complete query and report library will go a long way to giving the casual users a feeling of comfort and acceptance. The only ones who can validate the ease of use of the tool and its implementation are the technophobic end users whose use or abandonment of the system will determine its success.
There are those who take the position that testing in the data warehouse environment is always an option. The idea that the data warehouse has allowed us to abandon all the important lessons we learned in developing operational systems is WRONG! You still must test. Testing is critical for the ETL process. A best practice verifies that the ETL process has run correctly by verifying the number of records from the source systems match those of the target (allowing for discarded records) and cross checking for numerical values and dollar amounts.
Just because a query ran to completion and produced a result, it does not mean the answer is correct. The query may have been written incorrectly, the data might not have been understood, the data may have been wrong or incomplete, or old data may have been accessed with the user believing he or she was looking at current data. User training should include techniques for validation including reasonableness checks.
Any queries or report programs that become a part of the libraries must go through a rigorous test since users will be counting on the correctness of these programs. The testing should include user acceptance tests that incorporate the documentation associated with the queries and reports.
Data quality is a mixed bag that includes no duplicates, no missing values, correct data types, valid values and accurate data. It's best to look at each of these data quality characteristics separately as the tasks to correct -or not correct - the dirty data is often quite different.
Very few organizations have a process to determine how clean the data should be. Some misguided organizations make the assumption that all the data should and will be clean. Other organizations don't even consider cleansing since they believe that if it's clean enough for the operational systems, it's clean enough for the data warehouse. The fact is that while it may be clean enough for the operational systems, it just isn't clean enough for the data warehouse.
The approach to data quality must be pragmatic. First you need to determine just how bad the data is - it's almost always worse than you thought. Next you need to determine the value of cleaning up each data field and if it's even feasible to do so - some data can never be corrected. Your data will never be perfect and so you need to determine where you will spend your valuable time and resources. Cleansing of some data will cost more than its worth.
The owner of the data, usually the line-of-business manager responsible for the data in the data warehouse will decide how clean the data needs to be.
Data warehouse standards are critical success factors and can spell the difference between the success and failure of your data warehouse projects. It's time for the CIO to step up to making a commitment to these standards, communicating not just the importance of the standards, but that they are standards, not guidelines, meaning "You will follow them. There will be no exceptions or dispensations without my expressed and written approval. If you think (and you surely should) following the standards will result in additional tasks, time and budget, I expect you to include those factors in your project plan and budget."
About the Author
Sid Adelman founded Sid Adelman & Associates, an organization specializing in planning and implementing Data Warehouses. He consults and writes exclusively on data warehouse topics and presents regularly at DCI and other data warehouse conferences.
Sid can be reached at 818.783.9634, firstname.lastname@example.org, www.sidadelman.com.