Effective data warehouse development requires an iterative approach that results in a robust, well-defined and usable system for analytics.
This article is excerpted from Data Warehouse Project Management (Addison-Wesley, Adelman and Moss, © September 2000).
Data Warehouse Iterations
A data warehouse cannot and should not be built in one Big Bang. Instead, a data warehouse is an evolving system that must be built in iterations, small iterations, which over time evolve into a robust, well defined, quality oriented data warehouse. Although this concept is easy to understand and appreciate, it is difficult to apply.
Users have been conditioned over the past few decades to ask for all they can during their initial identification of requirements, because they may never get a second chance. With this idea of “If I don’t ask for everything under the sun now, it will take years before I’ll get it” they continue this pattern of:
- asking for everything (usually at transaction level detail),
- wanting to keep everything forever,
- having been influenced by vendor hype, wanting it all now, which usually translates into 90-120 days or less,
- since this is a data warehouse, naturally everything, forever, and now will instantly be clean and integrated – as if by magic.
This won’t work. It won’t work for a number of reasons.
Reasons for Big Bang Failures
First of all, in a decision support / analytics environment, the requirements are never as stable and as well understood as in an operational system. Moreover, in a data warehouse environment, where users can “play” with their data as never before, they are constantly discovering new requirements, and they are changing or dropping some of their old ones. This means that the underlying system would constantly have to change as well. The larger the system the longer it takes to change it.
Secondly, “everything” and “now” have never been able to live in the same sentence, not in an operational environment, and especially not in data warehousing. “Everything” in data warehousing means data from many operational systems, data that has never been cleansed or integrated. It also means data from external sources and sources completely new to the business and the developers. The time and effort required for all that dirty and disparate data to be cleansed and integrated definitely will not fit the category of “now”.
Thirdly, since “everything” implies a colossal scope, and “now” implies a minuscule amount of time, all the vendor hype in the world won’t change the laws of physics. There is great merit in asking for immediate delivery, so that the organization can quickly gain from added value to its business. However, since it is not possible to deliver everything quickly, let alone immediately, there must be a strategy in place to deliver small increments over time.
Finally, the larger the scope, the harder it is to manage, and the higher the risk for failure. Adding to this formula the probability that the requirements are unstable and subject to change, any Big Bang data warehouse project would be heading straight for disaster, as it is guaranteed to miss target dates and budgets, and most importantly, fail any user expectations. All of these outcomes are difficult to recover from.
Considerations for First Data Warehouse Project
The first data warehouse project is the most crucial one and should be kept restrained in scope, both functional scope and data scope. Users, especially those who still panic about IT’s past non-responsive track record, will resist this, and it will be up to you to explain the reasons for this approach.
Huge learning curve
The staff may have attended training seminars for the new tools, they may have gone to data warehouse conferences, they may even have a mentor aboard who will help overcome hurdles. There will still be a tremendous learning curve for everyone on the IT as well as the user side, because training in no way is a substitute for experience. It is essential to provide training that will combine concepts and practical experience.
Besides all the adjustments the team members will have to make, coming up to speed with new technology will be a challenge. This will be especially true for those whose first exposure to the client/server or web environment is the data warehouse project. The web, especially, adds an additional layer of complexity to building, testing, maintaining and operating the data warehouse. In addition to learning new technology components, such as new development tools (ETL) and new delivery tools (OLAP), the team members will have to learn to navigate the new environment. That means learning a new operating system, learning how to manipulate files, learning new languages and their compilers, learning new testing tools, and learning new utilities. It also means learning about operational features on their workstation that they never had to worry about before, such as the network setup, library management, file backup and recovery.
Roles and responsibilities will have shifted and the team members may be struggling with their new roles. Some may not have the skills or the confidence to carry out their new responsibilities. Some may not be used to working in a core team group where brainstorming on issues, reaching consensus decisions, and occasionally challenging old work habits are the norm.
Lack of infrastructure
When speaking of infrastructure, we must divide this topic into two sections: one being the technical infrastructure, the other being non-technical infrastructure.
Although most organizations already have a technical infrastructure in place, some new technical infrastructure pieces may have to be added, such as new hardware, network components, a new RDBMS to be used, as well as new utilities and tools. This is not a trivial effort, and it must take into account future data warehouse iterations in order to be scalable.
As involved as the technical infrastructure components are, the non-technical infrastructure components often take more time to implement. Standards have to be established. A methodology for data warehouse projects has to be purchased or developed. As part of that methodology, roles and responsibilities have to be established, change control procedures have to be written, and an issues log has to be created. Security considerations for hardware, network, database access, web access, tool usage and for sensitive corporate data have to be documented. Procedures for enforcing security measures also have to be set up and published. In addition, a help facility along with personnel to assist end-users must be established.
Lack of standards
Many organizations are very lax in enforcing their standards and procedures, if they have any at all. Since operational systems and traditional decision support systems have always been separate and non-integrated, the lack of standards might have been tolerable, although never advisable. In an integrated data warehouse environment, lack of standards and managed metadata will contribute to a chaotic development environment, and will result in non-cohesive data delivery to the users. When developing standards and procedures, the following categories have to be considered:
Users may be in disagreement of whose requirements should be the first or the next data warehouse deliverable. If a procedure for prioritization is not in place, IT will be caught in a political fight that it cannot resolve without causing bad feelings amongst users, and possibly losing some future data warehouse users.
Data naming (including synonyms and aliases)
Both the business names and the technical names should follow some standards. A common standard for creating business names is to base it on its concise business definition, to eliminate articles, prepositions, conjunctions and excess words, and to classify the remaining words as a prime word, a qualifier, or a class word. This is often a responsibility of data governance professionals and data stewards.
Technical data names should always use abbreviations. This requires that an approved abbreviation list be compiled and published. A rule of thumb for creating these abbreviations is to eliminate the vowels and double consonants of a word. A common standard for creating a technical name is to use the approved abbreviation of each component of the business name. If the resulting technical name is still too long, one or more qualifiers are eliminated. Under no circumstances should an approved abbreviation be further abbreviated.
Metadata capture and usage
Metadata provides the navigation through a data warehouse. It is imperative that metadata is complete and accurate. Therefore, procedures must be set up with clearly assigned responsibilities and authorization to capture and maintain the metadata, as well as how and when to use it. There are two types of metadata, business metadata and technical metadata. Business metadata serves the users and provides clarification on a daily basis for data they access and use. Technical metadata assists both users and IT staff in navigating the labyrinth of tables, columns, and programs, which are used for defining and maintaining the data warehouse and its associated processes. These two sets of metadata have a very different focus and purpose. Appropriate individuals should be assigned the task of capturing and maintaining metadata in the most complete and timely method possible.
Data quality measures
Since a main objective of every data warehouse is to improve data quality to some degree, standards and procedures should be established on how to accomplish that. These standards and procedures should include a process for determining what level of quality is required of the data warehouse data and for assessing the data quality of the feeding source systems. The standards and procedures should also include guidelines for determining the severity of the problem and for performing a cost benefit analysis for the data cleansing effort. Finally, the standards and procedures should also outline a program for periodically monitoring and measuring the quality of the data within the data warehouse.
Just because we deliver data and functionality for a data warehouse incrementally does not mean “skip testing – we’ll fix it in the next iteration.” There may not be a next iteration if the quality of the deliverable does not meet the stated user requirements and their expectations. Testing standards for each development track should include guidelines for developing test scripts, writing test cases, documenting expected test results, and logging actual results as well as having a process in place for test reruns.
Similar to testing, just because data warehousing is a decision support environment does not mean “ignore security.” Security standards should include procedures for defining the security requirements for each data element and for identifying data ownership responsible for assigning access privileges to the data. The standards should include rules for reviewing the current data security features in the operational environment so that the requested security features for the same data in the data warehouse environment can be better evaluated. The standards also should have guidelines for identifying approaches to implement the security requirements, and for assigning security responsibilities to the appropriate staff members. Security may be set by group or by individual, and may be controlled at the data access tool, the database, or both. If the data warehouse application is going to be web-enabled, additional security exposures must be identified and addressed. Firewalls must be tested by someone who is looking for ways to break through, and not by someone who is attempting to prove the invulnerability of the firewall.
Service level agreements
Most likely your organization already has some standards in place regarding service level agreements (SLA). These standards need to be adapted to data warehouse projects. If these standards do not exist, they must be created, and the guidelines for the SLA categories should be described. SLA categories are typically:
- Response Time: A commitment on how fast a report will run or a table will be refreshed. Note that response time for ad-hoc queries are impossible to estimate.
- Availability: A commitment to the percentage of time the data warehouse will be available during scheduled hours.
- Timeliness: A commitment to the currency of the data.
- Data Quality: A commitment to the reliability of data. An indication of how many records or what percentage of records have accurate values.
- Ongoing Support: A commitment on how quick response will be given to problems.
Guidelines for First Data Warehouse Project
You should consider the following guidelines when planning your first data warehouse project:
Small in size
What constitutes small in size? This is very difficult to judge because we cannot give a guideline in terms of gigabytes, the number of new reports, or even the number of users. In fact, the meaning of small is directly related and balanced against the time it will take to implement or at least to consider all the other guidelines in this section. Small in size is defined by:
Whatever is doable, in terms of functionality and amount of data, in the amount of time given to this project, considering the cleanliness of the data, or lack thereof, the number of new tools, the current skill set of resources, the number of standards in place, or the standards yet to be established, and the availability of the users.
Few legacy source files and few data elements
It is not difficult to judge the number of source files to include, as it purely depends on the effort involved to:
- a) cleanse and transform the data elements of each file,
- b) integrate the data from the files based on the business rules of the logical data model.
Availability of the source files must also be taken into consideration, and whether they are internal files or externally generated files.
We have measured the time it typically takes to process a data element from the time it is identified as a requirement to writing the transformation and cleansing programming specifications for it. The average was two to three data elements per day, although some clients only averaged one per day.
Fairly clean data
Never base your assessment of the source data on simply asking the users or the IT staff the question: “how clean do you think the data is?” The answer will always be “Oh, pretty good” because the person answering will only think of typographical or programming errors. There are ten different dirty data categories described. These need to be dealt with, and without a thorough analysis of the candidate source files, there is no way to estimate how long the effort for cleansing and transforming will take. When a source file is an externally generated file, it may be very difficult to obtain it in advance in order to assess it. In that case, generous estimates and a contingency plan must be developed for any portion of the data warehouse that relies on the external file.
Value-added data, but not mission critical
Although the first deliverable will be small in size, it is nevertheless important that it provides some value-added functionality and new ways for the users to view the data. This could be the integration of two disparate source files, or the collection of all customer data into one customer table, or some derived data whose calculation is very complex and often misused or misinterpreted by users. This value-added functionality or data, however, should not be so mission critical that it will harm the business operation if not delivered in time or as promised.
Few new tools required
This is another category that can be somewhat controlled. For every new piece of technology you have to build in a learning curve. The more new tools, the longer the learning curve, the smaller the size has to be, etc. Leveraging technology that already exists in the work environment where feasible is the most prudent approach. When new technology is introduced to IT or users, be sure there is adequate and timely training, as well as time allocated to become familiar with the tool following the training.
Standards to be defined
The status of your standards will be the biggest influencing factor on all of the guidelines described in this section. The fewer standards that exist, the more that will have to be established, and the smaller the first deliverable can be. This is not to say that the project has to come to a screeching halt until all standards are in place. Like all other aspects of data warehousing, the standards will develop in an iterative fashion. A lot will be learned during the first project, and the standards will have to be adjusted. It is crucial that the standards be documented as they are being developed, or they will be postponed indefinitely, much to the detriment of subsequent projects.
Strong and supportive end-user sponsor
This is the most critical ingredient to any data warehouse project, especially your first. Many companies claim they have strong sponsorship for their data warehouse project, but when the challenges start to appear these strong sponsorships turn out to be nothing but lip service. The sponsor should have a record for being supportive on other projects and should have a reputation for removing roadblocks for the team. Unfortunately, many data warehouse teams don’t know enough about their sponsors and find out about their strength and support only when something goes wrong. If, when faced with a critical decision, the sponsor’s immediate reaction is to threaten to cancel the project, voice doubts in the project manager’s ability to manage the project, or express concerns about the competency of the staff, you do not have real support. Instead, you have an additional problem, which must be resolved. If, on the other hand, the sponsor’s reactions are a willingness to explore all possibilities for correction, an offer to get involved and run interference, or supportive pep-talk, you probably have a competent and caring sponsor.
Experienced project manager
Data warehouse projects are not for rookie managers. The development environment is much too dynamic to be controlled by an inexperienced project manager. Development activities for the back-end developers run at a different speed than those for the front-end developers. Roadblocks abound and adjustments have to be made frequently. Managing these projects with a tighter change control procedure than usual is the only recipe for not losing control. Prior experience with a data warehouse project would be very helpful, and at a minimum, the project manager must be educated on data warehousing and data management through seminars, conferences, or books and articles. This coupled with strong management and communication skills, will contribute much to the project’s success.
Users involved in all development steps
Data warehouse projects cannot be developed without full-time user participation. In this dynamic environment, unanticipated problems come up constantly, the impact of those problems has to be evaluated immediately, and decisions for correction have to be negotiated. This is not a process the IT project manager can or should take upon himself. These decisions have a business impact and should be made jointly by the IT project manager and the business project manager, after taking all the technical and business related repercussions into account.
Implemented as a pilot
After all is said and done, and all the testing has been performed, and all the data and queries have been validated against the requirements, it may still be wise not to rush into the production environment. It has been our experience, that during the first month or two in production, users create a laundry list of changes. This list might include anything from minor adjustments to major changes of the original requirements. IT often scrambles to apply the minor adjustments to the production tables. Since the production libraries reside in a tightly controlled environment, their scramble often proves time consuming and painful. We suggest delaying moving the tables and programs into the frozen production environment until these minor adjustments have been identified and applied.
A data warehouse is an evolving environment, which cannot be built with a Big Bang approach. In this dynamic environment the requirements are usually not stable, the learning curve is very high, users have unrealistic expectations, the IT staff is lacking skills, and the infrastructure is incomplete. Therefore, the first project is the most crucial one, and in order to succeed, it should be kept restrained in scope, both functional scope and data scope.