Data warehousing is the proven architecture for delivering subject-oriented, stable, analytical information for decision making. Effective data warehouse development provides many business benefits.
A data warehouse is an environment that combines an integrated decision support database with software to collect, cleanse, transform, and store data from a variety of operational and external sources. These technologies are combined to support historical, analytical, and business intelligence (BI) requirements. A data warehouse may include dependent data marts, which are subject-area databases that are aligned with the data warehouse database. In most references, the term “data warehouse” includes all the components of the environment.
Components of a Data Warehouse
Figure 1: Components of a Data Warehouse
To understand the challenges of building and managing an enterprise data warehouse, it is important to provide some definitions of commonly used terms:
- Data Warehouse – An environment (includes data, processes, hardware, software) that provides data for effective decision-making. Typically, a data warehouse is an enterprise level initiative.
- Atomic Data Warehouse (ADW): A data architecture construct that is the central hub of the Data Warehouse. It integrates, cleanses, and stores data from operational systems to be fed into dependant data marts. It can also support ad-hoc data mining, and other applications.
- Business Intelligence (BI): Application programs and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include query and reporting, online analytical processing (OLAP), balanced scorecards with KPI’s, statistical analysis, forecasting, and data mining. Some consider business intelligence to be the delivery aspects of a data warehouse environment.
- ETL Transformation (Extract, Transform, and Load): This is the process of extracting data from operational data sources or external data sources, transforming the data which includes cleansing, matching, standardization, aggregation, summaries, integration, as well as basic transformation, and loading the data into some data store of the data warehouse (ODS, atomic data warehouse, data mart) environment.
- Data Mart: A data structure that is optimized for analysis and reporting purposes for a business process or group of users. Typically, a data mart utilizes a dimensional schema using RDBMS and/or BI cube
- Strategic Reporting: Summarized reporting needed for analysis and prediction to help guide the business from a strategic perspective
- Tactical Reporting: Reporting and analysis to help the company run day to day business
- Operational Reporting or Operational BI: Helps the company to apply intelligence to business transactions to improve performance in “real-time”.
- Operational Data Store (ODS): An integrated, current-valued data structure that is optimized for operational reporting or data service purposes.
- Real-Time Data Warehousing: Integrating and making data available through the data warehouse in “real” time. Also called Active Data Warehousing. Traditional use of phrase “real time” means instantaneous or very nearly instantaneous. For Data Warehousing, some latency is almost always implied.
Different Views of Data Warehousing
The term “Data Warehouse” was coined by William H. Inmon in his book, “Building the Data Warehouse“. In this book, Inmon defined the data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. This view was the accepted architecture for enterprise data warehousing, where the source applications provided data to be transformed through cleansing and re-formatting based on analytical requirements, then sent to a large central database used by a variety of applications and users. The large database is normalized, where tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). When applied in large enterprises the result is dozens of tables that are linked together by a web of relationships, providing a large-scale view of historical data for analysis. This design can be understood by business stakeholders during requirements development, but it can be cumbersome for developers to implement without sophisticated programming and tool manipulation. In addition, it can be difficult for users to join data from different sources into meaningful information and to access the information without a precise understanding of the sources of data, its metadata, and the structure of the data warehouse.
Figure 2: Traditional Inmon Enterprise Data Warehouse
In response to some of the challenges faced by organizations attempting to implement enterprise data warehouses, an alternative approach was developed by Ralph Kimball, author of The Data Warehouse Toolkit and the creator of dimensional data modeling. In the dimensional approach, data warehouse facts (generally numeric transaction data) are associated with “dimensions” (generally reference information that give meaning / context to the facts).
Figure 3: Dimensional Model Example
Dimensional models are used to create “data marts,” not enterprise data warehouses. The Kimball approach to data warehousing recommends creating a collection of data marts instead of a single enterprise data warehouse, and relating those data marts through a set of common dimension tables (“conformed dimensions”). Advantages to this approach include ease of use for business stakeholders as well as faster technical construction. The main disadvantages to the dimensional approach to data warehousing are the challenges of loading data to the various data marts from the disparate sources while maintaining data integrity within the dimensions, and the difficulty in modifying the various data mart structures to accommodate changes to business processes or business analytical requirements after implementation.
Figure 4: Traditional Kimball Data Mart Architecture
To address the challenges of both approaches, Inmon and Kimball, and to exploit the advantages of the two architectures, a third data warehouse architecture arose and gained acceptance among many practitioners and experts. The hybrid architecture combines an enterprise data warehouse designed using a normalized atomic level database along with data marts targeted to a subject area or more detailed analytical data presentation.
Figure 5: Hybrid Data Warehouse General Architecture
Goals of Data Warehouse
Most organizations have the following goals when they embark upon a data warehouse initiative:
- Integrate operational systems data from a variety of sources
Without integration and standardization, analysts must repeatedly integrate data from the source systems. Research shows that up to 80% of an analyst’s time can be spent in finding the data and only 20% in analysis. Providing integrated, cleansed, and standardized data to the analysts in an easy to use structure allows them to spend more time focused on analysis and less on the mechanisms of getting the data into the proper format.
- Improve strategic decision-making through analysis of data from a variety of sources using historical data
Systems should not simply feed information, they should help lead a company to make decisions that are aligned with the organization’s strategy. An enterprise data warehouse can provide: access to the appropriate, decision-oriented information when needed; a business view of the data, from business vantage points for strategy development. Entities such as customer and product are often disparate due to the nature of a business (e.g., multiple lines of business) and must be reconciled to an enterprise view. Not having a standardized, enterprise-wide view of a company’s master data leads to information silos. An enterprise data warehouse can provide that enterprise view and reduce or eliminate information silos.
- Improve operational decision-making through analysis of historical data of specific subject areas, or across subject areas
Strategic corporate decisions focus on reporting, analysis and prediction – tactical corporate decisions focus upon attempting to drive behaviors in real-time.
For example, a casino knows what drink a customer likes, and after a certain amount of time, software prompts a manager to send a waitress to deliver the drink to the customer, thus reducing the likelihood that the customer will visit other casinos. That was a tactical corporate decision. Making tactical corporate decisions enabled by the Data Warehouse is supported by relatively new “real-time” data warehousing methodologies.
Characteristics of a Data Warehouse
All data warehouses are defined by the following characteristics: Subject-oriented, integrated, non-volatile, time variant.
Goals of Data Warehouse
Although each organization will have its own, specific, goals for a data warehouse, there are certain common goals that can be used as the foundation for creating the list of expectations for any data warehouse initiative. Some of these goals could include:
- To enable users appropriate access to a homogenized and comprehensive view of the organization, supporting forecasting and decision-making processes at the enterprise
- To achieve information consistency. By bringing data from disparate data sources into a centralized database, the data warehouse provides a homogenized view of the organization’s data. Users from across the organization making use of the data warehouse all view a single and consistent version of the enterprise’s information that is available for analysis and reporting, at tactical and strategic decision-making levels.
All data warehouse goals should be aligned with the organization’s business goals, so that business value of the data warehouse’s development and maintenance can be justified and supported.
Challenges to Data Warehouse Development
Each data warehouse will experience specific challenges; however, most challenges will fall into one or more of the following categories. According to The Data Warehousing Institute, these data warehousing challenges are the most commonly experienced by organizations of every size and industry:
- Data quality management in data warehouse database and data marts: Most data warehouses bring data from multiple source systems, with varying levels of quality found in the data. When a data warehouse project tries to integrate the data from the source systems, it encounters issues such as inconsistent data, repetitions, omissions, and semantic conflicts. All these issues lead to data quality challenges. Resolving these issues and conflicts become difficult due to limited knowledge of business users outside the scope of their own systems.
- Data warehouse performance expectations and results, metrics and measurements: Many organizations neglect performance requirements when designing data warehouse projects initially. Such oversight can hamper data warehouse success significantly, and little can be done to improve data warehouse performance after implementation, since performance objectives are easier to be designed into the environment than to be tuned. Data warehouses should be built for performance rather than tuned for performance. Database tuning for the data warehouse must include the atomic data warehouse and all data marts, and performance tuning requirements will vary based on architecture, platform, and user populations.
- Data warehouse testing, from unit to user acceptance: Data warehouse testing is a major project itself, and is often neglected by organizations. Testing for each ETL process from source to target (including testing from atomic data warehouse to all data marts), testing for all reports and analytics applications, and regression testing for all additions to the data warehouse environment must be included in the data warehouse testing plans.
- Data reconciliation, data transformation (ETL), data lineage mapping, source system analysis: Reconciliation is a process of ensuring correctness and consistency of data in an application or environment. Unlike testing, which is a part of the software development life cycle, reconciliation is a continuous process that must be performed during and after the development cycle has been completed. Reconciliation is challenging due to the complexity of the data warehouse development process, which requires that all transformations and their ETL processes be examined for logic and performance consistency. Additionally, the reconciliation process must comply with performance requirements. Other challenges in this area include the need to document all data lineage and source system analysis that is performed to support data selection from each source for inclusion into the atomic data warehouse and / or a data mart.
- Cultural change management, user process change acceptance: Generally, data warehouse implementation means changes will occur to existing business processes for reporting, analysis (ad hoc and standard), and all forms of decision-making as a result of the availability of integrated, historical, analytically formatted data. In many organizations, the largest challenge to the success of a data warehouse initiative is the cultural change management. It is important to include re-designed processes and procedures to take advantage of this new source of data and information, and to incorporate this enterprise approach to data management instead of relying on a variety of data / information silos. Resistance to change can be extremely strong in many organizations, and this issue should not be overlooked when planning for a data warehouse.
- Data warehouse metadata management Metadata has been identified as a key success factor in data warehouse projects. Metadata captures a variety of information necessary to extract, transform and load data from source systems into the data warehouse and data marts, and to use and interpret the data warehouse’s (and data marts’) contents. The lack of a managed metadata environment, or a central metadata storage facility for the data warehouse, can be a main reason for a data warehouse project’s failure.
- Data warehouse data governance and data stewardship: Data governance refers to the overall management of the availability, usability, integrity, and security of the data found in an enterprise. A data warehouse serves as the focus for analytical and decision making querying and reporting, and, therefore, needs the attention to data requirements across the enterprise that a robust data governance program would provide. Data warehouse initiatives require organizations to make many decisions that involve data from several sources, to enable the cross-application analysis. In addition to these foundational data challenges, a data governance program for a data warehouse can provide analysis for external data that is brought into the warehouse, and can offer the oversight to enforce standards and rules after the decision support system becomes operational.
For any organization that wants to take advantage of the wealth of information that exists in its variety of source applications, make strategic and tactical decisions based on facts that have been formatted for analytical purposes, and improve the quality of the data in their systems and the processes that govern those systems, embarking on a well-designed data warehouse initiative is the answer.