An Overview of Data Warehouse Testing
Data warehouse and data integration testing should focus on ETL processes, BI engines, and applications that rely on data from the data warehouse and data marts.
There is little that casts doubt on a data warehouse and BI project more quickly than incorrectly reported data. It is crucial that data warehouse project teams do all in their power to deliver reports with high quality data.
This article presents an overview of key concepts for planning tests that assure there’s no need to continually address quality issues on your project. If a team follows the framework, the integrity and stability of any BI solutions should increase significantly.
Data warehouse testing is important and qualified testers are in demand. The reasons are clear. Prominent among them is the increase in business mergers, data center migrations, compliance regulations, and management’s focus on data and data-driven decisions related to business intelligence (BI) initiatives. Among the data warehouse testing focus areas are the ETL process, BI engines, and applications that rely on data warehouses.
Preparing for the Data Warehouse QA Process
A good understanding of data modeling and source to target data mappings help equip Quality Assurance (QA) analysts with information to develop an appropriate testing strategy. Consequently, it’s important that during the project’s requirement analysis phase, the QA team (along with business analysts and data analysts who may be testing as well) work to understand the data warehouse implementation to the greatest extent.
Various stages of the data warehouse implementation (source data profiling, data warehouse design, ETL development, data loading, and transformations, etc.), require the testing team’s participation and support. Unlike some traditional testing, test execution does not start at the end of the data warehouse implementation and BI application use. In short, testing involves many development phases spread throughout the life cycle of the data warehouse implementation.
A key element contributing to the success of the data warehouse solution is the ability of the testing team to plan, design, and execute a set of effective tests that will identify multiple issues related to data inconsistency, data quality, data security, failures in the extract, transform and load (ETL) process, performance related issues, accuracy of business flows and fitness for use from an end user perspective.
Overall, the primary focus of testing should be on the end-to-end ETL process. This includes, validating the loading of all required rows/columns, correct execution of all data transformations, and successful completion of data cleansing operations. The team also needs to thoroughly test SQL queries, stored procedures, and other code that produces aggregate or summary tables. Keeping in tune with emerging trends, it is also important for test team to design and execute a set of tests that are customer experience–centric.
While basic QA philosophies hold true when testing a data warehouse implementation, it’s important for test teams to understand that testing a data warehouse is different from most other software testing. Since a data warehouse primarily deals with data, a major portion of the test effort is spent on planning, designing, and executing tests that are data oriented. Such tests include SQL queries, validating that ETL sessions execute as expected, that ETL tool and store procedure exceptions are handled effectively, application performance meets the SLAs and finally, ensuring that data integration points are working as expected.
Data Warehouse Testing Goals and Related Verification Methods
Primary goals for verification over all data warehouse project testing phases include:
- Data completeness.Ensure that all expected data is loaded by means of each ETL procedure.
- Data transformations.Ensure that all data to be transformed is completed correctly according to business rules and design specifications.
- Data quality.Ensure that the ETL process correctly rejects, substitutes default values, modifies, handles invalid data.
- Performance and scalability.Ensure that data loads and queries perform within expected periods and that the technical architecture is scalable.
- Integration testing.Confirm that the ETL process functions well with other upstream and downstream processes.
- User-acceptance testing.Certify the data warehousing solution meets users’ current expectations and anticipates their future expectations.
- Regression testing.Ensure existing functionality remains intact each time a new release of ETL code and data is completed.
Listed below are a few of the many reasons to thoroughly test each data warehouse and use a QA process that is specific to data and ETL testing:
- Source data is often huge in volume and gathered from varied types of data repositories
- The quality of source data cannot be assumed and should be profiled and often cleaned
- Inconsistent and redundancy may exist in source data
- Many source data records may be rejected; ETL/stored procedure logs will contain messages that must be acted upon
- Source field values may be missing where they should always be present
- Source data history, business rules, and audits of source data may not be documented
- Enterprise-wide data knowledge and business rules may not be available to verify runs
- Since data ETL’s must often pass through multiple phases before loading into the data warehouse, extraction, transformation and loading components must be thoroughly tested to ensure that the variety of data behaves as expected, within each phase
- Heterogeneous sources of data (e.g., mainframe, spreadsheets, Unix files) will be updated asynchronously through time then incrementally loaded.
- Transaction-level traceability may be difficult to attain in a data warehouse
- The data warehouse will be a strategic enterprise resource and heavily replied upon.
Testing Phases to Be Considered For the Data Warehouse Test Strategy
Figure 1 display a representative data warehouse implementation from identification of source data (lower left) to report and portal reporting (upper left). In between, several typical phases of the end to end data warehouse development process are depicted such as source extract to staging, dimension data to the operational data store (ODS), fact data to the data warehouse and report and portal functions extracting data for display and reporting. The graphic illustrates that all data load programs and resulting data loads should be verified throughout the end-to-end QA process.
Figure 1: End to End Data Warehouse Process and Associated Testing
An end-to-end data warehouse test strategy is important for documenting the approach to test the warehouse implementation process. The strategy often contains a high-level understanding of what the eventual testing workflow will be. The strategy will be used to verify and ensure that the data warehouse system meets its design specifications and other requirements.
For businesses to take full advantage of the benefits that latest DWH/BI technologies have granted, planning foundations for an effective testing strategy need to be laid out strategically, before implementation. The success and ultimate competitive advantage of an organization rely directly on how the team of testers verifies large volumes of BI data. Organizations faced with this situation must fully realize the benefits of the DWH/BI initiative; create an effective test strategy and appropriate test environment as part of their overall project strategy.