Explaining data warehouse automation can be challenging, but all data warehouse / business intelligence professionals should understand the concepts and practices.
Data warehouse automation is not some sort of wizardry. It does not automagically build, model, and load a data warehouse. This level of automation is (for the time being at least) wishful thinking.
Data warehouse automation is no different to any other type of automation. The process identifies patterns and instead of implementing tasks manually and repeatedly, the tasks are automated.
Examine an ETL (extraction, transformation, loading) process. Data integration uses the same patterns continually. Using data warehouse automation, these design patterns can be converted into reusable code templates. The template itself contains placeholders, which are populated from metadata when the code template is run (instantiated).
A simple example can illustrate the point. A very common (and simple) data integration pattern is to truncate-load a table. The truncate-load operation requires two steps.
- Truncate the target table
- Insert the data from one or more source tables to the target table
These steps are always the same. What changes are the instances of objects in the template: What is the name of the target table? What are the names of the source tables? What are the mappings between source and target tables? What is the underlying technology? All this information can be stored in a metadata catalog. From there it can be retrieved at runtime to populate the placeholders in the code template. With this information the automation engine can generate the code that implements and executes the data integration pattern. It is important to note that one crucial pillar for any data warehouse automation effort is the availability of a rich set of business and technical metadata.
For a practical example of data warehouse automation and code templates in action review an article that shows how to create a code template for loading data from S3 to Redshift.
Like data integration, certain rules can be enabled to auto-generate a dimensional model from a normalized model. The approach works well, but it will not generate some production ready model. It provides a starting point for a model and adds agility to the process. Another area that can be easily automated is data landing and staging from various data sources.
What are the benefits of data warehouse automation?
Encapsulating recurring patterns in a code template and automating common tasks offers a wide variety of benefits:
- Increased productivity. The ability to reuse the same code template repeatedly increases productivity. Engineers must write less boilerplate code.
- Using a code template will result in fewer bugs, a higher level of consistency, and higher quality of code.
- Changes can be rolled out much faster and with greater confidence. Assume a data integration pattern must be modified, perhaps to add a third step to the truncate-load pattern, e.g. to collect table statistics after the INSERT has completed. Simply add this step to the code template and by “magic” it is rolled out to all the instances where it is used.
- Using an automated approach to code generation aligns well with automated approaches to testing.
Data warehouse automation. A sample use case
Over the years, many things have emerged and gone in the data warehouse environment, either to become standard, or to disappear. However, there are some fundamental truths. Projects that involve complex XML as a data source either fail or run over time and budget. There are many reasons for this including but not limited to:
- Many XML files are based on industry data standards. A data standard typically covers and standardizes on a wide area of business processes. The business processes themselves contain many different entities, sometimes with hundreds or thousands of entities. Many developers work with one of these standards daily: Office Open XML. It is the standard that underlies Microsoft office documents. The documentation for this standard has more than 8,000 pages. Liberating data from spreadmarts and Excel requires a lot of time to understand the standard. Mapping back the concepts to XML is not straightforward. As a result, data analysts spend a lot of time trying to make sense of the standard. The fact that it is very awkward to query data locked away in XML files does not make the life of an analyst easier. Most would rather use SQL against tables than XQuery against an XML file to query data.
- Data engineers are good at working with databases, SQL, and scripting languages. They typically lack the niche skills such as XSLT or XQuery to work with XML files. Most developers and data engineers have little interest or incentive to acquire these esoteric skills.
- Standard data integration tools have limited support for working with XML. Typically, they provide a GUI on top of XPath. The whole process is still very manual and time consuming. The approach works reasonably well for simple XML but not for complex data standards. Apart from the lack of automation these tools also demonstrate terrible performance. For example, there are instances of ETL running for more than 24 hours for a relatively small number of 50,000 XML documents.
- All these issues lead to long analysis and development lifecycles, poor quality of code, badly performing data pipelines, and a lack of agility. In summary, there are significant risks to these projects.
The case for data warehouse automation for industry data standards and XML
Data warehouse automation for XML addresses all these issues. As a result, data warehouse analysts and engineers can focus on adding value to the enterprise rather than converting XML from one format to another. Some of the typical tasks than can be automated include:
- The analysis phase can be automated, to collect information and metadata such as data types, relationships between elements and types, data profiles, and data distribution from the source XML files.
- The generation of the target data model to a database can be automated.
- Using a metadata layer, the documentation can be auto-generated. Think of data lineage or source to target maps for the data analysts or ER diagrams of the target data models for the data engineers.
- Logging of errors and issues can be automated.
- Rogue XML files with unexpected structures can be auto-detected and parked for inspection.
- Relationships and globally unique keys can be automatically defined.
- The data warehouse target schema can be auto-populated.
With automation of these steps, several benefits can be realized:
- Code is consistent and of high quality
- Performance has been optimized
- Testing can be automated
- Increased productivity and fewer bugs. Data engineers and analysts can focus on adding real value to the organization
Data warehouse automation is an essential and important activity that can ensure high-quality performance and results from any data warehouse or business intelligence environment. Enabling the use of a metadata layer, using XML, and automating portions of the analysis phase, the organization can reap substantial benefits for the developers and data engineers.