ETL is changing to ELT. This is important since data preparation remains one of the toughest obstacles that a data-aware organization must overcome. Consider these critical facts.
Cycles of innovation in data management and analytics appear to drive classic ETL (Extract-Transform-Load) functions to a reversal, ELT (Extract-Load-Transform). The implications of this reversal exceed merely the sequence of events.
The cornerstone of ETL was a schema or numerous schemas, but nothing was on the fly. The source data could be incomprehensible, but ETL always knew the target. ELT does not necessarily see the target.
Today, with many information architecture variations, “prep” (prep is shorthand for turning raw data into valuable materializations, metadata, abstraction, processes, pipelines, and models) continues to play a crucial role. Still, with ELT, it is elevated because the “T” in ELT is more lightweight than its predecessor. The numerous ways of handling information. Such as cloud architectures, data lakes, data lake houses, and cloud-native data warehouses require unique prep capabilities for data science, AI, analytics, edge intelligence, or data warehousing.
The terms “data prep” (or only “prep”) and “wrangling” (or “data wrangling”) refer to the preparation of data for analytics, reporting or any downstream applications. Prep refers to directly manipulating after accessing it from a data source or sources. In the past, wrangling was referred separately to preparing data during the interactive data analysis and model building. For the sake of simplicity, “prep” refers to both activities. In the past, they were different disciplines, but they are nearly the same now.
In the same way, a profusion of data management architectures now exist that align themselves with an ELT approach, and prep plays an essential role in the process.
Drivers of change in data
“Après Moi le deluge.” After me, the flood,” attributed to French King Louis XV or Madame de Pompadour (history is a little fuzzy about the attribution), alludes to signs about the approaching Revolution. The virtually unlimited amount of data, processing capacity, and tools to leverage it are a modern deluge without a doubt. The challenge today is not the volume of data. It’s making sense of it, at scale, continuously.
The vision of big data freed organizations to capture far more data sources at lower levels of detail and vastly greater volumes, which exposed a massive semantic dissonance problem. For example, data science always consumes “historical” data, and there is no guarantee that the semantics of older datasets are the same, even if their names are. Pushing data to a data lake and assuming it is ready for use is a mistake.
Harkening to the call to be “data-driven” (or “data-aware”), organizations strove to ramp up skills in all manner of predictive modeling, machine learning, AI, or even deep learning. And, of course, the existing analytics could not be left behind, so any solution must satisfy those requirements as well. Integrating data from ERP and CRM systems may be a chore, but for today’s data-aware applications, the fabric of data is multi-colored. So-called secondary data such as structured interviews, transcripts from focus groups, email, query logs, published texts, literature reviews, and observation records present a challenging problem to understand. Records written and kept by individuals (such as diaries and journals) and accessed by other people are secondary sources.
The primary issue is that enterprise data no longer exists solely in a data center or even a single cloud (or more than one, or combinations of both). Edge analytics for IoT, for example, capture, digest, curate, and even pull data from other, different application platforms and live connections to partners. Previously this was a snail-like process using obsolete processes like EDI or even batch ETL. Edge computing can be thought of as decentralized from on-premises networks, cellular networks, data center networks, or the cloud. These factors pose a risk of data originating in far-flung environments, where the data structures and semantics are not well understood or documented. The risk of smoothly moving data from place to place or the complexity of moving the logic to the data while everything is in motion is too extreme for manual methods. The problem arises because these data sources are semantically incompatible, but only by drawing from multiple sources can useful analytics be derived.
The path to data prep in ELT
Today, data warehouses only slightly resemble their predecessors of only a decade ago. Gone are the single physical location, the “single version of the truth schema,” and a blurring of the lines between data warehouses and so-called “data lakes.” Former, mixed workloads are now ultra-mixed workloads serving the usual discovery and analytics, plus data science, AI/ML, private and public data exchanges, and providing a platform for data of all types and formats. Today, clusters of servers, even distributed ones, are the norm. As a result, the valuable qualities of ETL are insufficient for the evolved ecosystem, and ELT emerged as the logical choice for data warehousing.
The “T” in ELT
With the onslaught of AI/ML, data volumes, cadence, and complexity have exploded. Cloud providers Amazon AWS, Microsoft Azure, Google, and Alibaba not only provide capacity beyond which the data center can provide, their current and emerging capabilities and services drive the execution of AI/ML away from the data center. To make the transition or streamline processes already there, a data prep platform that is ready for the cloud will be essential by enabling all the enhancements a cloud environment offers:
- Moving/Integrating Data in the Cloud/Data Exploration and Quality Assessment
- Special Requirements for AI/ML
- Data Pipeline Orchestration
- Collaboration and Data Governance
- Low-code, no-code operation and,
- Support for languages and SQL
Data scientists and data engineers search, assemble and manipulate data sources. At each step, the prep tools assist with:
- Predictive Transformation – an inference algorithm that informs the analyst with a ranked set of suggestions about the transformation.
- Parametrization – a technique to automate changes in iterative passes.
- Collaboration – support for multiple analysts to collaborate and create the facility to share quality work for reuse.
- Pattern Matching – useful for exposing patterns in the data.
- Visual Profiling – ability to interact with the actual data and perform analysis on it
- Sampling – automatic sampling to test transformation.
- Scheduling – provide the facility to a time or event for a job to run and offer useful post-run information.
- Target Matching – Similar to a data warehouse schema, the prep tool automates the development of the recipe to match.
- Common Data Types – csv, JSON, XML, AVRO, Parquet, Hyper, TDE, and more
- Native Cloud Operation
Though not entirely “self-service,” prep is designed for people who understand the data and its uses, and very seldom does it require the coding process of dev/test/prod. In the context of using the information for operations or decisions, this seems like a pretty good idea. In the past there were only two choices for gathering data for reporting and analysis: putting in a request to IT for a report and waiting six months or more to get it or getting a microfiche reader. Compared to that, wrangling some data and developing a report or a dashboard or a visualization was a huge step forward.
The data lake:
The concept of a data lake came from Hadoop, which was indifferent to the size and type of files that could be processed. Instead of the rigid and not nearly as scalable nature of relational data warehouses, Hadoop hatched the idea of the single place for everything in the organization – the data lake. Though it did simplify searching for and locating data, it provided no analytical processing tools. The logic of moving a JSON file from Paris, France, to a data lake in a cloud location in Paris, Texas, adds no value except for some economies of scale in storage and processing by the unit.
Something to consider is that “organization” is often an oxymoron. While there may be a single “strategy” for data architecture in most organizations, the result of acquisitions, legacies, geography, and just the usual punctuated progress, a collection of architectures commonly happens, distributed physically and architecturally. The best advice is to:
- Pay more attention to what the data means than where you put it.
- To patch some of the data lake idea’s manifest deficiencies, cloud providers have regularly added processing capabilities that mimic early data warehousing features – building the data lakehouse.
The data lakehouse:
According to Databricks, “A data lakehouse is a new, open data management paradigm that combines the capabilities of data lakes and data warehouses, enabling BI and ML on all data. Merging them into a single system means that data teams can move faster to use data without accessing multiple systems.” This statement is more aspirational than fact. Data warehouses represent sixty years of continuous (though not consistently smooth) progress and provide all the services that are needed, such as:
- AI-driven query optimizer
- Complex query formation
- Massively parallel operation based on the model, not just sharing
- Workload Management
- Load Balancing
- Scaling to thousands of simultaneous queries
- Full ANSI SQL and beyond
- In-database Advanced Analytics
- Ability to handle native data types such as spatial and time-series
Cloud data warehouse:
There are principally four Cloud data warehouses: AWS Redshift, Azure, Snowflake, and Google BigQuery. Many other relational data warehouse technologies have acceptable cloud versions, but the cloud-natives currently claim the high ground. They provide all the functions listed above at a particular maturity, rather than being bolt-on capabilities to generic cloud features. It does get a little blurry, though, because the CDW’s provide more than a traditional data warehouse.
ELT and prep work perfectly with CDW because:
- They use optimized cloud storage and accept the same kinds and volumes of data as a data lake(house).
- Prep tools leverage the internal analytical, statistical, and even AI capabilities of the CDW to ever more valuable data and models for the origination, serving a wide swath of business functions and skills.
Conclusion
It is worth noting that despite the relentless expansion of data management and the technologies that enable it, not everything is solved with technology. Forty years ago, a major stumbling block in a data warehouse and Business Intelligence was the inability to understand data at its source. That problem still exists with metadata management. Data can never be taken at face value, and context is always critical.
A semi-cryptic name for an attribute may seem like a match but looking across millions of instances to see if that assumption is correct is essential to an effective use of data. This issue nullifies the idea of a data lake, which overlooked the metadata and scale problems. This may be the most challenging problem (other than the lyrics of “Louie, Louie”) exacerbated by algorithmic opacity in AI. However, Prep is the key to progress.
A high-level list of the prep functions AI could enable:
- Predictive Transformation – An inference algorithm that informs the analyst with a ranked set of suggestions about the transformation.
- Parametrization – a technique to automate changes n iterative passes.
- Collaboration – support for multiple analysts t work together and create the facility to share quality work for reuse.
- Pattern Matching – useful for exposing patterns in the data.
- Visual Profiling – ability to interact with the actual data and perform analysis on it.
- Sampling – automatic sampling to test transformation.
- Scheduling – provide the facility to a time or event for a job to run and offer useful post-run information.
- Target Matching – Similar to a data warehouse schema, the prep tool automates the development of the recipe to match.
- Common Data Types – csv, JSON, XML, AVRO, Parquet, Hyper, TDE, and more
- Native Cloud Operation
In other words, plenty of work ahead.
A version of this article was published at https://diginomica.com/out-etl-elt-why-data-preparation-needs-change