Skip to content.

Sections
Home » Resource Center » Real-World Decision Support (RWDS) Journal » April 2002 - Volume 1, Issue 15 » Near Real Time (NRT) and The Next Generation ETL Tool

Near Real Time (NRT) and The Next Generation ETL Tool

By Dan Linstedt

Introduction

This article is a short look at ETL, the future of the ETL tool and what EAI (Near-Real-Time Feeds - NRT) bring to the table. We look briefly at the feature set that the new ETL tools will have to have, not only to cover NRT - but also to cover additional requirements in data movement and integration. We also take a very quick look at EAI and the differences between it and ETL.

For EAI - Latency is an issue, as well as integration. It will no longer suffice to simply have coded efforts developed in an EAI solution - due to lack of data integration. It will no longer be good enough to "initialize" the world of ETL. There will not be enough time in the day to download ever-changing data, and the complex world of cross-integrated solutions will require that the data be kept in sync across never ending streams of processes.

These requirements are based on the new volumes being pushed into warehouses, shrinking load windows, and lights-out back-room operation desires; in other words, business requirements. The new requirements of ETL (I believe) will be in the broad categories below:

  • Produce favorable benchmarks with 1 Terabyte Volumes on medium sized machines
  • Parallelize data streams under the covers.
  • Integrate best-of-breed Database Features (ELT - upserts, views, partitioning, indexing)
  • Implementation of event driven processing, and RDBMS trigger based processing
  • Messaging between streams, across streams, and in and out of other processes
  • Workflow management
  • Fault-Tolerance
  • Fail-Over
  • Auto-Process recovery (based on meta-data rule sets created by the designer).
  • Auto-Load Balancing
  • Auto-Process Prioritization
  • Metadata rules based metrics generation (run-times, CPU utilization, disk utilization - capture and recording of all metrics).
  • Configuration Management Facilities built-in.
  • Audit trails generated for edits to data flows.
  • Capable of handling batch and NRT at the same time.
  • Query Rewrite algorithms embedded into sourcing engines.

Recently I've been challenged by EAI and ETL vendors as to what the user population might want to see next year. What's around the corner they say? As if I had a magic mirror, well just for an instance, let's glance into that magic mirror on the wall - make a wish list of functions and features. Let's take a realistic look at integrating data - the task at hand in an NRT functional environment. I've also been asked: will EAI take over the ETL world, or will ETL take over the EAI world? I'm here to answer NEITHER. They both play in concert in tomorrow's vision. Each has its role, and each plays it well.

The Role of EAI

AI is Enterprise Application Integration (Interface/etc..), basically the software that plays in this segment has taken the responsibility of guaranteed delivery. Rain or shine, sleet or snow, it get's there - as fast as possible. Even if the network goes down, and comes up again next year - it will get there. If there are alternate routes - re-route the information and make sure it gets delivered.

These tools also ensure distribution. If there is more than one destination, make sure that all destinations here the message. Again, Guaranteed Delivery. They play a huge role in a transaction by transaction analysis, and quite possibly re-formatting the transaction on a 1 by 1 basis, but these tools do NOT do well (without loads of programming) at taking care of integration, aggregation, and consolidation of information - especially at an enterprise level. This is the job of the ETL tool.

EAI play's an important role - but has yet to master the cross-stream integration layers that ETL already offers, not to mention the GUI development environment for data flow validation and control. This again, is where the ETL tool's play. This ISN'T where EAI vendors should go. They should continue focusing on data delivery, making it faster, better, cheaper - and of course more reliable. Consistency is the backbone of EAI - that will never go away. It is also EAI's role to ensure transactions are delivered on-time. In this case, speed IS an issue - and will continue to be as the data volumes grow, along side with the generation of these volumes from our source systems.

So what does EAI hold in it's future? Don't know - haven't made that analysis of the market yet. Hopefully by this time next year we'll be able to write intelligently on what is currently a young market (by software standards). At least for tomorrow, ETL and EAI won't necessarily merge functionality, except there will be some bleed-over in the technology sector.

Best of breed ETL will not be able to compete with best of breed EAI - simply because the technology band is too wide, and we can't be best at everything, now can we?

Just remember, unless the data warehouse requires NRT, you may not need an EAI system - unless the company can add value in a different department, and has already installed the system to meet a different business need.

ETL's Future - What does it hold?

It holds a whole lot, especially when it comes to supporting NRT. As far as integration with EAI - this is important, and one of the foundational cornerstones for the industry to survive. It is my prediction that the ETL will not survive on batch alone in the new industry. Of course that's no secret to anyone - almost everyone these days is investing in some sort of EAI solution, and has had an ETL to choose from for quite a while.

But before you go and try NRT with your ETL, be warned, there are serious implications dictated often by complex business rules which will stop you dead in your tracks today. It's a game of "push the synchronization point around" until it lands in the right hole. In other words, today it's a series of work-arounds, and undulations (a nightmare for the systems architect) to ensure success.

What the heck is he talking about? The EAI vendors preach it as: "Simply load your database whenever the data is ready…" Ok - re-read the statement, what does it say? "Database" not "data warehouse". It falls in line with the saturated OLTP market, doesn't it? Wow. Never thought about that before… Yea, right. The focus of most of today's database vendors is OLTP - EAI is big-money and guess where all the big-money is? Integrating OLTP systems; of course it makes sense.

All right - enough soap box. There are serious complications to utilizing an ETL tool in NRT. Let's not get off the subject here. Below is a list of items that need to be addressed from an ETL individual process stand-point:

  1. Each process must be capable of passing messages to other running processes.
  2. Processes must be able to be defined as "never ending." Until they receive a message to stop.
  3. Queuing mechanisms must be threaded and parallel, so that messages between processes are not "lost" due to CPU time-out.
  4. Look-ups against tables and source files must be responsive to messaging systems - able to dynamically, add, update, and delete rows on demand. In this case, if multiple streams are loading a single target table - the integration must be such that the lookups across these multiple streams can be synchronized without deadlock contention.
  5. Aggregations must be equally shared and synchronized across multiple running streams.
  6. Input information must be able to be dynamically re-assigned to other processes based on rules designed by the business (meta-data controlled).
  7. Multiple streams must be able to be dynamically consumed by a single process based on business rules.
  8. Once initialized - continuous build and append must be available.
  9. Checkpoints and failure recovery must be built in (or allowed to be designated) at certain points in the process. Recovery for a process must be a matter of seconds, not minutes or hours.
  10. Recovery should consist of returning to the most recent checkpoint.
  11. Data should not be lost between processes.
  12. ETL processes should be distributable across all registered resources such that the workload is shared.
  13. Ability to reconcile, and synchronize across process flows, the information arriving at particular points in time.
  14. Ability to construct data hierarchies and dependencies that span process flows -producing queued environments, and time-tables for latency processing of particular information.
  15. Ability to allow the developer of the processes to construct message flow diagrams, based on conditions, execute different messages across the processes.
  16. Ability to manage (internally) and possibly eliminate target contention or deadlock, allowing the developers of the process to focus on the complex task of getting the NRT data in, and integrating it.

What we're discussing here is the ability of the ETL to expand beyond individual processes - to encompass workflows and dynamic data sharing between workflows - as well as messaging between the workflows. The more data driven the ETL and its processes become, the more likely it will be enabled to handle NRT feeds.

Conclusion

This is definitely tomorrow's market for ETL. Will it come right away? No, there's a lot of engineering work involved in just these suggestions alone, forget about the other 100 that I didn't mention here. However the smart ETL vendors will begin to adapt their tools to handle these and other critical features in the NRT world, in order to handle additional on-the-fly information.

I think the industry will begin to see a blur between what is RDBMS, EAI, and ETL functionality today. There will be a lot of changes, possibly mergers or at least alliances between RDBMS vendors and ETL vendors. EAI vendors will continue to link to ETL and integrate as best as possible. Just don't forget, the focus of successful warehousing will be a paradigm shift - from ETL compensating for deficiencies in the RDBMS engines, to the RDBMS designed for data warehousing and ETL utilizing best of breed functionality.

© Copyright 2001-2002 Core Integration Partners, Inc.· 455 Sherman St. Suite 207 · Denver, CO 80203