Database management systems and query engines may seem to be suited for analytics. However, it is important to understand the technical and business requirements for the solution before choosing.
As we travel through life, we are constantly assessing our choices. Should you eat that salad, or opt for the burger? Should you marry your partner or seek greener pastures elsewhere? All of us do these assessments in both our personal and business lives. However, it may have never occurred to you in your moments of private introspection to consider the question: Do I need a database management system or a query engine?
Of course, query engine decisions may not have as much impact as marriage, but like marriage, you can live happily ever after or become committed to a relationship that won’t work in the long run. It’s a confusing landscape because often query engines are sold as database management systems (DBMS). Database management systems appear to be significantly more costly than query engines while being more capable for business intelligence. And, vendors tend to blur the line between query engines and DBMS. Without careful thought, you may find yourself married to the wrong solution with promises to deliver analytics that you cannot meet.
Exploring the Differences
What makes a database management system (DBMS)? A DBMS is a place where you can load and store data in the most optimal way for queries. You move data into a database built with a DBMS to take advantage of its optimization capabilities, and because someone is holding your team to a service level agreement on the amount of data stored, how fast questions are answered and how many concurrent users are asking their questions about the data.
In a database, you might need ACID compliance (atomicity, consistency, isolation, durability); if you ask for a change, you want to be reasonably sure the change happens even if a hardware error or other random error occurs. Examples of DBMS include MySQL, Oracle, Vertica, Microsoft SQL and many others.
On the other hand, a query engine is a piece of software that you can bring to the data to query it. In this case, you don’t really want to move the data. Maybe the data is too big and will take too long to move. Instead, you’d like to bring a query engine directly to the nodes or cloud that’s holding it and use the resources in place. In querying the data with a query engine, you are less concerned about optimizations and more interested seeing what’s possible with the engine’s capabilities against the data. You may want to explore the data that is outside the constraints of service level agreements. Since some other process is probably writing the data, things like ACID compliance are less important. Examples of query engines include Apache Drill, Cloudera Impala, Vertica’s SQL on Hadoop, and even Apache Spark.
Use Case Differences
You need a query engine when you have a lot of data stored and need to bring analytics to it. Companies will frequently store data in Amazon S3 or Hadoop without knowing the value of much of it. They may peel off portions of the data to their database, usually a data warehouse, to perform analytics. You can also point multiple query engines at the data, but the engines won’t necessarily run in isolation, nor will you automatically have workload management that can handle fast running queries and long running queries without bumping into each other.
You need a DBMS when your database needs a new home that can deliver compliance with standards for SQL, ACID compliance and where backup and restoration are part of the system. A DBMS provides advanced methods for optimization, for faster analytics. Most importantly, you store data in a database with a DBMS when you’re expecting it to meet service level agreements on analytics. In other words, if you have to run x number of reports in x number of minutes, use a database with a DBMS.
Let’s look at some technical comparisons of a DBMS and a query engine.
|Technical Capability||Database Management System (DBMS)||Query Engine|
|Transactional Isolation||A DBMS manages the transactions that it is asked to do and has built-in features that ensure no transaction is forgotten. A DBMS is aware that other transactions may be coming and has a strategy for queuing and managing actions||A query engine works in isolation. You may point several query engines at the same data. If you point too many queries at the same data, some of them may fail.|
|Data Optimization||A DBMS prefers to store structured data to know exactly where the data is and how to access it. By storing structured data, it can optimize data access to boost performance. It can also compress data more efficiently to lower the storage footprint.||Query engines usually use something like HIVE to define structure around the data. Because data of dissimilar types are often mixed together, there is less chance of optimization.|
|Unstructured Data Query||Most modern DBMS have ways to handle unstructured data, but it’s mostly about adding structure to the data blobs. “Schema on read” is a popular feature in modern DBMS, for example, allowing it to read in poly/semi-structured data like JSON.||Most query engines were design to sit on top of Hadoop and perform analytics on unstructured data|
|Query Volume and workload management||A DBMS is designed to handle many users concurrently asking many questions of the data. DBMS have workload management built in. Resource pools can be set up so that the queries that need to be fast run in one pool, while less critical queries can be allocated to pools with fewer resources.||For query engines, workload management is often done at the cluster level. Although technologies like Ambari, YARN and Mesos exist, they are still immature and may be difficult to implement to manage workloads.|
|Data Assurance||DBMS Responsibility||User Responsibility|
Clarifying Your Use Case
Database management systems have been designed with the idea that structured data and data integrity are needed for very good reasons. It’s important to consider the advantages that are offered by a DBMS, particularly around concurrency of analytical workloads, as well as the consistency of data as it is being inserted, copied and deleted. Performance of data analytics is also a key difference between the use of a DBMS and a query engine.
While query engines won’t be the replacement for a data warehouse, it’s important to know when and how it is safe to bend or break the rules for choosing between a database management system and a query engine in a particular situation. Query engines may help in getting around the need to impose tight structure on the data. However, they essentially remove the safety guard from data management. It’s important to make a careful assessment about which one you need before you head down the path for the infrastructure to big data analytics.