Close Encounters with a Third Kind of Database

Gary Orenstein, Chief Marketing Officer, MemSQL
289
551
90

Gary Orenstein, Chief Marketing Officer, MemSQL

For nearly forty years, we have known two kinds of data systems: one for recording transactions, typically a database, and one for conducting analytics, typically a data warehouse. Keeping these systems connected with data transfer processes remains universally troublesome to database administrators. 

HTAP presents exciting potential for applications, particu
larly those dealing with real-
time data across on-demand and Internet of Things
HTAP presents exciting potential for applications, particu
larly those dealing with real-
time data across on-demand and Internet of Things

Fortunately, new breakthroughs with in-memory computing and distributed systems provide capabilities to merge these two types of systems, and create a third kind of database that can handle both transactions and analytics. Gartner refers to this combination as HTAP, or Hybrid Transaction/Analytical Processing. 

Let’s explore a brief history of the market space and how HTAP came about. 

Database Beginnings 

The early days of computing focused on recording transactions. Examples included accounting transactions, financial transactions, inventory management, and human resources. Databases became a way for us to organize these transactions with more speed and accuracy than manual methods. 

Basic database operations involved placing and retrieving information reliably and quickly. However, when we wanted to ask a question of that database, we often had to wait a very long time. Original databases were built to operate on disks and were limited to the poor performance of disks as a rapid storage and retrieval mechanism. Databases were also built with transactions in mind first, and therefore not well suited to handling multiple types of workloads simultaneously. It became clear that just increasing the size and performance of the database was not going to end the wait for analytical results. 

Enter the Data Warehouse 

Next to arrive was the data warehouse, which focused on answering questions about the data. However, since data had to be moved from the database to the data warehouse, we were left with an ETL gap for extract, transform, and load, a process every data administrator loves to hate. 

This separation cost users on three fronts: 

1. You have to move data
a.  When dealing with large data volumes and online operations, this process can be complex, time consuming, and fragile

2. Analytics are inherently out of date
a.  Once you have had to move data, the information is no longer up to the moment, and your analytics are always one step behind

3. The transactional system cannot get feedback from the analytical system
a.  Since the systems are separate, it is not possible for the transactional system to benefit directly, in an automated fashion, from the analytical system

Origins of Transactional and Analytical Systems 

Two primary shifts transpired in computing leading to the creation of a combined transactional and analytical system. First was a rapid decline in the cost of dynamic random access memory, or DRAM for short. Second was an ability to link multiple servers together in a distributed system so that a cluster of inexpensive computers could be treated like a single pool of memory. 

Since the speed of DRAM is orders of magnitude faster than disk drives, systems that function with memory deliver the performance needed to handle both transactional and analytical workloads simultaneously. 

Benefits of an HTAP Capable Systems 

Once the functions of transactions and analytics reside within a single system, users can break through traditional data processing barriers to achieve significant benefits including: 

Elimination of the ETL Process 

HTAP capable systems do not require an ETL process, as transactional data can also be used for analytics. Given that it takes a significant amount of effort to implement, manage, and maintain ETL processes, this can free organizations to put that time to use towards more sophisticated data analysis. 

Data Accurate to the Last Transaction 

With disparate systems the data warehouse is inherently out of date, since data must be moved prior to running analytics. For many companies, ETL processes can take hours or days, leaving analytics that are a view of the past but not the present. With HTAP capable systems, every query is accurate to the last transaction or last event recorded in the system. This provides the most fresh and accurate representation of the data. 

An Analytical Feedback Loop for Transactions 

Perhaps the most exciting part of HTAP systems is the potential for the transactions to incorporate analytical processing themselves. 

Consider two types of HTAP implementations. The first is where the transactional processes and analytical processes share the same data, but do not necessarily interact with each other. An example might be the logging of web metrics on an ecommerce site (transactions) and a real-time dashboard of ecommerce activity (analytics). Gartner refers to this type of HTAP as Point of Decision HTAP. The analytics deliver the most accurate data, operate on a live data set, and drive the most accurate point of decision. 

The second type of HTAP goes one step further to incorporate analytics within the transaction itself. Gartner refers to this as In-Process HTAP where the transaction might embed analytical functions that affect its outcome. 

A simple example here is the collection of sensor data. Consider the typical bell curve function. Generally, in this situation, the most interesting data lies beyond two standard deviations of the bell curve. The data in the middle, representing the most common occurrences, is often well understood. But the outlying areas often merit further investigation. 

In this case, the transaction could execute in conjunction with a query on whether the data was beyond 2 standard deviations from the norm. If true, these outlying cases could be stored in a second table for further analysis. Given that this “anomalies” table would only contain a smaller subset of data, and not the entire dataset, it becomes easier to work with. 

Of course, you could perform additional functions, such as if a data point was an anomaly, and run further queries against other datasets to test for its severity. This points to the full potential of In-Process HTAP to conduct complex outcomes based on a variety of incoming data. 

Getting Started with HTAP 

HTAP presents exciting potential for applications, particularly those dealing with real-time data across on-demand and Internet of Things. The application frontiers of these emerging marketspaces present numerous opportunities to build both Point of Decision and In-Process HTAP functions. 

When getting started with HTAP systems, look for the following characteristics: 

Memory-optimized or In-Memory Databases 

The speed and performance for HTAP databases comes from the ability to use DRAM for data. But not all In-Memory databases are alike. Identify those that capture and retain data in DRAM as a primary storage mechanism. Many databases retain legacy capture paths that require disks first, then quickly store the data in memory. These approaches limit the ability to capture large volumes of incoming data. Furthermore, even a short ETL process within the system creates separate datasets that are incapable of true In-Process HTAP. 

Understandably, memory-optimized databases need to include options for high availability should a node go down, and persistence mechanisms like transaction logs, snapshots, and backups. 

Distributed Systems 

As datasets grow, databases must expand and support low cost implementation models. Distributed systems enable HTAP databases to scale over multiple nodes. This is particularly useful for memory-optimized systems as a single server has a fixed memory capacity. It further helps keep costs low as inexpensive industry standard servers can be used as building blocks for an infinite number of different database configurations across different clusters. 

Relational Engine 

Since half of the goal of an HTAP system is analytics, it helps to have a relational engine at the core of the system. This provides peak performance for SQL, the most widely deployed analytics solution for enterprises. Other approaches promote general purpose engines which may not have optimum SQL performance, or add SQL as a separate system adding administrative complexity. 

Look for Capabilities in Both Operational Database and Data Warehouse Workloads 

Finally, ensure the system you choose has capabilities across operational database and data warehouse workloads. This gives you maximum flexibility to deploy HTAP in both Point of Decision and In-Process modes and define a center of excellence for real-time analytics across your organization. 

Read Also

Are You Ready To Virtualize Your Enterprise Applications and Databases?

Sachin Chheda, Director of Product and Solutions Marketing, Nutanix

Hadoop: The Biggest Catalyst of Change!

Boni Bruno, Chief Solutions Architect, Dell EMC

Big Data Transparency-Integrate Data Initiatives, Maximize Investments, Make Smarter Decisions

Alex Paul Manders, Technology Business Management Practice Lead Ameri, ISG