SQL-in-Hadoop: Beyond the Benchmark

Abstract

In June, Actian gave the keys to the Hadoop kingdom to millions of business-savvy SQL users and analysts with the launch of its SQL-in-Hadoop* offering. With it, Hadoop data repositories are accessible to the entire enterprise, which means that companies investing in Hadoop can broaden the scope of data discovery, increase the accuracy of decisions, and speed time to value.

Oh, and it's fast. Really fast. Our benchmarks show that the Actian Analytics Platform–Hadoop SQL Edition is between 16 and 30 times faster than Cloudera's Impala (Fig. 1). Impressive, right?

FIG. 1.

Comparison between the Actian Analytics Platform–Hadoop SQL Edition and Cloudera's Impala.

What's really impressive is what users can accomplish if they can perform low-latency analytics on Hadoop data directly from within HDFS. In the financial services industry, faster means more dollars. In healthcare, faster means saved lives. For entrepreneurial companies, faster means more market share.

One of our customers in the clean energy space uses Actian's Hadoop SQL Edition to collect, integrate, and analyze data from the Internet of Things, Google Earth, local governments, NASA, drones, and other sources to pinpoint those who are ready for clean energy, and offer tailored solutions that make it completely affordable. They estimate that, by leveraging Hadoop analytics—something they never dreamed they'd be able to accomplish prior to Actian's SQL-in-Hadoop offering—they'll see a 60% increase in customers.

But let's get back to benchmarks. How exactly can Actian prove that it is faster than Impala? In early May, as we were readying our product for launch, we took a great deal of care in our benchmark testing to create an apples-to-apples comparison of what was then the most current performance data from Cloudera. The results were impressive. We ran the tests several times to make sure that we weren't seeing things. We felt a little like a Tour de France winner riding a bike with the training wheels still on. And this is only the beginning. We are extremely excited about the newest innovations coming out of our research and development lab, led by Peter Boncz, technical advisor to Actian and father of Vector Database Processing–technology advances that we anticipate will further extend our performance lead both in benchmark results and in real-world customer usage.

And for those that enjoy benchmark battles and banter, we'd like to point out a few of our observations of the Cloudera benchmark:

• Cloudera uses a 19-query subset of the normally 99-query TPC-DS benchmark. Where are the results of the other 80 queries?

• Cloudera modified the queries by removing all rollup and window functions, which are not supported by Impala.

• It added extra SQL clauses that help Impala prune partitions.

• Impala runs just one core per node for query processing and runs each query single threaded.

You can read further details and a more technical breakdown of the benchmark by visiting Peter's blog.

At the end of the day, do benchmarks performed in a technology company's own environment, with their data and their parameters, mean anything to the people who are making purchasing decisions? Yes, they provide a point of comparison, but benchmarks shouldn't be the only point of consideration. What else should organizations consider when they're looking for a SQL-in-Hadoop platform?

• How well does it integrate with Hadoop? Is integration achieved through a connector? Does it move data between databases when querying?

• Where does it store the data? Inside or outside of HDFS?

• Is it ACID compliant?

• How is failover managed?

• Is it secure? How is access controlled?

• How is the workload managed?

• Does it support trickle updates?

• Are queries run single threaded or multithreaded?

• Does it work well with your current environment, or is it necessary to purchase additional hardware?

“SO, WHILE BENCHMARKS ARE USUALLY WORTH AS MUCH AS THE PAPER THEY'RE PRINTED ON, ACTIAN DOES OUTPERFORM IMPALA ON ITS OWN BENCHMARK.”

So, while benchmarks are usually worth as much as the paper they're printed on, Actian does outperform Impala on its own benchmark. The industry is littered with vendor claims but one thing is for certain: architectural differences are hard to beat, and Actian's columnar, native support of YARN, and strong partnership with Hortonworks give us a sustainable advantage. Hadoop is one of the most disruptive technologies. With Actian's Hadoop SQL Edition, and through our strong partnership with Hortonworks, we aim to move even more Hadoop projects from the sandbox to the enterprise (Fig. 2).

FIG. 2.

Flowchart of improved visual data and analytics workflow.

“THE RESULT: USERS CAN EASILY DESIGN POWERFUL BIG DATA PIPELINES THAT STORE, QUERY, AND ANALYZE DATA, WITHOUT HAVING TO WRITE ONE LINE OF MAPREDUCE CODE.”

Let's talk architecture. Here are some of our findings when looking under Impala's hood:

• Data format: Impala uses Parquet, a format that leverages Snappy compression, which is slow by Actian Vector standards. Parquet is columnar, but uses simple compression techniques. It also performs poorly at pushing down selections. Columns and column groups in Parquet have to be read fully, which makes it difficult to skip pages.

• YARN compliance: The latest version of Impala adds YARN support, which we applaud. Actian was the one of the first YARN-compliant vendors.

• Reliability: Impala uses default data placement on HDFS. Actian has more advanced control over where HDFS stores its files; this replica placement control enables it to better react if a machine goes down. The Impala approach warrants the need for the data to be fetched across the network. Actian redistributes responsibilities such that each node only reads local data. With this more efficient approach, Actian is better equipped to handle cluster failures.

• Updates: Impala is designed for read-only queries, and updates are supported by appending data to existing tables. Actian can demonstrate individual update support today.

• Design environment: Actian DataFlow provides a graphical user interface that operates natively from within Hadoop, is complete with prebuilt operators and data-mining algorithms, and enables organizations to design their data flows using an easy drag-and-drop designer—making fast, parallel analytics accessible to more people in your company, not just your most sophisticated data scientists and developers. It can read and write many Hadoop data sources such as HBase, and comes with data-mining algorithms. The result: Users can easily design powerful big data pipelines that store, query, and analyze data, without having to write one line of MapReduce code. Impala's approach requires users to gather a host of tools, including Kite, Flume, Oozie, Crunch, and Morphline, among others—none of which are nearly as functional as DataFlow.

Also of high importance is the ability of enterprises to support industry-revolutionizing technologies like Hadoop. We hang our hat on being the only one to deliver the necessary, and often-overlooked, fully industrialized, enterprise-grade capabilities like

• End-to-end analytic processing natively in Hadoop with full SQL support

• Natively YARN ready

• Direct access to data stored in HDFS and ability to manage files to ensure increased disaster recovery

• Fully ACID compliant with multiversion read consistency, plus system-wide failover protection

• Business-critical security

• Native DBMS security, including authentication, data protection, and encryption at the user and role levels

• Mature, proven planner and optimizer; optimal use of every node, CPU, memory, and cache

• Libraries of analytics functions to make the entire analytic process more consumable and easier to manage

• Multithreaded queries

• Resources managed automatically in Hadoop via YARN

• Standard ANSI SQL-92 to support standard BI tools; plus key advanced analytics, including cube, grouping sets, and windowing functions

While we do think that Actian's approach to SQL-in-Hadoop is superior, we want to give credit to other companies that have dedicated resources to creating analytical database systems for Hadoop. We support Hortonworks in their efforts to build a rock-solid Hadoop ecosystem for the growing number of users around the world. And this is only the beginning—we are excited about our continued technology advances that we anticipate we will further extend our performance lead.

SQL-in-Hadoop: Beyond the Benchmark

Abstract

Footnotes