Hadoop – Whose to Choose (Part 3)

Which Hadoop image
By David Teplow


Big Data is the new normal in data centers today – the inevitable result of the fact that so much of what we buy and what we do is now digitally recorded, and so many of the products we use are leaving their own “digital footprint” (known as the “Internet of Things / IoT”). The cornerstone technology of the Big Data era is Hadoop, which is now a common and compelling component of the modern data architecture. The question these days is not so much whether to embrace Hadoop but rather which distribution to choose. The three most popular and viable distributions come from Cloudera, Hortonworks and MapR Technologies. Their respective products are CDH (Cloudera Distribution of Apache Hadoop), HDP (Hortonworks Data Platform) and MapR. This series of posts looks at the differences between CDH, HDP and MapR. The first focused on The Companies behind them; the second on their respective Management / Administration Tools; this third post will tackle the important differences between their primary SQL-on-Hadoop Offerings; and the fourth and final post will take a look at some recent and relevant Performance Benchmarks.

SQL-on-Hadoop Offerings

The SQL language is what virtually every programmer and every tool uses to define, manipulate and query data. This has been true since the advent of relational database management systems (RDBMS) 35 years ago. The ability to use SQL to access data in Hadoop was therefore a critical development. Hive was the first tool to provide SQL access to data in Hadoop through a declarative abstraction layer, the Hive Query Language (QL), and a data dictionary (metadata), the Hive metastore. Hive was originally developed at Facebook and was contributed to the open source community / Apache Software Foundation in 2008 as a subproject of Hadoop. Hive graduated to top-level status in September 2010. Hive was originally designed to use the MapReduce processing framework in the background and, therefore, it is still seen more as a batch-oriented tool than an interactive one.

To address the need for an interactive SQL-on-Hadoop capability, Cloudera developed Impala. Impala is based on Dremel, a real-time, distributed query and analysis technology developed by Google. It uses the same metadata that Hive uses but provides direct access to data in HDFS or HBase through a specialized distributed query engine. Impala streams query results whenever they’re available rather than all at once upon query completion. While Impala offers significant benefits in terms of interactivity and speed (which will be clearly demonstrated in the next post) it also comes with certain limitations. For example, Impala is not fault-tolerant (queries must be restarted if a node fails) and, therefore, may not be suitable for long-running queries. In addition, Impala requires the working set of a query to fit into the aggregate physical memory of the cluster it’s running on and, therefore, may not be suitable for multi-terabyte datasets. Version 2.0.0 of Impala, which was introduced with CDH 5.2.0, has a “Spill to Disk” option that may avoid this particular limitation. Lastly, User-Defined Functions (UDFs) can only be written in Java or C++.

Consistent with its commitment to develop and support only open source software, Hortonworks has stayed with Hive as its SQL-on-Hadoop offering and has worked to make it orders of magnitude faster with innovations such as Tez. Tez was introduced in Hive 0.13 / HDP 2.1 (April 2014) as part of the “Stinger Initiative”. It provides performance improvements for Hive by assembling many tasks into a single MapReduce job rather than many by using Directed Acyclic Graphs (DAGs). From Hive 0.10 (released in January 2013) to Hive 0.13 (April 2014), performance improved an average of 52X on 50 TPC-DS Queries[1] (the total time to run all 50 queries decreased from 187.2 hours to 9.3 hours). Hive 0.14, which was released in November 2014 and comes with HDP 2.2, has support for INSERT, UPDATE and DELETE statements via ACID[2] transactions. Hive 0.14 also includes the initial version of a Cost-Base Optimizer (CBO), which has been named Calcite (f.k.a. Optiq). As we’ll see in the next post, Hive is still slower than its SQL-on-Hadoop alternatives, in part because it writes intermediate results to disk (unlike Impala, which streams data between stages of a query, or Spark SQL, which holds data in memory).

Like Cloudera with Impala, MapR is building its own interactive SQL-on-Hadoop tool with Drill. Like Impala, Drill is also based on Google’s Dremel. Drill began in August 2012 as an incubator project under the Apache Software Foundation and graduated to top-level status in December 2014. MapR employs 13 of the 16 committers on the project. It uses the same metadata that Hive and Impala use (Hive metastore). What’s unique about Drill is that it doesn’t need metadata as schemas can be discovered on the fly (as opposed to RDBMS schema on write or Hive/Impala schema on read) by taking advantage of self-describing data such as that in XML, JSON, BSON, Avro, or Parquet files.

A fourth option that none of the three companies are touting as their primary SQL-on-Hadoop offering but all have included in their distributions is Spark SQL (f.k.a. “Shark”). Spark is another implementation of the DAG approach (like Tez). A significant innovation that Spark offers is Resilient Distributed Datasets (RDDs), an abstraction that makes it possible to work with distributed data in memory. Spark is a top-level project under the Apache Software Foundation. It was originally developed at the UC Berkeley AMPLab, became an incubator project in June 2013, and graduated to top-level status in February 2014. Spark currently has 32 committers from 12 different organizations (the most active being Databricks with 11 committers, UC Berkley with 7, and Yahoo! with 4). CDH 5.3.2 includes Spark 1.2.0; HDP 2.2.2 includes Spark 1.2.1; and MapR 4.1 includes Spark 1.2.1 (as well as Impala 1.4.1). Furthermore, most major tool vendors have native Spark SQL connectors, including MicroStrategy, Pentaho, QlikView, Tableau, Talend, etc. In addition to HDFS, Spark can run against HBase, MongoDB, Cassandra, JSON, and Text Files. Spark not only provides database access (with Spark SQL), but also has built-in libraries for continuous data processing (with Spark Streaming), machine learning (with MLlib), and graphical analytics (with GraphX). While Spark and Spark SQL are still relatively new to the market, they have been rapidly enhanced and embraced, and have the advantage of vendor neutrality – not being owned or invented by any of the three companies, while being endorsed by all three. In my opinion, this gives Spark SQL the best chance of becoming the predominant – if not the standard – SQL-on-Hadoop tool.

My fourth and final post will compare the relative performance of the different distributions and their SQL-on-Hadoop offerings by looking at some recent and relevant Performance Benchmarks.

[1] 30TB dataset / Scale Factor of 30,000 on a 20-node cluster.

[2] Atomic, Consistent, Isolated and Durable – see: http://en.wikipedia.org/wiki/ACID.

Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: