Deciding on Hadoop

If you’re considering Hadoop but unsure of which distribution or framework or SQL-on-Hadoop tool to use, Integra can help

  

Hadoop is an innovative breakthrough in database technology. Your decision to use it can be justified by any number of measures – price, performance, time-to-value, flexibility, functionality, etc. The decision to use Hadoop, however, is but the first of a great many decisions that must be made.

The Open Source community, which gave us Hadoop and the ever-expanding list of projects in its ecosystem, is great at developing new tools and technologies, but not very good at coalescing around its leaders. That makes it harder for potential buyers who must pick the winners and losers themselves and decide upon company standards where no industry standards exist.

That puts the burden of research and decision-making in your hands, starting with whose distribution to use. The leading vendors are Cloudera, Hortonworks, IBM, and MapR Technologies. In my opinion, we have an exception here to the old IT adage: “No one ever got fired for buying IBM.”

Next, you’ve got to choose the best processing framework to use for a particular use case. That was easy in the Hadoop 1.x days, when MapReduce was the only option. These days, the most prominent and, in most cases, the most appropriate framework is Spark. An exception would be search applications, where Solr is better suited.

When it comes to picking a SQL-on-Hadoop tool, there are far too many options to choose from. Your short list should include:

  • Hive
  • Impala
  • Spark SQL
  • Drill

That list could be lengthened with any or all of:

  • Presto
  • BigSQL
  • Phoenix
  • HAWQ

If you’re in a highly concurrent environment where many users are running the same queries at the same time, then Impala is probably best. If you’re in a high reuse environment where the same queries are rerun frequently throughout the day, then Spark SQL is probably best.

Integra can help with these decisions and the myriad of others that come up when selecting and implementing Hadoop and its related ecosystem. We can show you the relative performance of one distribution or technology versus another running your queries against your data. We’ve done this for other customers to help with their selections and implementations. We’ve also given conference presentations and published papers on these issues. Let us share our knowledge and experience with you.

Advertisements

Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: