Hadoop – Whose to Choose (Part 2)

Which Hadoop image

By David Teplow


Big Data is the new normal in data centers today – the inevitable result of the fact that so much of what we buy and what we do is now digitally recorded, and so many of the products we use are leaving their own “digital footprint” (known as the “Internet of Things / IoT”). The cornerstone technology of the Big Data era is Hadoop, which is now a common and compelling component of the modern data architecture. The question these days is not so much whether to embrace Hadoop but rather which distribution to choose. The three most popular and viable distributions come from Cloudera, Hortonworks and MapR Technologies. Their respective products are CDH (Cloudera Distribution of Apache Hadoop), HDP (Hortonworks Data Platform) and MapR. This series of posts will look at the differences between CDH, HDP and MapR. My first post focused on The Companies behind them; this second post will discuss their respective Management/Administration Tools; the third will tackle the important differences between their primary SQL-on-Hadoop Offerings; and the fourth and final post will take a look at some recent and relevant Performance Benchmarks.

Part 2 – Management/Administration Tools

All three vendors have comprehensive tools for configuring, managing and monitoring your Hadoop cluster. In fact, all three received scores of 5 (on a scale of 0 to 5) for “Setup, management, and monitoring tools” in Forrester’s report on “Big Data Hadoop Solutions, Q1 2014”. The main difference between the three is that Hortonworks offers a completely open source, completely free tool (Apache Ambari) while Cloudera and MapR offer their own proprietary tools (Cloudera Manager and MapR Control System, respectively). While free versions of these tools do come with the free versions of Cloudera’s and MapR’s distribution (Cloudera Express and MapR Community Edition, respectively), the tools’ advanced features only come with the paid editions of their distribution (Cloudera Enterprise and MapR Enterprise Edition, respectively).

That’s sort of like having a car but only getting satellite radio when you pay a subscription. Although with Cloudera Manager and MapR Control System, it’s more like having the navigation system, Bluetooth connectivity, and the airbags enabled only when you pay a subscription. You can get from place to place just fine without these extras but, in certain cases, it sure would be nice to have the use of these “advanced features”. When you drive Ambari off the lot, on the other hand, you’re free to use any and all available features.

The advanced features of Cloudera Manager, which are only enabled by subscription, include:

  • Quota Management for setting/tracking user and group-based quotas/usage.
  • Configuration History / Rollbacks for tracking all actions and configuration changes, with the ability to roll back to previous states.
  • Rolling Updates for staging service updates and restarts to portions of the cluster sequentially to minimize downtime during cluster upgrades/updates.
  • AD Kerberos and LDAP / SAML Integration
  • SNMP Support for sending Hadoop-specific events/alerts to global monitoring tools as SNMP traps.
  • Scheduled Diagnostics for sending a snapshot of the cluster state to Cloudera support for optimization and issue resolution.
  • Automated Backup and Disaster Recovery for configuring/managing snapshotting and replication workflows for HDFS, Hive and HBase.

The advanced features of MapR Control System (MCS), which are only enabled by subscription, include:

  • Advanced Multi-Tenancy with control over job placement and data placement.
  • Consistent Point-In-Time Snapshots for hot backups and to recover data from deletions or corruptions due to user or application error.
  • Disaster Recovery through remote replicas created with block level, differential mirroring with multiple topology configurations.

Apache Ambari has a number of advanced features (which are always free and enabled), such as:

  • Configuration versioning and history provides visibility, auditing and coordinated control over configuration changes, and management of all services and components deployed on your Hadoop Cluster (rollback will be supported in the next release of Ambari).
  • Views Framework provides plug-in UI capabilities to surface custom visualization, management and monitoring features in the Ambari Web console. A “view” is a way of extending Ambari that allows 3rd parties to plug in new resource types along with the APIs, providers and UI to support them. In other words, a view is an application that is deployed into the Ambari container.
  • Blueprints provide declarative definitions of a cluster, which allows you to specify a Stack, the Component layout and the configurations to materialize a Hadoop cluster instance (via a REST API) without the need for any user interaction.

Ambari leverages other open source tools that may already be in use within your data center, such as Ganglia for metrics collection and Nagios for system alerting (e.g. sending emails when a node goes down, remaining disk space is low, etc). Furthermore, Apache Ambari provides APIs to integrate with existing management systems including Microsoft System Center and Teradata ViewPoint.

My next post will tackle the important differences between the three distributions’ primary SQL-on-Hadoop Offerings.


Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: