Archive for March, 2015

Hadoop – Whose to Choose (Part 2)

March 25, 2015

Which Hadoop image

By David Teplow

Background  

Big Data is the new normal in data centers today – the inevitable result of the fact that so much of what we buy and what we do is now digitally recorded, and so many of the products we use are leaving their own “digital footprint” (known as the “Internet of Things / IoT”). The cornerstone technology of the Big Data era is Hadoop, which is now a common and compelling component of the modern data architecture. The question these days is not so much whether to embrace Hadoop but rather which distribution to choose. The three most popular and viable distributions come from Cloudera, Hortonworks and MapR Technologies. Their respective products are CDH (Cloudera Distribution of Apache Hadoop), HDP (Hortonworks Data Platform) and MapR. This series of posts will look at the differences between CDH, HDP and MapR. My first post focused on The Companies behind them; this second post will discuss their respective Management/Administration Tools; the third will tackle the important differences between their primary SQL-on-Hadoop Offerings; and the fourth and final post will take a look at some recent and relevant Performance Benchmarks.

Part 2 – Management/Administration Tools

All three vendors have comprehensive tools for configuring, managing and monitoring your Hadoop cluster. In fact, all three received scores of 5 (on a scale of 0 to 5) for “Setup, management, and monitoring tools” in Forrester’s report on “Big Data Hadoop Solutions, Q1 2014”. The main difference between the three is that Hortonworks offers a completely open source, completely free tool (Apache Ambari) while Cloudera and MapR offer their own proprietary tools (Cloudera Manager and MapR Control System, respectively). While free versions of these tools do come with the free versions of Cloudera’s and MapR’s distribution (Cloudera Express and MapR Community Edition, respectively), the tools’ advanced features only come with the paid editions of their distribution (Cloudera Enterprise and MapR Enterprise Edition, respectively).

That’s sort of like having a car but only getting satellite radio when you pay a subscription. Although with Cloudera Manager and MapR Control System, it’s more like having the navigation system, Bluetooth connectivity, and the airbags enabled only when you pay a subscription. You can get from place to place just fine without these extras but, in certain cases, it sure would be nice to have the use of these “advanced features”. When you drive Ambari off the lot, on the other hand, you’re free to use any and all available features.

The advanced features of Cloudera Manager, which are only enabled by subscription, include:

  • Quota Management for setting/tracking user and group-based quotas/usage.
  • Configuration History / Rollbacks for tracking all actions and configuration changes, with the ability to roll back to previous states.
  • Rolling Updates for staging service updates and restarts to portions of the cluster sequentially to minimize downtime during cluster upgrades/updates.
  • AD Kerberos and LDAP / SAML Integration
  • SNMP Support for sending Hadoop-specific events/alerts to global monitoring tools as SNMP traps.
  • Scheduled Diagnostics for sending a snapshot of the cluster state to Cloudera support for optimization and issue resolution.
  • Automated Backup and Disaster Recovery for configuring/managing snapshotting and replication workflows for HDFS, Hive and HBase.

The advanced features of MapR Control System (MCS), which are only enabled by subscription, include:

  • Advanced Multi-Tenancy with control over job placement and data placement.
  • Consistent Point-In-Time Snapshots for hot backups and to recover data from deletions or corruptions due to user or application error.
  • Disaster Recovery through remote replicas created with block level, differential mirroring with multiple topology configurations.

Apache Ambari has a number of advanced features (which are always free and enabled), such as:

  • Configuration versioning and history provides visibility, auditing and coordinated control over configuration changes, and management of all services and components deployed on your Hadoop Cluster (rollback will be supported in the next release of Ambari).
  • Views Framework provides plug-in UI capabilities to surface custom visualization, management and monitoring features in the Ambari Web console. A “view” is a way of extending Ambari that allows 3rd parties to plug in new resource types along with the APIs, providers and UI to support them. In other words, a view is an application that is deployed into the Ambari container.
  • Blueprints provide declarative definitions of a cluster, which allows you to specify a Stack, the Component layout and the configurations to materialize a Hadoop cluster instance (via a REST API) without the need for any user interaction.

Ambari leverages other open source tools that may already be in use within your data center, such as Ganglia for metrics collection and Nagios for system alerting (e.g. sending emails when a node goes down, remaining disk space is low, etc). Furthermore, Apache Ambari provides APIs to integrate with existing management systems including Microsoft System Center and Teradata ViewPoint.

My next post will tackle the important differences between the three distributions’ primary SQL-on-Hadoop Offerings.

Hadoop – Whose to Choose (Part 1)

March 18, 2015

Which Hadoop image

By David Teplow

Background

Big Data is the new normal in data centers today – the inevitable result of the fact that so much of what we buy and what we do is now digitally recorded, and so many of the products we use are leaving their own “digital footprint” (known as the “Internet of Things / IoT”). The cornerstone technology of the Big Data era is Hadoop, which is now a common and compelling component of the modern data architecture. The question these days is not so much whether to embrace Hadoop but rather which distribution to choose. The three most popular and viable distributions come from Cloudera, Hortonworks and MapR Technologies. Their respective products are CDH (Cloudera Distribution of Apache Hadoop), HDP (Hortonworks Data Platform) and MapR. This series of posts will look at the differences between CDH, HDP and MapR. This first post will focus on The Companies behind them; the second on their respective Management/Administration Tools; the third will tackle the important differences between their primary SQL-on-Hadoop Offerings; and the fourth and final post will take a look at some recent and relevant Performance Benchmarks.

Part 1 – The Three Contenders: Cloudera, Hortonworks and MapR

The table below shows some key facts and figures related to each company.

Which Hadoop - table 1

Of the three companies, only Hortonworks is traded publicly (as of 12/12/14; NASDAQ: HDP). So valuations, revenues and other financial measures are harder to ascertain for Cloudera and MapR.

Cloudera’s valuation is based on a $740M investment by Intel in March 2014 for an 18% stake in the company. Hortonworks valuation is based on it’s stock price of $27 on 12/31/14, which happens to be equivalent to their valuation in July 2014 when HP invested $50M for just under 5% of the company. When Hortonworks went public in December 2014, they raised $110M; on top of $248M they had raised privately before their Initial Public Offering (IPO), which totals the $358M in the table above. That’s one third of what Cloudera has raised but twice what MapR has. I haven’t found any information that would allow me to determine a valuation for MapR, though Google Capital and others made an $80M investment in MapR (for an undisclosed equity stake) in June 2014.

Cloudera announced that for the twelve months ending 1/31/15 “Preliminary unaudited total revenue surpassed $100 million.” Hortonworks’ $46M revenue is for the year ending 12/31/14. I haven’t seen revenue figures for MapR for 2014 or any recent 12-month period, so the figure above is Wikibon’s estimate of 2013 revenue. My best guess is that their 2014 revenue was in the $40M to $45M range. Price/Earnings or P/E is a common financial measure for comparing companies, but since none of the three companies have yet to earn any profits, I’ve used Price/Sales in the table above. For comparison, Oracle typically trades at around 5.1X Sales; RedHat at around 7.8X Sales. So 41X for Cloudera and 24X for Hortonworks, while not quite off the scale, are exceedingly high.

The last two rows in the table above show how many employees each company has on the Project Management Committee and as additional Committers on the Apache Hadoop project. This shows their level of involvement in and commitment to Hadoop being sustained and enhanced by the open source community. From the first four rows of Table 1, it is clear that Cloudera has the lead in terms of being first to market, raising the most money, having the highest valuation, and selling the most software. Hortonworks, on the other hand, is the leading proponent of Hadoop as a vibrant and innovative open source project. This is true not only for the core Hadoop project and its most essential sub-projects like Hadoop Common, Hadoop MapReduce and Hadoop YARN (the Project Lead for each of these is employed by Hortonworks[1]), but also for most related projects like Ambari, Hive, Pig, etc.

My next post will explore the different Management/Administration Tools offered by Cloudera, Hortonworks and MapR.

[1] Facebook employs the Project Lead for Hadoop HDFS.

The Database Emperor Has No Clothes

March 12, 2015

Hadoop’s Inherent Advantages Over RDBMS for DW / BI; Especially in the “Big Data” Era

By David Teplow
[NOTE: This article was originally published in The Data Warehousing Institute’s quarterly BI Journal in March 2013.  It was also one of two articles chosen from 2013 to be republished in TDWI’s annual Best of BI in February 2014.]

Background

Relational database management systems (RDBMS) were specified by IBM’s E.F. Codd in 1970 and first commercialized by Oracle Corporation (then Relational Software, Inc.) in 1979. Since that time, practically every database has been built using an RDBMS—either proprietary (Oracle, SQL Server, DB2, and so on) or open source (MySQL, PostgreSQL). This was entirely appropriate for transactional systems that dealt with structured data and benefitted when that data was normalized.

In the late 1980s, we began building decision support systems (DSS)—also referred to as business intelligence (BI), data warehouse (DW), or analytical systems. We used RDBMS for these, too, because it was the de facto standard and essentially the only choice. To meet the performance requirements of DSS, we denormalized the data to eliminate the need for most table joins, which are costly from a resource and time perspective. We accepted this adaptation (some would say “misuse”) of the relational model because there were no other options—until recently.

Relational databases are even less suitable for handling so-called “Big Data.” Transactional systems were designed for just that—transactions; data about a point in time when a purchase occurred or an event happened. Big Data is largely a result of the electronic record we now have about the activity that precedes and follows a purchase or event. This data includes the path taken to a purchase—either physical (surveillance video, location service, or GPS device) or virtual (server log files or clickstream data). It also includes data on where customers may have veered away from a purchase (product review article or comment, shopping cart removal or abandonment, jumping to a competitor’s site), and it certainly includes data about what customers say or do as a result of purchases or events via tweets, likes, yelps, blogs, reviews, customer service calls, and product returns. All this data dwarfs transactional data in terms of volume, and it usually does not lend itself to the structure of tables and fields.

The Problems with RDBMS for DW / BI

To meet the response-time demands of DSS, we pre-joined and pre-aggregated data into star schemas or snowflake schemas (dimensional models) instead of storing data in third normal form (relational models). This implied that we already knew what questions we would need to answer, so we could create the appropriate dimensions by which to measure facts. In the real world, however, the most useful data warehouses and data marts are built iteratively. Over time, we realize that additional data elements or whole new dimensions are needed or that the wrong definition or formula was used to derive a calculated field value. These iterations entail changes to the target schema along with careful and often significant changes to the extract-transform-load (ETL) process.

The benefit of denormalizing data in a data warehouse is that it largely avoids the need for joining tables, which are usually quite large and require an inordinate amount of machine resources and time to join. The risk associated with denormalization is that it makes the data susceptible to update anomalies if field values change.

For example, suppose the price of a certain item changes on a certain date. In our transactional system, we would simply update the Price field in the Item table or “age out” the prior price by updating the effective date and adding a new row to the table with the new price and effective dates. In our data warehouse, however, the price would most likely be contained within our fact table and replicated for each occurrence of the item.

Anomalies can be introduced by an update statement that misses some occurrences of the old price or catches some it shouldn’t have. Anomalies might also result from an incremental data load that runs over the weekend and selects the new price for every item purchased in the preceding week when, in fact, the price change was effective on Wednesday (which may have been the first of the month) and should not have been applied to earlier purchases.

With any RDBMS, the schema must be defined and created in advance, which means that before we can load our data into the data warehouse or data mart, it must be transformed—the dreaded “T” in ETL. Transformation processes tend to be complex, as they involve some combination of deduplicating, denormalizing, translating, homogenizing, and aggregating data, as well as maintaining metadata (that is, “data about the data” such as definitions, sources, lineage, derivations, and so on). Typically, they also entail the creation of an additional, intermediary database—commonly referred to as a staging area or an operational data store (ODS). This additional database comes with the extra costs of another license and database administrator (DBA). This is also true for any data marts that are built, which is often done for each functional area or department of a company.

Each step in the ETL process involves not only effort, expense, and risk, but also requires time to execute (not to mention the time required to design, code, test, maintain, and document the process). Decision support systems are increasingly being called on to support real-time operations such as call centers, military intelligence, recommendation engines, and personalization of advertisements or offers. When update cycles must execute more frequently and complete more rapidly, a complex, multi-step ETL process simply will not keep up when high volumes of data arriving at high velocity must be captured and consumed.

The Problems with Big Data

Big Data is commonly characterized as having high levels of volume, velocity, and variety. Volume has always been a factor in BI/DW, as discussed earlier. The velocity of Big Data is high because it flows from the so-called “Internet of Things,” which is always on and includes not just social media and mobile devices but also RFID tags, Web logs, sensor networks, on-board computers, and more. To make sense of the steady stream of data that these devices emit requires a DSS that, likewise, is always on. Unfortunately, high availability is not standard with RDBMS, although each brand offers options that provide fault resilience or even fault tolerance. These options are neither inexpensive to license nor easy to understand and implement. To ensure that Oracle is always available requires RAC (Real Application Clusters for server failover) and/or Data Guard (for data replication). RAC will add over 48 percent to the cost of your Oracle license; Data Guard, over 21 percent[1].

Furthermore, to install and configure RAC or Data Guard properly is not simple or intuitive, but instead requires specialized expertise about Oracle as well as your operating system. We were willing to pay this price for transactional systems because our businesses depended on them to operate. When the DSS was considered a “downstream” system, we didn’t necessarily need it to be available all the time. For many businesses today, however, decision support is a mainstream system that is needed 24/7.

Variety is perhaps the biggest “Big Data” challenge and the primary reason it’s poorly suited for RDBMS. The many formats of Big Data can be broadly categorized as structured, semi-structured, or unstructured. Most data about a product return and some data about a customer service call could be considered structured and is readily stored in a relational table. For the most part, however, Big Data is semi-structured (such as server log files or likes on a Facebook page) or completely unstructured (such as surveillance video or product-related articles, reviews, comments, or tweets). These data types do not fit neatly—if at all—into tables made up of fields that are rigidly typed (for example, six-digit integer, floating point number, fixed- or variable-length character string of exactly X or no more than Y characters, and so on) and often come with constraints (for example, range checks or foreign key lookups).

Like high availability, high performance is an option for an RDBMS, and vendors have attempted to address this with features that enable partitioning, caching, and parallelization. To take advantage of these features, we have to license these software options and also purchase high-end (that is, expensive) hardware to run it on—full of disks, controllers, memory, and CPUs. We then have to configure the database and the application to take advantage of components such as data partitions, memory caches and/or parallel loads, parallel joins/selects, and parallel updates.

A New Approach

In December of 2004, Google published a paper on MapReduce, which was a method it devised to store data across hundreds or even thousands of servers, then use the power of each of those servers as worker nodes to “map” its own local data and pass along the results to a master node that would “reduce” the result sets to formulate an answer to the question or problem posed. This allowed a “Google-like” problem (such as which servers across the entire Internet have content related to a particular subject and which of those are visited most often) to be answered in near real time using a divide-and-conquer approach that is both massively parallel and infinitely scalable.

Yahoo! used this MapReduce framework with its distributed file system (which grew to nearly 50,000 servers) to handle Internet searches and the required indexing of millions of websites and billions of associated documents. Doug Cutting, who led these efforts at Yahoo!, contributed this work to the open source community by creating the Apache Hadoop project, which he named for his son’s toy elephant. Hadoop has been used by Google and Yahoo! as well as Facebook to process over 300 petabytes of data. In recent years, Hadoop has been embraced by more and more companies for the analysis of more massive and more diverse data sets.

Data is stored in the Hadoop Distributed File System (HDFS) in its raw form. There is no need to normalize (or denormalize) the data, nor to transform it to fit a fixed schema, as there is with RDBMS. Hadoop requires no data schema—and no index schema. There is no need to create indexes, which often have to be dropped and then recreated after data loads in order to accelerate performance. The common but cumbersome practice of breaking large fact tables into data partitions is also unnecessary in Hadoop because HDFS does that by default. All of your data can be readily stored in Hadoop regardless of its volume (inexpensive, commodity disk drives are the norm), velocity (there is no transformation process to slow things down), or variety (there is no schema to conform to).

As for availability and performance, Hadoop was designed from the beginning to be fault tolerant and massively parallel. Data is always replicated on three separate servers, and if a node is unavailable or merely slow, one of the other nodes takes over processing that data set. Servers that recover or new servers that are added are automatically registered with the system and immediately leveraged for storage and processing. High availability and high performance is “baked in” without the need for any additional work, optional software, or high-end hardware.

Although getting data into Hadoop is remarkably straightforward, getting it out is not as simple as with RDBMS. Data in Hadoop is accessed by MapReduce routines that can be written in Java, Python, or Ruby, for example. This requires significantly more work than writing a SQL query. A scripting language called Pig, which is part of the Apache Hadoop project, can be used to eliminate some of the complexity of a programming language such as Java. However, even Pig is not as easy to learn and use as SQL.

Hive is another tool within the Apache Hadoop project that allows developers to build a metadata layer on top of Hadoop (called “HCatalog”) and then access data using a SQL-like interface (called “HiveQL”). In addition to these open source tools, several commercial products can simplify data access in Hadoop. I expect many more products to come from both the open source and commercial worlds to ease or eliminate the complexity inherent in MapReduce, which is currently the biggest inhibitor to Hadoop adoption. One that bears watching is a tool called “Impala,” which is being developed by Cloudera and allows you to run SQL queries against Hadoop in real time. Unlike Pig and Hive, which must be compiled into MapReduce routines and then run in “batch mode,” Impala runs interactively and directly with the data in Hadoop so that query results begin to return immediately.

Conclusion

Relational database Management Systems (RDBMS) have been around for more than 30 years and have proven to be a far better way to handle data than their predecessors. They are particularly well suited for transactional systems, which quickly and rightfully made them a standard for the type of data processing that was prevalent in the 1980s and 1990s. Decades ago, when BI / DW / Analytical systems became common, we found ways to adapt RDBMS for these systems. However, these adaptations were unnatural in terms of the relational model, and inefficient in terms of the data staging and transformation processes they created. We tolerated this because it achieved acceptable results—for the most part. Besides, what other option did we have?

When companies such as Google, Yahoo!, and Facebook found that relational databases were simply unable to handle the massive volumes of data they have to deal with—and necessity being the mother of invention—a new and better way was developed to put data in context (after all, data in context is information; information in context is knowledge). In this age of Big Data, more companies must now deal with data that not only comes in much higher volumes, but also at much faster velocity and in much greater variety.

Relational databases are no longer the only game in town, and for analytical systems (versus transactional systems), they are no longer the best available option.

About the Author
David Teplow began using Oracle with version 2 in 1981, and has worked as a consultant for Database Technologies (1986-1999) and Integra Technology Consulting (2000-present). He can be reached at: DTeplow@IntegraTC.com
[1] Per Processor License costs from the Oracle Technology Global Price List dated July 19, 2012.