March 11, 2013

Hadoop execution enhancements

Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:

Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others.  The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

This is similar to the approach of BDAS Spark:

Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.

although Tez won’t match Spark’s richer list of primitive operations.

More specifically, there will be six primitive Tez operations:

A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.

I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:

Comments

14 Responses to “Hadoop execution enhancements”

  1. Francois on March 11th, 2013 12:40 pm

    Are you seeing companies starting to offload apps from their data warehouses onto Hadoop? It seems like the roadmap points to continuous improvement, with issues such as concurrency being solved relatively rapidly.

  2. aaron on March 11th, 2013 6:03 pm

    An important point to make here is that Hadoops are not just ecosystems but also assembly languages for distributed filesystems and databases.

    The original frameworks are adequate for batch big file processing, perhaps an order of magnitude slower than a quickly written program plus some startup time.

    The startup time is a current hot button for twitchy fingered web priorities. This is leading to YARN and Impala and many other approaches to short and mixed workloads.

    Everyone needs to be clear that we are in transition. Current technology is broken, and the next iteration of Hadoops (not Hadoop2!) will be better at complex workload management.

    What will evolve will be a collection of:
    - Java API components to perform tasks at a high level (and higher level languages consuming them)
    - a local “OS” and distributed high level OS to perform work
    - A distributed file system that will take on either file system semantics or structured ones
    We have these now, but everything is brute force. YARN can’t do much without constrained workloads and dishing out uncontrolled processes, and with manual optimization. We have no Hadoop of Hadoops with hierarchical management….

    If you are planning a YARN implementation, you should still segregate small from large workloads on different clusters. You should also make sure your work is constrained, with little mix of small and medium size work.

  3. Curt Monash on March 11th, 2013 7:44 pm

    Francois,

    If by “offload” you mean “port”, that’s probably much more myth than reality. But workloads on Hadoop, if they could be done at all without Hadoop, would otherwise be done in RDBMS and ETL tools. Or Splunk. Heck, sometimes they’d be done BETTER in alternative technologies, e.g. RDBMS with an analytic platform flavor. So in that sense, Hadoop often offloads from somewhere else.

  4. aaron on March 12th, 2013 12:04 pm

    Most hadoop I see (99%) can be done in RDBMS, so most of it is a combination of
    - tech feeler
    - RDBMS license play (not economical to use RDBMS, archive logs and hope, archive/preprocess inputs, …)
    - tactical easy stuff, such as distributed work
    With 1% of enormous scale that would not fit in tactical RDBMS. With the exception of HBase (evolving schema in iterative development and good random access), most hadoop won’t work with any sort of traditional warehouse.

    BUT – it is starting to suck in stuff at the periphery. Semistructured or DQ failing stuff that would have been 99% discarded is now brought in to hadoop with weak links to DW. It’s nibbling at the edges.

    No sane person would migrate a DW to hadoop – it’s missing too much: it can scan, but mostly can’t join, little indexing, no governance.

  5. Francois on March 12th, 2013 1:53 pm

    Thanks aaron/Curt. Do you see any sort of boomerang or backlash effect, where the economics of putting the data back onto an RDBMS start to make sense?

    I’ve heard Teradata is down to $30k/TB list for its 6000 series. If data warehouse appliances are down to under $10k/TB in two or three years, do people view that as making more sense for workloads that are better suited for an RDBMS?

    I suppose some would argue the late-binding flexibility of Hadoop is what makes it so attractive, rather than the cost.

  6. aaron on March 12th, 2013 5:10 pm

    My take -

    Current price model:

    Server:
    - $3K 16 core
    - $1K 64G RAM + 400G SSD
    - $.15 X 24 3TB SATA drives
    Total of <$4K for a server with ~70TB net, no RAID

    Cost of RDBMS – license range:
    - pg or a few community editions ($0, chat support)
    - traditional RDBMS $15-25K/core – (~$250K + $35K annually)
    - classic analytics DBMS ranges from ~$100K to $3,000M+)

    Cost of Hadoop support – annual range from $0-$150K

    (granted – this HW intentionally has no redundancy, so hadoop needs a master and 3x redundancy, rdbms not as clear. Compression changes density as well)

    So – this nifty workstation for a data scientist costs virtually nothing and is a rounding error in cost relative to license and support.

    The rest of the decade will be a dynamic in RDBMS where SQL is a commodity and with a conflict between established vendors attempting to preserve pricing (lock in, enterprise integration and recovery) and new ones looking to establish themselves (where price is less critical to them if they don't have recurring revenues to cannibalize). New vendors are at a big disadvantage in the near term, but….

    The license/sales model will drive how the other stuff goes, mostly. This in turn will drive most of how analytics shapes up. My guess is the market breaks apart in 3-5 years. The legacy vendors will be in a real legacy mode – needing to preserve revenue with little growth and unable to compete with lower cost spreads.

    Meanwhile, Hadoop will grow in different ways. It is a long way from a generic analytic solution, but is a useful thing for specialized use cases such as giant data and compute intensive scanning (e.g., scoring) and weak semantics data. Data will grow in both camps independently. Most structured data will remain in RDBMS.

  7. Curt Monash on March 12th, 2013 6:33 pm

    Francois,

    You’re behind the times. :) Netezza was down to $20K/TB after compression in 2009. http://www.dbms2.com/2009/07/30/the-netezza-price-point/ Other firms have responded. Infobright is trying to go especially low.

  8. aaron on March 12th, 2013 7:29 pm

    Yup – the low end of supported I’ve seen is http://aws.amazon.com/redshift/pricing/ ($2-10K/tb/year; with networking and secondary disk, you can be under $5K/tb year.) Established vendors SW-only are breaking under 10K/year, twice that for appliances – based on source data….

  9. Francois on March 21st, 2013 1:13 pm

    Thanks guys. You are right, Curt, but Teradata has traditionally been at the high end of the pricing spectrum. I’d note that their 2000 series is down below 15k/TB list and their 1500 (singularity) stuff is down below $4k/TB list.

    Question: how do you think solid state drives impact the price/performance dynamic between Hadoop and MPP systems? Would SSDs (particularly for systems like Teradata that heavily use random i/o) dramatically swing the price/performance conversation back in MPP’s favor?

  10. Curt Monash on March 21st, 2013 6:31 pm

    Francois,

    1. Hadoop is just as MPP as Teradata, Netezza, Vertica, et al. I reject terminology to the contrary. :)

    2. If we impute software license revenue to both Teradata and the entire world of Hadoop distributions in Hive uses, Teradata’s advantage might be 100X. (Please nobody quote that figure; I didn’t actually check any numbers.) Hardware, of course, is a different story.

    3. You’re right that flash helps Teradata in analytic processing vs. Hadoop for those analytic workloads Teradata can handle, but I might question whether it’s “dramatic”. Teradata’s done pretty well in the disk world.

    4. Flash potentially helps the HBase/Hadoop integration story.

    5. Tez potentially cuts Hadoop’s sequential I/O burden.

  11. Francois on March 26th, 2013 4:39 pm

    Fair point on the terminology, Curt!

    It seems like there is a limit to what Hadoop can address, however. Workloads that you might term “operational analytics” with hundreds or thousands of users, for example. I’m sure this is on the roadmap for Cloudera and Hortonworks, but Hadoop seems like it has limitations from a performance perspective (particularly with regards to concurrency), at least until the cost of memory drops enough. Please correct me if I’m wrong.

  12. Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 6th, 2013 6:49 pm

    [...] a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 [...]

  13. Distinctions in SQL/Hadoop integration | DBMS 2 : DataBase Management System Services on February 9th, 2014 2:20 pm

    [...] Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez. [...]

  14. Spark on fire | DBMS 2 : DataBase Management System Services on April 30th, 2014 6:42 am

    [...] seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.