Discussion of workload management technology, typically in analytic or mixed-workload DBMS.
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
|Categories: Databricks, Spark and BDAS, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo||14 Comments|
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more
Vertica 6 was recently announced, and so it seemed like a good time to catch up on Vertica features. The main topics I want to address are:
- External tables and the associated new Hadoop connector.
- Online schema evolution.
- Workload management.
- I have some tidbits to add to my June, 2011 coverage of Vertica’s analytic functionality.
- I’ll stand for now on my previous coverage of Vertica’s database organization.
In general, the main themes of Vertica 6 appear to be:
- Enterprise/SaaS-friendliness, high uptime, and so on.
- Improved analytic usefulness.
Let’s do the analytic functionality first. Notes on that include:
- Vertica has extended its user-defined function/analytic procedure/whatever functionality to include user-defined load. (Same SDK, different specific classes.)
- One of the languages Vertica supports is R. But for now, parallel R is limited to “Of course, you can run the same functions and procedures on many nodes at once.”
- Based on community activity around bugs and so on, it seems there are users for Vertica’s JSON-based Twitter sentiment analysis plug-in.
I’ll also take this opportunity to expand on something I wrote about a few vendors — including Vertica — at the end of my post on approximate query results. When I probed how customers of Vertica and other RDBMS-based analytic platform vendors used vendor-proprietary advanced analytic SQL and other analytic capabilities, answers included: Read more
|Categories: Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Predictive modeling and advanced analytics, SQL/Hadoop integration, Vertica Systems, Workload management||1 Comment|
A lot of confusion seems to have built around the facts:
- Hadoop MapReduce is being opened up into something called MapReduce 2 (MRv2).
- Something called YARN (Yet Another Resource Negotiator) is involved.
- One purpose of the whole thing is to make MapReduce not be required for Hadoop.
- MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative, yet the MPI/YARN/Hadoop effort is somehow troubled.
- Cloudera shipped YARN in June, yet simultaneously warned people away from actually using it.
Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.
- YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
- The YARN availability story goes:
- YARN is in alpha.
- YARN is expected to be in production at year-end, give or take.
- Cloudera made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than production.
- Hortonworks, in its own June release, only shipped code it advised putting into production.
- My take on the YARN/MPI story goes something like this:
- Numerous people have told me of YARN/MPI delays.
- One person suggested that Greenplum is taking the lead in YARN/MPI integration, but has gotten slow and reclusive, apparently due to some big company-itis.
- I find that credible because of the Greenplum/SAS/MPI connection.
- If I understood Arun correctly, the latency story on Hadoop MapReduce is approximately:
- Arun says that Hadoop’s reputation for taking 10s of seconds to start a Hadoop job is old news. It takes a low single-digit number of seconds.
- However, starting all that Java does take 100s of milliseconds at best — 200 milliseconds in an ideal case, 500 milliseconds more realistically, and that’s just on a single server.
- Thus, if you want human real-time interaction, Hadoop MapReduce is not and likely never will be the way to go. Getting Hadoop MapReduce latencies under a few seconds is likely to be more trouble than it’s worth — because of MapReduce, not because of Hadoop.
- In particular — instead of incurring the overhead of starting processes up, Arun thinks low-latency needs should be met in a different way, namely by serving them from already-running processes. The examples he kept mentioning were the event processing projects Storm (out of Twitter, via an acquisition) and S4 (out of Yahoo).
Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.
*I hope any errors in interpretation are minor.
Major aspects of DB2 10 include new or improved capabilities in the areas of:
- Analytic query performance.
- Data ingest.
- Multi-temperature data management.
- Workload management.
- Graph management/relationship analytics.
- Time-travel, bitemporal features, and bitemporal time-travel.
Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*
*Also, the data ingest part isn’t in base DB2.
|Categories: Data warehousing, Database compression, IBM and DB2, RDF and graphs, Solid-state memory, Workload management||6 Comments|
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes. Read more
|Categories: Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Exadata, Greenplum, Hadoop, Hortonworks, IBM and DB2, Informatica, Netezza, Oracle, Sybase, Teradata, Workload management||11 Comments|
Talking with my clients at SAND can be confusing. That said:
- I need to revise my figures for SAND’s customer count way downward.
- SAND finally has a reasonably clear positioning.
- SAND’s product actually seems to have a lot of features.
A few months ago, I wrote:
SAND Technology reported >600 total customers, including >100 direct.
Upon talking with the company, I need to revise that figure downward, from > 600 to 15.
I last wrote about Exasol in 2008. After talking with the team Friday, I’m fixing that now. The general theme was as you’d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on.
Top-level points included:
- Exasol’s technical philosophy is substantially the same as before, albeit not with as extreme a focus on fitting everything in RAM.
- Exasol believes its flagship DBMS EXASolution has great performance on a load-and-go basis.
- Exasol has 25 EXASolution customers, all in Germany.*
- 5 of those are “cloud” customers, at hosting providers engaged by Exasol.
- EXASolution database sizes now range from the low 100s of gigabytes up to 30 terabytes.
- Pretty much the whole company is in Nuremberg.
|Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Exasol, Market share and customer counts, Pricing, Software as a Service (SaaS), Specific users, Sybase, Workload management||1 Comment|
I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
|Categories: Hadapt, Hadoop, MapReduce, PostgreSQL, SQL/Hadoop integration, Theory and architecture, Workload management||6 Comments|
It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.
Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:
- Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
- Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.
Otherwise, the whole thing is just what you would think:
- Hadoop can read from and write to MarkLogic, in parallel at both ends.
- If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
- If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”
MarkLogic said that it wrote this Hadoop connector itself.
|Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management||2 Comments|