Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:
- Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
- Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
- Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
- There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.
The key concept here seems to be the RDD. Any one RDD:
- Is a collection of Java objects, which should have the same or similar structure.
- Can be partitioned/distributed and shuffled/redistributed across the cluster.
- Doesn’t have to be entirely in memory at once.
Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:
- At the moment, RDDs expire at the end of a job.
- This restriction will be lifted in a future release.
|Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration||9 Comments|
UC Berkeley’s AMPLab is working on a software stack that:
- Is meant (among other goals) to improve upon Hadoop …
- … but also to interoperate with it, and which in fact …
- … uses significant parts of Hadoop.
- Seems to have the overall name BDAS (Berkeley Data Analytics System).
The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).
Specific projects of note in all that include:
- Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
- Spark, a replacement for MapReduce and the associated execution stack.
- Shark, a replacement for Hive.
|Categories: ClearStory Data, Databricks, Spark and BDAS, Hadoop, MapReduce, Parallelization, Specific users, SQL/Hadoop integration||10 Comments|
I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,
Subjects I’ll mention include:
- Parallel Data Warehouse
- Columnar data management
- In-memory data management (Hekaton)
One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.
Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.
|Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Hadoop, Hortonworks, In-memory DBMS, MapReduce, Market share and customer counts, Microsoft and SQL*Server, Oracle, Yahoo||10 Comments|
My clients at Couchbase checked in.
- After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
- Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
- There also are many users of open source Couchbase, most famously LinkedIn.
- Orbitz is a much-mentioned flagship paying Couchbase customer.
- Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
- Couchbase headcount is just under 100.
The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:
- JSON storage, including secondary indexes.
- Multi-data-center replication.
- A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.
Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.
Technology notes on Couchbase 2.0 include: Read more
|Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB and 10gen, NoSQL, Open source, Structured documents||4 Comments|
My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:
The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.
These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.
A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.
But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more
|Categories: Cloudant, Couchbase, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MongoDB and 10gen, RDF and graphs||1 Comment|
What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.
First, some basics:
- Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
- Impala is in public beta, and is targeted for general availability Q1 2013 or so.
- Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
- Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
- … notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.
The general technical idea of Impala is:
- It’s an additional daemon that runs on each of your Hadoop nodes.
- Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
- Impala operates as a distributed parallel analytic DBMS.*
- Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.
|Categories: Cloudera, Data models and architecture, Data warehousing, Hadoop, HBase, MapReduce, Open source, Predictive modeling and advanced analytics, SQL/Hadoop integration||11 Comments|
I chatted with Todd Papaioannou about his new company Continuuity. Todd is as handy at combining buzzwords as he is at concatenating vowels, and so Continuuity — with two “U”s — is making a big data fabric platform as a service with REST APIs that runs over Hadoop and HBase in the private or public clouds. I found the whole thing confusing, in that:
- I recoil against buzzwords. In particular …
- … I pay as little attention to distinctions among PaaS/IaaS/WaaS — Platform/Infrastructure/Whatever as a Service — as I can.
- The Continuuity story sounds Heroku-like, but Todd doesn’t want Continuuity compared to Heroku.
- Todd does want Continuuity discussed in terms of the application server category, but:
- It is hard to discuss app servers without segueing quickly amongst development, deployment, and data connectivity, and Continuuity is no exception to that rule.
- There is doubt as to whether using app servers makes any sense.
But all confusion aside, there are some interesting aspects to Continuuity. Read more
|Categories: Application servers, Cloud computing, Hadoop, HBase, MapReduce, Parallelization, Predictive modeling and advanced analytics, Software as a Service (SaaS)||6 Comments|
Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
Beyond that: Read more
|Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration||6 Comments|
Stuart Frost, of DATAllegro fame, has started a small family of companies, and they’ve become my clients sort of as a group. The first one that I’m choosing to write about is Cirro, for which the basics are:
- Cirro does data federation for analytics.
- Cirro has 10 full-time people plus 4 part-timers.
- Cirro launched its product in June.
- Cirro doesn’t have customers yet, but hopes to fix that soon.
Data federation stories are often hard to understand because, until you drill down, they implausibly sound as if they do anything for everybody. That said, it’s reasonable to think of Cirro as a layer between Hadoop and your BI tool that:
- Helps with data transformations.
- Helps join Hadoop data to relational tables, even if the joins are large ones.
In both cases, Cirro is calling on your data management software for help, RDBMS or Hadoop as the case may be.
More precisely, Cirro’s approach is: Read more
|Categories: Business intelligence, Cirro, Data integration and middleware, Hadoop, MapReduce, Tableau Software||4 Comments|
My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this: Read more
|Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text||4 Comments|