Hadoop
Discussion of open source MapReduce implementation Hadoop. Related subjects include:
The Naming of the Foo
Let’s start from some reasonable premises. Read more
| Categories: Data models and architecture, Database diversity, Hadoop, MapReduce, Mark Logic, NoSQL, OLTP, Theory and architecture | 25 Comments |
TwinFin(i) – Netezza’s version of a parallel analytic platform
Much like Aster Data did in Aster 4.0 and now Aster 4.5, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the Netezza TwinFin appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before its Enzee Universe conference in June. At a high level, the Aster and Netezza approaches compare/contrast as follows: Read more
| Categories: Analytic technologies, Aster Data, Data warehouse appliances, Data warehousing, Hadoop, MapReduce, Netezza, SAS Institute, Teradata | 2 Comments |
Clearing up MapReduce confusion, yet again
I’m frustrated by a constant need — or at least urge
— to correct myths and errors about MapReduce. Let’s try one more time: Read more
| Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Hadoop, MapReduce, SenSage, Splunk | 7 Comments |
The legit part of the NoSQL idea
I’ve written some snarky things about the “NoSQL” concept – or at least the moniker. (Carl Olofson’s term “non-schematic databases” seems less bad.) Yet I’m actually favorable about the increasing use of SQL alternatives. Perhaps I should pull those thoughts together. Read more
| Categories: Data models and architecture, Database diversity, Hadoop, NoSQL, Theory and architecture | 20 Comments |
Three big myths about MapReduce
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
| Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics | 11 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
Jacek Becla on issues in scientific data management
Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own. Read more
| Categories: Analytic technologies, Hadoop, MapReduce, Open source, Parallelization, SciDB, Scientific research | 3 Comments |
Issues in scientific data management
In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:
- A data model based on multidimensional arrays, not sets of tuples
- A storage model based on versions and not update in place
- Built-in support for provenance (lineage), workflows, and uncertainty
- Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
- Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
- Open source in order to foster a community of contributors and to insure that data is never “locked up” — a critical requirement for scientists
However: Read more
| Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, Open source, SciDB, Scientific research, Specific users | 2 Comments |
MapReduce tidbits
I’ve never had children, and so have never had to supervise squabbling siblings, each accusing the other of selfishness and insufficient sharing. Perhaps the MapReduce vendors are a form of karmic payback. Be that as it may, my client Cloudera has organized Hadoop World on October 2 in New York, and my other client Aster Data is hosting a MapReduce-centric Big Data Summit the night before, at the same venue. Even if you don’t go, both conference’s agenda pages offer a peek into what’s going on in MapReduce applications. I’m not going either, but even so I hope to post an overview of MapReduce uses after the conferences serve to publicize some of them.
Even better, I plan to hold a couple of webinars on MapReduce, the first at 10 am (blech) and 1 pm Eastern time on October 15. They’re sponsored by Aster Data, and so will have a strong SQL/MapReduce orientation.
In connection with its conference, Aster is introducing an nCluster-Hadoop connector — i.e., a loader from HDFS (Hadoop Distributed File System) implemented in SQL/MapReduce. In particular: Read more
| Categories: Aster Data, Cloudera, Data warehousing, Hadoop, MapReduce | 5 Comments |
Yahoo wants to do decapetabyte-scale data warehousing in Hadoop
My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.
Highlights of our visit included:
- There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
- Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
- A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
- Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
- Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.
