Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
Aster Data 4.0 and the evolution of “advanced analytic(s) servers”
Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”
Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:
- Aster is offering a slick vision for integrating big-database management and general analytic processing on the same MPP cluster, under the not-so-slick name “Data-Application Server.”
- Aster is also offering a sophisticated vision for workload management.
In addition, Aster has matured nCluster in various ways, for example cleaning up a performance problem with single-row updates.
Highlights of the Aster “Data-Application Server” story include: Read more
Categories: Aster Data, Cloud computing, Data warehousing, EAI, EII, ETL, ELT, ETLT, MapReduce, Market share and customer counts, Teradata, Theory and architecture, Workload management | 9 Comments |
Three big myths about MapReduce
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics | 11 Comments |
Introduction to SenSage
I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage’s, hasty. Still, I think I have enough of a handle on SenSage basics to be worth writing up.
General SenSage highlights include:
Technical introduction to Splunk
As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.
Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include: Read more
Categories: Analytic technologies, Log analysis, MapReduce, Splunk, Structured documents, Text, Web analytics | 12 Comments |
MapReduce webinars and annotated slides
As previously noted, I’m giving a webinar twice today — i.e., Thursday, October 15 — at 10:00 am and 1:00 pm Eastern time.
- The subject is MapReduce.
- The sponsor is Aster Data.
- Part of the webinar will be an explanation of MapReduce basics, especially the conflict between theory/propaganda and reality.
- As you might guess from the identity of the sponsor, there will be an emphasis on how MapReduce and SQL play nicely with each other.
- You can register for the webinar on Aster’s site.
- (Edit) The webinar replay can be found here.
- I’ve already uploaded the slides from which I will present. (But not the ones from which Aster folks will be talking. I’ve seen those, and there’s some good technical crunch in some of them.) The “Notes” under the slides have a number of relevant URLs for follow-up, as well as a small number of explanatory comments (e.g., as to why one slide simply has a quote from and corresponding picture of Shakespeare).
Categories: Aster Data, MapReduce, Presentations | 6 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
I have some presentations coming up (all on October Thursdays)
On Thursday, October 15, and two different times (10:00 am and 1:00 pm Eastern time), I’ll be giving a webinar for Aster Data on MapReduce. The content is very much work in progress, but it definitely will:
- Be overviewy in nature
- Emphasize SQL/MapReduce integration
Then, on the evening of Thursday, October 22, there’s something called the Boston Big Data Summit, in Waltham, where “Big Data” evidently is to be construed as anything from a few terabytes on up. (Things are smaller in the Northeast than in California …) It’s being put together by Amrith Kumar (who I don’t really know) and Bob Zurek (who everybody knows). This is the inaguaral meeting. It seems I’m both giving the keynote and running the subsequent panel, one of whose participants will be Ellen Rubin. Read more
Categories: Analytic technologies, Aster Data, Cloud computing, MapReduce, Presentations | 4 Comments |
Oracle’s version of “actually, we’ve been doing MapReduce all along too”
In a recent blog post, Jean-Pierre Dijcks of Oracle makes the argument that Oracle has supported MapReduce all along, essentially because:
- You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Map steps.
- You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Reduce steps.
- Oracle offers a mechanism for parallelizing procedural logic.
Oracle doesn’t appear to have an explicit Map/Reduce programming interface, but I wouldn’t be surprised if Oracle Consulting cranked one out at some point to meet customer demand.
The post goes on to claim the usual in-database MapReduce benefit of avoiding the overhead of intermediate query result materialization. Presumably, then, Oracle’s quasi-MapReduce would also lack query fault-tolerance.
Categories: Analytic technologies, MapReduce, Oracle, Parallelization | 1 Comment |
Jacek Becla on issues in scientific data management
Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own. Read more
Categories: Analytic technologies, Hadoop, MapReduce, Objectivity and Infinite Graph, Open source, Parallelization, SciDB, Scientific research | 4 Comments |
Martin Kersten on issues in scientific data management
Martin Kersten emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited it, and am posting it below. Read more
Categories: Analytic technologies, Clustering, Parallelization, SciDB, Scientific research | 3 Comments |