MapReduce
Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
The substance of Pentaho’s Hadoop strategy
Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been — quite insistently — saying things that don’t make a lot of sense to people who know anything about Hadoop.
That said, I think I found four sensible points in Pentaho’s Hadoop strategy, namely:
- If you use an ETL tool like Pentaho’s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop’s native orchestration tools.
- A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho’s. (That would include tokenization or regex.)
- If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho’s might be useful.
- Somebody might want to use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.
The first of those points is, in the grand scheme of things, pretty trivial.
The third one makes sense. While Hadoop’s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)
The fourth one is kind of sad.
But if there’s any shovel-meet-pony aspect to all this — or indeed a reason for writing this blog post — it would be the second point. If one understands data management, but is in the “Oh no! Hadoop wants me to PROGRAM!” crowd, then being able to specify one’s MapReduce might be a really nice alternative versus having to actually code it.
| Categories: Analytic technologies, Business intelligence, Hadoop, MapReduce, Parallelization, Pentaho | 6 Comments |
Big Data is Watching You!
There’s a boom in large-scale analytics. The subjects of this analysis may be categorized as:
- People
- Financial trades
- Electronic networks
- Everything else
The most varied, interesting, and valuable of those four categories is the first one.
| Categories: Analytic technologies, Aster Data, Data warehousing, Investment research and trading, Log analysis, MapReduce, RDF and graphs, Specific users, Telecommunications, Web analytics | 3 Comments |
Some interesting links
In no particular order: Read more
| Categories: Business intelligence, EnterpriseDB and Postgres Plus, Fun stuff, Hadoop, Humor, In-memory DBMS, MapReduce, Memory-centric data management, Open source, Oracle, SAP AG | 1 Comment |
Cloudera Enterprise and Hadoop evolution
I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say: Read more
Clarifying the state of MPP in-database SAS
I routinely am briefed way in advance of products’ introductions. For that reason and others, it can be hard for me to keep straight what’s been officially announced, introduced for test, introduced for general availability, vaguely planned for the indefinite future, and so on. Perhaps nothing has confused me more in that regard than the SAS Institute’s multi-year effort to get SAS integrated into various MPP DBMS, specifically Teradata, Netezza Twinfin(i), and Aster Data nCluster.
However, I chatted briefly Thursday with Michelle Wilkie, who is the SAS product manager overseeing all this (and also some other stuff, like SAS running on grids without being integrated into a DBMS). As best I understood, the story is: Read more
| Categories: Analytic technologies, Aster Data, Data warehouse appliances, MapReduce, Netezza, Parallelization, SAS Institute, Specific users, Teradata | 10 Comments |
Aster Data’s mapreduce.org site
Aster Data has started a site mapreduce.org, which purports to compile “the best information about MapReduce.” At the moment, mapreduce.org highlights include:
- A feed of MapReduce-related posts from several blogs, including this one.
- A calendar of MapReduce-related events, not necessarily Aster-specific, integrated with a feed combining …
- … Aster MapReduce-related press releases and also …
- … not necessarily Aster-specific MapReduce-related press articles.
- Links to a lot of Aster Data MapReduce-related collateral. Some of that stuff is quite good.*
- A sycophantic introduction from Colin White praising the value of the mapreduce.org “independent forum.”
*I did a couple of MapReduce-related webinars for Aster late last year.
But seriously — Aster does a good job of writing clear and informative collateral.
| Categories: Analytic technologies, Aster Data, MapReduce | 3 Comments |
Introduction to Datameer
Elder care issues have flared up with a vengeance, so I’m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:
- Datameer offers a business intelligence and analytics stack that runs on any distribution of Hadoop.
- Datameer is still building a lot of features that it talks about, for target release in (I think) the fall.
- Datameer’s pride and joy is its user interface. Very laudably for a software start-up, Datameer claims to have spent considerable time with professional user interface designers.
- Datameer’s core user interface metaphor is formula definition via a spreadsheet.
- Datameer includes 124 functions one can use in these formulae, ranging from math stuff to text tokenization.
- Datameer does some straight BI, with 4 kinds of “visualization” headed for 20 kinds later. But if you want to do hard-core BI, use Datameer to dump data into an RDBMS and then use the BI tool of your choice. (Datameer’s messaging does tend to obscure or even contradict that point.)
- Rather, Datameer seems to be designed for the classic MapReduce use cases of ETL and heavy data crunching.
- Datameer’s messaging includes a bit about “Datameer is real-time, even though Hadoop is generally thought of as batch.” So far as I can tell, what that boils down to is …
- … Datameer will let you examine sample and/or partial query results before a full Hadoop run is over. Apparently, there are three different ways Datameer lets you do this:
- You can truly query against a sample of the data set.
- You can query against intermediate results, when only some stages of the Hadoop process have already been run.
- You can drill down into a “distributed index,” whatever the heck that means when Datameer says it.
- Datameer will let you import data from 15 or so different kinds of sources, SQL, NoSQL, and file system alike.
| Categories: Analytic technologies, Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce | 2 Comments |
The Naming of the Foo
Let’s start from some reasonable premises. Read more
| Categories: Data models and architecture, Database diversity, Hadoop, MapReduce, MarkLogic, NoSQL, OLTP, Theory and architecture | 32 Comments |
TwinFin(i) – Netezza’s version of a parallel analytic platform
Much like Aster Data did in Aster 4.0 and now Aster 4.5, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the Netezza TwinFin appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before its Enzee Universe conference in June. At a high level, the Aster and Netezza approaches compare/contrast as follows: Read more
| Categories: Analytic technologies, Aster Data, Data warehouse appliances, Data warehousing, Hadoop, MapReduce, Netezza, SAS Institute, Teradata | 6 Comments |
More patent nonsense — Google MapReduce
Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine): Read more
| Categories: Google, MapReduce, Parallelization | 11 Comments |
