The substance of Pentaho’s Hadoop strategy
Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been — quite insistently — saying things that don’t make a lot of sense to people who know anything about Hadoop.
That said, I think I found four sensible points in Pentaho’s Hadoop strategy, namely:
- If you use an ETL tool like Pentaho’s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop’s native orchestration tools.
- A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho’s. (That would include tokenization or regex.)
- If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho’s might be useful.
- Somebody might want to use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.
The first of those points is, in the grand scheme of things, pretty trivial.
The third one makes sense. While Hadoop’s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)
The fourth one is kind of sad.
But if there’s any shovel-meet-pony aspect to all this — or indeed a reason for writing this blog post — it would be the second point. If one understands data management, but is in the “Oh no! Hadoop wants me to PROGRAM!” crowd, then being able to specify one’s MapReduce might be a really nice alternative versus having to actually code it.
Categories: Analytic technologies, Business intelligence, Hadoop, MapReduce, Parallelization, Pentaho | 10 Comments |
I’m collecting data points on NoSQL and HVSP adoption
I was asked to do a magazine article on NoSQL, where by “NoSQL” is meant “whatever they talk about at NoSQL conferences.” By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL alike.
It also is understood that, realistically, I can’t be expected to know and mention the very latest news for all the many products in the categories. Even so, I think this would be fine time to check just where NoSQL and HVSP adoption stand. Here is most of what I know, or links to same; it would be great if you guys would contribute additional data in the comment thread.
In the NoSQL area: Read more
Finally confirmed: Membase has a reasonable product roadmap
On my recent trip to California, neither I nor my clients at Northscale covered ourselves in meeting-arranging glory. Still, from the rushed 30 minute meeting we did wind up having, I finally came away feeling good about Membase’s product direction.
To review, Membase is a reasonably elastic persistent data store, sporting the memcached API, making memcached/Membase an attractive alternative to memcached/sharded MySQL. As of now, Membase is a pure key-value store.
Northscale defends pure key-value stores by arguing, in effect: Read more
Categories: Couchbase, memcached, NoSQL, Parallelization | 5 Comments |
DB2 workload management
DB2 has added a lot of workload management features in recent releases. So when we talked Tuesday afternoon, Tim Vincent and I didn’t bother going through every one. Even so, we covered some interesting subjects in the area of DB2 workload management, including: Read more
Categories: Data warehousing, IBM and DB2, Netezza, Workload management | 3 Comments |
More on temp space, compression, and “random” I/O
My PhD was in a probability-related area of mathematics (game theory), so I tend to squirm when something is described as “random” that clearly is not. That said, a comment by Shilpa Lawande on our recent flash/temp space discussion suggests the following way of framing a key point:
- You really, really want to have multiple data streams coming out of temp space, as close to simultaneously as possible.
- The storage performance characteristics of such a workload are more reminiscent of “random” than “sequential” I/O.
If everybody else is cool with it too, I can live with that. 🙂
Meanwhile, I talked again with Tim Vincent of IBM this afternoon. Tim endorsed the temp space/Flash fit, but with a different emphasis, which upon review I find I don’t really understand. The idea is:
- Analytic DBMS processing generally stresses reads over writes.
- Temp space is an exception — read and write use of temp space is pretty balanced. (You spool data out once, you read it back in once, and that’s the end of that; next time it will be overwritten.)
My problem with that is: Flash typically has lower write than read IOPS (I/O per second), so being (relatively) write-intensive would, to a first approximation, seem if anything to disfavor a workload for flash.
On the plus side, I was reminded of something I should have noted when I wrote about DB2 compression before:
Much like Vertica, DB2 operates on compressed data all the way through, including in temp space.
Categories: Data warehousing, Database compression, IBM and DB2, Vertica Systems | 6 Comments |
Vertica’s innovative architecture for flash, plus more about temp space than you perhaps wanted to know
Vertica is announcing:
- Technology it already has released*, but has not published any reference architectures for.
- A Barney partnership.**
In other words, Vertica has succumbed to the common delusion that it’s a good idea to put out half-baked press releases the week of TDWI conferences. But if we look past that kind of all-too-common nonsense, Vertica is highlighting an interesting technical story, about how the analytic DBMS industry can exploit solid-state memory technology.
*Upgrades to Vertica FlexStore to handle flash memory, actually released as part of Vertica 4.0
** With Fusion I/O
To set the context, let’s recall a few points I’ve noted in the past:
- Solid-state memory’s price/throughput tradeoffs obviously make it the future of database storage.
- The flash future is coming soon, in part because flash’s propensity to wear out is overstated. This is especially true in the case of modern analytic DBMS, which tend to write to blocks all at once, and most particularly the case for append-only systems such as Vertica.
- Being able to intelligently split databases among various cost tiers of storage – e.g. flash and disk – makes a whole lot of sense.
Taken together, those points tell us:
For optimal price/performance, analytic DBMS should support databases that run part on flash, part on disk.
While all this is a future for some other analytic DBMS vendors, Vertica is shipping it today.* What’s more, three aspects of Vertica’s architecture make it particularly well-suited for hybrid flash/disk storage, in each case for a similar reason – you can get most of the performance benefit of all-flash for a relatively low actual investment in flash chips: Read more
Categories: Columnar database management, Data warehousing, Database compression, Solid-state memory, Vertica Systems | 10 Comments |
Teradata’s future product strategy
I think Teradata’s future product strategy is coming into focus. I’ll start by outlining some particular aspects, and then show how I think it all ties together.
Read more
Categories: Business intelligence, Data warehouse appliances, Data warehousing, Kickfire, MicroStrategy, Solid-state memory, Storage, Teradata | 5 Comments |
Big Data is Watching You!
There’s a boom in large-scale analytics. The subjects of this analysis may be categorized as:
- People
- Financial trades
- Electronic networks
- Everything else
The most varied, interesting, and valuable of those four categories is the first one.
Links and observations
I’m back from a trip to the SF Bay area, with a lot of writing ahead of me. I’ll dive in with some quick comments here, then write at greater length about some of these points when I can. From my trip: Read more
Categories: Analytic technologies, Aster Data, Calpont, Cassandra, Couchbase, Data warehouse appliances, Data warehousing, EMC, Exadata, Facebook, Greenplum, HP and Neoview, Kickfire, NoSQL, OLTP, ParAccel, Sybase, XtremeData | 1 Comment |
Notes on EMC’s Greenplum subsidiary
I spent considerable time last week with my clients at both Greenplum and EMC (if we ignore the fact that the deal has closed and they’re now the same company). I also had more of a hardcore engineering discussion than I’ve had with Greenplum for quite a while (I should have been pushier about that earlier). Takeaways included:
- This is starting off as a honeymoon deal. Everything Greenplum was planning to do is being continued. Additional resources are being poured into Greenplum to do more.
- Some Greenplum execs seem to envision staying long term, some seem to envision moving on to their next startups. The ones who envision moving on are, however, going to work hard first to make the merger a success.
- Greenplum has, for quite a while, had more of an advanced analytics/embedded predictive modeling story than I realized. Bad on them for not fleshing it out more in marketing and product packaging alike.
- Greenplum both denies the concurrency problems I previously noted and also has a very credible story as to how it will eliminate them. 🙂 Seriously, Greenplum tells of one customer that routinely runs 150 simultaneous queries – on what I think is not a terribly big system — and a number of POCs (Proofs of Concept) that simulated similar levels of concurrency.
Categories: Analytic technologies, Data warehousing, EMC, Greenplum | 1 Comment |