Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Notes and links October 10 2010
More quick-hit notes, links, and so on: Read more
| Categories: Analytic technologies, Aster Data, Data warehousing, Greenplum, Health care, Surveillance and privacy, XtremeData | Leave a Comment |
EMC/Greenplum notes
I dropped by the former Greenplum for my quarterly consulting visit (scheduled for the first week of Q4 for a couple of reasons, one of them XLDB4). Much of what we discussed was purely advisory and/or confidential — duh! — but there were real, nonconfidential takeaways in two areas.
First, feelings about the EMC acquisition are still very positive.
- Hiring has been rapid, on track to roughly quadruple Greenplum’s size over a 1 1/2 year period. These don’t seem to be EMC imports, but rather outside hires, although EMC folks are surely helping in the recruiting.
- The former Greenplum is clearly going to pursue more product possibilities than it would have on its own. This augurs well for Greenplum customers.
- Griping about big-company bureaucracy is minimal.
- I didn’t hear one word about any unwelcome product/business strategy constraints. On the other hand …
- … the next Greenplum product announcement you’ll hear about will be one designed to be appealing to the EMC customer base — i.e., to enterprises that EMC is generally successful in selling to.
| Categories: Data warehousing, EMC, Greenplum, MapReduce, Parallelization, Predictive modeling and advanced analytics | 4 Comments |
It can be hard to analyze analytics
When vendors talk about the integration of advanced analytics into database technology, confusion tends to ensue. For example: Read more
| Categories: Aster Data, Greenplum, Netezza, Predictive modeling and advanced analytics, SAS Institute | 7 Comments |
eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more
I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included: Read more
| Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata | 30 Comments |
Some thoughts on the announcement that IBM is buying Netezza
As you’ve probably read, IBM and Netezza announced a deal today for IBM to buy Netezza. I didn’t sit in on the conference call, but I’ve seen the reporting. Naturally, I have some quick thoughts, which I’ve broken up into several sections below:
- Clearing some underbrush.
- Speculation about what IBM/Netezza will do.
- Speculation about alternative acquirers for Netezza.
- Speculation about what IBM/Netezza competitors will do.
Aster Data nCluster Version 4.6
The main thing in Aster Data nCluster Version 4.6 is Aster’s version of hybrid row-column store technology. Technical highlights include:
- Aster Data is simply taking the number of storage options in nCluster up from 1 to 2 – you now can store a table either in the Aster Data nCluster row store or column store.
- In fact, you can store parts of a table in the Aster Data nCluster row store and other parts in the Aster Data nCluster column store. I‘m a bit foggy on the details of that – Aster makes discussions of partitioning more complicated than they need to be — but it definitely sounds pretty flexible. Edit: See comment thread below.
- Anything you can do with the Aster Data nCluster row store you can also do with the Aster Data nCluster column store. In particular, that includes all of Aster Data’s analytic functionality.
- The same is true vice-versa. There is no columnar-oriented kind of compression in Aster Data nCluster at this time.
So Aster Data has now joined Greenplum/EMC among row-based analytic DBMS vendors with hybrid row-column stores. Oracle will join them some day, and the same probably applies to other row-based vendors as well. Similarly, Aster Data will probably join Oracle some day in having columnar compression. And so this all fits the model:
- Aster Data has an impressively competitive analytic relational DBMS, considering the youth and size of the company.
- Aster Data is a leader in extending its analytic relational DBMS by integrating in other analytic processing capabilities.
| Categories: Analytic technologies, Aster Data, Columnar database management, Data warehousing, Database compression | 4 Comments |
The Workday architecture — a new kind of OLTP software stack
One of my coolest company visits in some time was to SaaS (Software as a Service) vendor Workday, Inc., earlier this month. Reasons included:
- Workday has forward-thinking ideas about SaaS enterprise applications and the integration of business intelligence into same.
- Workday has highly innovative ideas in how it manages data.
- Companies founded by Dave Duffield tend to feature smart, likeable people who talk to one pleasantly and forthrightly. Workday is no exception; CTO Stan Swete and the other Workday folks present were a delight to talk with.
- I’d invited Merv Adrian to come along with me. He asked great questions, and I could gather myself a bit despite how sleep-deprived I was for the first part of that trip.
Workday kindly allowed me to post this Workday slide deck. Otherwise, I’ve split out a quick Workday, Inc. company overview into a separate post.
The biggie for me was the data and object management part. Specifically: Read more
The substance of Pentaho’s Hadoop strategy
Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been — quite insistently — saying things that don’t make a lot of sense to people who know anything about Hadoop.
That said, I think I found four sensible points in Pentaho’s Hadoop strategy, namely:
- If you use an ETL tool like Pentaho’s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop’s native orchestration tools.
- A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho’s. (That would include tokenization or regex.)
- If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho’s might be useful.
- Somebody might want to use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.
The first of those points is, in the grand scheme of things, pretty trivial.
The third one makes sense. While Hadoop’s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)
The fourth one is kind of sad.
But if there’s any shovel-meet-pony aspect to all this — or indeed a reason for writing this blog post — it would be the second point. If one understands data management, but is in the “Oh no! Hadoop wants me to PROGRAM!” crowd, then being able to specify one’s MapReduce might be a really nice alternative versus having to actually code it.
| Categories: Analytic technologies, Business intelligence, Hadoop, MapReduce, Parallelization, Pentaho | 10 Comments |
DB2 workload management
DB2 has added a lot of workload management features in recent releases. So when we talked Tuesday afternoon, Tim Vincent and I didn’t bother going through every one. Even so, we covered some interesting subjects in the area of DB2 workload management, including: Read more
| Categories: Data warehousing, IBM and DB2, Netezza, Workload management | 3 Comments |
More on temp space, compression, and “random” I/O
My PhD was in a probability-related area of mathematics (game theory), so I tend to squirm when something is described as “random” that clearly is not. That said, a comment by Shilpa Lawande on our recent flash/temp space discussion suggests the following way of framing a key point:
- You really, really want to have multiple data streams coming out of temp space, as close to simultaneously as possible.
- The storage performance characteristics of such a workload are more reminiscent of “random” than “sequential” I/O.
If everybody else is cool with it too, I can live with that. 🙂
Meanwhile, I talked again with Tim Vincent of IBM this afternoon. Tim endorsed the temp space/Flash fit, but with a different emphasis, which upon review I find I don’t really understand. The idea is:
- Analytic DBMS processing generally stresses reads over writes.
- Temp space is an exception — read and write use of temp space is pretty balanced. (You spool data out once, you read it back in once, and that’s the end of that; next time it will be overwritten.)
My problem with that is: Flash typically has lower write than read IOPS (I/O per second), so being (relatively) write-intensive would, to a first approximation, seem if anything to disfavor a workload for flash.
On the plus side, I was reminded of something I should have noted when I wrote about DB2 compression before:
Much like Vertica, DB2 operates on compressed data all the way through, including in temp space.
| Categories: Data warehousing, Database compression, IBM and DB2, Vertica Systems | 6 Comments |
