Donut holes converted to code
And with impressively linear scalability.
| Categories: Humor, Parallelization | Leave a Comment |
Are analytic DBMS vendors overcomplicating their interconnect architectures?
I don’t usually spend a lot of time researching Ethernet switches. But I do think a lot about high-end data warehousing, and as I noted back in July, networking performance is a big challenge there. Among the very-large-scale MPP data warehouse software vendors, Greenplum is unusual in that its interconnect of choice is (sufficiently many) cheap 1 gigabit Ethernet switches.
A recent Network World story suggested that Greenplum wasn’t alone in this preference; other people also feel that clusters of commodity 1 gigabit Ethernet switches can be superior to higher-performing ones. So I pinged CTO Luke Lonergan of Greenplum for more comment. His response, which I got permission to publish, was: Read more
| Categories: Data warehousing, Greenplum, Parallelization | 4 Comments |
Sales figures for analytic DBMS
One of my clients asked how many new customers I thought were buying analytic DBMS each quarter. I don’t generally track such things, but hey — a client asked, so I did the best I could. And since I did the work, now I’ll share it generally. To wit:
Read more
Enterprises are buying multiple brands of analytic DBMS each
Over the past few weeks I’ve had a lot of NDA discussions about analytic DBMS vendors’ specific customers. And so I’ve been acutely aware of something I already sort of knew — just as there was in prior generations of database management technology, there’s huge overlap among analytic DBMS vendors’ customer bases as well. As they always have, enterprises are investing in multiple different brands of DBMS, even in cases where those DBMS can do pretty much the same things.
For example:
- Many Teradata users are buying newer technology too. But they aren’t actually throwing out Teradata.
- The same sometimes applies to Netezza already. At least two Netezza references are also references for a rival vendor.
- One outfit is among the biggest customers for two different analytic DBMS vendors, neither of which is Teradata or Netezza.
- One corporation is using or deploying four different brands of analytic DBMS.
- TEOCO is a big user of both DATAllegro and Netezza.
Vertica’s paying customer count
In a recent Computerworld article, Andy Ellicott of Vertica was cited as saying Vertica has 50 paying customers total. That’s very much on par with Greenplum’s figure, leaving aside any questions of deal size. (Greenplum runs a number of databases much larger than Vertica’s biggest. However, I believe Greenplum also charges a lot less per terabyte of user data.)
Previous Vertica paying customer count figures include:
| Categories: Data warehousing, Greenplum, Vertica Systems | 8 Comments |
Three approaches to parallelizing data transformation
Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.
*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.
| Categories: Aster Data, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization, Pervasive Software | 8 Comments |
Why MapReduce matters to SQL data warehousing
Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this.
The core ideas of MapReduce are: Read more
| Categories: Analytic technologies, Data warehousing, MapReduce, Parallelization | 24 Comments |
Known applications of MapReduce
Most of the actual MapReduce applications I’ve heard of fall into a few areas:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what’s in the two big sources I found online. Read more
| Categories: MapReduce, RDF and graphs, Text | 16 Comments |
MapReduce links
For whatever reason, I seem to be making the peripheral posts about MapReduce tonight before getting to the meat of the issues. So be it. There’s a rich set of links out there about MapReduce, and here are some of the best of them:
- Aster Data introduced MapReduce integrated into its SQL data warehouse DBMS tonight. Aster’s site features an excellent white paper.
- Exactly the same is true of Greenplum.
- Google Labs offers the seminal MapReduce research paper. It also has a broken link to an associated slide presentation, which fortunately is available here.
- One can get a good sense of MapReduce by reading up on the open source implementation Hadoop.
- In particular, this list of Hadoop applications is the longest list of MapReduce applications I know of (ahead even of Google’s long internal list).
- Joel Spolsky explained the core MapReduce concept a couple of years ago.
| Categories: MapReduce, Parallelization | 8 Comments |
MapReduce sound bites
Last Thursday, both Greenplum and Aster Data — the two most recent of my numerous data warehouse specialist customers — both told me of the same major innovation. Both were rushing to announce it first, before anybody else did. This led to considerable tap dancing, with the upshot being that both are releasing the information tonight or tomorrow morning.
What’s going on is that Aster Data and Greenplum have both integrated MapReduce into their respective MPP shared-nothing data warehouse DBMS. Read more
| Categories: Analytic technologies, Aster Data, Greenplum, MapReduce, Parallelization | 11 Comments |
