Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Mike Stonebraker’s counterarguments to MapReduce’s popularity
In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says: Read more
| Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL | 5 Comments |
More data on data warehouse sizes and issues
I spoke today with Paul Barth and Randy Bean of consultancy NewVantage Partners. The core of NewVantage’s business seems to be helping large enterprises (especially financial services) with their data warehouse strategies. Takeaways — none of which should shock regular readers of DBMS2 — included:
- Administrative cost and difficulty are often the single biggest issue in selecting analytic DBMS products.
- Oracle hits a wall around 10 terabytes of user data. The one customer NewVantage can think of with an Oracle data warehouse over 10 terabytes is fleeing Oracle for Netezza.
- NewVantage says that very specialized data warehouses on Oracle could conceivably be larger than that.
- NewVantage does have a customer on DB2/UDB in the 30-40 terabyte range. That customer does a lot of careful tuning to make it work.
- About 15% of NewVantage’s customers use Netezza. Few if any use newer analytic DBMS (but I got the sense more will soon). The rest rely on “traditional” DBMS, a group that includes Teradata.
| Categories: Data warehousing, IBM and DB2, Netezza, Oracle | 1 Comment |
Introduction to Aster Data and nCluster
I’ve been writing a lot about Greenplum since a recent visit. But on the same trip I met with Aster Data, and have talked with them further since. Let me now redress the balance and outline some highlights of the Aster Data story.
| Categories: Analytic technologies, Aster Data, Data warehousing, Parallelization, Specific users | 4 Comments |
Estimating user data vs. spinning disk
There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:
- Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
- Compression. I write about database compression a lot.
- Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
- Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)
Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.
Yes, but what are the Very Biggest benefits of MapReduce?
On behalf of On-Demand Enterprise, nee’ Grid Today, Dennis Barker asked me to clarify the most important benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s why.
The core benefit of MapReduce is price/performance (because it allows the cost benefits of parallelization to be applied to analyses that are hard to parallelize otherwise). Large price/performance gains commonly mix together three kinds of benefits.
1. They let you do what you did before, for less money.
2. They let you do a better version of what you did before, for similar money.
3. They let you do new things that didn’t make economic sense before, but now do.
Read more
| Categories: Analytic technologies, Data warehousing, MapReduce | Leave a Comment |
Are analytic DBMS vendors overcomplicating their interconnect architectures?
I don’t usually spend a lot of time researching Ethernet switches. But I do think a lot about high-end data warehousing, and as I noted back in July, networking performance is a big challenge there. Among the very-large-scale MPP data warehouse software vendors, Greenplum is unusual in that its interconnect of choice is (sufficiently many) cheap 1 gigabit Ethernet switches.
A recent Network World story suggested that Greenplum wasn’t alone in this preference; other people also feel that clusters of commodity 1 gigabit Ethernet switches can be superior to higher-performing ones. So I pinged CTO Luke Lonergan of Greenplum for more comment. His response, which I got permission to publish, was: Read more
| Categories: Data warehousing, Greenplum, Parallelization | 4 Comments |
Sales figures for analytic DBMS
One of my clients asked how many new customers I thought were buying analytic DBMS each quarter. I don’t generally track such things, but hey — a client asked, so I did the best I could. And since I did the work, now I’ll share it generally. To wit:
Read more
Enterprises are buying multiple brands of analytic DBMS each
Over the past few weeks I’ve had a lot of NDA discussions about analytic DBMS vendors’ specific customers. And so I’ve been acutely aware of something I already sort of knew — just as there was in prior generations of database management technology, there’s huge overlap among analytic DBMS vendors’ customer bases as well. As they always have, enterprises are investing in multiple different brands of DBMS, even in cases where those DBMS can do pretty much the same things.
For example:
- Many Teradata users are buying newer technology too. But they aren’t actually throwing out Teradata.
- The same sometimes applies to Netezza already. At least two Netezza references are also references for a rival vendor.
- One outfit is among the biggest customers for two different analytic DBMS vendors, neither of which is Teradata or Netezza.
- One corporation is using or deploying four different brands of analytic DBMS.
- TEOCO is a big user of both DATAllegro and Netezza.
Vertica’s paying customer count
In a recent Computerworld article, Andy Ellicott of Vertica was cited as saying Vertica has 50 paying customers total. That’s very much on par with Greenplum’s figure, leaving aside any questions of deal size. (Greenplum runs a number of databases much larger than Vertica’s biggest. However, I believe Greenplum also charges a lot less per terabyte of user data.)
Previous Vertica paying customer count figures include:
| Categories: Data warehousing, Greenplum, Vertica Systems | 8 Comments |
Three approaches to parallelizing data transformation
Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.
*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.
