Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
What are the best choices for scaling Postgres?
March, 2011 edit: In its quaintness, this post is a reminder of just how fast Short Request Processing DBMS technology has been moving ahead. If I had to do it all over again, I’d suggest they use one of the high-performance MySQL options like dbShards, Schooner, or both together. I actually don’t know what they finally decided on in that area. (I do know that for analytic DBMS they chose Vertica.)
I have a client who wants to build a new application with peak update volume of several million transactions per hour. (Their base business is data mart outsourcing, but now they’re building update-heavy technology as well. ) They have a small budget. They’ve been a MySQL shop in the past, but would prefer to contract (not eliminate) their use of MySQL rather than expand it.
My client actually signed a deal for EnterpriseDB’s Postgres Plus Advanced Server and GridSQL, but unwound the transaction quickly. (They say EnterpriseDB was very gracious about the reversal.) There seem to have been two main reasons for the flip-flop. First, it seems that EnterpriseDB’s version of Postgres isn’t up to PostgreSQL’s 8.4 feature set yet, although EnterpriseDB’s timetable for catching up might have tolerable. But GridSQL apparently is further behind yet, with no timetable for up-to-date PostgreSQL compatibility. That was the dealbreaker.
The current base-case plan is to use generic open source PostgreSQL, with scale-out achieved via hand sharding, Hibernate, or … ??? Experience and thoughts along those lines would be much appreciated.
Another option for OLTP performance and scale-out is of course memory-centric options such as VoltDB or the Groovy SQL Switch. But this client’s database is terabyte-scale, so hardware costs could be an issue, as of course could be product maturity.
By the way, a large fraction of these updates will be actual changes, as opposed to new records, in case that matters. I expect that the schema being updated will be very simple — i.e., clearly simpler than in a classic order entry scenario.
Hasso Plattner calls for in-memory OLTP column stores
Former SAP CEO Hasso Plattner has written a paper called A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, in association with a SIGMOD keynote address.* The approach Plattner advocates is an MPP in-memory column store, presumably somewhat akin to SAP’s frequently renamed Business Warehouse Accelerator/Business Intelligence Accelerator/BWA/BIA/Son-of-TREX technology. There also are strong similarities to the MPP in-memory row store project H-Store/VoltDB, although I don’t know whether Plattner would go so far as to adopt the H-Store view that all transactions should run in stored procedures. Unsurprisingly, SAP applications are used as the OLTP paradigm throughout.
*Thanks to Dave Kellogg for tipping me off to Plattner’s paper. I only went to two SIGMOD sessions, neither of which was Plattner’s. Nobody actually mentioned Plattner’s talk to me when I was down at SIGMOD.
Perhaps the most interesting part is Plattner’s claim that what’s demanding about OLTP isn’t database updating per se, but rather maintaining aggregates for quick-response analytics. In his main example of that point, Plattner proposes a real-life “more than 18” table schema, of which 2 are base tables, and (most of?) the rest are materialized views that his proposed database architecture dispenses with (because analytic performance is sufficiently good without them). Thus, Plattner’s core columnar argument seemingly is
columnar –> natively fast analytics –> no need to maintain aggregates –> much lower update burden.
That said — if Plattner’s paper contained a clear statement of how much more expensive it is to insert or update a single row in a columnar vs. row-based system, I overlooked it. Instead, Plattner seems to be arguing that the volume of base-table updates is low enough that — whatever it may be — column-store update overhead is an acceptable price to pay. (At one point he claims that only 5% of the data inserted in a financial application ever gets changed.) That may actually be true in a financial accounting system, but seems more questionable in a sufficiently large application that gets its updates from automatic devices, or from the consumer web.
Other highlights include: Read more
User data vs. raw disk space as a marketing metric
I tried to post a comment on Daniel Abadi’s blog, but doing so seems to require some sort of registration process, so I’m posting here instead.
In a comment to his post on node scalability, Daniel Abadi argued that disk space is a better metric to use in marketing than (presumably compressed) user data. Well, I imagine he didn’t quite mean to say that, but that’s actually what he wound up saying, starting from the accurate observation that compression ratios vary wildly from one data set to another, even more than they vary from product to product on the same data.
Nonetheless, I favor user data as a metric because:
- That’s what users care about.
- That’s how a number of analytic DBMS vendors, including Vertica, actually price.
Categories: Data warehousing, Parallelization, Pricing | 3 Comments |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
Aster Data enters the appliance game
Aster Data is rolling out a line of nCluster appliances today. Highlights include:
- Configurations ranging from 9 6.25 terabytes to 1 petabyte of user data. (Edit: Here’s the up-to-date data sheet.)
- A $50K “Express Edition” price for <1 terabyte of user data. Unfortunately, that’s the only stated price.
- The option of bundled MicroStrategy.
- “MapReduce” in the name, which suggests something about the positioning — i.e., enterprise decision support, rather than Aster’s usual web/”frontline” emphasis. (Edit: That also fits with Aster’s recent MapReduce-for-.NET announcement.) (Edit: Actual name is Aster MapReduce Data Warehouse Appliance.)
- Claims that because Aster runs effectively on cheaper, more truly “commodity” hardware than competitors, you get more hardware bang for the buck if you buy from Aster.
I don’t have a lot more to add right now, mainly because I wrote at some length about Aster’s non-appliance-specific, non-MapReduce technology and positioning a couple of weeks ago.
Categories: Analytic technologies, Aster Data, Business intelligence, Data warehouse appliances, Data warehousing, Database compression, MapReduce, Pricing | 16 Comments |
Aster Data on parallelism
Aster Data’s core claim boils down to “We do parallelism better.” Aster has shied away from saying that for marketing purposes, for fear of the response “Yeah, right, everybody says that.” But when I talked with Mayank Bawa, Steve Wooledge, et al. yesterday, I focused discussions on just that point. Based on that chat and others before, here are some highlights (as I understand them) of what Aster claims, believes, or believes to be differentiated about its nCluster technology: Read more
Categories: Analytic technologies, Aster Data, Data warehousing, MapReduce, Parallelization, Theory and architecture | 3 Comments |
Aster Data sticks by its SQL/MapReduce guns
Aster Data continues to think that MapReduce, integrated with SQL, is an important technology. For example:
- Aster announced today that it’s providing .NET support for SQL/MapReduce. Perhaps not coincidentally, Aster’s biggest customer is MySpace, which is apparently a big Microsoft shop. (And MySpace parent Fox Interactive Media is a SQL/MapReduce fan, albeit running on Greenplum.)
- Aster generally puts more emphasis on MapReduce than SQL/MapReduce rival Greenplum. That’s a non-trivial comparison, because Greenplum is making progress in SQL/MapReduce itself.
- When talking with Aster folks, I can’t get them to shut up hear a lot about SQL/MapReduce.
I was a big fan of SQL/MapReduce when it was first announced last August. Notwithstanding persuasive examples favoring pure DBMS or pure MapReduce over DBMS/MapReduce integration, I continue to think the SQL/MapReduce idea has great potential. But I do wish more successful production examples would become visible …
Categories: Analytic technologies, Aster Data, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Parallelization | 4 Comments |
The future of data marts
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:
- Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
- Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.
- In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
- Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.
- Consolidating data marts onto one common technological platform has important benefits.
In essence, Greenplum is pitching the story:
- Thesis: Enterprise Data Warehouses (EDWs)
- Antithesis: Data Warehouse Appliances
- Synthesis: Greenplum’s Enterprise Data Cloud vision
When put that starkly, it’s overstated, not least because
Specialized Analytic DBMS != Data Warehouse Appliance
But basically it makes sense, for two main reasons:
- Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
- On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.
Greenplum update — Release 3.3 and so on
I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including: Read more
Categories: Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Market share and customer counts, Parallelization, PostgreSQL, Pricing | 11 Comments |
How much state is saved when an MPP DBMS node fails?
Mark Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post:
My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?
Honestly, I’d just be guessing at the answer.
Would any vendors or other knowledgeable folks care to take a crack at answering directly?
Categories: Data warehousing, Parallelization | 10 Comments |