Parallelization

Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:

July 29, 2009

What are the best choices for scaling Postgres?

March, 2011 edit: In its quaintness, this post is a reminder of just how fast Short Request Processing DBMS technology has been moving ahead.  If I had to do it all over again, I’d suggest they use one of the high-performance MySQL options like dbShards, Schooner, or both together.  I actually don’t know what they finally decided on in that area. (I do know that for analytic DBMS they chose Vertica.)

I have a client who wants to build a new application with peak update volume of several million transactions per hour.  (Their base business is data mart outsourcing, but now they’re building update-heavy technology as well. ) They have a small budget.  They’ve been a MySQL shop in the past, but would prefer to contract (not eliminate) their use of MySQL rather than expand it.

My client actually signed a deal for EnterpriseDB’s Postgres Plus Advanced Server and GridSQL, but unwound the transaction quickly. (They say EnterpriseDB was very gracious about the reversal.) There seem to have been two main reasons for the flip-flop.  First, it seems that EnterpriseDB’s version of Postgres isn’t up to PostgreSQL’s 8.4 feature set yet, although EnterpriseDB’s timetable for catching up might have tolerable. But GridSQL apparently is further behind yet, with no timetable for up-to-date PostgreSQL compatibility.  That was the dealbreaker.

The current base-case plan is to use generic open source PostgreSQL, with scale-out achieved via hand sharding, Hibernate, or … ??? Experience and thoughts along those lines would be much appreciated.

Another option for OLTP performance and scale-out is of course memory-centric options such as VoltDB or the Groovy SQL Switch.  But this client’s database is terabyte-scale, so hardware costs could be an issue, as of course could be product maturity.

By the way, a large fraction of these updates will be actual changes, as opposed to new records, in case that matters.  I expect that the schema being updated will be very simple — i.e., clearly simpler than in a classic order entry scenario.

July 7, 2009

Hasso Plattner calls for in-memory OLTP column stores

Former SAP CEO Hasso Plattner has written a paper called A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, in association with a SIGMOD keynote address.* The approach Plattner advocates is an MPP in-memory column store, presumably somewhat akin to SAP’s frequently renamed Business Warehouse Accelerator/Business Intelligence Accelerator/BWA/BIA/Son-of-TREX technology. There also are strong similarities to the MPP in-memory row store project H-Store/VoltDB, although I don’t know whether Plattner would go so far as to adopt the H-Store view that all transactions should run in stored procedures. Unsurprisingly, SAP applications are used as the OLTP paradigm throughout.

*Thanks to Dave Kellogg for tipping me off to Plattner’s paper. I only went to two SIGMOD sessions, neither of which was Plattner’s. Nobody actually mentioned Plattner’s talk to me when I was down at SIGMOD.

Perhaps the most interesting part is Plattner’s claim that what’s demanding about OLTP isn’t database updating per se, but rather maintaining aggregates for quick-response analytics. In his main example of that point, Plattner proposes a real-life “more than 18” table schema, of which 2 are base tables, and (most of?) the rest are materialized views that his proposed database architecture dispenses with (because analytic performance is sufficiently good without them). Thus, Plattner’s core columnar argument seemingly is

columnar –> natively fast analytics –> no need to maintain aggregates –> much lower update burden.

That said — if Plattner’s paper contained a clear statement of how much more expensive it is to insert or update a single row in a columnar vs. row-based system, I overlooked it. Instead, Plattner seems to be arguing that the volume of base-table updates is low enough that — whatever it may be — column-store update overhead is an acceptable price to pay.  (At one point he claims that only 5% of the data inserted in a financial application ever gets changed.) That may actually be true in a financial accounting system, but seems more questionable in a sufficiently large application that gets its updates from automatic devices, or from the consumer web.

Other highlights include: Read more

July 2, 2009

User data vs. raw disk space as a marketing metric

I tried to post a comment on Daniel Abadi’s blog, but doing so seems to require some sort of registration process, so I’m posting here instead.

In a comment to his post on node scalability, Daniel Abadi argued that disk space is a better metric to use in marketing than (presumably compressed) user data.  Well, I imagine he didn’t quite mean to say that, but that’s actually what he wound up saying, starting from the accurate observation that compression ratios vary wildly from one data set to another, even more than they vary from product to product on the same data.

Nonetheless, I favor user data as a metric because:

July 1, 2009

NoSQL?

Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:

As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more

June 29, 2009

Aster Data enters the appliance game

Aster Data is rolling out a line of nCluster appliances today.  Highlights include:

I don’t have a lot more to add right now, mainly because I wrote at some length about Aster’s non-appliance-specific, non-MapReduce technology and positioning a couple of weeks ago.

June 16, 2009

Aster Data on parallelism

Aster Data’s core claim boils down to “We do parallelism better.” Aster has shied away from saying that for marketing purposes, for fear of the response “Yeah, right, everybody says that.” But when I talked with Mayank Bawa, Steve Wooledge, et al. yesterday, I focused discussions on just that point. Based on that chat and others before, here are some highlights (as I understand them) of what Aster claims, believes, or believes to be differentiated about its nCluster technology: Read more

June 9, 2009

Aster Data sticks by its SQL/MapReduce guns

Aster Data continues to think that MapReduce, integrated with SQL, is an important technology. For example:

I was a big fan of SQL/MapReduce when it was first announced last August. Notwithstanding persuasive examples favoring pure DBMS or pure MapReduce over DBMS/MapReduce integration, I continue to think the SQL/MapReduce idea has great potential.  But I do wish more successful production examples would become visible …

June 8, 2009

The future of data marts

Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:

In essence, Greenplum is pitching the story:

When put that starkly, it’s overstated, not least because

Specialized Analytic DBMS != Data Warehouse Appliance

But basically it makes sense, for two main reasons:

Read more

June 5, 2009

Greenplum update — Release 3.3 and so on

I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including: Read more

May 12, 2009

How much state is saved when an MPP DBMS node fails?

Mark Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post:

My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?

Honestly, I’d just be guessing at the answer.

Would any vendors or other knowledgeable folks care to take a crack at answering directly?

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.