July 29, 2009

What are the best choices for scaling Postgres?

March, 2011 edit: In its quaintness, this post is a reminder of just how fast Short Request Processing DBMS technology has been moving ahead.  If I had to do it all over again, I’d suggest they use one of the high-performance MySQL options like dbShards, Schooner, or both together.  I actually don’t know what they finally decided on in that area. (I do know that for analytic DBMS they chose Vertica.)

I have a client who wants to build a new application with peak update volume of several million transactions per hour.  (Their base business is data mart outsourcing, but now they’re building update-heavy technology as well. ) They have a small budget.  They’ve been a MySQL shop in the past, but would prefer to contract (not eliminate) their use of MySQL rather than expand it.

My client actually signed a deal for EnterpriseDB’s Postgres Plus Advanced Server and GridSQL, but unwound the transaction quickly. (They say EnterpriseDB was very gracious about the reversal.) There seem to have been two main reasons for the flip-flop.  First, it seems that EnterpriseDB’s version of Postgres isn’t up to PostgreSQL’s 8.4 feature set yet, although EnterpriseDB’s timetable for catching up might have tolerable. But GridSQL apparently is further behind yet, with no timetable for up-to-date PostgreSQL compatibility.  That was the dealbreaker.

The current base-case plan is to use generic open source PostgreSQL, with scale-out achieved via hand sharding, Hibernate, or … ??? Experience and thoughts along those lines would be much appreciated.

Another option for OLTP performance and scale-out is of course memory-centric options such as VoltDB or the Groovy SQL Switch.  But this client’s database is terabyte-scale, so hardware costs could be an issue, as of course could be product maturity.

By the way, a large fraction of these updates will be actual changes, as opposed to new records, in case that matters.  I expect that the schema being updated will be very simple — i.e., clearly simpler than in a classic order entry scenario.


30 Responses to “What are the best choices for scaling Postgres?”

  1. Fazal Majid on July 29th, 2009 3:11 am

    One word: SSD. I have a 320GB FusionIO drive on loan from HP, and I am seeing 8,000 to 11,000 transactions per second on my application benchmark on it with PostgreSQL, with comparable numbers on Oracle.

  2. Peter on July 29th, 2009 3:43 am

    Pricing might be an issue with Fusion I/O …..

  3. Curt Monash on July 29th, 2009 4:05 am

    Good idea. One would think an SSD could be cheaper than an equivalent amount of RAM, especially if you wind up needing extra CPUs to drive that RAM.

    Um — is that really true?

  4. Tony Bain on July 29th, 2009 4:44 am

    I would be very cautious about going down the SSD/EFD route for write intensive databases. SSD still seems to lag FC disk in write performance. At best you will have to have a careful mix of SSD and FC Disk.

    The requirements will come down to how big the peaks are & how much data is being updated in a single transaction of course. But while I don’t have a Postgres example, I do know of examples on other RDBMS platforms with write intensive databases processing ~2500-3500 tps on pretty modest hardware (<$100k). One example I have in mind is an 800GB OLTP DB.

  5. Imre Samu on July 29th, 2009 11:44 am

    it is worth mention as a future choice : HadoopDB
    [ open source stack that includes PostgreSQL, Hadoop, and Hive ]

  6. Fazal Majid on July 29th, 2009 11:59 am

    Well, FusionIO is expensive, but our application also calls for ultra low latency, not just improvements in throughput (measured transaction time is mean 0.9ms with 1.4ms standard deviation). 10K rpm FC drives do have better write throughput but still exhibit worse latency.

    Some of the best practices I have read for Oracle on TMS RamSan DRAM-based SAN SSDs recommend storing redo logs on disk (as sequential I/O throughput is not a limiting factor) and using SSDs for datafiles and indexes. Then again, with PCIe or SATA SSDs, the latency of a FC SAN is no longer negligible compared to the disk’s.

    The main problem with FusionIO or other PCIe-attached devices is limited fanout, and lack of HA options beyond DataGuard or the PostgreSQL equivalent.

    I did perform some testing with inexpensive Intel X25-E drives (about $10/GB) and achieved about 2,000tps with a single drive set up with ZFS.

    I expect much of the in-memory database market is going to shift to SSDs instead. SSD costs are falling fast, and RDBMS technology, while not optimized for SSDs, is still far more mature and manageable than clustered in-memory databases like Scalaris or Cassandra. You can also have full ACID-compliant databases, with an emphasis on D, whereas IMDBs tend to offer only checkpoints for persistence, i.e. some risk of data loss.

  7. Curt Monash on July 29th, 2009 12:04 pm

    New-generation memory-centric DBMS, I’m pretty sure, are ACID-compliant — the Groovy SQL Switch, VoltDB, probably solidDB (not so new) as well. But I digress …

  8. eggyknap on July 29th, 2009 12:30 pm

    Skype built PL/Proxy precisely because they needed to shard PostgreSQL instances fairly seamlessly.

  9. RC on July 29th, 2009 12:38 pm

    @Imre Samu

    Is it possible to use HadoopDB in a OLTP environment? We are talking about “several million transactions per hour”!

  10. Shawn Fox on July 29th, 2009 1:04 pm

    One thing to look out for if you are going to roll your own is that it is generally very difficult to scale your system by add new nodes once you have built it. You should run multiple nodes per server (say maybe 4 nodes per server) so that you can later move some of the nodes to new servers to gain capacity instead of having to split the nodes by re-sharding to add new servers.

  11. Curt Monash on July 29th, 2009 1:46 pm

    Why can’t you have some kind of a look-up table that lets you refactor the nodes as you desire?

    Still, great point on having extra shards just in case.

  12. Daniel Abadi on July 29th, 2009 1:55 pm

    Hi Imre,

    Thanks for the HadoopDB mention. At least for now, HadoopDB only supports bulk loads, so I wouldn’t recommend using it for this workload. We’re really focused on the analytical data management space.

    I don’t know much about Groovy, but I agree with Curt that this is exactly the type of application VoltDB is designed for.

  13. Edward on July 29th, 2009 2:21 pm

    If they’re willing to use Hibernate then “Hibernate Shards” is the project to look at. It provides sharding while being compatible with traditional Hibernate.

  14. RC on July 29th, 2009 2:27 pm

    @Shawn Fox,

    I thought that consistent hashing (http://michaelnielsen.org/blog/?p=613) has solved the problem of adding new nodes.

  15. Bob Zurek on July 29th, 2009 3:33 pm

    Maybe this is something that could help the client.

  16. Jonah H. Harris on July 29th, 2009 5:36 pm

    Why is it that RethinkDB is trending so much these past few days? I can’t believe how fast people focus on shiny new toys and forget to realize that these toys are nothing new at all. Append-only log structured systems, like RethinkDB, have been around for a long time. Perhaps throwing in the words, “lock free” makes everyone think it’s great.

    Based on their feature-set, I see *nothing* new here. Rather than continuing to promote old-but-marketed-as-new technology, I wish people would do some personal research and move along.

  17. Curt Monash on July 29th, 2009 6:23 pm

    Well, append-now-and-clean-up-the-disk-later (if at all) is a reasonable strategy for a lot of apps, including this client’s. It seems to fail largely where there’s some kind of “balance” — account, inventory, whatever — or other aggregate that needs to be kept current on a transactional basis.

  18. Shawn Fox on July 29th, 2009 7:27 pm

    A lookup table mechanism works. I remember reading an article a while back written by someone from either myspace or facebook and that was how they did it. It does allow you to add new shards on the fly as long as you don’t have to move data around (just put new data on the new shards). Moving data is possible but it would still take some significant coding to make it happen cleanly.

    The lookup table mechanism does introduce additional latency though. If you have to do a database lookup across a network connection to find the right node to put data on, that is going to add 1 ms or so to each transaction. Not a problem for a social networking site and it doesn’t limit scalability, but the latency could be a problem for some applications.

    If the lookup database was also spread across the shards based on a fixed shard lookup (no dictionary) then scalability should never be a problem. You wouldn’t be able to spread the lookup database across new shards but I doubt you’d ever need to.

    The lookup database system is also limited to situations where the ratio of data to the lookup key is high (for example a userid where you have 1000s of records associated with the customer). It doesn’t work if the ratio is low because then the lookup database is nearly as big as the data you are storing.

    As to the other post about consistent hashing that doesn’t solve the problem of having to move data around to add a node, it just makes it a easier to do.

  19. Curt Monash on July 29th, 2009 8:15 pm

    ZIP code would work fine for this lookup table. The whole app is geographically-based. There’s a composite key that might work even a little better.

  20. paul on July 29th, 2009 9:00 pm

    {asterdata,greenplum} both based on postgres so swapin should be easy. are they too expensive?

  21. Curt Monash on July 29th, 2009 11:27 pm

    They’re not optimized for OLTP. Indeed, I’m not sure one gets much benefit from the parallelism in those systems if one’s doing transactional updates.

  22. Michalis on July 30th, 2009 7:54 am

    So, you’re actually asking your readers for what your client is paying you for..?

  23. Curt Monash on July 30th, 2009 10:51 am


    Therein lies a tale about the real value consultants bring. A big part is “You know, that’s an interesting question you hired me to answer, but there’s actually a much more important one that needs answering too …”

    Beyond that, my learning doesn’t stop when the consulting project begins.

  24. Peter on July 30th, 2009 2:22 pm

    Just wondering. I went to a Jobnob Happy hour yesterday. Most of the startups go the LAMP route. Why is there no LAPP stack? I wonder – since Oracle has their hands MySQL shouldn’t it be called LAOP – or OLAP because the big always comes first? 🙂

    So are there any reason for not having a LAPP?

  25. Curt Monash on July 30th, 2009 2:38 pm

    I don’t think anybody focused in time on making PostreSQL easy to download and install. Packaging was a huge part of why MySQL got ahead.

  26. Zman on July 31st, 2009 12:14 am

    I haven’t used this product but you might want to take a look at CodeFuture’s dbShards

    Continuent’s Tungsten’s Enterprise might also work for you. I’m referring to the “Parallel replication to support sites with high write rates and/or sharded data sets” feature

    I have no first-hand experience with either so this falls under the thought category not the experience one 🙂

  27. Fazal Majid on July 31st, 2009 5:13 am

    Not to start a flame war, but I don’t find MySQL any easier to install than PostgreSQL (or less so). Then again I insist on building from source myself on Solaris. Postgres is also easier to administer, and I much prefer pgAdmin to phpMyAdmin.

    I suspect a big reason for MySQL’s popularity stems from the fact it used to be bundled with PHP. That, combined with its multithreaded architecture, which scales down well on low-end machines, even if it hits a performance wall on larger configurations.

  28. Log Buffer #156: a Carnival of the Vanities for DBAs | Pythian Group Blog on July 31st, 2009 6:33 pm

    […] Monash of DBMS2 asks, what are the best choices for scaling Postgres? “I have a client who wants to build a new application with peak update volume of several […]

  29. jt on August 20th, 2009 2:20 pm

    It depends on how they want to scale out. They could have gone with Enterprisedb and used InfiniteCache. InfiniteCache is based on memcached and you can scale out memory to as many servers as you want. So you get one giant data buffer cache, essentially caching the whole database.

    But if 8.4 feature set was a no go, then stick with standard Postgres and go with PL Proxy. Here’s some more info on that:

    That is what Skype does.

  30. Cory Isaacson on June 2nd, 2010 12:23 pm

    Our dbShards product can in fact help with this. We use a concept called virtual shards that let’s you split a shard when further scalability is required. Check out the link to learn more:


Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.