Amazon and its cloud
Analysis of Amazon’s role in database and analytic technology, especially via the S3/EC2 cloud computing initiative. Also covered are SimpleDB and Amazon’s role as a technology user. Related subjects include:
Quick thoughts on Oracle-on-Amazon
Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a “License Included” instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).
My quick thoughts start:
- Mainly, this isn’t for production usage. But exceptions might arise when:
- An application, from creation to abandonment, is only expected to have a short lifespan, in support of a specific project.
- There is an extreme internal-politics bias to operating versus capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn’t make much rational sense.
- An application is small enough, or the situation is sufficiently desperate, that any inefficiencies are outweighed by convenience.
- There is non-production appeal. In particular:
- Spinning up a quick cloud instance can make a lot of sense for a developer.
- The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.
- The same might go for off-site replication/disaster recovery.
Of course, those are all standard observations every time something that’s basically on-premises software is offered in the cloud. They’re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.
And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it’s on your premises, with a known hardware configuration; who would want to try to manage a production instance of Oracle in the cloud?
| Categories: Amazon and its cloud, Cloud computing, Oracle | 6 Comments |
Cassandra technical overview
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
| Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization | 4 Comments |
Daniel Abadi on NoSQL design tradeoffs
In a thought-provoking post, Daniel Abadi points out NoSQL-related terminological problems similar to the ones I just railed against, and argues
To me, CAP should really be PACELC — if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?
and goes on to say
For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC — upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler — once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.
However, I think Daniel’s improved formulation is still misleading, in at least two ways:
- Daniel implicitly assumes any given NoSQL system makes a fixed set of tradeoffs, when actually — as he in fact notes in his post – some of them offer tradeoffs that are quite tunable.
- I think Daniel is at best oversimplifying when he appears to assert that best-case network latency is an important design criterion for all that many NoSQL systems. Naively, anything that acknowledges reads or writes requires two hops. Two-phase commit (2PC) requires three hops. 33% latency reductions are not the kinds of goals that drive dramatic DBMS redesigns, even though tenths of seconds — i.e. 100s of milliseconds — matter in the kinds of environments where NoSQL is sprouting up.
| Categories: Amazon and its cloud, Cassandra, NoSQL, Theory and architecture | 2 Comments |
Read-your-writes (RYW), aka immediate, consistency
In which we reveal the fundamental inequality of NoSQL, and why NoSQL folks are so negative about joins.
Discussions of NoSQL design philosophies tend to quickly focus in on the matter of consistency. “Consistency”, however, turns out to be a rather overloaded concept, and confusion often ensues.
In this post I plan to address one essential subject, while ducking various related ones as hard as I can. It’s what Werner Vogel of Amazon called read-your-writes consistency (a term to which I was actually introduced by Justin Sheehy of Basho). It’s either identical or very similar to what is sometimes called immediate consistency, and presumably also to what Amazon has recently called the “read my last write” capability of SimpleDB.
This is something every database-savvy person should know about, but most so far still don’t. I didn’t myself until a few weeks ago.
Considering the many different kinds of consistency outlined in the Werner Vogel link above or in the Wikipedia consistency models article — whose names may not always be used in, er, a wholly consistent manner — I don’t think there’s much benefit to renaming read-your-writes consistency yet again. Rather, let’s just call it RYW consistency, come up with a way to pronounce “RYW”, and have done with it. (I suggest “ree-ooh”, which evokes two syllables from the original phrase. Thoughts?)
Definition: RYW (Read-Your-Writes) consistency is achieved when the system guarantees that, once a record has been updated, any attempt to read the record will return the updated value.
| Categories: Amazon and its cloud, NoSQL, OLTP, Parallelization, Theory and architecture | 11 Comments |
Some NoSQL links
I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more
| Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek | 5 Comments |
Introduction to Gooddata
Around the end of the Cold War, Esther Dyson took it upon herself to go repeatedly to Eastern Europe and do a lot of rah-rah and catalysis, hoping to spark software and other computer entrepreneurs. I don’t know how many people’s lives she significantly affected – I’d guess it’s actually quite a few – but in any case the number is not zero. Roman Stanek, who has built and sold a couple of software business, cites her as a key influence setting him on his path.
Roman’s latest venture is business intelligence firm Gooddata. Gooddata was founded in 2007 and has been soliciting and getting attention for a while, so I was surprised to learn that Gooddata officially launched just a few weeks ago. Anyhow, some less technical highlights of the Gooddata story include: Read more
Sneakernet to the cloud
Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet. In some circumstances, he is surely right. Possible implications include:
- When sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, cloud, cloud — but Clearpace thinks you really should have a bit of its software onsite too, to compress the data before sending it across the wire.
- Getting data from one cloud to another cloud could be problematic. I’m fond of saying that weblog data naturally lives in the cloud at your hosting company’s location, so you should analyze it there too. But this makes the most sense if you analyze it or at least filter/reduce it in place. (That said, the really, really big web companies have lots of different data centers, and presumably do move huge amounts of log data from place to place.)
But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.
| Categories: Amazon and its cloud, Cloud computing, Database compression, EAI, EII, ETL, ELT, ETLT, Web analytics | 2 Comments |
Maybe Amazon should be using a real DBMS after all
Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as “adult,” the source said. (Technically, the flag for adult content was flipped from ‘false’ to ‘true.’)
“It’s no big policy change, just some field that’s been around forever filled out incorrectly,” the source said.
Amazon employees worked on the problem well past midnight, and then handed it over to an international team, he said.
This was the best practice for reversing an error — how? Is SimpleDB somehow implicated? If this story is remotely true, and if there’s a sensible database architecture, I can’t imagine why there wouldn’t be a faster fix.
| Categories: Amazon and its cloud | 7 Comments |
Amazon Elastic MapReduce
Amazon is introducing a beta of Amazon Elastic MapReduce. What it boils down to is cheap, on-demand Hadoop.
This seems like a great way to experiment with MapReduce and see if you like it. But for serious use, I don’t know why you wouldn’t prefer MapReduce more closely integrated into a DBMS.
| Categories: Amazon and its cloud, Cloud computing, MapReduce | 1 Comment |
April Fool’s Day highlights
Amazon says it’s taking “cloud” computing to new heights, as it were.
Derivative funds and large government-subsidized entities will be especially interested in FACE’s transmodal operation. They can allocate a dedicated FACE, load it up with data, and then send it out to sea to perform advanced processing in safety. The government will have absolutely no chance of acting against them, because they will be too busy trying to decide which Federal Air Regulation (FAR) was violated, not to mention scheduling news conferences.
First excellent April Fool’s joke I saw this year was from The Guardian. The best so far is from Expedia. Others are linked in my Twitter feed. And personally, I’m encouraging the concept of April No-Fooling Day.
| Categories: Amazon and its cloud, Cloud computing, Humor | 1 Comment |
