Amazon and its cloud
Analysis of Amazon’s role in database and analytic technology, especially via the S3/EC2 cloud computing initiative. Also covered are SimpleDB and Amazon’s role as a technology user. Related subjects include:
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production (this post).
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92.
The posts depend on each other in various ways.
My clients at Cloudera and Hortonworks have somewhat different views as to the maturity of various pieces of Hadoop technology. In particular:
- Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
- CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
- HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
- CDH 4 boasts sub-second NameNode failover.
- Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
- Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
- As does CDH 4, HDP 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store. (Edit: Actually, see the comment thread below.)
- Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
- HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
- Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
- Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.
*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.
|Categories: Amazon and its cloud, Cloudera, Hadoop, Hortonworks, MapReduce, Market share and customer counts||10 Comments|
This is part of a three-post series:
The canonical Metamarkets batch ingest pipeline is a bit complicated.
- Data lands on Amazon S3 (uploaded or because it was there all along).
- Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
- Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
- Druid is notified, and pulls the data from Hadoop at its convenience.
By “get data read to be put into Druid” I mean:
- Build the data segments (recall that Druid manages data in rather large segments).
- Note metadata about the segments.
That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)
By “build the data segments” I mean:
- Make the sharding decisions.
- Arrange data columnarly within shard.
- Build a compressed bitmap for each shard.
When things are being done that way, Druid may be regarded as comprising three kinds of servers: Read more
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
|Categories: Amazon and its cloud, ClearStory Data, Cloud computing, Cloudera, Hadoop, HBase, Hortonworks, MapR, MapReduce, Market share and customer counts, Petabyte-scale data management, WibiData||1 Comment|
Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a “License Included” instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).
My quick thoughts start:
- Mainly, this isn’t for production usage. But exceptions might arise when:
- An application, from creation to abandonment, is only expected to have a short lifespan, in support of a specific project.
- There is an extreme internal-politics bias to operating versus capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn’t make much rational sense.
- An application is small enough, or the situation is sufficiently desperate, that any inefficiencies are outweighed by convenience.
- There is non-production appeal. In particular:
- Spinning up a quick cloud instance can make a lot of sense for a developer.
- The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.
- The same might go for off-site replication/disaster recovery.
Of course, those are all standard observations every time something that’s basically on-premises software is offered in the cloud. They’re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.
And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it’s on your premises, with a known hardware configuration; who would want to try to manage a production instance of Oracle in the cloud?
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
|Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization||4 Comments|
To me, CAP should really be PACELC — if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?
and goes on to say
For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC — upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler — once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.
However, I think Daniel’s improved formulation is still misleading, in at least two ways:
- Daniel implicitly assumes any given NoSQL system makes a fixed set of tradeoffs, when actually — as he in fact notes in his post – some of them offer tradeoffs that are quite tunable.
- I think Daniel is at best oversimplifying when he appears to assert that best-case network latency is an important design criterion for all that many NoSQL systems. Naively, anything that acknowledges reads or writes requires two hops. Two-phase commit (2PC) requires three hops. 33% latency reductions are not the kinds of goals that drive dramatic DBMS redesigns, even though tenths of seconds — i.e. 100s of milliseconds — matter in the kinds of environments where NoSQL is sprouting up.
In which we reveal the fundamental inequality of NoSQL, and why NoSQL folks are so negative about joins.
Discussions of NoSQL design philosophies tend to quickly focus in on the matter of consistency. “Consistency”, however, turns out to be a rather overloaded concept, and confusion often ensues.
In this post I plan to address one essential subject, while ducking various related ones as hard as I can. It’s what Werner Vogel of Amazon called read-your-writes consistency (a term to which I was actually introduced by Justin Sheehy of Basho). It’s either identical or very similar to what is sometimes called immediate consistency, and presumably also to what Amazon has recently called the “read my last write” capability of SimpleDB.
This is something every database-savvy person should know about, but most so far still don’t. I didn’t myself until a few weeks ago.
Considering the many different kinds of consistency outlined in the Werner Vogel link above or in the Wikipedia consistency models article — whose names may not always be used in, er, a wholly consistent manner — I don’t think there’s much benefit to renaming read-your-writes consistency yet again. Rather, let’s just call it RYW consistency, come up with a way to pronounce “RYW”, and have done with it. (I suggest “ree-ooh”, which evokes two syllables from the original phrase. Thoughts?)
Definition: RYW (Read-Your-Writes) consistency is achieved when the system guarantees that, once a record has been updated, any attempt to read the record will return the updated value.
I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more
|Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek and TokuDB||5 Comments|
Around the end of the Cold War, Esther Dyson took it upon herself to go repeatedly to Eastern Europe and do a lot of rah-rah and catalysis, hoping to spark software and other computer entrepreneurs. I don’t know how many people’s lives she significantly affected – I’d guess it’s actually quite a few – but in any case the number is not zero. Roman Stanek, who has built and sold a couple of software business, cites her as a key influence setting him on his path.
Roman’s latest venture is business intelligence firm Gooddata. Gooddata was founded in 2007 and has been soliciting and getting attention for a while, so I was surprised to learn that Gooddata officially launched just a few weeks ago. Anyhow, some less technical highlights of the Gooddata story include: Read more
Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet. In some circumstances, he is surely right. Possible implications include:
- When sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, cloud, cloud — but Clearpace thinks you really should have a bit of its software onsite too, to compress the data before sending it across the wire.
- Getting data from one cloud to another cloud could be problematic. I’m fond of saying that weblog data naturally lives in the cloud at your hosting company’s location, so you should analyze it there too. But this makes the most sense if you analyze it or at least filter/reduce it in place. (That said, the really, really big web companies have lots of different data centers, and presumably do move huge amounts of log data from place to place.)
But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.
|Categories: Amazon and its cloud, Cloud computing, Database compression, EAI, EII, ETL, ELT, ETLT, Web analytics||2 Comments|