Amazon and its cloud
Analysis of Amazon’s role in database and analytic technology, especially via the S3/EC2 cloud computing initiative. Also covered are SimpleDB and Amazon’s role as a technology user. Related subjects include:
Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:
- Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
- ParAccel did some engineering to make its DBMS run better in the cloud.
- Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.
“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.
The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.
So who should and will use Amazon Redshift? For starters, I’d say: Read more
|Categories: Amazon and its cloud, Business intelligence, Cloud computing, Data mart outsourcing, Data warehousing, Infobright, ParAccel, Predictive modeling and advanced analytics, Pricing, Vertica Systems||4 Comments|
In connection with Amazon’s Redshift announcement, ParAccel reached out, and so I talked with them for the first time in a long while. At the highest level:
- ParAccel now has 60+ customers, up from 30+ two years ago and 40ish soon thereafter.
- ParAccel is now focusing its development and marketing on analytic platform capabilities more than raw database performance.
- ParAccel is focusing on working alongside other analytic data stores — relational or Hadoop — rather than supplanting them.
There wasn’t time for a lot of technical detail, but I gather that the bit about working alongside other data stores:
- Is relatively new.
- Works via SELECT statements that reach out to the other data stores.
- Is called “on-demand integration”.
- Is built in ParAccel’s extensibility/analytic platform framework.
- Uses HCatalog when reaching into Hadoop.
Also, it seems that ParAccel:
- Is in the early stages of writing its own analytic functions.
- Bundles Fuzzy Logix and actually has some users for that.
|Categories: Amazon and its cloud, Cloud computing, Data warehousing, Hadoop, Market share and customer counts, ParAccel, Predictive modeling and advanced analytics, Specific users||5 Comments|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production (this post).
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92.
The posts depend on each other in various ways.
My clients at Cloudera and Hortonworks have somewhat different views as to the maturity of various pieces of Hadoop technology. In particular:
- Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
- CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
- HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
- CDH 4 boasts sub-second NameNode failover.
- Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
- Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
- As does CDH 4, HDP 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store. (Edit: Actually, see the comment thread below.)
- Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
- HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
- Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
- Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.
*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.
|Categories: Amazon and its cloud, Cloudera, Hadoop, Hortonworks, MapReduce, Market share and customer counts||10 Comments|
This is part of a three-post series:
The canonical Metamarkets batch ingest pipeline is a bit complicated.
- Data lands on Amazon S3 (uploaded or because it was there all along).
- Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
- Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
- Druid is notified, and pulls the data from Hadoop at its convenience.
By “get data read to be put into Druid” I mean:
- Build the data segments (recall that Druid manages data in rather large segments).
- Note metadata about the segments.
That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)
By “build the data segments” I mean:
- Make the sharding decisions.
- Arrange data columnarly within shard.
- Build a compressed bitmap for each shard.
When things are being done that way, Druid may be regarded as comprising three kinds of servers: Read more
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
|Categories: Amazon and its cloud, ClearStory Data, Cloud computing, Cloudera, Hadoop, HBase, Hortonworks, MapR, MapReduce, Market share and customer counts, Petabyte-scale data management, WibiData||1 Comment|
Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a “License Included” instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).
My quick thoughts start:
- Mainly, this isn’t for production usage. But exceptions might arise when:
- An application, from creation to abandonment, is only expected to have a short lifespan, in support of a specific project.
- There is an extreme internal-politics bias to operating versus capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn’t make much rational sense.
- An application is small enough, or the situation is sufficiently desperate, that any inefficiencies are outweighed by convenience.
- There is non-production appeal. In particular:
- Spinning up a quick cloud instance can make a lot of sense for a developer.
- The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.
- The same might go for off-site replication/disaster recovery.
Of course, those are all standard observations every time something that’s basically on-premises software is offered in the cloud. They’re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.
And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it’s on your premises, with a known hardware configuration; who would want to try to manage a production instance of Oracle in the cloud?
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
|Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization||4 Comments|
To me, CAP should really be PACELC — if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?
and goes on to say
For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC — upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler — once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.
However, I think Daniel’s improved formulation is still misleading, in at least two ways:
- Daniel implicitly assumes any given NoSQL system makes a fixed set of tradeoffs, when actually — as he in fact notes in his post – some of them offer tradeoffs that are quite tunable.
- I think Daniel is at best oversimplifying when he appears to assert that best-case network latency is an important design criterion for all that many NoSQL systems. Naively, anything that acknowledges reads or writes requires two hops. Two-phase commit (2PC) requires three hops. 33% latency reductions are not the kinds of goals that drive dramatic DBMS redesigns, even though tenths of seconds — i.e. 100s of milliseconds — matter in the kinds of environments where NoSQL is sprouting up.
In which we reveal the fundamental inequality of NoSQL, and why NoSQL folks are so negative about joins.
Discussions of NoSQL design philosophies tend to quickly focus in on the matter of consistency. “Consistency”, however, turns out to be a rather overloaded concept, and confusion often ensues.
In this post I plan to address one essential subject, while ducking various related ones as hard as I can. It’s what Werner Vogel of Amazon called read-your-writes consistency (a term to which I was actually introduced by Justin Sheehy of Basho). It’s either identical or very similar to what is sometimes called immediate consistency, and presumably also to what Amazon has recently called the “read my last write” capability of SimpleDB.
This is something every database-savvy person should know about, but most so far still don’t. I didn’t myself until a few weeks ago.
Considering the many different kinds of consistency outlined in the Werner Vogel link above or in the Wikipedia consistency models article — whose names may not always be used in, er, a wholly consistent manner — I don’t think there’s much benefit to renaming read-your-writes consistency yet again. Rather, let’s just call it RYW consistency, come up with a way to pronounce “RYW”, and have done with it. (I suggest “ree-ooh”, which evokes two syllables from the original phrase. Thoughts?)
Definition: RYW (Read-Your-Writes) consistency is achieved when the system guarantees that, once a record has been updated, any attempt to read the record will return the updated value.
I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more
|Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek and TokuDB||5 Comments|