Amazon and its cloud
Analysis of Amazon’s role in database and analytic technology, especially via the S3/EC2 cloud computing initiative. Also covered are SimpleDB and Amazon’s role as a technology user. Related subjects include:
Let’s start with some terminology biases:
- I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
- Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).
So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.
It turns out that:
- Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
- Zoomdata depends heavily on Spark.
- Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
- Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
- Zoomdata also tries to detect data variability.
- Zoomdata is OEM/embedding-friendly.
*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.
Core aspects of Zoomdata’s technical strategy include: Read more
I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:
- MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
- … but quickly positioned with “We also do ‘real-time’ analytic processing” …
- … and backed that up by adding a flash-based column store option …
- … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
- There’s also a JSON option.
The main new aspects of MemSQL 4.0 are:
- Geospatial indexing. This is for me the most interesting part.
- A new optimizer and, I suppose, query planner …
- … which in particular allow for serious distributed joins.
- Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
- Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.
There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.
|Categories: Amazon and its cloud, Columnar database management, Databricks, Spark and BDAS, GIS and geospatial, Hadoop, Investment research and trading, Market share and customer counts, MemSQL, NewSQL, Pricing, Structured documents||8 Comments|
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as: Read more
|Categories: Amazon and its cloud, Cloud computing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Software as a Service (SaaS)||6 Comments|
While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:
- Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
- In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
- That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
- The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.
Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.
Further, the low-cost alternatives for analytic RDBMS are adding up. Read more
|Categories: Amazon and its cloud, Citus Data, Data warehouse appliances, EAI, EII, ETL, ELT, ETLT, EMC, Greenplum, Hadoop, Infobright, MonetDB, Open source, Pricing||5 Comments|
Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as: Read more
|Categories: Amazon and its cloud, Cloudera, EMC, Emulation, transparency, portability, Greenplum, Hadoop, Hortonworks, IBM and DB2, MapR, Open source||11 Comments|
It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:
- Amazon Redshift appears to be prospering.
- So are some SaaS (Software as a Service) business intelligence vendors.
- Amazon Elastic MapReduce is still around.
- Snowflake Computing launched with a cloud strategy.
- Cazena, with vague intentions for cloud data warehousing, destealthed.*
- Cloudera made various cloud-related announcements.
- Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
- The general argument for cloud-or-at-least-colocation has compelling aspects.
- Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.
I talked with the Snowflake Computing guys Friday. For starters:
- Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
- The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
- The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
- Snowflake claims excellent SQL coverage for a 1.0 product.
- Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.
Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*
*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.
In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:
- Ingest formats are JSON, XML or AVRO for now.
- I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.
I don’t know enough details to judge whether I’d call that an example of schema-on-need.
A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather: Read more
|Categories: Amazon and its cloud, Cloud computing, Data mart outsourcing, Data models and architecture, Data warehousing, Market share and customer counts, Parallelization, Pricing, Software as a Service (SaaS), Structured documents||2 Comments|
1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?
2. A few thoughts on cloud, colocation, etc.:
- The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
- The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
- Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
- I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
- I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
- In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
- Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
- Cloud hype is of course still with us.
- Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.
3. As for the analytic DBMS industry: Read more
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:
- Rack- or data-center-scale systems.
- The real or imagined demise of Moore’s Law.
I also got updated as to typical Hadoop hardware.
If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)
One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.
|Categories: Amazon and its cloud, Buying processes, Cloudera, Facebook, Google, Intel, Memory-centric data management, Pricing, Solid-state memory||19 Comments|