Discussion of relational database management systems that are offered through some version of open source licensing. Related subjects include:
One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:
- >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
- ~75,000 users of MongoDB Cloud Manager.
- Estimated ~1/4 million production users of MongoDB total.
Also >530 staff, and I think that number is a little out of date.
MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:
- Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
- A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results.
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
- Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.
There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. Read more
|Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents, Text||6 Comments|
Occasionally I talk with an astute reporter — there are still a few left — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.
There is a group of questions going around that includes:
- Is Hadoop overhyped?
- Has Hadoop adoption stalled?
- Is Hadoop adoption being delayed by skills shortages?
- What is Hadoop really good for anyway?
- Which adoption curves for previous technologies are the best analogies for Hadoop?
To a first approximation, my responses are: Read more
|Categories: Application areas, Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, MapR, MapReduce, Market share and customer counts, Open source, Pricing||6 Comments|
At the highest level:
- Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
- Teradata is announcing support for Presto with a classic open source pricing model.
- Presto will also become, roughly speaking, Teradata’s replacement for Hive.
- Teradata’s Presto efforts are being conducted by the former Hadapt.
Now let’s make that all a little more precise.
Regarding Presto (and I got most of this from Teradata)::
- To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
- … Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
- Facebook at various points in time created both Hive and now Presto.
- Facebook started the Presto project in 2012 and now has 10 engineers on it.
- Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
- Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
- Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.
Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project: Read more
I chatted with the MariaDB folks on Tuesday. Let me start by noting:
- MariaDB, the product, is a MySQL fork.
- MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
- MariaDB, the company, is the former SkySQL …
- … which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
- I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
- It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.
The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:
- ~80 people in ~15 countries.
- 20-25 engineers, which hopefully doesn’t count a few field support people.
- “Tiny” headquarters in Helsinki.
- Business leadership growing in the US and especially the SF area.
MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.
MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries. Read more
|Categories: Database compression, Hadoop, IBM and DB2, Market share and customer counts, Mid-range, MySQL, Open source, Tokutek and TokuDB, Transparent sharding||1 Comment|
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that: Read more
|Categories: Cloudera, eBay, Facebook, Hadoop, HBase, Market share and customer counts, NoSQL, Open source, Petabyte-scale data management, Specific users, Yahoo||4 Comments|
I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:
- The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
- Cloudera still uses a figure of 20% of its customers being HBase-centric.
- HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
- With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.
- Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
- As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
- Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)
|Categories: Cloudera, Clustering, Data models and architecture, Database diversity, Hadoop, HBase, Hortonworks, HP and Neoview, Intel, Market share and customer counts, NoSQL, Open source||4 Comments|
While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:
- Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
- In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
- That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
- The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.
Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.
Further, the low-cost alternatives for analytic RDBMS are adding up. Read more
|Categories: Amazon and its cloud, Citus Data, Data warehouse appliances, EAI, EII, ETL, ELT, ETLT, EMC, Greenplum, Hadoop, Infobright, MonetDB, Open source, Pricing||6 Comments|
Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as: Read more
|Categories: Amazon and its cloud, Cloudera, EMC, Emulation, transparency, portability, Greenplum, Hadoop, Hortonworks, IBM and DB2, MapR, Open source||11 Comments|
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add: Read more
|Categories: Database compression, Hadoop, Humor, In-memory DBMS, MongoDB, NoSQL, Open source, Structured documents, Sybase||9 Comments|
Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.
So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):
- CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
- Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
- Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
- In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
- Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
- Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
- Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
- Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
- Cloudera Enterprise Basic Edition contains:
- All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
- Commercial licenses for all that code.
- A license key to use the entirety of Cloudera Manager, not just the free part.
- Support for the “Core Hadoop” part of CDH.
- Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
- The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
- Cloudera Enterprise Data Hub Edition contains:
- Everything in Cloudera Basic Edition.
- A license key for Cloudera Navigator.
- Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
- Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.
In analyzing all this, I’m focused on two particular aspects:
- The “zero, one, many” system for defining the editions of Cloudera Enterprise.
- The use of “Data Hub” as a general marketing term.
|Categories: Cloudera, Data warehousing, Databricks, Spark and BDAS, Hadoop, HBase, Hortonworks, Open source, Pricing||2 Comments|