DBMS product categories
Analysis of database management technology in specific product categories. Related subjects include:
Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:
- An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
- A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
- An excuse for press releases.
- A source of an extra logo graphic to put on marketing slides.
Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).
The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?
Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:
- Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
- MapR’s flavor, as software from MapR.
- Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.
In saying that, I’m glossing over a few points, such as: Read more
|Categories: Amazon and its cloud, Cloudera, EMC, Emulation, transparency, portability, Greenplum, Hadoop, Hortonworks, IBM and DB2, MapR, Open source||10 Comments|
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add: Read more
|Categories: Database compression, Hadoop, Humor, In-memory DBMS, MongoDB, NoSQL, Open source, Structured documents, Sybase||9 Comments|
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.
So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):
- CDH (Cloudera Distribution … Hadoop) contains a lot of Apache open source code.
- Cloudera has a much longer list of Apache projects that it thinks comprise “Core Hadoop” than, say, Hortonworks does.
- Specifically, that list currently is: Hadoop, Flume, HCatalog, Hive, Hue, Mahout, Oozie, Pig, Sentry, Sqoop, Whirr, ZooKeeper.
- In addition to those projects, CDH also includes HBase, Impala, Spark and Cloudera Search.
- Cloudera Manager is closed-source code, much of which is free to use. (I.e., “free like beer” but not “free like speech”.)
- Cloudera Navigator is closed-source code that you have to pay for (free trials and the like excepted).
- Cloudera Express is Cloudera’s favorite free subscription offering. It combines CDH with the free part of Cloudera Manager. Note: Cloudera Express was previously called Cloudera Standard, and that terminology is still reflected in parts of Cloudera’s website.
- Cloudera Enterprise is the umbrella name for Cloudera’s three favorite paid offerings.
- Cloudera Enterprise Basic Edition contains:
- All the code in CDH and Cloudera Manager, and I guess Accumulo code as well.
- Commercial licenses for all that code.
- A license key to use the entirety of Cloudera Manager, not just the free part.
- Support for the “Core Hadoop” part of CDH.
- Support for Cloudera Manager. Note: Cloudera is lazy about saying this explicitly, but it seems obvious.
- The code for Cloudera Navigator, but that’s moot, as the corresponding license key for Cloudera Navigator is not part of the package.
- Cloudera Enterprise Data Hub Edition contains:
- Everything in Cloudera Basic Edition.
- A license key for Cloudera Navigator.
- Support for all of HBase, Accumulo, Impala, Spark, Cloudera Search and Cloudera Navigator.
- Cloudera Enterprise Flex Edition contains everything in Cloudera Basic Edition, plus support for one of the extras in Data Hub Edition.
In analyzing all this, I’m focused on two particular aspects:
- The “zero, one, many” system for defining the editions of Cloudera Enterprise.
- The use of “Data Hub” as a general marketing term.
|Categories: Cloudera, Data warehousing, Databricks, Spark and BDAS, Hadoop, HBase, Hortonworks, Open source, Pricing||2 Comments|
I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.
First, let’s catch up with some personnel gossip. So far as I can tell:
- Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
- … Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
- Oliver Ratzesberger runs Teradata’s software development.
- Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
- Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
- Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
- With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.
The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.
In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:
- The Teradata 1xxx series is focused on cost-per-bit.
- The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
- The Teradata 6xxx series is supposed to be able to do “everything”.
- The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”
Also: Read more
|Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadapt, Hadoop, MapReduce, Solid-state memory, Teradata||2 Comments|
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
I’m commonly asked to assess vendor claims of the kind:
- “Our system lets you do multiple kinds of processing against one database.”
- “Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”
So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:
- The most extreme vendor marketing claims are false.
- There are many different choices that make sense in at least some use cases each.
Horses for courses
It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:
- Short-request vs. analytic.
- SQL vs. non-SQL (NoSQL or otherwise).
- Expensive/heavy-duty vs. cheap/easy-to-support.
Vendors are part of this consensus; already in 2005 I observed
For all practical purposes, there are no DBMS vendors left advocating single-server strategies.
Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.
Multiple data stores for a single application
We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree: Read more
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
I stopped by MemSQL last week, and got a range of new or clarified information. For starters:
- Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
- … which was the point of introducing MemSQL’s flash-based columnar option.
- One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
- Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
- At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
- A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
- MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.
On the more technical side: Read more
|Categories: Clustering, Clustrix, Columnar database management, Data warehousing, Database compression, In-memory DBMS, MemSQL, NewSQL, NuoDB, Specific users, Vertica Systems, VoltDB and H-Store, Workload management, Zynga||18 Comments|
I frequently am asked questions that boil down to:
- When should one use NoSQL?
- When should one use a new SQL product (NewSQL or otherwise)?
- When should one use a traditional RDBMS (most likely Oracle, DB2, or SQL Server)?
The details vary with context — e.g. sometimes MySQL is a traditional RDBMS and sometimes it is a new kid — but the general class of questions keeps coming. And that’s just for short-request use cases; similar questions for analytic systems arise even more often.
My general answers start:
- Sometimes something isn’t broken, and doesn’t need fixing.
- Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
- Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.
In particular, migration away from legacy DBMS raises many issues: Read more
|Categories: Columnar database management, Couchbase, HBase, In-memory DBMS, Microsoft and SQL*Server, NewSQL, NoSQL, OLTP, Oracle, Parallelization, SAP AG||17 Comments|