Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.
On the customer side:
- DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
- This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.
Customers in large numbers want cloud capabilities, as a potential future if not a current need.
One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.
Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are: Read more
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
- Spark and Databricks are both prospering, and of course enhancing their technology as well.
- Ditto DataStax.
- Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.
Subjects I’d like to add to that list include:
- MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.
Making Cloudera run in the cloud has three major aspects:
- Cloudera’s usual software, ported to run on the cloud platform(s).
- Cloudera Director, which for example launches cloud instances.
- Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.
Features new in this week’s release of Cloudera Director include:
- An API for job submission.
- Support for spot and preemptable instances.
- High availability.
- Some cluster repair.
- Some cluster cloning.
I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.
As for porting, let me start by noting: Read more
When I find myself making the same observation fairly frequently, that’s a good impetus to write a post based on it. And so this post is based on the thought that there are many analogies between:
- Oracle and the Oracle DBMS.
- IBM and the IBM mainframe.
And when you look at things that way, Oracle seems to be swimming against the tide.
Drilling down, there are basically three things that can seriously threaten Oracle’s market position:
- Growth in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that description.
- Outright, widespread replacement of Oracle’s application suites. This is the least of Oracle’s concerns at the moment, but could of course be a disaster in the long term.
- Transition to “the cloud”. This trend amplifies the other two.
Oracle’s decline, if any, will be slow — but I think it has begun.
There’s a clear market lead in the core product category. IBM was dominant in mainframe computing. While not as dominant, Oracle is definitely a strong leader in high-end OTLP/mixed-use (OnLine Transaction Processing) RDBMS.
That market lead is even greater than it looks, because some of the strongest competitors deserve asterisks. Many of IBM’s mainframe competitors were “national champions” — Fujitsu and Hitachi in Japan, Bull in France and so on. Those were probably stronger competitors to IBM than the classic BUNCH companies (Burroughs, Univac, NCR, Control Data, Honeywell).
Similarly, Oracle’s strongest direct competitors are IBM DB2 and Microsoft SQL Server, each of which is sold primarily to customers loyal to the respective vendors’ full stacks. SAP is now trying to play a similar game.
The core product is stable, secure, richly featured, and generally very mature. Duh.
The core product is complicated to administer — which provides great job security for administrators. IBM had JCL (Job Control Language). Oracle has a whole lot of manual work overseeing indexes. In each case, there are many further examples of the point. Edit: A Twitter discussion suggests the specific issue with indexes has been long fixed.
Niche products can actually be more reliable than the big, super-complicated leader. Tandem Nonstop computers were super-reliable. Simple, “embeddable” RDBMS — e.g. Progress or SQL Anywhere — in many cases just work. Still, if you want one system to run most of your workload 24×7, it’s natural to choose the category leader. Read more
|Categories: Cloud computing, Database diversity, Exadata, IBM and DB2, Market share and customer counts, Microsoft and SQL*Server, NoSQL, Oracle, Software as a Service (SaaS)||25 Comments|
Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:
- They’re both titanic figures in the database industry.
- They both gave me testimonials on the home page of my business website.
- They both have been known to use the present tense when the future tense would be more accurate.
I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.
But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**
*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.
**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.
Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as: Read more
I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.
All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:
- It’s all about analytic SQL queries.
- Specifically, it’s about reducing duplicated work.
- It is not an “optimizer” in the ordinary RDBMS sense of the word.
- It’s delivered via SaaS (Software as a Service).
- Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
- … in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.
|Categories: Business intelligence, Cloudera, Data pipelining, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, SQL/Hadoop integration||4 Comments|
I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:
- Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
- Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
- When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
- Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
- Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
- More “platform” stuff from the Hadoop stack (e.g. for data ingest).
- Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.
While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of: Read more
|Categories: Benchmarks and POCs, Cloudera, Data warehousing, Databricks, Spark and BDAS, Market share and customer counts, Petabyte-scale data management, Predictive modeling and advanced analytics, SQL/Hadoop integration||4 Comments|
In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:
- (Other) trustworthiness
- User experience
and sometimes also issues in adoption and administration.
Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.
Applying this taxonomy to data management:
|Categories: Buying processes, Clustering, Data warehousing, Database diversity, Microsoft and SQL*Server, Predictive modeling and advanced analytics, Pricing||2 Comments|
- I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
- The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
- Cassandra is an established leader in multi-data-center operation.
But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.
The story starts:
- Cassandra groups nodes into logical “data centers” (i.e. token rings).
- As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
- There are two levels of replication — within a single logical data center, and between logical data centers.
- Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
- However, copies of the database held elsewhere may have different replication factors …
- … and can indeed have different replication factors for different parts of the database.
In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):
- General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
- Instead, you get that effect by tweaking your specific replication settings.
|Categories: Cassandra, Clustering, DataStax, HBase, NoSQL, Open source, Specific users, Surveillance and privacy||3 Comments|
Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.
- Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
- Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
- Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
- Basho’s revenue is ~90% subscription.
- Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
- Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.
Basho’s product line has gotten a bit confusing, but as best I understand things the story is:
- There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
- Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
- Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
- Riak TS is for time series, and just coming out now.
- Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
- There’s an umbrella marketing term of “Basho Data Platform”.
Technical notes on some of that include: Read more
|Categories: Aerospike, Basho and Riak, Cassandra, Clustering, Couchbase, Databricks, Spark and BDAS, DataStax, HBase, Health care, Log analysis, MapR, Market share and customer counts, MongoDB, NoSQL, Pricing, Specific users, Splunk||Leave a Comment|