Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is: Read more
I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.
On the customer side:
- DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
- This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.
Customers in large numbers want cloud capabilities, as a potential future if not a current need.
One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.
Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are: Read more
Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.
1. The most basic form of lock-in is:
- You do application development for a target set of platform technologies.
- Your applications can’t run without those platforms underneath.
- Hence, you’re locked into those platforms.
2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:
- That simplifies your environment, requiring less integration and interoperability.
- That simplifies your staffing; the same skill sets apply to multiple needs and projects.
- That simplifies your vendor support relationships; there’s “one throat to choke”.
- That simplifies your price negotiation.
3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:
- For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
- A few years later, the price is renegotiated, based on then-current levels of usage.
|Categories: Amazon and its cloud, Buying processes, Cassandra, Exadata, Facebook, IBM and DB2, Microsoft and SQL*Server, MongoDB, Neo Technology and Neo4j, Open source, Oracle, SAP AG||8 Comments|
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
- Spark and Databricks are both prospering, and of course enhancing their technology as well.
- Ditto DataStax.
- Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.
Subjects I’d like to add to that list include:
- MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
- Kafka has gotten considerable attention and adoption in streaming.
- Kafka is open source, out of LinkedIn.
- Folks who built it there, led by Jay Kreps, now have a company called Confluent.
- Confluent seems to be pursuing a fairly standard open source business model around Kafka.
- Confluent seems to be in the low to mid teens in paying customers.
- Confluent believes 1000s of Kafka clusters are in production.
- Confluent reports 40 employees and $31 million raised.
At its core Kafka is very simple:
- Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
- Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
- Kafka handles various issues of scaling, load balancing, fault tolerance and so on.
So it seems fair to say:
- Kafka offers the benefits of hub vs. point-to-point connectivity.
- Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)
|Categories: Data integration and middleware, Humor, Kafka and Confluent, Market share and customer counts, Microsoft and SQL*Server, Open source, Specific users, Streaming and complex event processing (CEP)||10 Comments|
This is part of a four post series spanning two blogs.
- One post gives a general historical overview of the artificial intelligence business.
- One post specifically covers the history of expert systems.
- One post gives a general present-day overview of the artificial intelligence business.
- One post (this one) explores the close connection between machine learning and (the rest of) AI.
1. I think the technical essence of AI is usually:
- Inputs come in.
- Decisions or actions come out.
- More precisely — inputs come in, something intermediate is calculated, and the intermediate result is mapped to a decision or action.
- The intermediate results are commonly either numerical (a scalar or perhaps a vector of scalars) or a classification/partition into finitely many possible intermediate outputs.
Of course, a lot of non-AI software can be described the same way.
To check my claim, please consider:
- It fits rules engines/expert systems so simply it’s barely worth saying.
- It fits any kind of natural language processing; the intermediate results might be words or phrases or concepts or whatever.
- It fits machine vision beautifully.
To see why it’s true from a bottom-up standpoint, please consider the next two points.
2. It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response. Examples of what I mean include: Read more
|Categories: Facebook, Google, IBM and DB2, Microsoft and SQL*Server, Predictive modeling and advanced analytics||6 Comments|
- I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
- The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
- Cassandra is an established leader in multi-data-center operation.
But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.
The story starts:
- Cassandra groups nodes into logical “data centers” (i.e. token rings).
- As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
- There are two levels of replication — within a single logical data center, and between logical data centers.
- Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
- However, copies of the database held elsewhere may have different replication factors …
- … and can indeed have different replication factors for different parts of the database.
In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):
- General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
- Instead, you get that effect by tweaking your specific replication settings.
|Categories: Cassandra, Clustering, DataStax, HBase, NoSQL, Open source, Specific users, Surveillance and privacy||3 Comments|
Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.
- Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
- Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
- Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
- Basho’s revenue is ~90% subscription.
- Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
- Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.
Basho’s product line has gotten a bit confusing, but as best I understand things the story is:
- There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
- Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
- Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
- Riak TS is for time series, and just coming out now.
- Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
- There’s an umbrella marketing term of “Basho Data Platform”.
Technical notes on some of that include: Read more
|Categories: Aerospike, Basho and Riak, Cassandra, Clustering, Couchbase, Databricks, Spark and BDAS, DataStax, HBase, Health care, Log analysis, MapR, Market share and customer counts, MongoDB, NoSQL, Pricing, Specific users, Splunk||Leave a Comment|
MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.
It seems fair to say that in most cases:
- Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
- Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.
Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:
- DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
- Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
- A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.
*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data.
While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind: Read more
|Categories: Business intelligence, Cassandra, Databricks, Spark and BDAS, DataStax, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Specific users, Text||6 Comments|
At the highest level:
- Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
- Teradata is announcing support for Presto with a classic open source pricing model.
- Presto will also become, roughly speaking, Teradata’s replacement for Hive.
- Teradata’s Presto efforts are being conducted by the former Hadapt.
Now let’s make that all a little more precise.
Regarding Presto (and I got most of this from Teradata)::
- To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
- … Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
- Facebook at various points in time created both Hive and now Presto.
- Facebook started the Presto project in 2012 and now has 10 engineers on it.
- Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
- Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
- Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.
Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project: Read more