December 15, 2017

The technology industry is under broad political attack

I apologize for posting a December downer, but this needs to be said.

The technology industry is under attack:

From politicians and political pundits …
… especially from “populists” and/or the political right …
… in the United States and other countries.

These attacks:

Are in some cases specific to internet companies such as Google and Facebook.
In some cases threaten the tech industry more broadly.
Are in some cases part of general attacks on the educated/ professional/“globalist”/”coastal” “elites”.

You’ve surely noticed some of these attacks. But you may not have noticed just how many different attacks and criticisms there are, on multiple levels.

Categories: Amazon and its cloud, Facebook, Google, Public policy

4 Comments

August 21, 2016

More about Databricks and Spark

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:

What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.

Ali also walked me through customer use cases and adoption in wonderful detail. In general:

A large majority of Databricks customers have machine learning use cases.
Predicting and preventing user/customer churn is a huge issue across multiple market sectors.

The story on those sectors, per Ali, is: Read more

Categories: Data warehousing, Databricks, Spark and BDAS, Investment research and trading, Market share and customer counts, Open source, Predictive modeling and advanced analytics, Software as a Service (SaaS), Specific users, Streaming and complex event processing (CEP), Web analytics

4 Comments

August 7, 2016

Notes on DataStax and Cassandra

I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.

On the customer side:

DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.

Customers in large numbers want cloud capabilities, as a potential future if not a current need.

One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.

Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are: Read more

Categories: Cassandra, DataStax, Microsoft and SQL*Server, NoSQL, Specific users

2 Comments

July 19, 2016

Notes on vendor lock-in

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Categories: Amazon and its cloud, Buying processes, Cassandra, Exadata, Facebook, IBM and DB2, Microsoft and SQL*Server, MongoDB, Neo Technology and Neo4j, Open source, Oracle, SAP AG

12 Comments

July 19, 2016

Notes from a long trip, July 19, 2016

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

Categories: About this blog, Business intelligence, Cassandra, ClearStory Data, Cloud computing, Cloudera, Data warehousing, Databricks, Spark and BDAS, DataStax, Facebook, HBase, Market share and customer counts, MemSQL, Microsoft and SQL*Server, NoSQL, Open source, Platfora, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP), Workday, Zoomdata

7 Comments

January 25, 2016

Kafka and Confluent

For starters:

Kafka has gotten considerable attention and adoption in streaming.
Kafka is open source, out of LinkedIn.
Folks who built it there, led by Jay Kreps, now have a company called Confluent.
Confluent seems to be pursuing a fairly standard open source business model around Kafka.
Confluent seems to be in the low to mid teens in paying customers.
Confluent believes 1000s of Kafka clusters are in production.
Confluent reports 40 employees and $31 million raised.

At its core Kafka is very simple:

Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
Kafka handles various issues of scaling, load balancing, fault tolerance and so on.

So it seems fair to say:

Kafka offers the benefits of hub vs. point-to-point connectivity.
Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)

Categories: Data integration and middleware, Humor, Kafka and Confluent, Market share and customer counts, Microsoft and SQL*Server, Open source, Specific users, Streaming and complex event processing (CEP)

11 Comments

December 1, 2015

Machine learning’s connection to (the rest of) AI

This is part of a four post series spanning two blogs.

One post gives a general historical overview of the artificial intelligence business.
One post specifically covers the history of expert systems.
One post gives a general present-day overview of the artificial intelligence business.
One post (this one) explores the close connection between machine learning and (the rest of) AI.

1. I think the technical essence of AI is usually:

Inputs come in.
Decisions or actions come out.
More precisely — inputs come in, something intermediate is calculated, and the intermediate result is mapped to a decision or action.
The intermediate results are commonly either numerical (a scalar or perhaps a vector of scalars) or a classification/partition into finitely many possible intermediate outputs.

Of course, a lot of non-AI software can be described the same way.

To check my claim, please consider:

It fits rules engines/expert systems so simply it’s barely worth saying.
It fits any kind of natural language processing; the intermediate results might be words or phrases or concepts or whatever.
It fits machine vision beautifully.

To see why it’s true from a bottom-up standpoint, please consider the next two points.

2. It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response. Examples of what I mean include: Read more

Categories: Facebook, Google, IBM and DB2, Microsoft and SQL*Server, Predictive modeling and advanced analytics

6 Comments

October 15, 2015

Cassandra and privacy requirements

For starters:

I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
Cassandra is an established leader in multi-data-center operation.

But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.

The story starts:

Cassandra groups nodes into logical “data centers” (i.e. token rings).
As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
There are two levels of replication — within a single logical data center, and between logical data centers.
Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
However, copies of the database held elsewhere may have different replication factors …
… and can indeed have different replication factors for different parts of the database.

In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):

General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
Instead, you get that effect by tweaking your specific replication settings.

Categories: Cassandra, Clustering, DataStax, HBase, NoSQL, Open source, Specific users, Surveillance and privacy

3 Comments

October 15, 2015

Basho and Riak

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
Basho’s revenue is ~90% subscription.
Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
Riak TS is for time series, and just coming out now.
Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include: Read more

Categories: Aerospike, Basho and Riak, Cassandra, Clustering, Couchbase, Databricks, Spark and BDAS, DataStax, HBase, Health care, Log analysis, MapR, Market share and customer counts, MongoDB, NoSQL, Pricing, Specific users, Splunk

1 Comment

September 14, 2015

DataStax and Cassandra update

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. 🙂

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind: Read more

Categories: Business intelligence, Cassandra, Databricks, Spark and BDAS, DataStax, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Specific users, Text

6 Comments

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The technology industry is under broad political attack

More about Databricks and Spark

Notes on DataStax and Cassandra

Notes on vendor lock-in

Notes from a long trip, July 19, 2016

Kafka and Confluent

Machine learning’s connection to (the rest of) AI

Cassandra and privacy requirements

Basho and Riak

DataStax and Cassandra update

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin