In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:
What users value in the product-buying process is quite different from what they value once a product is (being) put into use.
Here are some thoughts about how that comes into play today.
Wants outrunning needs
1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.
2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.
Even so, the want for speed keeps growing stronger.
3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.
4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.
5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.
6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.
|Categories: Benchmarks and POCs, Business intelligence, Cloud computing, Clustering, Data models and architecture, Data warehousing, NoSQL, Software as a Service (SaaS), Vertica Systems||3 Comments|
In my latest post, I noted that
The “real-time analytics” gold rush I called out last year continues.
I also recently mocked the slogan
Analytics for everybody!
So when I saw today an email subject line
[Vendor X] to announce real-time analytics for everyone …
I laughed. Indeed, I snorted so loudly that Linda — who was on a different floor of our house — called to check that I was OK.
As the day progressed, I had a consulting call with a client, and what did I see on the first substantive slide? There were references to:
real-time data analysis
The trends — real or imaginary — are melting into each other!
The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.
Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing.
The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.
Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.
*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.
WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.
Disclosure: My fingerprints are all over that deal.
In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.
I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.
I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.
I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.
*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.
Some technical background about Splunk
In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):
Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.
It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.
I also wrote:
The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.
I further wrote:
Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Splunk’s ILM story turns out to be simple indeed.
- As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
- Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.
Finally, I wrote:
I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
The point of what I in October, 2013 called
a high(er)-performance data store into which you can selectively copy columns of data
and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.
Inverted list technology is confusing for several reasons, which start: Read more
|Categories: Data models and architecture, NoSQL, SAP AG, Splunk, Structured documents, Text||1 Comment|
For quite some time, one of the most frequent marketing pitches I’ve heard is “Analytics made easy for everybody!”, where by “quite some time” I mean “over 30 years”. “Uniquely easy analytics” is a claim that I meet with the greatest of skepticism.* Further confusing matters, these claims are usually about what amounts to business intelligence tools, but vendors increasingly say “Our stuff is better than the BI that came before, so we don’t want you to call it ‘BI’ as well.”
*That’s even if your slide deck doesn’t contain a picture of a pyramid of user kinds; if there actually is such a drawing, then the chance that I believe you is effectively nil.
All those caveats notwithstanding, there are indeed at least three forms of widespread analytics:
- Fairly standalone, eas(ier) to use business intelligence tools, sometimes marketed as focusing on “data exploration” or “data discovery”.
- Charts and graphs integrated or at least well-embedded into production applications. This technology is on a long-term rise. But in some sense, integrated reporting has been around since the invention of accounting.
- Predictive analytics built into automated systems, for example ad selection. This is not what is usually meant by the “easy analytics” claim, and I’ll say no more about it in this post.
It would be nice to say that the first two bullet points represent a fairly clean operational/investigative BI split, but that would be wrong; human real-time dashboards can at once be standalone and operational.
|Categories: Business intelligence, Data integration and middleware, Data warehousing||Leave a Comment|
A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.
“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:
- Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
- Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
- Inputs to ancillary data management functions — for example, security privileges.
- Support for human decisions about data — for example, information about authorship or lineage.
What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.
And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:
- Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
- Some document databases store structural metadata right with the document data itself.
- Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
- Actual text documents carry the structure imposed by grammar and syntax.
- A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
- Metadata as derived data (May, 2011)
- Dataset management (May, 2013)
- Structured/unstructured … multi-structured/poly-structured (May, 2011)
|Categories: Data models and architecture, Hadoop, Structured documents, Surveillance and privacy, Telecommunications||4 Comments|
Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are:
- MemSQL has historically been an in-memory row store, which as of last year scales out.
- It turns out that the MemSQL row store actually has two table types. One is scaled out. The other — called “reference” — is replicated on every node.
- MemSQL has now added a third table type, which is columnar and which resides in flash memory.
- If you want to keep data in, for example, both the scale-out row store and the column store, you’d have to copy/replicate it within MemSQL. And if you wanted to access data from both versions at once (e.g. because different copies cover different time periods), you’d likely have to do a UNION or something like that.
*MemSQL’s first columnar offering sounds pretty basic; for example, there’s no columnar compression yet. (Edit: Oops, that’s not accurate. See comment below.) But at least they actually have one, which puts them ahead of many other row-based RDBMS vendors that come to mind.
And to hammer home the contrast:
- IBM, Oracle and Microsoft, which all sell row-based DBMS meant to run on disk or other persistent storage, have added or will add columnar options that run in RAM.
- MemSQL, which sells a row-based DBMS that runs in RAM, has added a columnar option that runs in persistent solid-state storage.
|Categories: Columnar database management, Database compression, In-memory DBMS, MemSQL, Solid-state memory||12 Comments|
Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):
- Are the SQL engine and Hadoop:
- Necessarily on the same cluster?
- Necessarily or at least most naturally on different clusters?
- How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
- HDFS (Hadoop Distributed File System)?
- Hadoop MapReduce?
- How, if at all, is the SQL engine invoked by Hadoop?
- If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
- A way of making data transfer maximally parallel.
- Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
- If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.
Let’s go to some examples. Read more
|Categories: Cloudera, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadapt, Hadoop, HBase, Hortonworks, MapReduce, Microsoft and SQL*Server, NewSQL, PostgreSQL, SQL/Hadoop integration, Teradata||26 Comments|
From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.