Data models and architecture
Discussion of issues in data modeling, and whether databases should be consolidated or loosely coupled. Related subjects include:
The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.
Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing.
The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.
Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.
*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.
WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.
Disclosure: My fingerprints are all over that deal.
In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.
I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.
I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.
I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.
*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.
Some technical background about Splunk
In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):
Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.
It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.
I also wrote:
The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.
I further wrote:
Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Splunk’s ILM story turns out to be simple indeed.
- As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
- Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.
Finally, I wrote:
I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
The point of what I in October, 2013 called
a high(er)-performance data store into which you can selectively copy columns of data
and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.
Inverted list technology is confusing for several reasons, which start: Read more
|Categories: Data models and architecture, NoSQL, SAP AG, Splunk, Structured documents, Text||1 Comment|
A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.
“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:
- Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
- Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
- Inputs to ancillary data management functions — for example, security privileges.
- Support for human decisions about data — for example, information about authorship or lineage.
What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.
And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:
- Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
- Some document databases store structural metadata right with the document data itself.
- Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
- Actual text documents carry the structure imposed by grammar and syntax.
- A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
- Metadata as derived data (May, 2011)
- Dataset management (May, 2013)
- Structured/unstructured … multi-structured/poly-structured (May, 2011)
|Categories: Data models and architecture, Hadoop, Structured documents, Surveillance and privacy, Telecommunications||5 Comments|
Cassandra’s reputation in many quarters is:
- World-leading in the geo-distribution feature.
- Impressively scalable.
- Hard to use.
This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”
My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.
DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:
- A petabyte or so of data at a very prominent company, geo-distributed, with 800+ nodes, in a kind of block storage use case.
- A messaging application at a very prominent company, anticipated to grow to multiple data centers and a petabyte of so of data, across 1000s of nodes.
- A 300 terabyte single-data-center telecom account (which I can’t find on DataStax’s extensive customer list).
- A huge health records deal.
- A Fortune 10 company.
DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.
DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion. Read more
|Categories: Cassandra, Clustering, Couchbase, Data models and architecture, DataStax, Facebook, HBase, Health care, Log analysis, Market share and customer counts, MongoDB and 10gen, NoSQL, Petabyte-scale data management, Specific users||10 Comments|
It took me a bit of time, and an extra call with Vertica’s long-time R&D chief Shilpa Lawande, but I think I have a decent handle now on Vertica 7, code-named Crane. The two aspects of Vertica 7 I find most interesting are:
- Flex Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica performance).
- What sounds like an alternate query execution capability for short-request queries, the big point of which is that it saves them from being broadcast across the whole cluster, hence improving scalability. (Adding nodes of course doesn’t buy you much for the portion of a workload that’s broadcast.)
Other Vertica 7 enhancements include:
- A lot of Bottleneck Whack-A-Mole.
- “Significant” improvements to the Vertica management console.
- Security enhancements (Kerberos), Hadoop integration enhancements (HCatalog), and enhanced integration with Hadoop security (Kerberos again).
- Some availability hardening. (“Fault groups”, which for example let you ensure that data is replicated not just to 2+ nodes, but also that the nodes aren’t all on the same rack.)
- Java as an option to do in-database analytics. (Who knew that feature was still missing?)
- Some analytic functionality. (Approximate COUNT DISTINCT, but not yet Approximate MEDIAN.)
Overall, two recurring themes in our discussion were:
- Load and ETL (Extract/Transform/Load) performance, and/or obviating ETL.
- Short-request performance, in the form of more scalable short-request concurrency.
I’m a little shaky on embargo details — but I do know what was in my own quote in a Splunk press release that went out yesterday.
Splunk has been rolling out a lot of news. In particular:
- Hunk follows through on the Hadoop/Splunk (get it?) co-opetition I foreshadowed last year, including access to Hadoop via the same tools that run over the Splunk data store, plus …
- … some Datameer-like capabilities to view partial Hadoop-job results as they flow in.
- Splunk 6 has lots of new features, including a bunch of better please-don’t-call-it-BI capabilities, and …
- … a high(er)-performance data store into which you can selectively copy columns of data.
I imagine there are some operationally-oriented use cases for which Splunk instantly offers the best Hadoop business intelligence choice available. But what I really think is cool is Splunk’s schema-on-need story, wherein:
- Data comes in wholly schema-less, in a time series of text snippets.
- Some of the fields in the text snippets are indexed for faster analysis, automagically or upon user decree.
- All this can now happen over the Splunk data store or (new option) over Hadoop.
- Fields can (in another new option) also be copied to a separate data store, claimed to be of much higher performance.
That highlights a pretty serious and flexible vertical analytic stack. I like it.
|Categories: Business intelligence, Data models and architecture, Data warehousing, Hadoop, Schema on need, Splunk||2 Comments|
Glassbeam checked in recently, and they turn out to exemplify quite a few of the themes I’ve been writing about. For starters:
- Glassbeam has an analytic technology stack focused on poly-structured machine-generated data.
- Glassbeam partially organizes that data into event series …
- … in a schema that is modified as needed.
Glassbeam basics include:
- Founded in 2009.
- Based in Santa Clara. Back-end engineering in Bangalore.
- $6 million in angel money; no other VC.
- High single-digit customer count, …
- … plus another high single-digit number of end customers for an OEM offering a limited version of their product.
All Glassbeam customers except one are SaaS/cloud (Software as a Service), and even that one was only offered a subscription (as oppose to perpetual license) price.
So what does Glassbeam’s technology do? Glassbeam says it is focused on “machine data analytics,” specifically for the “Internet of Things”, which it distinguishes from IT logs.* Specifically, Glassbeam sells to manufacturers of complex devices — IT (most of its sales so far ), medical, automotive (aspirational to date), etc. — and helps them analyze “phone home” data, for both support/customer service and marketing kinds of use cases. As of a recent release, the Glassbeam stack can: Read more
I coined the term schema-on-need last month. More precisely, I coined it while being briefed on JSON-in-Teradata, which was announced earlier this week, and is slated for availability in the first half of 2014.
The basic JSON-in-Teradata story is as you expect:
- A JSON document is stuck into a relational field.
(Oddly, Teradata wasn’t yet sure whether the field would be a BLOB or VARCHAR or something else.)Edit: See Dan Graham’s comment below.
- Fields within the JSON document can be indexed on.
- Those fields can be referenced in SQL statements much as regular Teradata columns can.
You have to retrieve the whole document.Edit: See Dan Graham’s comment below.
- To avert the performance pain of retrieving the whole document, you can of course copy any particular field into a column of its own. (That’s the schema-on-need part of the story.)
JSON virtual columns are referenced a little differently than ordinary physical columns are. Thus, if you materialize a virtual column, you have to change your SQL. If you’re doing business intelligence through a semantic layer, or otherwise have some kind of declarative translation, that’s probably not a big drawback. If you’re coding analytic procedures directly, it still may not be a big drawback — hopefully you won’t reference the virtual column too many times in code before you decide to materialize it instead.
My Bobby McFerrin* imitation notwithstanding, Hadapt illustrates a schema-on-need approach that is slicker than Teradata’s in two ways. First, Hadapt has full SQL transparency between virtual and physical columns. Second, Hadapt handles not just JSON, but anything represented by key-value pairs. Still, like XML before it but more concisely, JSON is a pretty versatile data interchange format. So JSON-in-Teradata would seem to be useful as it stands.
*The singer in the classic 1988 music video Don’t Worry Be Happy. The other two performers, of course, were Elton John and Robin Williams.
|Categories: Data models and architecture, Data warehousing, Hadapt, Schema on need, Structured documents, Teradata||2 Comments|
Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:
- There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
- There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
- The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.
There’s much more, of course, but those are the essential pieces.
Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.
The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):
- BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
- BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
- BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
- Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.
Use cases suggested are a lot of marketing, plus anti-fraud.
*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.
So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:
- Ordinary SQL except in the FROM clause.
- Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.
Within those functions, the core idea is: Read more
|Categories: Application areas, Aster Data, Business intelligence, Data models and architecture, Data warehousing, Hadoop, Parallelization, Predictive modeling and advanced analytics, RDF and graphs, Teradata||4 Comments|
ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy.
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.