Data models and architecture
Discussion of issues in data modeling, and whether databases should be consolidated or loosely coupled. Related subjects include:
I’m a little shaky on embargo details — but I do know what was in my own quote in a Splunk press release that went out yesterday.
Splunk has been rolling out a lot of news. In particular:
- Hunk follows through on the Hadoop/Splunk (get it?) co-opetition I foreshadowed last year, including access to Hadoop via the same tools that run over the Splunk data store, plus …
- … some Datameer-like capabilities to view partial Hadoop-job results as they flow in.
- Splunk 6 has lots of new features, including a bunch of better please-don’t-call-it-BI capabilities, and …
- … a high(er)-performance data store into which you can selectively copy columns of data.
I imagine there are some operationally-oriented use cases for which Splunk instantly offers the best Hadoop business intelligence choice available. But what I really think is cool is Splunk’s schema-on-need story, wherein:
- Data comes in wholly schema-less, in a time series of text snippets.
- Some of the fields in the text snippets are indexed for faster analysis, automagically or upon user decree.
- All this can now happen over the Splunk data store or (new option) over Hadoop.
- Fields can (in another new option) also be copied to a separate data store, claimed to be of much higher performance.
That highlights a pretty serious and flexible vertical analytic stack. I like it.
|Categories: Business intelligence, Data models and architecture, Data warehousing, Hadoop, Schema on need, Splunk||2 Comments|
Glassbeam checked in recently, and they turn out to exemplify quite a few of the themes I’ve been writing about. For starters:
- Glassbeam has an analytic technology stack focused on poly-structured machine-generated data.
- Glassbeam partially organizes that data into event series …
- … in a schema that is modified as needed.
Glassbeam basics include:
- Founded in 2009.
- Based in Santa Clara. Back-end engineering in Bangalore.
- $6 million in angel money; no other VC.
- High single-digit customer count, …
- … plus another high single-digit number of end customers for an OEM offering a limited version of their product.
All Glassbeam customers except one are SaaS/cloud (Software as a Service), and even that one was only offered a subscription (as oppose to perpetual license) price.
So what does Glassbeam’s technology do? Glassbeam says it is focused on “machine data analytics,” specifically for the “Internet of Things”, which it distinguishes from IT logs.* Specifically, Glassbeam sells to manufacturers of complex devices — IT (most of its sales so far ), medical, automotive (aspirational to date), etc. — and helps them analyze “phone home” data, for both support/customer service and marketing kinds of use cases. As of a recent release, the Glassbeam stack can: Read more
I coined the term schema-on-need last month. More precisely, I coined it while being briefed on JSON-in-Teradata, which was announced earlier this week, and is slated for availability in the first half of 2014.
The basic JSON-in-Teradata story is as you expect:
- A JSON document is stuck into a relational field.
(Oddly, Teradata wasn’t yet sure whether the field would be a BLOB or VARCHAR or something else.)Edit: See Dan Graham’s comment below.
- Fields within the JSON document can be indexed on.
- Those fields can be referenced in SQL statements much as regular Teradata columns can.
You have to retrieve the whole document.Edit: See Dan Graham’s comment below.
- To avert the performance pain of retrieving the whole document, you can of course copy any particular field into a column of its own. (That’s the schema-on-need part of the story.)
JSON virtual columns are referenced a little differently than ordinary physical columns are. Thus, if you materialize a virtual column, you have to change your SQL. If you’re doing business intelligence through a semantic layer, or otherwise have some kind of declarative translation, that’s probably not a big drawback. If you’re coding analytic procedures directly, it still may not be a big drawback — hopefully you won’t reference the virtual column too many times in code before you decide to materialize it instead.
My Bobby McFerrin* imitation notwithstanding, Hadapt illustrates a schema-on-need approach that is slicker than Teradata’s in two ways. First, Hadapt has full SQL transparency between virtual and physical columns. Second, Hadapt handles not just JSON, but anything represented by key-value pairs. Still, like XML before it but more concisely, JSON is a pretty versatile data interchange format. So JSON-in-Teradata would seem to be useful as it stands.
*The singer in the classic 1988 music video Don’t Worry Be Happy. The other two performers, of course, were Elton John and Robin Williams.
|Categories: Data models and architecture, Data warehousing, Hadapt, Schema on need, Structured documents, Teradata||2 Comments|
Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:
- There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
- There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
- The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.
There’s much more, of course, but those are the essential pieces.
Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.
The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):
- BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
- BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
- BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
- Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.
Use cases suggested are a lot of marketing, plus anti-fraud.
*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.
So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:
- Ordinary SQL except in the FROM clause.
- Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.
Within those functions, the core idea is: Read more
|Categories: Application areas, Aster Data, Business intelligence, Data models and architecture, Data warehousing, Hadoop, Parallelization, Predictive modeling and advanced analytics, RDF and graphs, Teradata||4 Comments|
ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy.
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.
There’s a growing trend for DBMS to beef up their support for multiple data manipulation languages (DMLs) or APIs — and there’s a special boom in JSON support, MongoDB-compatible or otherwise. So I talked earlier tonight with IBM’s Bobbie Cochrane about how JSON is managed in DB2.
For starters, let’s note that there are at least four strategies IBM could have used.
- Store JSON in a BLOB (Binary Large OBject) or similar existing datatype. That’s what IBM actually chose.
- Store JSON in a custom datatype, using the datatype extensibility features DB2 has had since the 1990s. IBM is not doing this, and doesn’t see a need to at this time.
- Use DB2 pureXML, along with some kind of JSON/XML translator. DB2 managed JSON this way in the past, via UDFs (User-Defined Functions), but that implementation is superseded by the new BLOB-based approach, which offers better performance in ingest and query alike.
- Shred — to use a term from XML days — JSON into a bunch of relational columns. IBM experimented with this approach, but ultimately rejected it. In dismissing shredding, Bobbie also disdained any immediate support for schema-on-need.
IBM’s technology choices are of course influenced by its use case focus. It’s reasonable to divide MongoDB use cases into two large buckets:
- Hardcore internet and/or machine-generated data, for example from a website.
- Enterprise data aggregation, for example a “360-degree customer view.”
IBM’s DB2 JSON features are targeted at the latter bucket. Also, I suspect that IBM is generally looking for a way to please users who enjoy working on and with their MongoDB skills. Read more
|Categories: Data models and architecture, IBM and DB2, MongoDB and 10gen, NoSQL, pureXML, Structured documents||2 Comments|
Two years ago I wrote about how Zynga managed analytic data:
Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.
What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:
- Relational DBMS are adding or enhancing their support for complex datatypes, to accommodate various kinds of machine-generated data.
- MongoDB-compatible JSON is the flavor of the day on the short-request side, but alternatives include other JSON, XML, other key-value, or text strings.
- It is often possible to index on individual attributes inside the complex datatype.
- The individual attributes inside the complex datatypes amount to virtual columns, which can play similar roles in SQL statements as physical columns do.
- Over time, the DBA may choose to materialize virtual columns as additional physical columns, to boost query performance.
That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done.
|Categories: Data models and architecture, Data warehousing, MongoDB and 10gen, PostgreSQL, Schema on need, Structured documents||8 Comments|
Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more
I’ll start with three observations:
- Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
- Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
- In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.
As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more
I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:
- Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
- At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
- While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
- Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
- Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
- The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.
Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.