Data models and architecture
Discussion of issues in data modeling, and whether databases should be consolidated or loosely coupled. Related subjects include:
WibiData, derived data, and analytic schema flexibility
My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to how WibiData works, in what is essentially a follow-on to my original WibiData post last October. Among other virtues, WibiData turns out to be a poster child for my views on derived data and the corresponding schema evolution.
Interesting quotes include:
WibiData is designed to store … transactional data side-by-side with profile and other derived data attributes.
… the ability to add new ad-hoc columns to a table enables more flexible analysis: output data that is the result of one analytic pipeline is stored adjacent to its input data, meaning that you can easily use this as input to second- or third-order derived data as well.
schemas can vary over time; you can easily add a field to a record, or delete a field. … But even though you start collecting that new data, your existing analysis pipelines can treat records like they always did; programs that don’t yet know about the new cookie are still compatible with both the old records already collected, and the new records with the additional field. New programs fill in default values for old data recorded before a field was added, applying the new schema at read time.
schemas for every column are stored in a data dictionary that matches column names with their schemas, as well as human-readable descriptions of the data.
Interesting aspects of the post that don’t lend themselves as well to being excerpted include:
- How the Produce-Gather “analysis calculus” — i.e. framework — works.
- How this all ties into Apache projects (and sub-projects) such as Hadoop, HBase, and Avro.
| Categories: Data models and architecture, Data warehousing, NoSQL, Odiago and WibiData | Leave a Comment |
Splunk update
Splunk is announcing the Splunk 4.3 point release. Before discussing it, let’s recall a few things about Splunk, starting with:
- Splunk is first and foremost an analytic DBMS …
- … used to manage logs and similar multistructured data.
- Splunk’s DML (Data Manipulation Language) is based on text search, not on SQL.
- Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).
- Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.
- The paradigmatic use of Splunk is to monitor IT operations in real time. However:
- There also are plenty of non-real-time uses for Splunk.
- Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.
As in any release, a lot of Splunk 4.3 is about “Oh, you didn’t have that before?” features and Bottleneck Whack-A-Mole performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users — especially the non-IT ones — really want to get Splunk information on the tablet devices. While this somewhat contradicts what I wrote a few days ago pooh-poohing mobile BI, let me hasten to point out:
- Splunk is used for a lot of (quasi) real-time monitoring.
- Splunk’s desktop user interfaces are, by BI standards, quite primitive.
That’s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn’t.
| Categories: Business intelligence, Data models and architecture, Data warehousing, Log analysis, Specific users, Splunk, Structured documents, Web analytics | 3 Comments |
Big data terminology and positioning
Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
given that
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
The cool aspects of Odiago WibiData
Christophe Bisciglia and Aaron Kimball have a new company.
- It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
- Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
- We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.
WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer. That’s where the secret sauce resides. Also included are:
- REST APIs for interactive access.
- Import/export tools, including JDBC access.
- Management tools.
- Analytic libraries — data mining, predictive analytics, machine learning, and so on.
The whole thing is in beta, with about three (paying) beta customers.
*And Avro and so on.
The core ideas of WibiData include:
- ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
- There are two primary operators in WibiData, Produce and Gather.
- Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
- It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
- Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
- HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.
| Categories: Data models and architecture, HBase, Hadoop, NoSQL, Odiago and WibiData, Predictive modeling and advanced analytics, Web analytics | 11 Comments |
What those nested data structures are about
As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.
The explanation was led by Oliver Ratzesberger, late of eBay*and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:
- All 50 search results you were shown, and their positions in the search rankings.
- Every ad, image, or graphical element.
- An ID as to which test you were participating in (every page you see on eBay has some element being tested).
*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.
There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)
Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.
| Categories: Data models and architecture, Data warehousing, Log analysis, Web analytics, eBay | 2 Comments |
The database architecture of salesforce.com, force.com, and database.com
salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture. That’s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as:
- A long-ago marketing decision to not give infrastructure details, so as to convey a “Don’t worry; we’ll take care of everything” message.
- Even so, a long-ago and perhaps now-regretted marketing decision to disclose and even exaggerate salesforce.com’s reliance on Oracle, as part of an early-days attempt to prove salesforce was using enterprise-class technology.
- A desire to hide the recipe for salesforce.com’s secret sauce.
- Force of habit — I’m not sure salesforce even knows how to tell its technical story with any clarity.
Actually, salesforce.com has moved some kinds of data out of Oracle that previously used to be stored there. Besides Oracle, salesforce uses at least a file system and a RAM-based data store about which I have no details. Even so, much of salesforce.com’s data is stored in Oracle — a single instance of Oracle, which it believes may be the largest instance of Oracle in the world.
| Categories: Data models and architecture, Market share and customer counts, Memory-centric data management, OLTP, Object, Oracle, Software as a Service (SaaS), salesforce.com | 13 Comments |
“Big data” has jumped the shark
I frequently observe that no market categorization is ever precise and, in particular, that bad jargon drives out good. But when it comes to “big data” or “big data analytics”, matters are worse yet. The definitive shark-jumping moment may be Forrester Research’s Brian Hopkins’ claim that:
… typical data warehouse appliances, even if they are petascale and parallel, [are] NOT big data solutions.
Nonsense almost as bad can be found in other venues.
Forrester seems to claim that “big data” is characterized by Volume, Velocity, Variety, and Variability. Others, less alliteratively-inclined, might put Complexity in the mix. So far, so good; after all, much of what people call “big data” is collections of disparate data streams, all collected somewhere in a big bit bucket. But when people start defining “big data” to include Variety and/or Variability, they’ve gone too far.
Derived data, progressive enhancement, and schema evolution
The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:
- Derived data.
- Many-step processes to produce derived data.
- Schema evolution.
- Temporary data constructs.
So let’s dive in. Read more
| Categories: Data models and architecture, Data warehousing, MarkLogic, Text | Leave a Comment |
Data management at Zynga and LinkedIn
Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn’s People You May Know application.
It’s blindingly obvious that Zynga is one of Vertica’s petabyte-scale customers, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it’s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.
I don’t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.
I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, memcached/Membase/Couchbase), as Zynga decided that sending the data to some kind of log first was more trouble than it’s worth. Second, there’s Zynga’s approach to analytic database design. Highlights of that include: Read more
| Categories: Aster Data, Couchbase, Data models and architecture, Games and virtual worlds, Greenplum, Hadoop, Petabyte-scale data management, Specific users, Vertica Systems, Zynga | 24 Comments |
Terminology: Dynamic- vs. fixed-schema databases
E. F. “Ted” Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization). But he may not have been as universally correct as he thought. Cases I’ve noted in which fixed schemas may be problematic include:
- “A bunch of apps in one, similar but not the same” (in my recent post on MongoDB).
- Out-of-control product catalogs (ditto).
- Analytic use cases in which one keeps enhancing the database with derived data.
And if marketing profile analysis is ever done correctly, that will be a huge example for the list.
So what do we call those DBMS — for example NoSQL, object-oriented, or XML-based systems — that bake the schema into the applications or the records themselves? In the MongoDB post I went with “schemaless,” but I wasn’t really comfortable with that, so I took the discussion to Twitter. Comments from Vlad Didenko (in particular), Ryan Prociuk, Merv Adrian, and Roland Bouman favored the idea that schemas in such systems are changeable or late-bound, rather than entirely absent. I quickly agreed.
| Categories: Data models and architecture, NoSQL, Object, Structured documents | 16 Comments |
