Cache – DBMS 2 : DataBase Management System Services

Couchbase 4.0 and related subjects

Curt Monash — Thu, 15 Oct 2015 15:17:44 +0000

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs — start:

There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
“Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
SELECT, FROM and WHERE clauses work in the usual way.
Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

Layering of database technology & DBMS with multiple DMLs

Curt Monash — Sun, 08 Sep 2013 08:52:13 +0000

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
MySQL storage engines.
MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).

Other examples on my mind include:

Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
FoundationDB’s strategy, and specifically its acquisition of Akiban.

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include:

The object/relational DBMS previously mentioned.
The new DMLs (Data Manipulation Languages) or APIs previously mentioned over key-value stores.
The SQL interfaces offered for a considerable number of non-SQL systems — Intersystems Cache’, MarkLogic, Hadoop (and thus HBase) and many more.
Text search interfaces for a variety of DBMS.
The JSON/MongoDB-compatibility interfaces that are popping up for multiple DBMS, e.g. DB2 or MarkLogic.
FoundationDB, previously mentioned.

So will these trends take over the DBMS world?

Developing a multi-purpose DBMS is extremely difficult, and even harder if it’s layered.

Developing any kind of DBMS is very hard.
Developing a multi-purpose DBMS is harder yet. Try, for example, to imagine a caching and memory-management subsystem that’s optimal for multiple datatypes and DMLs at once.
Layering carries performance costs. The best-case performance scenario is when you can optimize the flow of data all the way from client-server connection down to persistent storage, and back. Layering interferes with that.

But on the plus side, it can be great to have one DBMS handle multiple kinds of data.

Almost irrespective of product category, there are obvious benefits to buying, installing and administering one thing that can meet multiple needs.
Further, there are major use cases for manipulating the same data in different ways. For example:
- Almost any kind of large object is likely to have tabular metadata attached.
- Many kinds of database can, at times, be usefully addressed via full-text search.
- In scenarios where you incrementally derive and enhance data, it’s natural to want to keep everything in the same place. (That also helps with lineage, security and so on.) But derived data may be structured very differently than the raw data it’s based on.

And by the way — the more different functions a DBMS performs, the more they may need to be walled off from each other. In particular, I’ve long argued that it’s a best practice for e-commerce sites to manage access control, transactions, and interaction data in at least two separate databases, and preferably in three. General interaction logs do not need the security or durability that access control and transactions do, and there can be considerable costs to giving them what they don’t need. A classic example is the 2010 Chase fiasco, in which recovery from an Oracle outage was delayed by database clutter that would have fit better into a NoSQL system anyway. Building a single DBMS that refutes my argument would not be easy.

So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.

Layered and multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.
Exadata-like tiering in an otherwise integrated system seems like a smart way to avoid the traditional shared-everything vs. shared-nothing tradeoffs. Tiering could also be a good way to combine the ever more numerous kinds of storage — dish, flash, multiple levels of cache, etc.
Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …
… strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM have been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.

Related links

The refactoring of everything (July, 2013)
JSON in DB2 (September, 2013)
Multi-structured data support in Hadapt (September, 2013)

Introduction to NuoDB

Curt Monash — Sat, 12 Jan 2013 23:06:37 +0000

NuoDB has an interesting NewSQL story. NuoDB’s core design goals seem to be:

SQL.
Transactions.
Very flexible topology, including:
- Local replicas.
- Remote replicas.
- Easy deployment and management.

Please don’t blame me for various annoying aspects of NuoDB’s marketing; they only became a client this month.

Key aspects of NuoDB’s architecture include:

A modular approach, which seems to fall somewhere between true object-orientation and Spark RDDs.
Much tiering.
Very optimistic execution.

Less unusual technical highlights include:

MVCC (Multi-Version Concurrency Control).
B-trees.
Some kind of tokenized compression (but data is decompressed for execution).
Encryption across the network.

Online schema change is in the mix as well.

The heart of NuoDB’s design seems to be the optimistic execution. Highlights include:

Any change to data is immediately propagated to all replicas, throughout the system, straight down to disk.
Commit messages eventually say “Yeah, we’re good to go; commit that.”
Conflicts are unwound via timestamps.

Consequences include:

NuoDB doesn’t guarantee consistent state of the database; it merely guarantees database consistency in response to any kind of request. (This is akin to the philosophy behind RYW consistency.)
The synchronous parts of network communications are very lightweight.

NuoDB’s consistency, by the way, is tunable. Commits can require:

Acknowledgement by N different storage engine copies that persistence has happened (pretty much the traditional approach).
Acknowledgement by K different transaction engine copies that the change is in RAM (what the cool kids are doing).

NuoDB’s modularity story starts:

Everything — data, indexes, metadata, whatever — is chunked into “atoms” of 50Kb or so.
There are common services to do things like replication of atoms.
NuoDB speaks, inaccurately, as if the atoms actually carry their own logic; this seems to be the basis for various soaring metaphors about emergent flocks of birds.

Traditionally, DBMS have one tier. Exadata, MarkLogic, and InfiniDB — among others — have two to three each. NuoDB, however, effectively has four:

A lightweight load balancer (just like any other peer-to-peer system), which travels with …
… “transaction engines” that:
- Talk with clients.
- Parse and plan SQL.
- Manage distributed transactions.
- Cache data.
- Etc.
“Storage managers” that manage persistence, with the help of …
… “key-value stores”, a term NuoDB uses loosely but reasonably to encompass your choice of:
- A lightweight, proprietary NuoDB key-value store.
- Amazon S3.
- HDFS (Hadoop Distributed File System).
- Presumably more options later on.

The most counterintuitive part of this is probably that one instance of a storage manager can talk to a whole key-value cluster (e.g. one managed by HDFS); thus, this is not a classic shared-nothing nor sharded (transparently or otherwise) approach. Rather, each NuoDB storage manager sees the whole database.

Seeing only part of the database isn’t an option in NuoDB Version 1; hence, compliance-oriented geo-partitioning isn’t supported at this time.

Paradigmatically,

NuoDB transaction engines and storage managers run on different machines.
Each database has its own transaction engines and storage managers; transaction engines for different databases can run on the same machine.
Replication factors can be different for different databases.
Local replication gives high availability and elastic scale-out.
Remote replication gives disaster recovery.

Wrinkles include:

Transaction engines and storage managers have very similar code.
You could run everything on one machine if you really insisted.
Conversely, NuoDB fondly thinks that it makes sense to run transaction engines on the same machines as your application servers. (I’m more skeptical, since app servers and DBMS are both heavily consumptive of RAM.)

And finally, NuoDB company highlights include:

25 employees.
Single-digit number of production customers.
VCs who were previously CEOs of Relational Technology/Ingres (Gary Morgenthaler), Illustra (ditto), and Sybase (Mitchell Kertzman).

NuoDB took a “rolling thunder” approach to launch, so product news is already out, but some customer wins and internal benchmarks are being held back for a press event next week.

Bottom line: No new DBMS could possibly justify NuoDB’s hype. But NuoDB is an interesting new product for the cloud era.

Introduction to GenieDB

Curt Monash — Mon, 07 Jan 2013 18:35:31 +0000

GenieDB is one of the newer and smaller NewSQL companies. GenieDB’s story is focused on wide-area replication and uptime, coupled to claims about ease and the associated low TCO (Total Cost of Ownership).

GenieDB is in my same family of clients as Cirro.

The GenieDB product is more interesting if we conflate the existing GenieDB Version 1 and a soon-forthcoming (mid-year or so) Version 2. On that basis:

GenieDB has three tiers.
GenieDB’s top tier is the usual MySQL front-end.
GenieDB’s bottom tier is either Berkeley DB or a conventional MySQL storage engine.
GenieDB’s bottom tier stores your entire database at every node.
If you replicate locally, GenieDB’s middle tier operates a distributed cache.
If you replicate wide-area, GenieDB’s middle tier allows active-active/multi-master replication.

The heart of the GenieDB story is probably wide-area replication. Specifics there include:

Lamport clock.
Self-healing technology to detect errors and out-of-sync conditions, and to request data retransmission accordingly.
VPNs (Virtual Private Networks) to tie the whole thing together.

Obviously, replicating the whole database to every node imposes some limitations, most notably:

GenieDB database sizes are limited to what fits well on a node — unless, for example, some transparent sharding technology is added to the mix.
GenieDB doesn’t offer the regulatory compliance benefits of partitioning data in line with its geographical origin.

However, GenieDB does offer:

Redundancy among cloud data centers.
Response-time benefits of keeping data close to the user.
Support for occasionally-connected topologies. (The example GenieDB cites is oil rigs.)

Oddly, I can’t find any notes on GenieDB company particulars. But I think GenieDB’s employee count is in the teens and a couple of customer sites are going into production just around now. Technically, I don’t think GenieDB has raised a Series A round yet; but Stuart Frost is involved, and his fund-raising skills are exemplary.

Couchbase 2.0

Curt Monash — Tue, 20 Nov 2012 02:14:12 +0000

My clients at Couchbase checked in.

After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
There also are many users of open source Couchbase, most famously LinkedIn.
Orbitz is a much-mentioned flagship paying Couchbase customer.
Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
Couchbase headcount is just under 100.

The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:

JSON storage, including secondary indexes.
Multi-data-center replication.
A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.

Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.

Technology notes on Couchbase 2.0 include:

Couchbase’s query interface is via http.
Couchbase’s indexing strategy is append-only b-trees.
Couchbase doesn’t index data until it’s persisted to disk.
Hence, it might take a few 100 milliseconds between the time data arrives and when it shows up in query results. (Of course, you can do a key-value retrieval as soon as the data arrives in RAM.)
Couchbase has incremental MapReduce.
Couchbase’s replication is active-active/master-master.
Couchbase does not offer wide-area data partitioning. Hence, Couchbase is not yet suitable for certain compliance-driven use cases.
One difference between Couchbase and CouchDB is that indexes are per-node, not per-database.
Otherwise, that fork of CouchDB had a lot to do with Couchbase deciding that Erlang is too hard to optimize, and hence moving to C/C++ instead. Currently still in Erlang and “strong candidates” to also move to C/C++ are distributed view indexes (but not the single-node b-trees).

Couchbase has built up its customer base by offering just a key-value store — GET, SET, INCREMENT, DECREMENT and the like. In that world, Couchbase makes credible claims about performance, reliability, and manageability — fast, scalable, high-concurrency, always-on, etc. Specific reasons Couchbase offers to believe in its key-value/whole-document performance include:

Active cache management (not just memory-mapping).
Fine-grained locking.
A benchmark it bought showing that, in this use case, it far outshines MongoDB or Cassandra.

If you want a key-value store, Couchbase is obviously a/the market-leading alternative.

I gather from Couchbase that Basho/Riak is a solid key-value competitor, perhaps more so than in the past.

What remains to be seen is how Couchbase will fare as a document store. In particular, any update to a Couchbase document replaces the whole document, which is not necessarily the case in other document stores. Similarly, Couchbase’s secondary indexes are newer than some competitors’. And so Couchbase still needs to prove its document-store mettle, in reading and writing alike.

Integrated internet system design

Curt Monash — Fri, 07 Sep 2012 04:43:39 +0000

What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.

Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.

The top integration and integration-like challenges for me, from a practical standpoint, are:

Integrating silos — a decades-old problem still with us in a big way.
Dynamic schemas with joins.
Low-latency business intelligence.
Human real-time personalization.

Other concerns that get mentioned include:

Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
Logical data warehouse, a term that doesn’t actually mean anything real.
In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.

Let’s skip those latter issues for now, focusing instead on the first four.

Integrating silos

While the software industry has been working on application integration for decades, there’s clearly a long way yet to go. Let me illustrate by way of personal story.

I needed a new laptop computer on short notice, and decided to go with an HP Folio.* Driving to a local Wal-Mart seemed more practical than ordering online, as a couple of stores near my house were listed by Walmart.com as being in stock. I called just to check; both were out of stock. The Wal-Mart folks on the phone told me such errors are routine.

*It was pretty much the cheapest all-solid-state credible alternative I could find, is said to have a good keyboard, and has an Ethernet port for all those client visits when guest Wi-Fi doesn’t work.

You may recall my outraged tweets about a similar silos-of-non-integration story in Dell customer support, a couple of years back. Yet Dell is one of the larger computer companies in the world, while Wal-Mart is one of the most accomplished computer users. If Wal-Mart and Dell can’t get basic system functionality right, just imagine how screwed up everybody else is.

Dynamic schemas with joins

There are multiple reasons to use dynamic schemas over fixed ones. This is especially true when recording web interaction data, because every page can have very different information to log. But there are also multiple reasons to want to use joins, especially when your application combines two or more of:

User-specific reference, demographic, and/or psychographic data.
System-wide reference data driving user-specific personalization.
Orders and inventory.
Verbose log data.

That doesn’t mean that a fully general join syntax is needed in every DBMS. But it does mean that the workarounds to joinlessness I wrote about a couple of years ago often don’t suffice.

Fortunately, much better stuff is being developed. The best that I know of still awaits launch — but I’ve begun to connect users with vendors who can address that problem head-on.

Low-latency business intelligence

If you have data pounding into a short-request system, there are several levels of BI you could try to do on it in human real time.

System monitoring. There are lots of tools for that.
Simple business aggregations. Top-end system monitoring stacks can help with that too (notably Splunk). Alternatively, you can maintain a few aggregates in even a NoSQL database.*
More serious BI, drawing on the various information in your data warehouse. That one’s tougher.

*Counters are the canonical example.

Single-server RDBMS have, for years, combined OLTP (OnLine Transaction Processing) and a reasonable amount of reporting or BI. As needs get more intense, Oracle and SAP are throwing hardware at the problem, via Exadata, Exalytics, HANA, and so on. But suppose you prefer a short-request system that scales out, runs on cheap commodity hardware, and fits well into the cloud. What do you do then?

One approach, which in some form I’ve recommend to multiple clients, is to stream the data to some kind of analytic data store, and serve your analytics from there. That technology is getting better all the time, even though many vendors haven’t yet recognized the magnitude of the need and opportunity.

More responsive personalization

Another kind of human-real-time analytics is even more important — automated response, such as ad personalization. Ideally, you want your response to be well-informed by everything the user has been doing over the past few minutes and even seconds. But two difficulties loom.

First, if we combine this point and the previous two, we might ideally want to stream data from a NoSQL store to an analytic one and back to a short-request SQL DBMS. That would be — complicated. Fortunately, there are a variety of not-crazy approaches, with varying degrees of cost, pain, or risk, with more coming soon as different kinds of data stores somewhat re-converge.

The second problem is more conceptual. What are the models and algorithms that tell us how to personalize based on up-to-the-second information? Since only the most simple-minded approaches seem practical to implement, only the most simple-minded answers have ever been worked out. A lot of data science lies ahead — and for once I don’t think that term is overwrought.

And with that I’m shutting down for 2 1/2 weeks for vacation. Depending on how things go with my new HP Folio :), as well as Wi-Fi in Istanbul, I hope to be fairly responsive to blog comments and email, and indeed will work on setting up a long October California trip. But I also hope that, for once, there isn’t any vacation-busting news; I’ve had some bad luck in that regard before, professionally and personally alike.

In-memory, (hybrid) memory-centric DBMS — three analytic glossary draft entries

Curt Monash — Mon, 20 Aug 2012 07:07:48 +0000

These are three closely-related draft entries for the DBMS2 analytic glossary. Please comment with any ideas you have for their improvement!

1. We coined the term memory-centric data management to comprise several kinds of technology that manage data in RAM (Random Access Memory), including:

In-memory DBMS (DataBase Management Systems).
Hybrid memory-centric DBMS.
Other kinds of in-memory data stores, such as:
- Caching layers.
- In-memory data stores that are tightly tied to specific analytic tools, for example the in-memory data management part of QlikView.
Complex event/stream processing.

Related link

Many examples of memory-centric data management (April, 2012)

2. An in-memory DBMS is a DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory). Thus, in-memory DBMS form a subcategory of memory-centric data management systems.

Ways in which in-memory DBMS are commonly different from those that query and update persistent storage include:

Data access processes which include a larger number of individually cheaper steps. In-memory database access is orders of magnitude cheaper than disk access, so it’s not as important to minimize the number of accesses.
Reduced locking. At RAM speeds, the cost of database locks can be significant, so in-memory DBMS are designed to minimize their use.

If what otherwise appears to be an in-memory DBMS routinely queries data from disk, then we refer to it as being hybrid memory-centric. However, even true in-memory DBMS may copy data into persistent storage, so as to keep it safe.

Examples of in-memory DBMS include:

SAP HANA.
Oracle TimesTen.
IBM TM1.
Several NewSQL systems, such as VoltDB.
Several NoSQL systems, such as Citrusleaf.

3. Hybrid memory-centric DBMS is our term for a DBMS that has two modes:

In-memory.
Querying and updating (or loading into) persistent storage.

It is difficult to make the boundaries of this category precise, because:

Almost any DBMS runs faster when the entire database is kept in RAM (Random Access Memory).
For some DBMS, especially in high-volume short-request processing use cases, it is a best practice to keep one’s entire data working set in RAM.

That said, we prefer to reserve the term “hybrid memory-centric” for DBMS designed according to the same principles as in-memory DBMS, for example IBM solidDB.

Hybrid memory-centric DBMS form a subcategory of memory-centric data management systems.

Notes, links and comments August 6, 2012

Curt Monash — Mon, 06 Aug 2012 05:11:08 +0000

I haven’t done a notes/link/comments post for a while. Time for a little catch-up.

1. MySQL now has a memcached integration story. I haven’t checked the details. The MySQL team is pretty hard to talk with, due to the heavy-handedness of Oracle’s analyst relations.

2. The Large Hadron Collider offers some serious numbers, including:

1 petabyte/second.
6 x 10⁹ collisions/second.
Only 1 in 10¹³collision records kept (which I guess knocks things down to a 100 byte/second average, from the standpoint of persistent storage).
Real-time filtering by a cluster of several thousand machines, over a 25 nanosecond period.

3. One application area we don’t talk about much for analytic technologies is education. However:

Knewton vigorously talks up the idea of online learning that adapts to the students’ previous responses, complete with the “Big Data” buzzword.
Knewton evidently likes graphs, and seems to be eagerly awaiting scale-out capabilities in Neo4j.
The New York Times offered a survey article about analytics in education. It seemed to be focused on Arizona State University — where I attended the only educational software conference I’ve ever gone to, in approximately 1984. One concerning aspect: There didn’t seem to be any reason to be sure the outcomes they were working toward had much to do with an actually better education.

So how soon will budgets emerge for all this, especially in the United States? I’m not sure.

Education has all sorts of problems at both at the grade-school and collegiate levels, including bureaucratic weirdness and huge financial pressures.
Textbook publisher Macmillan is investing significant capital in education technology businesses … but diversifications of that kind have often gone wrong before.

4. Recent posts with robust comment threads — and this is a very partial list — include:

Pros and cons of Microsoft SQL Server were explored after I opined about SQL Server to MySQL migration.
There was a lot of commentary on my May series of graph analysis and management posts.
Later, Neo’s Philip Rathle added clarifying detail to my post on Neo Technology and Neo4j.
My June series on Hadoop drew numerous comments and clarifications too.
There was vigorous response when I suggested in May that “Big Data” might be overhyped …
… but nothing like what transpired when I said something similar in September, 2011.

5. Finally — and thoroughly superseding my post on disk, flash, and RAM — I saw an awesome round-up of latency numbers, which I’ll just quote below:

L1 cache reference …………………………………………………… 0.5 ns
Branch mispredict ……………………………………………………….. 5 ns
L2 cache reference ………………………………………………………. 7 ns
Mutex lock/unlock ………………………………………………………. 25 ns
Main memory reference …………………………………………… 100 ns
Compress 1K bytes with Zippy ……………………………… 3,000 ns
Send 2K bytes over 1 Gbps network …………………… 20,000 ns
SSD random read ……………………………………………….. 150,000 ns
Read 1 MB sequentially from memory ……………….. 250,000 ns
Round trip within same datacenter …………………… 500,000 ns
Read 1 MB sequentially from SSD* ………………… 1,000,000 ns
Disk seek ……………………………………………………….. 10,000,000 ns
Read 1 MB sequentially from disk ……………….. 20,000,000 ns
Send packet CA -> Netherlands -> CA ……… 150,000,000 ns

Repeating that in different units, it’s:

    L1 cache reference ......................... 0.5 ns

    Branch mispredict ............................ 5 ns

    L2 cache reference ........................... 7 ns

    Mutex lock/unlock ........................... 25 ns

    Main memory reference ...................... 100 ns

    Compress 1K bytes with Zippy ................. 3 µs

    Send 2K bytes over 1 Gbps network ........... 20 µs

    SSD random read ............................ 150 µs

    Read 1 MB sequentially from memory ......... 250 µs

    Round trip within same datacenter .......... 0.5 ms

    Read 1 MB sequentially from SSD* ............. 1 ms

    Disk seek ................................... 10 ms

    Read 1 MB sequentially from disk ............ 20 ms

    Send packet CA ->  Netherlands -> CA ....... 150 ms

Memory-centric data management when locality matters

Curt Monash — Mon, 16 Jul 2012 01:13:40 +0000

Ron Pressler of Parallel Universe/SpaceBase pinged me about a data grid product he was open sourcing, called Galaxy. The idea is that a distributed RAM grid will allocate data, not randomly or via consistent hashing, but rather via a locality-sensitive approach. Notes include:

The original technology was developed to track moving objects on behalf of the Israeli Air Force.
The commercial product is focused on MMO (Massively MultiPlayer Online) games (or virtual worlds).
The underpinnings are being open sourced.
Ron suggests that, among other use cases, Galaxy might work well for graphs.
Ron argues that one benefit is that when lots of things cluster together — e.g. characters in a game — there’s a natural way to split them elastically (shrink the radius for proximity).
The design philosophy seems to be to adapt as many ideas as possible from the way CPUs manage (multiple levels of) RAM cache.

The whole thing is discussed in considerable detail in a blog post and a especially in a Hacker News comment thread. There’s also an error-riddled TechCrunch article.

In the areas I cover, “error-riddled TechCrunch article” is pretty much a redundant phrase — but that post looked particularly bad.

Meanwhile, I just noticed a May, 2009 blog post out of Progress Apama. The idea was that event streaming technology could be used to track moving objects, something I heard directly from the CEP (Complex Event Processing) vendors in the 2007 – 2009 period as well.

My tentative opinions on all this start:

Locality is really important for graphs. Random partitioning is crazy if there’s a locality-friendly alternative.
Ron plays different MMOs than I do. That said, the real market would more likely be new games than existing ones. And Guild Wars 2 (for example) is showing the way to gathering many characters together in a small game area.
It’s easy to conceive of cases in which there’s so much specific information about moving objects’ locations that you have to throw much of it away, rather than persisting it all. That speaks for memory-centric technology in general, and data reduction in particular (in the CEP sense of “data reduction”, not the statistics meaning).
Sensor and scientific data often have strong locality.

Related link

I’ve written a fair amount recently about graph data management, although I haven’t tackled the partitioning issue head-on.

Workday update

Curt Monash — Thu, 14 Jun 2012 17:22:03 +0000

In August 2010, I wrote about Workday’s interesting technical architecture, highlights of which included:

Lots of small Java objects in memory.
A very simple MySQL backing store (append-only, <10 tables).
Some modernistic approaches to application navigation.
A faceted approach to BI.

I caught up with Workday recently, and things have naturally evolved. Most of what we talked about (by my choice) dealt with data management, business intelligence, and the overlap between the two.

It is now reasonable to say that Workday’s servers fall into at least seven tiers, although we talked mainly about five that work together as a kind of giant app/database server amalgamation. The three that do noteworthy data management can be described as:

In-memory objects and transactions. This is similar to what Workday had before.
Persistent MySQL. Part of this is similar to what Workday had before. In addition, Workday is now storing certain data in tables in the ordinary relational way.
In-memory caching and indexing. This has three aspects:
- Indexes for the ordinary relational tables, organized in interesting ways.
- Indexes for Workday’s search-box navigation (as per my original Workday technical post, you can search across objects, task-names, etc.).
- Compressed copies of the Java objects, used to instantiate other servers as needed. The most obvious uses of this are:
  - Recovery for the object/transaction tier.
  - Launch for the elastic compute tier. (Described below.)

Two other Workday server tiers may be described as:

Elastic compute. This is used for a few kinds of tasks, such as payroll processing, batch reporting or, I presume, batch ETL.
Assorted management services. The list CTO Stan Swete sent over included (and I quote verbatim):
- Environment management. What servers are deployed. What is available. Etc…
- Verification services. To verify that related services are in the various tiers are in sync with the state of the persistence layer.
- Credentials services for authentications.
- Management utilities. Scripts to manipulate (create/copy/move/delete) tenants.
- Services to manage unstructured data.
- PCI services for credit card processing.
- Print services.
- Messaging services.

Finally, Workday has a couple of server types or tiers for talking with other systems, namely for user interface and integrations.

Besides data management, the other cool thing we discussed was a type of live report called worklets. The idea is:

At their heart, Worklets are 2-dimensional reports, with other attributes being drill-down dimensions.
Worklets take up little screen real estate.
Thus, worklets are suitable for mobile (tablet) devices, or for embedding in various parts of the application (including otherwise transactional parts).
A worklet conveys a little bit of information; if you want more, you can pull it from the server.
Worklets are cached on the user interface server. The total result set in a worklet should be relatively small.
The two main examples Stan gave me of what a worklet starts with were:
- A few rows from a result set.
- A graphical result (but not the detailed data you might want to drill down to behind it).

Circling back to the five app-server-like tiers, further notes include:

The elastic compute tier is currently in the same data centers the rest of Workday’s system is. However, in the future it could be on Amazon as an alternative.
Behind the scenes, reports can run either against tabular data or by traversing the in-memory objects. While some reports are 100-1000x faster on tables, traversing the object graph is in other cases actually more performant. Anyhow, this choice is transparent to users.
Some data is duplicated between objects and ordinary tables; some is tabular-only.
The indexes on tabular data are custom to Workday, not native to MySQL. They’re organized in line with the object structure, in a way that sounds somewhat reminiscent of the Akiban hKey.

Besides (or in some cases including) the above, the development team is very concerned with controlling the memory footprint of the in-memory Workday system, and it sounds like improvements over time have literally been at the order(s) of magnitude level. Stan seems to attribute this largely to:

Choosing the right kind of Java collection for various groups of Java objects.
Compression, although he didn’t give particulars.

Going forward, Workday hopes to get a further 2X+ reduction via lazy loading driven by object usage stats.

If I were starting a transactional SaaS (Software as a Service) vendor today, I might look at an architecture a lot like Workday’s. In particular:

Having everything start in a Java object model makes loads of sense. (This is a use case for which Java seems far from obsolete.)
The dual data model (object and tabular) seems very appealing. Rather than shoehorn data into an uncomfortable model, deal instead with the discomfort of running a couple of (logical) data stores side by side.
Having multiple server tiers is pretty much a best practice.

However:

Workday’s extreme wheel-reinvention in the area of database management might not make sense for smaller companies.
In any given case, the exact choice of tiers might be different from Workday’s. In particular, there might need to be more explicitly analytics-oriented tiers than Workday chooses to split out.