Short-request and analytic processing
A few years ago, I suggested that database workloads could be divided into two kinds — transactional and analytic. The advent of non-transactional NoSQL has suggested that we need a replacement term for “transactional” or “OLTP”, but finding one has been a bit difficult. Numerous tries, including high-volume simple processing, online request processing, internet request processing, network request processing, short request processing, and rapid request processing have turned out to be imperfect, as per discussion at each of those links. But then, no category name is ever perfect anyway. I’ve finally settled on short request processing, largely because I think it does a good job of preserving the analytic-vs-bang-bang-not-analytic workload distinction.
The easy part of the distinction goes roughly like this:
- Anything transactional or “OLTP” is short-request.
- Anything “OLAP” is analytic.
- Updates of small amounts of data are probably short-request, be they transactional or not.
- Retrievals of one or a few records in the ordinary course of update-intensive processing are probably short-request.
- Queries that return or aggregate large amounts of data — even in intermediate result sets — are probably analytic.
- Queries that would take a long time to run on badly-chosen or -configured DBMS are probably analytic (even if they run nice and fast on your actual system).
- Analytic processes that go beyond querying or simple arithmetic are — you guessed it! — analytic.
- Anything expressed in MDX is probably analytic.
- Driving a dashboard is usually analytic.
Where the terminology gets more difficult is in a few areas of what one might call real-time or near-real-time analytics. My first takes are: Read more
Categories: Analytic technologies, Data warehousing, MySQL, NoSQL, OLTP | 34 Comments |
Introduction to Citrusleaf
Citrusleaf is the vendor of yet another short-request/NoSQL database management system, conveniently named Citrusleaf. Highlights for Citrusleaf the company include:
- 8 employees.
- $2 million in recently acquired venture capital.
- 1 1/2 – 2 1/2 years of total company history, depending on how you count.
- An undisclosed but nonzero number of paying customers, concentrated in the real-time advertising market, with a typical application being cookie management.
Citrusleaf the product is a kind of key-value store; however, the values are in the form of rows, so what you really look up is (key, field name, value) triples. Right now only the keys are indexed; futures include indexing on the individual fields, so as to support some basic analytics. SQL support is an eventual goal. Other Citrusleaf buzzword basics include:
- ACID-compliant.
- Log-structured.
- Tunable consistency model.
To date, Citrusleaf customers have focused on sub-millisecond data retrieval, preferably .2-.3 milliseconds. Accordingly, none has chosen to put the primary Citrusleaf data store on disk. Rather:
- Citrusleaf indexes are always in RAM. (Citrusleaf forces this, actually.)
- You can keep data in RAM and copy it to disk.
- You can keep data on solid-state drives. (Just A Bunch Of Flash or Fusion I/O.)
I don’t have a good grasp on what the data structure for those indexes is.
Citrusleaf characterizes its customers as firms that have “a couple of KB” of data on “every” person in North America. Naively, that sounds like a terabyte or less to me, but Citrusleaf says 1-3 terabytes is most common. Or to quote the press release, “The most common deployments for Citrusleaf 2.0 are terabytes of data, billions of objects, and 200K plus transactions per second per node, with sub-millisecond latency.” 4-8 nodes seems to be typical for Citrusleaf databases (all figures pre-replication). I didn’t ask what kind of hardware is at each node.
Citrusleaf data distribution features include: Read more
Categories: Aerospike, NoSQL, Parallelization | 6 Comments |
MySQL, hash joins and Infobright
Over a 24 hour or so period, Daniel Abadi, Dmitriy Ryaboy and Randolph Pullen all remarked on MySQL’s lack of hash joins. (It relies on nested loops instead, which were state-of-the-art technology around the time of the Boris Yeltsin administration.) This led me to wonder — why is this not a problem for Infobright?
Per Infobright chief scientist Dominik Slezak, the answer is
Infobright perform joins using its own optimization/execution layers (that actually include hash join algorithms and advanced knowledge-grid-based nested loop optimizations in particular).
Categories: Infobright, MySQL, Theory and architecture | 4 Comments |
Analytic performance — the persistent need for speed
Analytic DBMS and other analytic platform technologies are much faster than they used to be, both in absolute and price/performance terms. So the question naturally arises, “When is the performance enough?” My answer, to a first approximation, is “Never.” Obviously, your budget limits what you can spend on analytics, and anyhow the benefit of incremental expenditure at some point can grow quite small. But if analytic processing capabilities were infinite and free, we’d do a lot more with analytics than anybody would consider today.
I have two lines of argument supporting this view. One is application-oriented. Machine-generated data will keep growing rapidly. So using that data requires ever more processing resources as well. Analytic growth, rah-rah-rah; company valuation, sis-boom-bah. Application areas include but are not at all limited to marketing, law enforcement, investing, logistics, resource extraction, health care, and science.
The other approach is to point out some computational areas where vastly more analytic processing resources could be used than are available today. Consider, if you will, statistical modeling, graph analytics, optimization, and stochastic planning. Read more
Categories: Analytic technologies, RDF and graphs | Leave a Comment |
DataStax introduces a Cassandra-based Hadoop distribution called Brisk
Cassandra company DataStax is introducing a Hadoop distribution called Brisk, for use cases that combine short-request and analytic processing. Brisk in essence replaces HDFS (Hadoop Distributed File System) with a Cassandra-based file system called CassandraFS. The whole thing is due to be released (Apache open source) within the next 45 days.
The core claims for Cassandra/Brisk/CassandraFS are:
- CassandraFS has the same interface as HDFS. So, in particular, you should be able to use most Hadoop add-ons with Brisk.
- CassandraFS has comparable performance to HDFS on sequential scans. That’s without predicate pushdown to Cassandra, which is Coming Soon but won’t be in the first Brisk release.
- Brisk/CassandraFS is much easier to administer than HDFS. In particular, there are no NameNodes, JobTracker single points of failure, or any other form of head node. Brisk/CassandraFS is strictly peer-to-peer.
- Cassandra is far superior to HBase for short-request use cases, specifically with 5-6X the random-access performance.
There’s a pretty good white paper around all this, which also recites general Cassandra claims — [edit] and here at last is the link.
Categories: Cassandra, DataStax, Hadoop, HBase, MapReduce, Open source | 3 Comments |
Hadapt (commercialized HadoopDB)
The HadoopDB company Hadapt is finally launching, based on the HadoopDB project, albeit with code rewritten from scratch. As you may recall, the core idea of HadoopDB is to put a DBMS on every node, and use MapReduce to talk to the whole database. The idea is to get the same SQL/MapReduce integration as you get if you use Hive, but with much better performance* and perhaps somewhat better SQL functionality.** Advantages vs. a DBMS-based analytic platform that includes MapReduce — e.g. Aster Data — are less clear. Read more
MySQL soundbites
Oracle announced MySQL enhancements, plus intentions to use MySQL to compete against Microsoft SQL Server. My thoughts, lightly edited from an instant message Q&A, include:
- Given how hard Oracle fought the antitrust authorities to keep MySQL around the time of the acquisition, we always knew they were serious about the business.
- We’ll know they’re even more serious if they buy MySQL enhancements such as Infobright, dbShards, or Schooner MySQL.
- Oracle-quality MySQL’s most obvious target is SQL Server.
- But if you’ve bought into the Windows stack, why not stay bought-in?
- MySQL vs. SQL Server competition is mainly about new applications; few users will actually switch.
- A lot of SaaS vendors use Oracle Standard Edition, and have some MySQL somewhere as well. They don’t want to pay up for Oracle Enterprise Edition or Exadata. Good MySQL could suit them.
- Mainly, I see the Short Request Processing market as being a battle between MySQL versions and NoSQL systems. (I’m a VoltDB pessimist.)
The last question was “Is there an easy shorthand to describe how Oracle DB is superior to MySQL even with these improvements?” My responses, again lightly edited, were: Read more
Categories: Analytic technologies, Exadata, MySQL, NoSQL, Oracle, Software as a Service (SaaS) | 2 Comments |
So how many columns can a single table have anyway?
I have a client who is hitting a 1000 column-per-table limit in Oracle Standard Edition. As you might imagine, I’m encouraging them to consider columnar alternatives. Be that as it may, just what ARE the table width limits in various analytic or general-purpose DBMS products?
By the way — the answer SHOULD be “effectively unlimited.” Like it or not,* there are a bunch of multi-thousand-column marketing-prospect-data tables out there.
*Relational purists may dislike the idea for one reason, privacy-concerned folks for quite another.
Categories: Data warehousing, Surveillance and privacy | 37 Comments |
Notes for my March 10 Investigative Analytics webinar
It turns out that the slide deck I posted a couple of days ago underwent more changes than I expected. Here’s a more current version. A number of the changes arose when I thought more about how to categorize analytic business benefits; hence that blog post a few minutes ago with more detail on the same subject.
Unchanged, however, is the more technical list of six things you can do with analytic technology, taken from a blog post late last year. Also unaltered are my definitions of investigative analytics and machine-generated data.
I write extensively on privacy. This technological overview of privacy threats doubles as a survey of advanced investigative analytics techniques now coming into practical use.
And finally, on a happier note — if you enjoyed the xkcd cartoon, here are two links to that one and a few more.
Categories: Analytic technologies, Presentations | Leave a Comment |
The three principal kinds of analytic business benefit
When I tweaked the slide deck for Thursday’s Investigative Analytics webinar — I’ll post an updated version soon — the part that needed the most work was the section on “What business problems do you solve with this stuff anyway?” I’ve posted about that kind of thing at least five times in the past five years, across three different blogs (linked below). But perhaps this time I can really simplify matters, albeit at the cost of being not quite complete.
A large fraction of all analytic efforts ultimately serve one or more of three purposes:
- Marketing
- Problem and anomaly detection and diagnosis
- Planning and optimization
Those areas obviously overlap. Indeed, it can be argued that everything one does in business amounts to “optimization,” and everything in analysis boils down to noticing and understanding anomalies. Still, I am hopeful that this is an instructive categorization, as per the many examples below. Read more
Categories: Analytic technologies | 8 Comments |