March 30, 2011

Short-request and analytic processing

A few years ago, I suggested that database workloads could be divided into two kinds — transactional and analytic. The advent of non-transactional NoSQL has suggested that we need a replacement term for “transactional” or “OLTP”, but finding one has been a bit difficult. Numerous tries, including high-volume simple processing, online request processing, internet request processing, network request processing, short request processing, and rapid request processing have turned out to be imperfect, as per discussion at each of those links. But then, no category name is ever perfect anyway. I’ve finally settled on short request processing, largely because I think it does a good job of preserving the analytic-vs-bang-bang-not-analytic workload distinction.

The easy part of the distinction goes roughly like this:

Where the terminology gets more difficult is in a few areas of what one might call real-time or near-real-time analytics. My first takes are:  Read more

March 29, 2011

Introduction to Citrusleaf

Citrusleaf is the vendor of yet another short-request/NoSQL database management system, conveniently named Citrusleaf. Highlights for Citrusleaf the company include:

Citrusleaf the product is a kind of key-value store; however, the values are in the form of rows, so what you really look up is (key, field name, value) triples. Right now only the keys are indexed; futures include indexing on the individual fields, so as to support some basic analytics. SQL support is an eventual goal. Other Citrusleaf buzzword basics include:

To date, Citrusleaf customers have focused on sub-millisecond data retrieval, preferably .2-.3 milliseconds. Accordingly, none has chosen to put the primary Citrusleaf data store on disk. Rather:

I don’t have a good grasp on what the data structure for those indexes is.

Citrusleaf characterizes its customers as firms that have “a couple of KB” of data on “every” person in North America. Naively, that sounds like a terabyte or less to me, but Citrusleaf says 1-3 terabytes is most common. Or to quote the press release, “The most common deployments for Citrusleaf 2.0 are terabytes of data, billions of objects, and 200K plus transactions per second per node, with sub-millisecond latency.” 4-8 nodes seems to be typical for Citrusleaf databases (all figures pre-replication). I didn’t ask what kind of hardware is at each node.

Citrusleaf data distribution features include:  Read more

March 24, 2011

MySQL, hash joins and Infobright

Over a 24 hour or so period, Daniel Abadi, Dmitriy Ryaboy and Randolph Pullen all remarked on MySQL’s lack of hash joins. (It relies on nested loops instead, which were state-of-the-art technology around the time of the Boris Yeltsin administration.) This led me to wonder — why is this not a problem for Infobright?

Per Infobright chief scientist Dominik Slezak, the answer is

Infobright perform joins using its own optimization/execution layers (that actually include hash join algorithms and advanced knowledge-grid-based nested loop optimizations in particular).

March 24, 2011

Analytic performance — the persistent need for speed

Analytic DBMS and other analytic platform technologies are much faster than they used to be, both in absolute and price/performance terms. So the question naturally arises, “When is the performance enough?” My answer, to a first approximation, is “Never.” Obviously, your budget limits what you can spend on analytics, and anyhow the benefit of incremental expenditure at some point can grow quite small. But if analytic processing capabilities were infinite and free, we’d do a lot more with analytics than anybody would consider today.

I have two lines of argument supporting this view. One is application-oriented. Machine-generated data will keep growing rapidly. So using that data requires ever more processing resources as well. Analytic growth, rah-rah-rah; company valuation, sis-boom-bah. Application areas include but are not at all limited to marketing, law enforcement, investing, logistics, resource extraction, health care, and science.

The other approach is to point out some computational areas where vastly more analytic processing resources could be used than are available today. Consider, if you will, statistical modeling, graph analytics, optimization, and stochastic planning.  Read more

March 23, 2011

DataStax introduces a Cassandra-based Hadoop distribution called Brisk

Cassandra company DataStax is introducing a Hadoop distribution called Brisk, for use cases that combine short-request and analytic processing. Brisk in essence replaces HDFS (Hadoop Distributed File System) with a Cassandra-based file system called CassandraFS. The whole thing is due to be released (Apache open source) within the next 45 days.

The core claims for Cassandra/Brisk/CassandraFS are:

There’s a pretty good white paper around all this, which also recites general Cassandra claims — [edit] and here at last is the link.

March 23, 2011

Hadapt (commercialized HadoopDB)

The HadoopDB company Hadapt is finally launching, based on the HadoopDB project, albeit with code rewritten from scratch. As you may recall, the core idea of HadoopDB is to put a DBMS on every node, and use MapReduce to talk to the whole database. The idea is to get the same SQL/MapReduce integration as you get if you use Hive, but with much better performance* and perhaps somewhat better SQL functionality.** Advantages vs. a DBMS-based analytic platform that includes MapReduce — e.g. Aster Data — are less clear.  Read more

March 15, 2011

MySQL soundbites

Oracle announced MySQL enhancements, plus intentions to use MySQL to compete against Microsoft SQL Server. My thoughts, lightly edited from an instant message Q&A, include:

The last question was “Is there an easy shorthand to describe how Oracle DB is superior to MySQL even with these improvements?” My responses, again lightly edited, were:  Read more

March 13, 2011

So how many columns can a single table have anyway?

I have a client who is hitting a 1000 column-per-table limit in Oracle Standard Edition. As you might imagine, I’m encouraging them to consider columnar alternatives. Be that as it may, just what ARE the table width limits in various analytic or general-purpose DBMS products?

By the way — the answer SHOULD be “effectively unlimited.” Like it or not,* there are a bunch of multi-thousand-column marketing-prospect-data tables out there.

*Relational purists may dislike the idea for one reason, privacy-concerned folks for quite another.

March 10, 2011

Notes for my March 10 Investigative Analytics webinar

It turns out that the slide deck I posted a couple of days ago underwent more changes than I expected. Here’s a more current version. A number of the changes arose when I thought more about how to categorize analytic business benefits; hence that blog post a few minutes ago with more detail on the same subject.

Unchanged, however, is the more technical list of six things you can do with analytic technology, taken from a blog post late last year. Also unaltered are my definitions of investigative analytics and machine-generated data.

I write extensively on privacy. This technological overview of privacy threats doubles as a survey of advanced investigative analytics techniques now coming into practical use.

And finally, on a happier note — if you enjoyed the xkcd cartoon, here are two links to that one and a few more.

March 10, 2011

The three principal kinds of analytic business benefit

When I tweaked the slide deck for Thursday’s Investigative Analytics webinar — I’ll post an updated version soon — the part that needed the most work was the section on “What business problems do you solve with this stuff anyway?” I’ve posted about that kind of thing at least five times in the past five years, across three different blogs (linked below). But perhaps this time I can really simplify matters, albeit at the cost of being not quite complete.

A large fraction of all analytic efforts ultimately serve one or more of three purposes:

Those areas obviously overlap. Indeed, it can be argued that everything one does in business amounts to “optimization,” and everything in analysis boils down to noticing and understanding anomalies. Still, I am hopeful that this is an instructive categorization, as per the many examples below.  Read more

Next Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.