GIS and geospatial – DBMS 2 : DataBase Management System Services

New legal limits on surveillance in the US

Curt Monash — Mon, 25 Jun 2018 19:55:21 +0000

The United States has new legal limits on electronic surveillance, both in one specific way and — more important — in prevailing judicial theory. This falls far short of the protections we ultimately need, but it’s a welcome development even so.

The recent Supreme Court case Carpenter v. United States is a big deal. Let me start by saying:

Most fundamentally, the Carpenter decision was based on and implicitly reaffirms the Katz test.* This is good.
The Carpenter decision undermines the third-party doctrine.** This is great. Strict adherence to the third-party doctrine would eventually have given the government unlimited rights of Orwellian surveillance.
The Carpenter decision suggests the Court has adopted an equilibrium-adjustment approach to Fourth Amendment jurisprudence.
- The “equilibrium” being maintained here is the balance between governmental rights to intrude on privacy and citizens’ rights not to be intruded on.
- e., equilibrium-adjustment is a commitment to maintaining approximately the same level of liberty (with respect to surveillance) we’ve had all along.
- I got the equilibrium-adjustment point from Eugene Volokh’s excellent overview of the Carpenter decision.

*The Katz test basically says that that an individual’s right to privacy is whatever society regards as a reasonable expectation of privacy at that time.

**The third-party doctrine basically says that any information of yours given voluntarily to a third party isn’t private. This includes transactional information such as purchases or telephone call detail records (CDRs)

Key specifics include:

The actual issue in Carpenter is whether the government needs a warrant to access cell phone location data that allows a cell phone user’s movements to be quite accurately tracked. The decision on that issue was Yes.
This was a 5-4 decision. Chief Justice Roberts and the four liberal justices voted for it. Swing voter Justice Kennedy and the other three conservative justices voted against it.
The majority was united. Justice Roberts wrote the decision and there were no concurrences.
The dissents were not united, but did generally focus on two kinds of arguments:
- Reliance on the third-party doctrine, commonly expressed as reliance on the Smith and Miller precedents.
- Disagreement with the Katz test.

Also very relevant was the 2012 case requiring warrants for GPS tracking, United States v Jones. But discussion of the Jones decision is confusing, because while some justices at the time addressed the issue of general electronic tracking of a person’s movements, others focused narrowly on the physical action of installing the GPS device.

Unfortunately, all the good news in Carpenter notwithstanding, the decision doesn’t come close to accomplishing as much as we need. I stand by my oft-repeated observations:

Massive surveillance is inevitable.

Unless the uses of the resulting information are VERY limited, freedoms will be chilled into oblivion.

Justice Roberts correctly wrote in the Carpenter decision:

Mapping a cell phone’s location … provides an intimate window into a person’s life, revealing not only his particular movements, but through them his “familial, political, professional, religious, and sexual associations.”

Justice Kennedy however rejoindered:

What persons purchase and to whom they talk might disclose how much money they make; the political and religious organizations to which they donate; whether they have visited a psychiatrist, plastic surgeon, abortion clinic, or AIDS treatment center; whether they go to gay bars or straight ones; and who are their closest friends and family members.

His point, also correct, was that the data that police are allowed to get without warrants is even more privacy-violating than the data Carpenter keeps away from them. And so, as good as the Carpenter decision apparently is, privacy invasion and surveillance are still among the gravest threats to liberty that we face.

Related links

My January post on the chaotic politics of privacy is relevant both to the content of the Carpenter decision and to the fact that the decision-makers did not split perfectly along traditional partisan lines.
My 2013 series on privacy theory suggests that gaps in judicial reasoning could be filled by referencing the problem of chilling effects.

Introduction to Crate.io and CrateDB

Curt Monash — Sun, 18 Dec 2016 05:27:15 +0000

Crate.io and CrateDB basics include:

Crate.io makes CrateDB.
CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
Crate.io says it has 22 employees and 5 paying customers.
Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
… where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
… except when they’re just BLOBs (Binary Large OBjects).
There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

Writes and single-row look-ups.
Aggregates and joins.

And so it makes sense that:

Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
  - More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
  - CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
  - Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.

Crate.io is proud of its distributed/parallel story.

Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
The only join strategy currently implemented is nested loop. Others are in the future.
CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
I imagine CrateDB administrative tools are still rather primitive.

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Edit: For some clarification and even correction, please see the first comment below.

MemSQL 4.0

Curt Monash — Wed, 20 May 2015 09:41:34 +0000

I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:

MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
… but quickly positioned with “We also do ‘real-time’ analytic processing” …
… and backed that up by adding a flash-based column store option …
… before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
There’s also a JSON option.

The main new aspects of MemSQL 4.0 are:

Geospatial indexing. This is for me the most interesting part.
A new optimizer and, I suppose, query planner …
… which in particular allow for serious distributed joins.
Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.

There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.

Before MemSQL 4.0, distributed joins were restricted to the easy cases:

Two tables are distributed (i.e. sharded) on the same key.
One table is small enough to be broadcast to each node.

Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:

Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.

To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.

In other connectivity notes:

MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.

Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.

And now to the geospatial stuff. I thought I heard:

A “point” is actually a square region less than 1 mm per side.
There are on the order of 2^30 such points on the surface of the Earth.

Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)

Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:

In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
Hilbert curves seem to be in vogue, including at MemSQL.
Nice properties of Hilbert space-filling curves include:
- Numbers near each other always correspond to points near each other.
- The converse is almost always true as well.*
- If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)

*You could say it’s true except in edge cases … but then you’d deserve to be punished.

Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:

Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
… a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.

As for company metrics — MemSQL cites >50 customers and >60 employees.

Related links

I’ve posted about earlier versions of MemSQL technology, e.g. in May, 2014, April, 2013 and June, 2012.

Notes on indexes and index-like structures

Curt Monash — Thu, 16 Apr 2015 22:42:59 +0000

Indexes are central to database management.

My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
… and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
Recently, I’ve had numerous conversations in which indexing strategies played a central role.

Perhaps it’s time for a round-up post on indexing.

1. First, let’s review some basics. Classically:

An index is a DBMS data structure that you probe to discover where to find the data you really want.
Indexes make data retrieval much more selective and hence faster.
While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)

2. Further:

A DBMS or other system can index data it doesn’t control.
- This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security.
- This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.
Indexes can be updated in batch mode, rather than real time.
- Most famously, this is why Google invented MapReduce.
- Indeed, in cases where you index external data, it’s almost mandatory.
Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes.
- The most famous example is probably the rebalancing of B-trees.
- Append-only index writes call for later clean-up as well.

3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.

The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)
In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that:
- He felt that lockless indexes were essential to scale-out, and to that end …
- … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)
Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.
solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.

I’ll try to explain this paradox below.

4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.

Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.

5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.

To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.
Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.
Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.

6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.

The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:

For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.
So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.
Then regions on a plane are covered by subsequences (or unions of same).

The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.

7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.

MongoDB is growing up

Curt Monash — Thu, 17 Apr 2014 08:56:09 +0000

I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:

An actual automation and management user interface, as opposed to the current management style, which is almost entirely via scripts (except for the monitoring UI).
- That’s scheduled for public beta in May, and general availability later this year.
- It will include some kind of integrated provisioning with VMware, OpenStack, et al.
- One goal is to let you apply database changes, software upgrades, etc. without taking the cluster down.
A reasonable backup strategy.
- A snapshot copy is made of the database.
- A copy of the log is streamed somewhere.
- Periodically — the default seems to be 6 hours — the log is applied to create a new current snapshot.
- For point-in-time recovery, you take the last snapshot prior to the point, and roll forward to the desired point.
A reasonable locking strategy!
- Document-level locking is all-but-promised for MongoDB 2.8.
- That means what it sounds like. (I mention this because sometimes an XML database winds up being one big document, which leads to confusing conversations about what’s going on.)
Security. My eyes glaze over at the details, but several major buzzwords have been checked off.
A general code rewrite to allow for (more) rapid addition of future features.

Of course, when a DBMS vendor rewrites its code, that’s a multi-year process. (I think of it at Oracle as spanning 6 years and 2 main-number releases.) With that caveat, the MongoDB rewrite story is something like:

Updating has been reworked. Most of the benefits are coming later.
Query optimization and execution have been reworked. Most of the benefits are coming later, except that …
… you can now directly filter on multiple indexes in one query; previously you could only simulate doing that by pre-building a compound index.
One of those future benefits is more index types, for example R-trees or inverted lists.
Concurrency improvements are down the road.
So are rewrites of the storage layer, including the introduction of compression.

Also, you can now straightforwardly transform data in a MongoDB database and write it into new datasets, something that evidently wasn’t easy to do before.

One thing that MongoDB is not doing is offer any ODBC/JDBC or other SQL interfaces. Rather, there’s some other API — I don’t know the details — whereby business intelligence tools or other systems can extract views, and a few BI vendors evidently are doing just that. In particular, MicroStrategy and QlikView were named, as well as a couple of open source usual-suspects.

As of 2.6, MongoDB seems to have a basic integrated text search capability — which however does not rise to the search functionality level that was in Oracle 7.3.2. In particular:

15 Western languages are supported with stopwords, tokenization, etc.
Search predicates can be mixed into MongoDB queries.
The search language isn’t very rich; for example, it lacks WHERE NEAR semantics.
You can’t tweak the lexicon yourself.

And finally, some business and pricing notes:

Two big aspects of the paid-versus-free version of MongoDB (the product line) are:
- Security.
- Management tools.
Well, actually, you can get the management tools for free, but only on a SaaS basis from MongoDB (the company).
- If you want them on premises or in your part of the cloud, you need to pay.
- If you want MongoDB (the company) to maintain your backups for you, you need to pay.
Customer counts include:
- At least 1000 or so subscribers (counting by organization).
- Over 500 (additional?) customers for remote backup.
- 30 of the Fortune 100.

And finally, MongoDB did something many companies should, which is aggregate user success stories for which they may not be allowed to publish full details. Tidbits include:

Over 100 organizations run clusters with more than 100 nodes. Some clusters exceed 1,000 nodes.

Many clusters deliver hundreds of thousands of operations per second (combined read and write).

MongoDB clusters routinely store hundreds of terabytes, and some store multiple petabytes of data. Over 150 clusters exceed 1 billion documents in size. Many manage more than 100 billion documents.

The Hemisphere program

Curt Monash — Tue, 03 Sep 2013 08:04:52 +0000

Another surveillance slide deck has emerged, as reported by the New York Times and other media outlets. This one is for the Hemisphere program, which apparently:

Stores CDRs (Call Detail Records), many or all of which are collected via …
… some kind of back door into the AT&T switches that many carriers use. (See Slide 2.)
Has also included “subscriber information” for AT&T phones since July, 2012.
Contains “long distance and international” CDRs back to 1987.
Currently adds 4 billion CDRs per day.
Is administered by a Federal drug-related law enforcement agency but …
… is used to combat many non-drug-related crimes as well. (See Slides 21-26.)

Other notes include:

The agencies specifically mentioned on Slide 16 as making numerous Hemisphere requests are the DEA (Drug Enforcement Agency) and DHS (Department of Homeland Security).
“Roaming” data giving city/state is mentioned in the deck, but more precise geo-targeting is not.

I’ve never gotten a single consistent figure, but typical CDR size seems to be in the 100s of bytes range. So I conjecture that Project Hemisphere spawned one of the first petabyte-scale databases ever.

Hemisphere Project unknowns start:

Is that “back door into AT&T switches” inference really reliable? (I’m basing it on just a few words in the deck, and such decks can have inaccuracies in them.)
Just which calls’ metadata is currently being collected?
How long has this approximate rate of CDR collection been going on; can we just extrapolate back from the current 4 billion calls/day?

It seems that a primary use case for Project Hemisphere is to guess what phone numbers baddies are using, especially those of disposable “burner” cell phones that are otherwise very hard to trace. (The key benefit mentioned to such analysis is that those new phones can then be tapped.) There aren’t many details as to how the phone numbers are inferred, but since almost nothing is initially known about the target phone numbers except calling patterns, those are surely a huge part of the puzzle. In particular, it doesn’t seem to have been disclosed which other databases, if any, are linked into the analysis. There is no hint in the deck that the Hemisphere program directly collects telephone call contents. Rather, it’s used to help determine which telephone numbers to tap.

The government apparently trains its people to keep Hemisphere secret, to the point of lying about it, even though Slide 2 states that Hemisphere is “an unclassified program”.

Slide 8-12 generally emphasize the Hemisphere program’s secrecy.
Slide 10 seems to advocate outright deception. Specifically — and this is both complicated and ironic — it seems to say that the government should get subpoenas for information it already had without subpoena, so that those subpoenas can be the claimed source of the information when applying for yet other subpoenas.

So it seems as if Hemisphere is yet another example of the pattern:

The US government has long lied about how far it invades privacy …
… and about the assistance it receives from the telecom/technology industry in doing so.
Little tangible harm has been done by those invasions, except to those who clearly deserved it.

Up to a point, this is reassuring. But it still bodes badly for a future in which there are many more ways surveillance can be used to hurt us than were possible before.

Hortonworks business notes

Curt Monash — Sat, 24 Aug 2013 11:07:53 +0000

Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:

Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
… a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Media.
- Telecommunications.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.

In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social.

These use cases can be any of a true new application, an enhancement to an existing application, or a general investigative analytics environment.
This adoption is typically driven by a line-of-business group, but IT is a key influencer, and IT usually winds up running the project.
Overall, this accounts for 70% of Hortonworks’ business by some metric.

The other 30% Hortonworks sees is efficiency-oriented — i.e., a cheaper way to store and/or process data.

Hortonworks assigns ELT (Extract/Load/Transform) to this group. Based in part on a subsequent conversation with Cloudera, I gather that batch ELT offload — especially but not only from large Teradata installations — is a significant fraction of the total.
“Data lake” and similar buzzwords fall into this group, as does “re-architecting”.
Hortonworks asserts that adopters from the 70% rapidly move to this kind of use as well, while Teradata customers typically start out in this part.
Unsurprisingly, this part is IT all the way.

One customer apparently estimates its fully burdened Hadoop costs at $900/terabyte/year.

Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera.

And finally: One of my favorite things to ask is “When you win, why do win?” — at least when I think the vendor won’t just reiterate their core marketing messages. Hortonworks gave a great, threefold answer:

Its relationships with Teradata, Microsoft, et al.
Its promise that it can get specific customer-requested features into Apache Hadoop on a specific timeframe. (Yes, the Contribution Olympics are still with us.)
Its claim of greater experience with truly huge clusters — not just Yahoo, but I don’t know who its other examples are.

Related link

A few weeks ago, I talked with Hortonworks at length about technology and other subjects.

Analytic application themes

Curt Monash — Thu, 25 Apr 2013 08:41:59 +0000

I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.

1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.

Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.

Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:

There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.

2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:

Customer interaction
Network and sensor monitoring
Game and mobile application back-ends

Also arising fairly frequently are:

Algorithmic trading
Anti-fraud
Risk measurement
Law enforcement/national security
Healthcare
Stakeholder-facing analytics

I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.

3. Much of customer interaction revolves around recommendation and personalization. In connection with that I’ll remind you:

Multiple sources say that 5 millisecond response is a real need. Srini Srinivasan explained why in a January comment.
The results of the recommendation and personalization can be delivered in many different ways — product recommendations, ads, special offers, email, snail mail, call center scripts and more. This is the paradigmatic example for my skepticism about complete analytic applications.

4. Networks and sensors emit the epitome of machine-generated data. Data sources include web logs, network logs (in the IT sense), telecommunication networks, other utilities (e.g. electric), vehicle fleets, and more. Application themes include:

Human monitoring, via some kind of real-time business intelligence view. I hear about that a lot.
Various kinds of automated response. (Security is an obvious example.)
Integration with other kinds of application, data source, or use case.

As one example of the last point, Oliver Ratzesberger told me years ago that eBay had up-to-the-minute BI cubes integrating customer response and log data, for the purpose of quickly detecting technology problems. Acunu recently told me that similar applications are one of their sales focuses.

5. In another example, games and mobile applications can be a lot like websites in terms of the analytics that support them (all the more so if we’re talking about games with in-app purchases). Two special features come up repeatedly, however — leaderboards for games, and geospatial data sent by mobile devices.

6. Algorithmic trading is flashy because of the sums of money involved, and because of what is often hyper-low latency; I’ve even heard 50 microseconds, and that’s a slightly out of date figure for a sequence of several atomic operations. But otherwise it’s not one of the more interesting areas to me, for at least two reasons:

It depends on a lot of latency-specific stuff, such as hand-crafted hardware.
The participants are secretive — understandably so as they’re literally in a race with each other –and don’t reveal much.

Another reason I don’t study it much is that high-frequency trading could be devastated at any time by some simple regulatory changes.

7. I finally figured out one of the big drivers for better risk analysis. Banks need to keep capital lying around to cover a fraction of the risk they take on. If they can estimate the risk more precisely, and come up with a lower number, then they need to keep less capital. That’s a lot like finding large bags of money.

8. Anti-fraud applications arise in many industries, with many different kinds of data and latency requirement. For example:

Insurers don’t want to pay bogus claims. They usually have weeks to think about that problem.
Telcos don’t want to provision services for customers who will defraud them. They have to decide at call-center speed.
Similarly, retailers don’t want to accept bogus returns.
Stockbrokers don’t want rogue traders to defeat their controls. A lot of data and analysis go into that mission, as billions of dollars — literally — can be at stake.

9. And finally, the recent Boston Marathon bombing has brought law-enforcement/anti-terrorism applications to the fore. The Boston Globe criticized difficulties in information sharing, but the money quote is:

The FBI followed up by checking government databases and looking for things such as “derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history,” according to FBI Supervisory Agent Jason J. Pack. “The FBI also interviewed Tamerlan Tsarnaev and family members. The FBI did not find any terrorism activity.”

Neither the telephone intercept nor the web-surfing tracking is a capability the government routinely admits, unless there was something like a wiretap order that I so far haven’t seen reported.

Related links

Government surveillance is even more inevitable than when I wrote in 2010 that freedom can only be preserved by limiting government USES of data.
Stakeholder-facing analytics isn’t much better understood than when I wrote about it in 2010.
I wrote up a different list of analytic use cases back in 2006.
The continued drop in high-frequency trading latency strengthens my 2009 contrast between the speed of a turtle and the speed of light; we’re now over a 3 * 10^10 difference between the speed of trading and the speed of generic planning, and many turtles walk well faster than 1 cm/sec.

One database to rule them all?

Curt Monash — Thu, 21 Feb 2013 05:52:05 +0000

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:

Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
Tokenization can help with the fixed-/variable-length tradeoff.
Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.

What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.

So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears — for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:

If there’s much commonality among the component DBMS, each one would be sub-optimal.
If there’s little commonality among them, then there’s also little benefit to the combination.

Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.

Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:

Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
- Almost all the extensions come from the DBMS vendors themselves.
- Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
- Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
- While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
- Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
Analytic platforms can wind up with all sorts of temporary data structures.
Analytic DBMS have various ways to reach out and touch Hadoop.

Further:

Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
Hadapt combines Hadoop and PostgreSQL.
DataStax combines Cassandra, Hadoop, and Solr.
Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.

Related links

A taxonomy of database use cases (July, 2012)
An early form of this discussion in the single domain of analytic RDBMS (February, 2009)

Memory-centric data management when locality matters

Curt Monash — Mon, 16 Jul 2012 01:13:40 +0000

Ron Pressler of Parallel Universe/SpaceBase pinged me about a data grid product he was open sourcing, called Galaxy. The idea is that a distributed RAM grid will allocate data, not randomly or via consistent hashing, but rather via a locality-sensitive approach. Notes include:

The original technology was developed to track moving objects on behalf of the Israeli Air Force.
The commercial product is focused on MMO (Massively MultiPlayer Online) games (or virtual worlds).
The underpinnings are being open sourced.
Ron suggests that, among other use cases, Galaxy might work well for graphs.
Ron argues that one benefit is that when lots of things cluster together — e.g. characters in a game — there’s a natural way to split them elastically (shrink the radius for proximity).
The design philosophy seems to be to adapt as many ideas as possible from the way CPUs manage (multiple levels of) RAM cache.

The whole thing is discussed in considerable detail in a blog post and a especially in a Hacker News comment thread. There’s also an error-riddled TechCrunch article.

In the areas I cover, “error-riddled TechCrunch article” is pretty much a redundant phrase — but that post looked particularly bad.

Meanwhile, I just noticed a May, 2009 blog post out of Progress Apama. The idea was that event streaming technology could be used to track moving objects, something I heard directly from the CEP (Complex Event Processing) vendors in the 2007 – 2009 period as well.

My tentative opinions on all this start:

Locality is really important for graphs. Random partitioning is crazy if there’s a locality-friendly alternative.
Ron plays different MMOs than I do. That said, the real market would more likely be new games than existing ones. And Guild Wars 2 (for example) is showing the way to gathering many characters together in a small game area.
It’s easy to conceive of cases in which there’s so much specific information about moving objects’ locations that you have to throw much of it away, rather than persisting it all. That speaks for memory-centric technology in general, and data reduction in particular (in the CEP sense of “data reduction”, not the statistics meaning).
Sensor and scientific data often have strong locality.

Related link

I’ve written a fair amount recently about graph data management, although I haven’t tackled the partitioning issue head-on.