Data types – DBMS 2 : DataBase Management System Services

How to beat “fake news”

Curt Monash — Thu, 27 Jun 2019 10:42:50 +0000

Most observers hold several or all of the views:

“Fake news” and the like are severe problems.
Algorithmic solutions have not worked well to date.
Neither have manual ones.
Trusting governments to censor is a bad idea.
In light of the previous points, trusting large social media corporations to censor is a bad idea too.
Educating consumers to evaluate news and opinions accurately would be … difficult.

And further:

Whatever you think of the job traditional journalistic organizations previously did as news arbiters, they can’t do it as well anymore, for a variety of economic, structural and societal reasons.

But despite all those difficulties, I also believe that a good solution to news/opinion filtering is feasible; it just can’t be as simple as everybody would like.

1. When people think about these problems, they’re probably most focused on social media platforms such as Facebook, YouTube or Twitter. But before getting to those, let’s consider the simpler case of search engines. In essence, what search engines do is:

Assign a relevance score to the relationship between a site (or particular page) and your query.
Assign a quality score to a site.
Combine those two scores into an overall ranking, and serve up results accordingly.

How well does this work? I’d say that search engines:

Are good at directing you to information that is generally related to what you want to know. This is the technological core of what they do.
Are good at shielding you from the worst cheating/spammer/hacker sites. That’s also a major technical focus for them.
Are poor but not horrible at distinguishing between good and bad sources of information, opinion or advice. Generally, they do this via some kind of popularity contest, whether via Google’s venerable PageRank or by more directly observing which sites users seem to go to and stay at.
Don’t even try to filter sites according to leanings such as political bias.

Lessons from that start:

Huge technology companies can actually do pretty well at the parts of the problem that technology alone can solve.
A lot of the challenge boils down to adversarial information retrieval, where the adversaries range from somewhat honest polemicists or hucksters to completely awful hackers and spammers.

2. When defending against bad actors, scale helps a lot. In my favorite example:

A significant fraction of all the world’s email goes through Gmail.
Thus, it is very hard for email spam blasts to escape Google’s honeypots.
Informed by those honeypots, Google does what in my opinion is a very good job of fighting spam.

Similarly, as the publisher of multiple blogs, I can tell you that much the same is true of WordPress’ Akismet’s fight against spam comments. Akismet isn’t perfect; indeed, I’ve stopped adding new content to the blog where this post would fit best — Text Technologies – because of a multi-year spam attack. But on the whole Akismet works very well.

Thus, in contradiction to many observers, I believe that the huge scale of social media companies is NOT the root of the problem.

3. Of course, concern is really focused on social media, and especially on the concern that people communicate things they (supposedly) shouldn’t, where:

“Communicating” includes words, pictures, videos, etc.
“Shouldn’t” covers outright lies, great factual distortions, hate speech … or just opinions that the would-be censor doesn’t think should be spread.

And even if you don’t worry so much about those problems, some kind of censorship, filtering or gatekeeping is inevitable anyway, simply because there’s vastly more information in the world than any one person can consume.

So what are the main options for censorship and other gatekeeping? My opinions start:

Having governments be in charge of censorship is a terrible idea.
Having large, non-journalist corporations be directly in charge of censorship is also a bad idea, because ultimately they’ll just succumb to government or other political pressure.
The traditional modern gatekeepers are journalistic organizations, who both deserved and received trust that they’d do the job responsibly. But that model no longer suffices in its old form, for several sets of reasons:
- New requirements. Traditional journalistic gatekeeping boils down to organizations vetting the content they themselves produce. It doesn’t work nearly as well for third-party content, worthy efforts such as fact-checking columns notwithstanding.
- Trust. Walter Cronkite is long dead; journalists aren’t nearly as trusted as they used to be.
- Deliberate bias. Opinion and bias are now part of many “news” organizations’ business models, to a much greater extent than they were a few decades in the past.
- Money. Most journalistic organizations have much slimmer news budgets than they had at their peak.

4. So if we need gatekeeping, and no natural kind of gatekeeper can on its own be effective or safe, what’s left? In simplest terms, we need gatekeeping by (technological) committee. Mainly, what I propose comprises:

Different kinds of gatekeeper for different aspects of the problem, including at a minimum:
- Human-led filters to deal with various issues in credibility and bias.
- Technology-led filters to deal with pure fakery and false provenance.
- Further filtering of the kinds that would be needed even in a more benign world.
Multiple choices for at least the human-led filters.
Good, simple (!) user interfaces for combining those filters’ results.

Above all, people must be able to choose their own censors.

5. What I envision for the “human-led filters to deal with various issues in credibility and bias” is something like:

An organization maintains slowly-changing whitelists and blacklists of information sources.
The same organization fact-checks or other vets specific stories, claims and content in near-real-time.
The results scale to other stories, claims and content via very-rapidly-retrained machine learning models, whether those are based on a single gatekeeper’s hand editing or, more likely, on multiple hand-edited training sets and other collaborative inputs at once.

Here an “organization” can be anything trusted by enough people to be economically viable, for example:

An offshoot of an existing journalistic organization.
An offshoot of an existing political party or advocacy organization.
A successfully started-up new entity.

Obviously, there would be business issues, notably:

Costs. Who pays, in an economy where news is commonly “free”?
Chicken-egg adoption. Which gets developed first: Human-led filtering services that can’t yet be integrated into actual social media filtering, or technology to integrate human-led filtering services that don’t yet exist?

But given the importance and visibility of the problem, optimism about solving the business issues is appropriate. The hardest part is the technology itself. Can machine learning models be retrained on a sub-hour or even sub-minute basis? Sure. That’s been confirmed many times. But what I’m suggesting is a pretty complex case, with global scale, intermediate results passed among organizations, with plenty of adversarial elements, all done at very high speed.

That is not yet a solved problem. But it certainly seems solvable. Further, it’s a problem that must be solved, lest liberal democracy be as doomed as some people fear it actually is.

Related links

I wrote a bit about adversarial analytics in May, 2016.
I outlined my views about the “War(s) on Truth” in February, 2018.
Earlier this month, Cory Doctorow offered a hard-hitting column on the dangers of expecting internet companies to do our censorship for us.
More sedately but with more explanation, so did Will Oremus.

New legal limits on surveillance in the US

Curt Monash — Mon, 25 Jun 2018 19:55:21 +0000

The United States has new legal limits on electronic surveillance, both in one specific way and — more important — in prevailing judicial theory. This falls far short of the protections we ultimately need, but it’s a welcome development even so.

The recent Supreme Court case Carpenter v. United States is a big deal. Let me start by saying:

Most fundamentally, the Carpenter decision was based on and implicitly reaffirms the Katz test.* This is good.
The Carpenter decision undermines the third-party doctrine.** This is great. Strict adherence to the third-party doctrine would eventually have given the government unlimited rights of Orwellian surveillance.
The Carpenter decision suggests the Court has adopted an equilibrium-adjustment approach to Fourth Amendment jurisprudence.
- The “equilibrium” being maintained here is the balance between governmental rights to intrude on privacy and citizens’ rights not to be intruded on.
- e., equilibrium-adjustment is a commitment to maintaining approximately the same level of liberty (with respect to surveillance) we’ve had all along.
- I got the equilibrium-adjustment point from Eugene Volokh’s excellent overview of the Carpenter decision.

*The Katz test basically says that that an individual’s right to privacy is whatever society regards as a reasonable expectation of privacy at that time.

**The third-party doctrine basically says that any information of yours given voluntarily to a third party isn’t private. This includes transactional information such as purchases or telephone call detail records (CDRs)

Key specifics include:

The actual issue in Carpenter is whether the government needs a warrant to access cell phone location data that allows a cell phone user’s movements to be quite accurately tracked. The decision on that issue was Yes.
This was a 5-4 decision. Chief Justice Roberts and the four liberal justices voted for it. Swing voter Justice Kennedy and the other three conservative justices voted against it.
The majority was united. Justice Roberts wrote the decision and there were no concurrences.
The dissents were not united, but did generally focus on two kinds of arguments:
- Reliance on the third-party doctrine, commonly expressed as reliance on the Smith and Miller precedents.
- Disagreement with the Katz test.

Also very relevant was the 2012 case requiring warrants for GPS tracking, United States v Jones. But discussion of the Jones decision is confusing, because while some justices at the time addressed the issue of general electronic tracking of a person’s movements, others focused narrowly on the physical action of installing the GPS device.

Unfortunately, all the good news in Carpenter notwithstanding, the decision doesn’t come close to accomplishing as much as we need. I stand by my oft-repeated observations:

Massive surveillance is inevitable.

Unless the uses of the resulting information are VERY limited, freedoms will be chilled into oblivion.

Justice Roberts correctly wrote in the Carpenter decision:

Mapping a cell phone’s location … provides an intimate window into a person’s life, revealing not only his particular movements, but through them his “familial, political, professional, religious, and sexual associations.”

Justice Kennedy however rejoindered:

What persons purchase and to whom they talk might disclose how much money they make; the political and religious organizations to which they donate; whether they have visited a psychiatrist, plastic surgeon, abortion clinic, or AIDS treatment center; whether they go to gay bars or straight ones; and who are their closest friends and family members.

His point, also correct, was that the data that police are allowed to get without warrants is even more privacy-violating than the data Carpenter keeps away from them. And so, as good as the Carpenter decision apparently is, privacy invasion and surveillance are still among the gravest threats to liberty that we face.

Related links

My January post on the chaotic politics of privacy is relevant both to the content of the Carpenter decision and to the fact that the decision-makers did not split perfectly along traditional partisan lines.
My 2013 series on privacy theory suggests that gaps in judicial reasoning could be filled by referencing the problem of chilling effects.

Brittleness, Murphy’s Law, and single-impetus failures

Curt Monash — Wed, 20 Jun 2018 09:15:44 +0000

In my initial post on brittleness I suggested that a typical process is:

Build something brittle.
Strengthen it over time.

In many engineering scenarios, a fuller description could be:

Design something that works in the base cases.
Anticipate edge cases and sources of error, and design for them too.
Implement the design.
Discover which edge cases and error sources you failed to consider.
Improve your product to handle them too.
Repeat as needed.

So it’s necesseary to understand what is or isn’t likely to go wrong. Unfortunately, that need isn’t always met.

Murphy’s Law and exaggerated fears

We should always bear in mind Murphy’s Law, which in its simplest form states: Anything that can go wrong, will. But also remember that Murphy’s Law is a joke; and even if it were serious, nothing concise is ever precise.

People who tend to over-believe in Murphy’s Law include but are hardly limited to:

Bureaucrats.
Worried parents, especially of only children. (Later kids tend to have it easier, as their parents have more experience.)
Any buyer or voter you believe has been over-persuaded toward fear, uncertainty and doubt.
Relational bigots who view the Ted Codd guarantee as an absolute requirement for data management.

Adversaries

The strongest scenarios for Murphy’s Law should be adversarial ones, in which somebody is actively trying to cause problems. But even there it doesn’t always apply. For example:

Information security commonly fits the Murphy model. Hackers keep outwitting defenders.
Email spam, however, does not. It’s pretty much of a solved problem; the few spam emails that still get through hardly matter.
Web search is somewhere in between. Both sides are partially successful in the combat over adversarial information retrieval, as “good” and “bad” sites alike are both well-represented in search results.

Single-impetus failures

Since bad or scary things will happen — Murphy’s Law isn’t entirely wrong — a standard design practice is to avoid single points of failure. Brittleness has a lot to do with which single points of failure have been overlooked; improvement has a lot to do with belatedly cleaning them up. In adversarial scenarios, avoiding single points of failure relates closely to defense in depth.

Some of the nastiest surprises occur when failures have no obvious single point, yet wind up being possible from a single impetus.* This happens when multiple points or moments of failure are somehow correlated, or when they actually cascade. Examples vary widely, including:

The collapse of the World Trade Center buildings.
An authoritarian leader who manages to destroy a whole democratic system of government.

IT examples that are relatively big deals include:

Security breaches in which an attacker becomes able to fully impersonate a well-credentialed user.
Power outages or other whole-building breakdowns that bring down all parts of a (locally) redundant cluster.
Software bugs that bring down all parts of a supposedly redundant system at once.
Analytic failures that stem from misleading data sets. (Garbage in, garbage out.)

*I chose the phrase “single impetus” rather than “single cause” because NOTHING has a truly single cause; things only can happen when all kinds of conditions are satisfied for them to succeed. But there can indeed be an identifiable force, plan or occurrence that sets a chain of events in motion, and that’s what I’m calling the “impetus”.

Related link

A lot of analytics turns out to be adversarial.

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Introduction to Crate.io and CrateDB

Curt Monash — Sun, 18 Dec 2016 05:27:15 +0000

Crate.io and CrateDB basics include:

Crate.io makes CrateDB.
CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
Crate.io says it has 22 employees and 5 paying customers.
Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
… where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
… except when they’re just BLOBs (Binary Large OBjects).
There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

Writes and single-row look-ups.
Aggregates and joins.

And so it makes sense that:

Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
  - More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
  - CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
  - Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.

Crate.io is proud of its distributed/parallel story.

Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
The only join strategy currently implemented is nested loop. Others are in the future.
CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
I imagine CrateDB administrative tools are still rather primitive.

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Edit: For some clarification and even correction, please see the first comment below.

MongoDB 3.4 and “multimodel” query

Curt Monash — Wed, 23 Nov 2016 12:01:25 +0000

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:

A query layer with multiple ways to query and analyze data.
A separate data storage layer in which you have a choice of data storage engines …
… each of which has the same logical (JSON-based) data structure.

When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.

To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:

In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
When writing, the DML paradigm is unmixed — it’s just JSON.

Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.

The main ways to query data in MongoDB, to my knowledge, are:

Native/JSON. Duh.
SQL.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
Search.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.

Three years ago, in an overview of layered and multi-DML architectures, I suggested:

Layered DBMS and multimodel functionality fit well together.
Both carried performance costs.
In most cases, the costs could be affordable.

MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.

Rapid analytics

Curt Monash — Fri, 21 Oct 2016 14:17:04 +0000

“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.

1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:

General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.

2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:

It is meant to contrast with “operational analytics”.
It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
A simple definition would be “Seeking (previously unknown) patterns in data.”

Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.

3. This is not just a niche need. There are numerous rapid-investigation use cases in mind, some already mentioned in my recent posts on anomaly management and real-time applications.

Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
National security. If I told you what I meant by this one, I’d have to … [redacted].

4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.

The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.

I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:

If you’re wrong 49.9% of the time in investing, you might still be a big winner.
In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.

5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:

How much speed is important in meeting users’ needs.
How much additional speed, if any, is important in satisfying users’ desires.

But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Adversarial analytics and other topics

Curt Monash — Mon, 30 May 2016 10:15:33 +0000

Five years ago, in a taxonomy of analytic business benefits, I wrote:

A large fraction of all analytic efforts ultimately serve one or more of three purposes:

Marketing

Problem and anomaly detection and diagnosis

Planning and optimization

That continues to be true today. Now let’s add a bit of spin.

1. A large fraction of analytics is adversarial. In particular:

Many of the analytics companies I talk with tell me that they have important use cases in security, anti-fraud or both.
Click fraud steals a large fraction of the revenue in online advertising and other promotion. Combating it is a major application need.
Spam is another huge, ongoing fight.
- When Google et al. fight web spammers — which is of course a great part of what web search engine developers do — they’re engaged in adversarial information retrieval.
- Blog comment spam is still a problem, even though the vast majority of instances can now be caught.
- Ditto for email.
There’s an adversarial aspect to algorithmic trading. You’re trying to beat other investors. What’s more, they’re trying to identify your trading activity, so you’re trying to obscure it. Etc.
Unfortunately, unfree countries can deploy analytics to identify attempts to evade censorship. I plan to post much more on that point soon.
Similarly, de-anonymization can be adversarial.
Analytics supporting national security often have an adversarial aspect.
Banks deploy analytics to combat money-laundering.

Adversarial analytics are inherently difficult, because your adversary actively wants you to get the wrong answer. Approaches to overcome the difficulties include:

Deploying lots of data. Email spam was only defeated by large providers who processed lots of email and hence could see when substantially the same email was sent to many victims at once. (By the way, that’s why “spear-phishing” still works. Malicious email sent to only one or a few victims still can’t be stopped.)
Using unusual analytic approaches. For example, graph analytics are used heavily in adversarial situations, even though they have lighter adoption otherwise.
Using many analytic tests. For example, Google famously has 100s (at least) of sub-algorithms contributing to its search rankings. The idea here is that even the cleverest adversary might find it hard to perfectly simulate innocent behavior.

2. I was long a skeptic of “real-time” analytics, although I always made exceptions for a few use cases. (Indeed, I actually used a form of real-time business intelligence when I entered the private sector in 1981, namely stock quote machines.) Recently, however, the stuff has gotten more-or-less real. And so, in a post focused on data models, I highlighted some use cases, including:

It is increasingly common for predictive decisions to be made at [real-timeish] speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.

The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …

… most such analysis should look at historical data as well.

Streaming technology is supplying ever more fresh data.

Let’s now tie those comments into the analytic use case trichotomy above. From the standpoint of mainstream (or early-life/future-mainstream) analytic technologies, I think much of the low-latency action is in two areas:

Recommenders/personalizers.
Monitoring and troubleshooting networked equipment. This is generally an exercise in anomaly detection and interpretation.

Beyond that:

At sufficiently large online companies, there’s a role for low-latency marketing decision support.
Low-latency marketing-oriented BI can also help highlight system malfunctions.
Investments/trading has a huge low-latency aspect, but that’s somewhat apart from the analytic mainstream. (And it doesn’t fit well into my trichotomy anyway.)
Also not in the analytic mainstream are the use cases for low-latency (re)planning and optimization.

Related links

My April, 2015 post Which analytic technology problems are important to solve for whom? has a round-up of possibly relevant links.

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.