Text – DBMS 2 : DataBase Management System Services

How to beat “fake news”

Curt Monash — Thu, 27 Jun 2019 10:42:50 +0000

Most observers hold several or all of the views:

“Fake news” and the like are severe problems.
Algorithmic solutions have not worked well to date.
Neither have manual ones.
Trusting governments to censor is a bad idea.
In light of the previous points, trusting large social media corporations to censor is a bad idea too.
Educating consumers to evaluate news and opinions accurately would be … difficult.

And further:

Whatever you think of the job traditional journalistic organizations previously did as news arbiters, they can’t do it as well anymore, for a variety of economic, structural and societal reasons.

But despite all those difficulties, I also believe that a good solution to news/opinion filtering is feasible; it just can’t be as simple as everybody would like.

1. When people think about these problems, they’re probably most focused on social media platforms such as Facebook, YouTube or Twitter. But before getting to those, let’s consider the simpler case of search engines. In essence, what search engines do is:

Assign a relevance score to the relationship between a site (or particular page) and your query.
Assign a quality score to a site.
Combine those two scores into an overall ranking, and serve up results accordingly.

How well does this work? I’d say that search engines:

Are good at directing you to information that is generally related to what you want to know. This is the technological core of what they do.
Are good at shielding you from the worst cheating/spammer/hacker sites. That’s also a major technical focus for them.
Are poor but not horrible at distinguishing between good and bad sources of information, opinion or advice. Generally, they do this via some kind of popularity contest, whether via Google’s venerable PageRank or by more directly observing which sites users seem to go to and stay at.
Don’t even try to filter sites according to leanings such as political bias.

Lessons from that start:

Huge technology companies can actually do pretty well at the parts of the problem that technology alone can solve.
A lot of the challenge boils down to adversarial information retrieval, where the adversaries range from somewhat honest polemicists or hucksters to completely awful hackers and spammers.

2. When defending against bad actors, scale helps a lot. In my favorite example:

A significant fraction of all the world’s email goes through Gmail.
Thus, it is very hard for email spam blasts to escape Google’s honeypots.
Informed by those honeypots, Google does what in my opinion is a very good job of fighting spam.

Similarly, as the publisher of multiple blogs, I can tell you that much the same is true of WordPress’ Akismet’s fight against spam comments. Akismet isn’t perfect; indeed, I’ve stopped adding new content to the blog where this post would fit best — Text Technologies – because of a multi-year spam attack. But on the whole Akismet works very well.

Thus, in contradiction to many observers, I believe that the huge scale of social media companies is NOT the root of the problem.

3. Of course, concern is really focused on social media, and especially on the concern that people communicate things they (supposedly) shouldn’t, where:

“Communicating” includes words, pictures, videos, etc.
“Shouldn’t” covers outright lies, great factual distortions, hate speech … or just opinions that the would-be censor doesn’t think should be spread.

And even if you don’t worry so much about those problems, some kind of censorship, filtering or gatekeeping is inevitable anyway, simply because there’s vastly more information in the world than any one person can consume.

So what are the main options for censorship and other gatekeeping? My opinions start:

Having governments be in charge of censorship is a terrible idea.
Having large, non-journalist corporations be directly in charge of censorship is also a bad idea, because ultimately they’ll just succumb to government or other political pressure.
The traditional modern gatekeepers are journalistic organizations, who both deserved and received trust that they’d do the job responsibly. But that model no longer suffices in its old form, for several sets of reasons:
- New requirements. Traditional journalistic gatekeeping boils down to organizations vetting the content they themselves produce. It doesn’t work nearly as well for third-party content, worthy efforts such as fact-checking columns notwithstanding.
- Trust. Walter Cronkite is long dead; journalists aren’t nearly as trusted as they used to be.
- Deliberate bias. Opinion and bias are now part of many “news” organizations’ business models, to a much greater extent than they were a few decades in the past.
- Money. Most journalistic organizations have much slimmer news budgets than they had at their peak.

4. So if we need gatekeeping, and no natural kind of gatekeeper can on its own be effective or safe, what’s left? In simplest terms, we need gatekeeping by (technological) committee. Mainly, what I propose comprises:

Different kinds of gatekeeper for different aspects of the problem, including at a minimum:
- Human-led filters to deal with various issues in credibility and bias.
- Technology-led filters to deal with pure fakery and false provenance.
- Further filtering of the kinds that would be needed even in a more benign world.
Multiple choices for at least the human-led filters.
Good, simple (!) user interfaces for combining those filters’ results.

Above all, people must be able to choose their own censors.

5. What I envision for the “human-led filters to deal with various issues in credibility and bias” is something like:

An organization maintains slowly-changing whitelists and blacklists of information sources.
The same organization fact-checks or other vets specific stories, claims and content in near-real-time.
The results scale to other stories, claims and content via very-rapidly-retrained machine learning models, whether those are based on a single gatekeeper’s hand editing or, more likely, on multiple hand-edited training sets and other collaborative inputs at once.

Here an “organization” can be anything trusted by enough people to be economically viable, for example:

An offshoot of an existing journalistic organization.
An offshoot of an existing political party or advocacy organization.
A successfully started-up new entity.

Obviously, there would be business issues, notably:

Costs. Who pays, in an economy where news is commonly “free”?
Chicken-egg adoption. Which gets developed first: Human-led filtering services that can’t yet be integrated into actual social media filtering, or technology to integrate human-led filtering services that don’t yet exist?

But given the importance and visibility of the problem, optimism about solving the business issues is appropriate. The hardest part is the technology itself. Can machine learning models be retrained on a sub-hour or even sub-minute basis? Sure. That’s been confirmed many times. But what I’m suggesting is a pretty complex case, with global scale, intermediate results passed among organizations, with plenty of adversarial elements, all done at very high speed.

That is not yet a solved problem. But it certainly seems solvable. Further, it’s a problem that must be solved, lest liberal democracy be as doomed as some people fear it actually is.

Related links

I wrote a bit about adversarial analytics in May, 2016.
I outlined my views about the “War(s) on Truth” in February, 2018.
Earlier this month, Cory Doctorow offered a hard-hitting column on the dangers of expecting internet companies to do our censorship for us.
More sedately but with more explanation, so did Will Oremus.

Brittleness, Murphy’s Law, and single-impetus failures

Curt Monash — Wed, 20 Jun 2018 09:15:44 +0000

In my initial post on brittleness I suggested that a typical process is:

Build something brittle.
Strengthen it over time.

In many engineering scenarios, a fuller description could be:

Design something that works in the base cases.
Anticipate edge cases and sources of error, and design for them too.
Implement the design.
Discover which edge cases and error sources you failed to consider.
Improve your product to handle them too.
Repeat as needed.

So it’s necesseary to understand what is or isn’t likely to go wrong. Unfortunately, that need isn’t always met.

Murphy’s Law and exaggerated fears

We should always bear in mind Murphy’s Law, which in its simplest form states: Anything that can go wrong, will. But also remember that Murphy’s Law is a joke; and even if it were serious, nothing concise is ever precise.

People who tend to over-believe in Murphy’s Law include but are hardly limited to:

Bureaucrats.
Worried parents, especially of only children. (Later kids tend to have it easier, as their parents have more experience.)
Any buyer or voter you believe has been over-persuaded toward fear, uncertainty and doubt.
Relational bigots who view the Ted Codd guarantee as an absolute requirement for data management.

Adversaries

The strongest scenarios for Murphy’s Law should be adversarial ones, in which somebody is actively trying to cause problems. But even there it doesn’t always apply. For example:

Information security commonly fits the Murphy model. Hackers keep outwitting defenders.
Email spam, however, does not. It’s pretty much of a solved problem; the few spam emails that still get through hardly matter.
Web search is somewhere in between. Both sides are partially successful in the combat over adversarial information retrieval, as “good” and “bad” sites alike are both well-represented in search results.

Single-impetus failures

Since bad or scary things will happen — Murphy’s Law isn’t entirely wrong — a standard design practice is to avoid single points of failure. Brittleness has a lot to do with which single points of failure have been overlooked; improvement has a lot to do with belatedly cleaning them up. In adversarial scenarios, avoiding single points of failure relates closely to defense in depth.

Some of the nastiest surprises occur when failures have no obvious single point, yet wind up being possible from a single impetus.* This happens when multiple points or moments of failure are somehow correlated, or when they actually cascade. Examples vary widely, including:

The collapse of the World Trade Center buildings.
An authoritarian leader who manages to destroy a whole democratic system of government.

IT examples that are relatively big deals include:

Security breaches in which an attacker becomes able to fully impersonate a well-credentialed user.
Power outages or other whole-building breakdowns that bring down all parts of a (locally) redundant cluster.
Software bugs that bring down all parts of a supposedly redundant system at once.
Analytic failures that stem from misleading data sets. (Garbage in, garbage out.)

*I chose the phrase “single impetus” rather than “single cause” because NOTHING has a truly single cause; things only can happen when all kinds of conditions are satisfied for them to succeed. But there can indeed be an identifiable force, plan or occurrence that sets a chain of events in motion, and that’s what I’m calling the “impetus”.

Related link

A lot of analytics turns out to be adversarial.

MongoDB 3.4 and “multimodel” query

Curt Monash — Wed, 23 Nov 2016 12:01:25 +0000

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:

A query layer with multiple ways to query and analyze data.
A separate data storage layer in which you have a choice of data storage engines …
… each of which has the same logical (JSON-based) data structure.

When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.

To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:

In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
When writing, the DML paradigm is unmixed — it’s just JSON.

Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.

The main ways to query data in MongoDB, to my knowledge, are:

Native/JSON. Duh.
SQL.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
Search.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.

Three years ago, in an overview of layered and multi-DML architectures, I suggested:

Layered DBMS and multimodel functionality fit well together.
Both carried performance costs.
In most cases, the costs could be affordable.

MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.

Rapid analytics

Curt Monash — Fri, 21 Oct 2016 14:17:04 +0000

“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.

1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:

General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.

2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:

It is meant to contrast with “operational analytics”.
It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
A simple definition would be “Seeking (previously unknown) patterns in data.”

Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.

3. This is not just a niche need. There are numerous rapid-investigation use cases in mind, some already mentioned in my recent posts on anomaly management and real-time applications.

Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
National security. If I told you what I meant by this one, I’d have to … [redacted].

4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.

The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.

I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:

If you’re wrong 49.9% of the time in investing, you might still be a big winner.
In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.

5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:

How much speed is important in meeting users’ needs.
How much additional speed, if any, is important in satisfying users’ desires.

But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

What is AI, and who has it?

Curt Monash — Tue, 01 Dec 2015 09:25:46 +0000

This is part of a four post series spanning two blogs.

One post gives a general historical overview of the artificial intelligence business.
One post specifically covers the history of expert systems.
One post (this one) gives a general present-day overview of the artificial intelligence business.
One post explores the close connection between machine learning and (the rest of) AI.

1. “Artificial intelligence” is a term that usually means one or more of:

“Smart things that computers can’t do yet.”
“Smart things that computers couldn’t do until recently.”
“Technology that has emerged from the work of computer scientists who said they were doing AI.”
“Underpinnings for other things that might be called AI.”

But that covers a lot of ground, especially since reasonable people might disagree as to what constitutes “smart”.

2. Examples of what has been called “AI” include:

Rule-based processing, especially if it is referred to as “expert systems”.
Machine learning.
Many aspects of “natural language processing” — a term almost as overloaded as “artificial intelligence” — including but not limited to:
- Text search.
- Speech recognition, especially but not only if it seems somewhat lifelike.
- Automated language translation.
- Natural language database query.
Machine vision.
Autonomous vehicles.
Robots, especially but not only ones that seem somewhat lifelike.
Automated theorem proving.
Playing chess at an ELO rating of 1600 or better.
Beating the world champion at chess.
Beating the world champion at Jeopardy.
Anything that IBM brands or rebrands as “Watson”.

That last bit is awkward, as IBM is doing the industry a major disservice via its recklessly confusing Watson marketing, which is instantiating Monash’s First Law of Commercial Semantics — Bad jargon drowns out good. I suspect there’s an interesting debate under it all, in which IBM stands almost alone against the whole rest of the industry by sticking to the old academic belief that sophisticated knowledge representation is the key to AI. But it’s hard to be sure, because IBM’s Watson marketing is so full of smoke that reality, if any, doesn’t show through.

3. When I think of present-day AI commercialization, what comes to mind is mainly:

Multiple efforts in speech recognition, from Google, Microsoft, Apple, and Nuance Communications. (I’m not sure whether Apple’s is mainly in-house or mainly outsourced.)
Other natural language efforts, such as Google’s in machine translation.
Technology related to robots and autonomous vehicles, specifically in machine vision, other senses (e.g. touch), and reactions (e.g. driving decisions).
- Google is the most visible player here. It’s gotten a lot of press for driverless automobiles, and it bought up a lot of robotics companies when they were hurting due to a hiatus in DARPA funding.
- Large auto companies will surely compete.
Gesture interpretation and similar kinds of recognition.
- Microsoft has the most visibility here, due to Kinect, and is trying to bring similar technology to general computing.
- Facebook, Google et al. are making major investments into the closely related area of virtual reality. Facebook is also building an AI team.
Machine learning.
- Machine learning in general can be regarded as part of AI, at least historically.
- Machine learning is a key component of many AI efforts. Google in particular has made a big fuss about it, suggesting that data is generally more important than algorithms.
Whatever parts of the IBM story, if any, are actually real.

So with one big exception, commercial AI seems to be concentrated at a small number of behemoth companies. The exception is machine learning itself, which is being adopted and developed on a much broader basis.

4. AngelList seems to say I’m wrong, citing 576 different AI startups. CrunchBase offers 436 AI startups. So maybe some of those startups will succeed. We’ll see.

5. Some of the reasons for AI’s concentrated industry structure lie in general business and economics.

A large company can risk research with unclear payoffs a lot more easily than a small one can.
AI is prestigious and/or cool. Some large companies like to indulge in stuff like that.

Yes, those reasons are somewhat counteracted by the facts that:

VCs know they’re investing in companies whose eventual exit will likely be an acquisition.
Some of those acquisitions are for a LOT of money.

But I think they apply even so. And by the way — to date, most AI companies have not been acquired for very high prices.

6. Some of the reasons for AI industry concentration are more specifically technological.

Some AI — e.g. speech recognition or autonomous vehicle navigation — could be the “sizzle” that differentiates offerings in huge business sectors. Thus, a “win” in AI could have more value to an already-large electronics, search or automobile company than to a startup.
The largest companies in those huge sectors can afford huge amounts of training data, or may even get it as a byproduct of their other activities. Hence they can more easily afford massive exercises in the relevant machine learning.

My paradigmatic example for the latter point is Google with anything connected to search, such as translation (which it does of search results) or natural language recognition (which it does of search queries).

If you want to do an AI startup, those are some of the competitive factors that you need to beat.

Related links

An earlier version of some of this material was in my January, 2014 post on The games of Watson.
Earlier this year, I posted about robotics.
There is quite a bit of AI humor.

Sources of differentiation

Curt Monash — Mon, 26 Oct 2015 19:31:38 +0000

Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.

Many buying and design considerations for IT fall into six interrelated areas:

Scope: What does the technology even purport to do? This consideration applies to pretty much everything.
- Usually, this means something like features.
- However, there’s an important special case in which the important features are the information content. (Examples: Arguably Google, and the Bloomberg service for sure.)
Accuracy: How correctly does the technology do it? This can take multiple forms.
- Sometimes, a binary right/wrong distinction pretty much suffices, with an acceptable error rate of zero. If you’re writing data, it shouldn’t get lost. If you’re doing arithmetic, it should be correct. Etc.
- Sometimes, there’s a clear right/wrong distinction, but error rates are necessarily non-zero, often with a trade-off between the rates for false positives and false negatives. (In text search and similar areas, those rates are measured respectively as precision and recall.) Security is a classic example. Many other cases arise when trying to identify problems or
- Sometimes accuracy is on a scale. Predictive modeling results are commonly of that kind. So are text search, voice recognition and so on.
Other trustworthiness.
- Reliability, availability and security are considerations in almost any IT scenario.
- Also crucial are any factors that are perceived as affecting the risk of project failure. Sometimes, these are lumped together as (part of) maturity.
Speed. There’s a great real and/or perceived “need for speed”.
- On the user level:
  - There are many advantages to quick results, “real time” or otherwise.
  - In particular, analysis is often more accurate if you have time for more iterations or intermediate steps.
  - Please recall that speed can actually have multiple kinds of benefit. For example, it can reduce costs, it can improve accuracy, it can improve user experience, or it can enable capabilities that would otherwise be wholly impractical.
- There can also be considerations of time to (initial) value, although people sometimes overrate how often this is a function of the technology itself.
- Consistency of performance can be an important aspect of product maturity.
User experience. Ideally, using a system is easy and pleasurable, or at least not unpleasant.
- Ease of use often equates to ease of (re)learning …
- … but there are exceptions, generally for what might be considered “power users”.
- Speed and performance can avoid a lot of unpleasant frustration.
- In some cases you can compel somebody — usually an employee — to use your interface. Often, however, you can’t, and that’s when user experience may matter most.
- An important category of user experience that doesn’t directly equate to ease or is Of course, the more accurate the recommendations are, the better.
- Most systems have at least two categories of user experience — one for the true users, and one for the IT folks who manage it. The IT folks’ experience often depends not just on true UI features, but on how hard or difficult the underlying system is to deal with in the first place.
Cost, or more precisely TCO (Total Cost of Ownership). Cost is always important, and especially so if there are numerous viable alternatives.
- Sometimes money paid to the vendor really is the largest component of TCO.
- Often, however, hardware or IT personnel expenditures are the lion’s share of overall cost.
- Administrators’ user experience can affect a large chunk of TCO.

Related links

This post is starting out with two sequels, on data management and business intelligence respectively.
Issues of differentiation are central to my strategic worksheet.
When thinking about differentiation, keep in mind the distinction between wants and needs.
This post fits well with my claim that every product in a category is positioned along the same set of attributes.
In a post last year about differentiation, I wrote “Your spiffy innovation is important in fewer situations than you would like to believe.”
If you think you’re a rare exception to that rule, please see my post about over-optimism.

DataStax and Cassandra update

Curt Monash — Mon, 14 Sep 2015 06:02:59 +0000

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data.

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:

You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
In the general case, getting it back out for low-latency analytics is problematic …
… but there’s an increasing list of exceptions.

For DataStax Enterprise, exceptions start:

Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
DataStax is alert to the need to stream data into Cassandra.
- That’s central to the NoSQL expectation of ingesting internet data very quickly.
- Kafka, Storm and Spark Streaming all seem to be in the mix.
Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.

*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.

DataStax Enterprise:

Is based on Cassandra 2.1.
Will probably never include Cassandra 2.2, waiting instead for …
….Cassandra 3.0, which will feature a storage engine rewrite …
… and will surely include Cassandra 2.2 features of note.

This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:

These are functions — not libraries, a programming language, or anything like that.
The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.

And finally, some general tidbits:

A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
There are Cassandra users with >1 million reads+writes per second.

Finally a couple of random notes:

One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.

MongoDB update

Curt Monash — Thu, 10 Sep 2015 10:33:13 +0000

One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:

>2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
~75,000 users of MongoDB Cloud Manager.
Estimated ~1/4 million production users of MongoDB total.

Also >530 staff, and I think that number is a little out of date.

MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:

Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results.
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.

There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout.

The name will change, in part because if you try to search on that name you’ll probably find an unrelated Scout.
Scout samples data, runs stats, and all that stuff.
Scout is referred to as a “schema introspection” tool, but I’m not sure why; schema introspection sounds more like a feature or architectural necessity than an actual product.

As for storage engines:

WiredTiger, which was the biggest deal in MongoDB 3.0, will become the default in 3.2. I continue to think analogies to InnoDB are reasonably appropriate.
An in-memory storage engine option was also announced with MongoDB 3.0. Now there’s a totally different in-memory option. However, details were not available at posting time. Stay tuned.
Yet another MongoDB storage engine, based on or akin to WiredTiger, will do encryption. Presumably, overhead will be acceptably low. Key management and all that will be handled by usual-suspect third parties.

Finally — most data management vendors brag to me about how important their text search option is, although I’m not necessarily persuaded. MongoDB does have built-in text search, of course, of which I can say:

It’s a good old-fashioned TF/IDF algorithm. (Text Frequency/Inverse Document Frequency.)
About the fanciest stuff they do is tokenization and stemming. (In a text search context, tokenization amounts to the identification of word boundaries and the like. Stemming is noticing that alternate forms of the same word really are the same thing.)

This level of technology was easy to get in the 1990s. One thing that’s changed in the intervening decades, however, is that text search commonly supports more languages. MongoDB offers stemming in 8 or 9 languages for free, plus a paid option via Basis for other languages yet.

Related links

BI for NoSQL (March, 2015)
Uninterrupted DBMS operation (September, 2012)

IT-centric notes on the future of health care

Curt Monash — Tue, 26 May 2015 05:02:09 +0000

It’s difficult to project the rate of IT change in health care, because:

Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
… health care is heavily bureaucratic, political and regulated.

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
  - Most tests exploit electronic technology. Progress in electronics is intense.
  - Biomedical research is itself intense.
  - In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.

These vastly greater amounts of data cited above will allow for greatly changed analytics.

Right now, medical decisions are made based on research that looks at a few data points each for a specially-recruited sample of patients, then draws conclusions based on simplistic and questionable statistical methods.
More sophisticated analytic methods are commonly used, but almost always still to aid in the discovery and formation of hypotheses that will then be validated, if at all, using the bad old analytic techniques.
State of the art predictive modeling, applied to vastly more data, will surely yield greatly better results.

And so I believe that health care itself will be revolutionized.

Diagnosis will be much more accurate, pretty much across the board, except in those limited areas where it’s already excellent today.
Medication regimens will be much more personalized. (Pharma manufacturing may have to change greatly as a result.) So will other treatments. So will diet/fitness regimens.
The vulnerable (elderly, hospital patients) will be more accurately and comprehensively monitored. Also, their care will likely be aided by robotics.
Some of the same things will be true of infants and toddlers. (In other cases, they get such close attention today that I can’t imagine how it could be greatly increased. )

I believe that this will all happen because I believe that it will make health care vastly more successful. And if I’m right about that, no obstacles will be able to prevent it from coming into play — not cost (which will keep going down in a quasi-Moore’s-Law way), not bureaucratic inertia (although that will continue to slow things greatly), and not privacy fears (despite the challenges cited below).

So what are the IT implications of all this?

I already mentioned the need for new (or newly-used) kinds of predictive modeling.
Probably in association with those, event detection — which in many but not all cases will amount to anomaly detection — will be huge. If one goal is to let the elderly and ailing live independently, but receive help when it’s needed — well, recognizing when that help is needed will be crucial. Similar dynamics will occur in hospitals.
And in support of that, there will be great amount of monitoring, and hence strong demands upon sensors and recognition. Potentially, all five human senses will be mimicked, among others. These technologies will become even more important in health care if I’m right that robotics will play a big role.
Data quality will be a major challenge, especially in the doctors’-notes parts of health records. Reasons start:
- Different medical professionals might evaluate the same situation differently; diagnosis is a craft rather than a dumb, repeatable skill.
- If entries are selected from a predefined set of options, none may be a perfect match to the doctor’s actual opinion.
- Doctors often say what’s needful to have their decisions (care, tests, etc.) approved, whether or not it precisely matches what they really think. Thus, there are significant incentives to enter bad data.
- Free-text data is more central to health care than to many other application areas, and text data is inherently dirty.
- Health records are decades later than many other applications in moving from paper to IT.
Data integration problems will also be and indeed already are huge, because different health care providers have addressed the tough challenges of record-keeping in different ways.

As for data management — well, almost everything discussed in this blog could come into play.

A person’s entire medical record resembles the kind of mess increasingly often dumped these days into NoSQL — typically MongoDB, Cassandra, or HBase.
There are plenty of business-transaction records in the mix, of the kind that have long been managed by RDBMS.
There are a whole lot of diverse machines in the mix, and managing the data to keep such a menagerie running is commonly the job of Splunk or streaming-enhanced Hadoop.
There’s a lot of free text in medical records. Also images, video and so on.
Since graph analytics is used in research today, it might at some point make its way into clinical use.

Finally, let me say:

Data-driven medicine cannot live up to its potential unless researchers can investigate data sets comprising private information of large numbers of people.
Researchers will not have the appropriate permissions unless privacy law moves toward a basis in data use, rather than exclusively regulating data possession.

Related links

The New York Times and Hacker News discussed the benefits of using your own medical records a couple months ago.
I wrote about the monitoring/early response aspects of health care in February, 2015.
Perhaps my most recent survey of privacy issues was in September, 2014.
A pretty good survey of the debate about statistical methods in medical research came out in December, 2013.