Google – DBMS 2 : DataBase Management System Services

The technology industry is under broad political attack

Curt Monash — Fri, 15 Dec 2017 09:25:55 +0000

I apologize for posting a December downer, but this needs to be said.

The technology industry is under attack:

From politicians and political pundits …
… especially from “populists” and/or the political right …
… in the United States and other countries.

These attacks:

Are in some cases specific to internet companies such as Google and Facebook.
In some cases threaten the tech industry more broadly.
Are in some cases part of general attacks on the educated/ professional/“globalist”/”coastal” “elites”.

You’ve surely noticed some of these attacks. But you may not have noticed just how many different attacks and criticisms there are, on multiple levels.

1. Concerns about jobs, disruption, gentrification and so on are a Really Big Deal, causing large swaths of the population to regard technology as bad for their pocketbooks. In particular:

There’s tremendous concern about job loss to automation and/or globalization. Technology helps cause the first and enable the second.
Generally, when an industry destroys jobs, one hopes that it will create new ones to take their place. But while US technology companies have created many jobs, a lot of those are overseas.
Flaps about overseas finances, taxes, and so on aren’t helping. Apple, for example, has major issues in Europe and the US alike.
Working-class jobs that tech companies do create are often attacked for their pay and conditions, e.g. for Amazon warehouse workers or Uber drivers.
Even when the technology industry unquestionably creates good, domestic jobs, the industry may be attacked for them. Consider for example the concerns about cost of living/gentrification in Northern California.
“Sharing economy” companies such as Uber and Airbnb and others are involved in local political fights all around the world, as they undercut traditional service providers.

People who believe that technologists harm them are a major political force.

2. The technology industry is under considerable legislative, regulatory, and judicial pressure. For starters:

Tech companies are attacked for doing too little to aid law enforcement and government surveillance.
Tech companies are attacked for doing too much to aid law enforcement and government surveillance.
Tech companies are attacked for doing too little censorship.
Tech companies are attacked for doing too much censorship.
Privacy regulations are ever-changing.

Complicating things further, these challenges take different forms in different countries around the world.

Also:

China pressures foreign vendors to transfer technology into China.
Recent network neutrality developments in the US favor older telecom providers, at the expense of newer internet companies.
Anti-immigrant policies in the US threaten tech vendors.

I could keep going much longer than that. Government relations are a major, major issue for tech.

3. It is traditional to claim that advances in communication/media technologies will wreck society.

Television was going to make us mass-conformist couch potatoes.
Video games were going to make us violent couch potatoes.

This era brings similar concerns.

Social media makes us couch potatoes sitting in niche-conformist echo chambers.
Modern media over-stimulate us and wreck our attention spans.

I.e., the apocalypse is imminent, and tech is what will bring it on.

The most compelling version of that argument I’ve seen is Jean Twenge’s claims that there’s a teen mental health crisis perfectly matched in time to the rise of the smartphone. And to make any such claim seem particularly damning, please recall: Social media and gaming companies are clearly trying to foster a form of addiction in — well, in their users.

Current concern may ebb just like previous generations’ did. But for now, they’re yet another aspect of a threat-filled environment.

4. What worries me most is this: The United States and other countries face relentless attacks on education, educators, science, scientists, and rationality itself. And there are no obvious limits to how bad these can get. China’s Cultural Revolution and the Cambodian genocide happened during my lifetime. Stalin and Hitler ruled during my parents’. All four took particular aim at people like us.

Bottom line: EVERYBODY in the technology industry should be or quickly become politically aware. We have an awful lot of politics to deal with.

More notes on the transition to the cloud

Curt Monash — Thu, 17 Aug 2017 09:11:01 +0000

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

4. In another non-technical competitive factor: Wal-Mart isn’t the only huge company that is hostile to the Amazon cloud because of competition with other Amazon businesses.

5. It was once thought that in many small countries around the world, there would be OpenStack-based “national champion” cloud winners, perhaps as subsidiaries of the leading telecom vendors. This doesn’t seem to be happening.

Even so, some of the larger managed-economy and/or generally authoritarian countries will have one or more “national champion” cloud winners each — surely China, presumably Russia, obviously Iran, and probably some others as well.

6. While OpenStack in general seems to have fizzled, S3 compatibility has momentum.

7. Finally, let’s return to our opening points: The cloud transition will happen, but it will take considerable time. A principal reason for slowness is that, as a general rule, apps aren’t migrated to platforms directly; rather, they get replaced by new apps on new platforms when the time is right for them to be phased out anyway.

However, there’s a codicil to those generalities — in some cases it’s easier to migrate to the new platform than in others. The hardest migration was probably when the rise of RDBMS, the shift from mainframes to UNIX and the switch to client/server all happened at once; just about nothing got ported from the old platforms to the new. Easier migrations included:

The switch from Unix to Linux. They were very similar.
The adoption of virtualization. A major purpose of the technology was to make migration easy.
The initial adoption of DBMS. Then-legacy apps relied on flat file systems, which DBMS often found easy to emulate.

The cloud transition is somewhere in the middle between those extremes. On the “easy” side:

Popular database management technologies and so on are available in the cloud just as they are on-premise.
Major app vendors are doing the hard work of cloud ports themselves.

Nonetheless, the public cloud is in many ways a whole new computing environment — and so for the most part, customer-built apps will prove too difficult to migrate. Hence my belief that overall migration to the cloud will be very incremental.

Analyzing the right data

Curt Monash — Thu, 13 Apr 2017 12:05:43 +0000

0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.

1. In line with that theme:

Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.

2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.

*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.

3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:

Divide your data into clusters.
Model each cluster separately.

That continues to be tough work. Attempts to productize shortcuts have not caught fire.

4. In an example of the previous point, anomaly management technology can, in theory, help shortcut any type of analytics, in that it tries to identify what parts of your data to focus on (and why). But it’s in its early days; none of the approaches to general anomaly management has gained much traction.

5. Marketers have vast amounts of information about us. It starts with every credit card transaction line item and a whole lot of web clicks. But it’s not clear how many of those (10s of) thousands of columns of data they actually use.

6. In some cases, the “right” amount of data to use may actually be tiny. Indeed, some statisticians claim that fewer than 10 data points may be enough to get a good model. I’m skeptical, at least as to the practical significance of such extreme figures. But on the more plausible side — if you’re hunting bad guys, it may not take very many separate facts before you have good evidence of collusion or fraud.

Internet fraud excepted, of course. Identifying that usually involves sifting through a lot of log entries.

7. All the needle-hunting in the world won’t help you unless what you seek is in the haystack somewhere.

Often, enterprises explicitly invest in getting more data.
Keeping everything you already generate is the obvious choice for most categories of data, but some of the lowest-value-per-bit logs may forever be thrown away.

8. Google is famously in the camp that there’s no such thing as too much data to analyze. For example, it famously uses >500 “signals” in judging the quality of potential search results. I don’t know how many separate data sources those signals are informed by, but surely there are a lot.

9. Few predictive modeling users demonstrate a need for vast data scaling. My support for that claim is a lot of anecdata. In particular:

Some predictive modeling techniques scale well. Some scale poorly. The level of pain around the “scale poorly” aspects of that seems to be fairly light (or “moderate” at worst). For example:
- In the previous technology generation, analytic DBMS and data warehouse appliance vendors tried hard to make statistical packages scale across their systems. Success was limited. Nobody seemed terribly upset.
- Cloudera’s Data Science Workbench messaging isn’t really scaling-centric.
Spark’s success in machine learning is rather rarely portrayed as centering on scaling. And even when it is, Spark basically runs in memory, so each Spark node is processing all that much data.

10. Somewhere in this post — i.e. right here — let’s acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important too.

11. Let’s also mention data marts. Basically, data marts subset and copy data, because the data will be easier to analyze in its copied form, or because they want to separate workloads between the original and copied data store.

If we assume the data is on spinning disks or even flash, then the need for that strategy declined long ago.
Suppose you want to keep data entirely in memory? Then you might indeed want to subset-and-copy it. But with so many memory-centric systems doing decent jobs of persistent storage too, there’s often a viable whole-dataset management alternative.

But notwithstanding the foregoing:

Security/access control can be a good reason for subset-and-copy.
So can other kinds of administrative simplification.

12. So what does this all suggest going forward? I believe:

Drilldown is and will remain central to BI. If your BI doesn’t support robust drilldown, you’re doing it wrong. “Real-time” use cases are not exceptions to this rule.
In a strong overlap with the previous point, drilldown is and will remain central to monitoring. Whatever monitoring means to you, the ability to pinpoint the specific source of interesting signals is crucial.
The previous point can be recast as saying that it’s crucial to identify, isolate and explain anomalies. Some version(s) of anomaly management will become a big deal.
SQL and “SQL-like” languages will remain integral to analytic processing for a long time.
Memory-centric analytic frameworks such as Spark will continue to win. The data size constraints imposed by memory-centric processing will rarely cause difficulties.

Related links

Other recent “unifying-theme” posts focused on monitoring and coordination.
My 2013 post on what matters in investigative analytics still holds up pretty well.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Governments vs. tech companies — it’s complicated

Curt Monash — Thu, 19 May 2016 03:42:01 +0000

Numerous tussles fit the template:

A government wants access to data contained in one or more devices (mobile/personal or server as the case may be).
The computer’s manufacturer or operator doesn’t want to provide it, for reasons including:
- That’s what customers prefer.
- That’s what other governments require.
- Being pro-liberty is the right and moral choice. (Yes, right and wrong do sometimes actually come into play. )

As a general rule, what’s best for any kind of company is — pricing and so on aside — whatever is best or most pleasing for their customers or users. This would suggest that it is in tech companies’ best interest to favor privacy, but there are two important quasi-exceptions:

Recommendation/personalization. E-commerce and related businesses rely heavily on customer analysis and tracking.
When the customer is the surveiller. Governments pay well for technology that is used to watch over their citizens.

I used the “quasi-” prefix because screwing the public is risky, especially in the long term.

Something that is not even a quasi-exception to the tech industry’s actual or potential pro-privacy bias is governmental mandates to let their users be watched. In many cases, governments compel privacy violations, by threat of severe commercial or criminal penalties. Tech companies should and often do resist these mandates as vigorously as they can, in the courts and/or via lobbying as the case may be. Yes, companies have to comply with the law. However, it’s against their interests for the law to compel privacy violations, because those make their products and services less appealing.

The most visible example of all this right now is the FBI/Apple kerfuffle. To borrow a phrase — it’s complicated. Among other aspects:

Syed Rizwan Farook, one of the San Bernardino terrorist murderers, had 3 cell phones. He carefully destroyed his 2 personal phones before his attack, but didn’t bother with his iPhone from work.
Notwithstanding this clue that the surviving phone contained nothing of interest, the FBI wanted to unlock it. It needed technical help to do so.
The FBI got a court order commanding Apple’s help. Apple refused and appealed the order.
The FBI eventually hired a third party to unlock Farook’s phone, for a price that was undisclosed but >$1.3 million.
Nothing of interest was found on the phone.
Stories popped up of the FBI asking for Apple’s help unlocking numerous other iPhones. The courts backed Apple or not depending on how they interpreted the All Writs Act. The All Writs Act was passed in the first-ever session of the US Congress, in 1789, and can reasonably be assumed to reflect all the knowledge that the Founders possessed about mobile telephony.
It’s widely assumed that the NSA could have unlocked the phones for the FBI — but it didn’t.

Russell Brandom of The Verge collected links explaining most of the points above.

With that as illustration, let’s go to some vendor examples:

Apple — which sells devices much more than advertising — has clearly decided that being (seen as) pro-privacy is its preferred course.
Microsoft — all rumors about Skype backdoors and the like notwithstanding — has made a similar choice. Notably, it is struggling to keep data hosted on its European servers out of US subpoena reach.
Amazon and Google, by way of contrast, whose core consumer businesses depend on recommendation/personalization, have not been so visible about protecting the privacy of their cloud services’ data.
Blackberry, meanwhile, seems to split the difference, being pro-privacy in its enterprise server business but acquiescing to surveillance in its consumer operations.

All of these cases seem consistent with my comments about vendors’ privacy interests above.

Bottom line: The technology industry is correct to resist government anti-privacy mandates by all means possible.

Cloudera in the cloud(s)

Curt Monash — Fri, 22 Jan 2016 07:46:34 +0000

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Cloudera’s usual software, ported to run on the cloud platform(s).
Cloudera Director, which for example launches cloud instances.
Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.
Support for spot and preemptable instances.
High availability.
Kerberos.
Some cluster repair.
Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
Object storage integration for Windows Azure is “in progress”.
Object storage integration for Google GCP it is “to be determined”.
Security for object storage — e.g. encryption — is a work in progress.
Cloudera Navigator for object storage is a roadmap item.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
- General Cloudera vs. Hortonworks distinctions.
- “Hybrid”/portability.

Cloudera also offered a distinction among three types of workload:

ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
- Cloudera pitches this as batch work.
- Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
- This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.
- Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
“Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.

Machine learning’s connection to (the rest of) AI

Curt Monash — Tue, 01 Dec 2015 09:28:22 +0000

This is part of a four post series spanning two blogs.

One post gives a general historical overview of the artificial intelligence business.
One post specifically covers the history of expert systems.
One post gives a general present-day overview of the artificial intelligence business.
One post (this one) explores the close connection between machine learning and (the rest of) AI.

1. I think the technical essence of AI is usually:

Inputs come in.
Decisions or actions come out.
More precisely — inputs come in, something intermediate is calculated, and the intermediate result is mapped to a decision or action.
The intermediate results are commonly either numerical (a scalar or perhaps a vector of scalars) or a classification/partition into finitely many possible intermediate outputs.

Of course, a lot of non-AI software can be described the same way.

To check my claim, please consider:

It fits rules engines/expert systems so simply it’s barely worth saying.
It fits any kind of natural language processing; the intermediate results might be words or phrases or concepts or whatever.
It fits machine vision beautifully.

To see why it’s true from a bottom-up standpoint, please consider the next two points.

2. It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response. Examples of what I mean include:

Think of what’s on an IQ test, or a commonly accepted substitute for same. (The SAT sometimes substitutes.) A lot of that is pattern recognition.
When the “multiple intelligences” or just “emotional intelligence” concepts gained currency, the core idea was the recognition of various different kinds of pattern. (E.g., reading somebody else’s emotions, something that I’m not nearly as good at as I am at the skills measured by standard IQ tests.)
The central mechanism of neurotransmission is a neuron recognizing that an action potential has crossed a certain threshold, and firing as a result.
Traditional areas of AI include natural language recognition, machine vision, and so on.
Another traditional area of AI is rules-based processing — conditions in, decision out.
Back in the 1980s (less so today), it was thought that a core underpinning for AI technology was knowledge representation. That said, as much as I like interesting data structures, I have my doubts.
- The Semantic Web grew out of this idea.
- Also, the single most enduring proponent of the centrality of knowledge representation was probably Doug Lenat, who gave his name to a famed unit of bogosity.
- While the previous two points are probably just coincidence, the juxtaposition is suggestive.

3. In most computational cases, pattern recognition and response boil down to scoring and/or classification (whether in a narrow machine learning sense of “classification” or otherwise). What I mean by this is:

I’m thinking of scoring as a function that maps inputs into scalar values. (Or a vector of scalars.)
I’m thinking of classification as a function that maps inputs into a finite range of possible values. (Note that this is mathematically equivalent to a finite partition on the set of inputs.)
I’m also assuming that the system maps each possible score or classification to a decision or response (deterministically or probabilistically as the case may be).
Then if you compose the two maps, you wind up with a function from {possible input patterns} to {possible responses}.

4. If you want a good algorithm for classification, of course, it’s natural to pursue it via machine learning. And the same is true of scoring, at least if we recall that the domains of machine learning and statistics have essentially merged.

5. It took people remarkably long to figure out the previous point. Through at least the end of the previous century, it was generally assumed that the way to come up with clever algorithms for, for example, text analytics or machine vision was — well, to think them up.

6. As spelled out in my overview of present-day commercial AI, there’s a somewhat paradoxical industry structure, in that:

Even though machine learning is a sine qua non of many businesses, tech and non-tech alike …
… the rest of AI is largely concentrated at a few behemoth technology companies.

Of course, there are plenty of startups hoping to change that structure. I hope some of them succeed.

Notes on indexes and index-like structures

Curt Monash — Thu, 16 Apr 2015 22:42:59 +0000

Indexes are central to database management.

My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
… and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
Recently, I’ve had numerous conversations in which indexing strategies played a central role.

Perhaps it’s time for a round-up post on indexing.

1. First, let’s review some basics. Classically:

An index is a DBMS data structure that you probe to discover where to find the data you really want.
Indexes make data retrieval much more selective and hence faster.
While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)

2. Further:

A DBMS or other system can index data it doesn’t control.
- This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security.
- This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.
Indexes can be updated in batch mode, rather than real time.
- Most famously, this is why Google invented MapReduce.
- Indeed, in cases where you index external data, it’s almost mandatory.
Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes.
- The most famous example is probably the rebalancing of B-trees.
- Append-only index writes call for later clean-up as well.

3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.

The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)
In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that:
- He felt that lockless indexes were essential to scale-out, and to that end …
- … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)
Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.
solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.

I’ll try to explain this paradox below.

4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.

Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.

5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.

To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.
Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.
Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.

6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.

The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:

For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.
So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.
Then regions on a plane are covered by subsequences (or unions of same).

The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.

7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.

Hardware and storage notes

Curt Monash — Thu, 01 May 2014 02:05:16 +0000

My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:

Rack- or data-center-scale systems.
The real or imagined demise of Moore’s Law.
Flash.

I also got updated as to typical Hadoop hardware.

If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)

One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.

Otherwise, I didn’t gain much new insight into actual flash uptake. Everybody thinks flash is or soon will be very important; but in many segments, folks are trading off disk vs. RAM without worrying much about the intermediate flash alternative.

I visited two Hadoop distribution vendors this trip, namely the ones who are my clients – Cloudera and MapR. I remembered to ask one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or less:

The default assumption remains $20-30K/node, 2 sockets, 12 disks. (Edit: See lively price discussion in the comments below.)
Most hardware vendors have standard/default Hadoop boxes by now, and in many cases customers just buy what’s on offer.
The aforementioned disks sometimes get up to 4 terabytes now.
128GB is now the norm for RAM. 256GB is common. Higher amounts are seen, up to – in rare cases – 2-4 TB.
Flash is of interest, but isn’t being demanded much yet. This could change when flash’s storage density matches disk’s.
Flash interest is highest for Impala.

Cloudera suggested that the larger amounts of RAM tend to be used when customers frame the need as putting certain analytic datasets entirely in RAM. This rings true to me; there’s lots of evidence that users think that way, and not just in analytic cases. This is probably one of the reasons that they often jump straight from disk to RAM without fully exploring the opportunities of flash.

One last thing — the big cloud vendors are at least considering the use of their own non-Intel chip designs, which might be part of the reason for Intel’s large Hadoop investment.

More on public policy

Curt Monash — Sat, 01 Feb 2014 11:35:22 +0000

Occasionally I take my public policy experience out for some exercise. Last week I wrote about privacy and network neutrality. In this post I’ll survey a few more subjects.

1. Censorship worries me, a lot. A classic example is Vietnam, which basically has outlawed online political discussion.

And such laws can have teeth. It’s hard to conceal your internet usage from an inquisitive government.

2. Software and software related patents are back in the news. Google, which said it was paying $5.5 billion or so for a bunch of Motorola patents, turns out to really have paid $7 billion or more. Twitter and IBM did a patent deal as well. Big numbers, and good for certain shareholders. But this all benefits the wider world — how?

As I wrote 3 1/2 years ago:

The purpose of legal intellectual property protections, simply put, is to help make it a good decision to create something. …

Why does “securing … exclusive Right[s]” to the creators of things that are patented, copyrighted, or trademarked help make it a good decision for them to create stuff? Because it averts competition from copiers, thus making the creator a monopolist in what s/he has created, allowing her to at least somewhat value-price her creation.

I.e., the core point of intellectual property rights is to prevent copying-based competition. By way of contrast, any other kind of intellectual property “right” should be viewed with great suspicion.

That Constitutionally-based principle makes as much sense to me now as it did then. By way of contrast, “Let’s give more intellectual property rights to big corporations to protect middle-managers’ jobs” is — well, it’s an argument I view with great suspicion.

But I find it extremely hard to think of a technology industry example in which development was stimulated by the possibility of patent protection. Yes, the situation may be different in pharmaceuticals, or for gadgeteering home inventors, but I can think of no case in which technology has been better, or faster to come to market, because of the possibility of a patent-law monopoly. So if software and business-method patents were abolished entirely — even the ones that I think could be realistically adjudicated — I’d be pleased.

3. In November, 2008 I offered IT policy suggestions for the incoming Obama Administration, especially:

Pick the right Chief Technology Officer.

Fix the government technology contracting process in general.

Fix the air traffic control system in particular.

Generally take a businesslike approach to government IT. Obama’s focus on making government “transparent” and searchable would be just one byproduct of that effort.

Continue to beef up internal search and knowledge management (remember the FBI agent who guessed the 9/11 plans, but couldn’t communicate his ideas to anybody who cared).

Write privacy laws of the sort that will, for example, allow electronic health records to be adopted without great fear of misuse. (I have some strong opinions as to what form those laws should take.)

Drastically beef up math education!! (Science too, but math is especially important.) This takes leadership to convince people it’s CRUCIAL to be numerate, perhaps even more than it takes specific policy initiatives. Little else is as important.

and

… we need an experienced technology implementation leader to:

Recommend major changes in government IT contracting. Right now, information technology is bought at the wrong level of granularity, too coarse and too fine at once. Private sector CIOs make broad technology architecture decisions, then make incremental purchases as needed. Public sector IT managers, however, are generally compelled to make purchases on a “project” basis, which allows neither the sanity of broad-scale planning nor the economies and adaptability of just-in-time acquisition.

Establish best practices in a broad range of IT areas. Obama’s “transparency” initiative involves pushing the state of the art in public-facing technology for search, query, and audio/video, at a minimum. Other areas of major technical challenge include internal search, knowledge management, and social networking; disaster robustness; planning in the face of political budgeting uncertainty; numbers-based management without the benefit of a profit/loss statement … and the list could easily be twice as long.

Interact with the private sector. From electronic health records to the general supply chain, there are huge opportunities for public/private interoperability, quite apart from the obvious customer/vendor relationships the government has with the IT industry.

Improve training, recruiting, and retention. Anywhere government needs employees whose skills are also in high demand in the private sector, government pay scales cause difficulties. IT is a top area for that problem. Outstanding leadership is needed to overcome it.

Little of that actually happened.

Kudos if you noticed the link — which I herewith repeat — to what I wrote about privacy in 2006.

In particular — and even after the HealthCare.gov fiasco — I think few voters or legislators understand how incredibly broken government IT contracting is. Almost all major projects go through a five-stage process:

Specify.
Bid.
Select.
Complain.
Adjudicate.

Re-competes usually follow as well.

And so government IT is subject to extreme forms of two inevitable project killers:

Waterfall methodology.
Delay.

Procurement cycles take years, and in the worst cases decades. Project specifications are often fixed until the next procurement, which is often 7-10 years down the road. This, to put it mildly, is the opposite of agility, and widespread project failure ensues.