Scientific research – DBMS 2 : DataBase Management System Services

Notes on predictive modeling, October 10, 2014

Curt Monash — Fri, 10 Oct 2014 08:40:02 +0000

As planned, I’m getting more active in predictive modeling. Anyhow …

1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.

2. The most controversial part of that post was probably the claim:

I think the predictive modeling state of the art has become:

Cluster in some way.

Model separately on each cluster.

In particular:

It is always possible to instead go with a single model formally.
A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.

3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start:

While KXEN was distinguished by how limited its choice of model templates was, Nutonian is distinguished by its remarkable breadth. Is the best model for your data a quadratic polynomial in which some of the terms are trigonometric functions? Nutonian is happy to find that for you.
Nutonian is starting out as a SaaS (Software as a Service) vendor.
A big part of Nutonian’s goal is to find a simple/parsimonious model, because — although this is my phrasing rather than theirs — the simpler the model, the more likely it is to have robust explanatory power.

With all those possibilities, what do Nutonian models actually wind up looking like? In internet/log analysis/whatever kinds of use cases, I gather that:

The model is likely to be a polynomial — of multiple variables of course — of degree no more than 3 or 4.
Variables can have time delays built into them (e.g., sales today depend on email sent 2 weeks ago). Indeed, some of Nutonian’s flashiest early modeling successes seem to be based around the ease with which they capture time-delayed causality.
In each monomial, all variables except 1 are likely to be “control”/”capping”/”transition-point”/”on-off switch”/logical/conditional/whatever variables — i.e., variables whose range is likely to be either {0,1} or perhaps [0,1] instead.

Nutonian also serves real scientists, however, and their models can be all over the place.

4. One set of predictive modeling complexities goes something like this:

A modeling exercise may have 100s or 1000s of potential variables to work with. (For simplicity, think of a potential variable as a column or field in the input data.)
The winning models are likely to use only a small fraction of these variables.
Those may not be variables you’re thrilled about using.
Fortunately, many variables have strong covariances with each other, so it’s often possible to exclude your disfavored variables and come out with a model almost as good.

I pushed the Nutonian folks to brainstorm with me about why one would want to exclude variables, and quite a few kinds of reasons came up, including:

(My top example.) Regulatory compliance may force you to exclude certain variables. E.g., credit scores in the US mustn’t be based on race.
(Their top example.) Some data is just expensive to get. E.g., a life insurer would like to come up with a way to avoid using blood test results in their decision making, because they’d like to drop the expense of the blood tests.
(Perhaps our joint other top example.) Clarity of explanation is an important goal. Some models are black boxes, and that’s that. Others are also supposed to uncover causality that helps humans make all kinds of better decision. Regulators may also want clear models. Note: Model clarity can be affected by model structure and variable(s) choice alike.
Certain variables can simply be more or less trusted, in terms of the accuracy of the data.
Certain variables can be more or less certain to be available in the future. However, I wonder how big a concern that is in a world where models are frequently retrained anyway.

5. I’m not actually seeing much support for the theory that Julia will replace R except perhaps from Revolution Analytics, the company most identified with R. Go figure.

6. And finally, I don’t think it’s wholly sunk in among predictive modeling folks that Spark both:

Has great momentum.
Was designed with machine learning in mind.

Data as an asset

Curt Monash — Mon, 22 Sep 2014 03:49:00 +0000

We all tend to assume that data is a great and glorious asset. How solid is this assumption?

Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
In many cases, however, data’s value diminishes quickly.
Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
- Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
- Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.

*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.

This all raises the idea — if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include:

Actual scientific, clinical, seismic, or engineering research.
Actual selling of (usually proprietary) data, with the straightforward economic proposition of “Get once, sell to multiple customers more cheaply than they could get it themselves.” Examples start:
- This is the essence of the stock quote business. And Michael Bloomberg started building his vast fortune by adding additional data to what the then-incumbents could offer, for example by getting fixed-income prices from Cantor Fitzgerald.*
- Multiple marketing-data businesses operate on this model.
- Back when there was a small but healthy independent paper newsletter and directory business, its essence was data.
- And now there are many online data selling efforts, in niches large and small.
Internet ad-targeting businesses. Making money from your great ad-targeting technology usually involves access to lots of user-impression and de-anonymization data as well.
Aggressive testing by internet businesses, of substantive offers and marketing-display choices alike. At the largest, such as eBay, you’ll rarely see a page that doesn’t have at least one experiment on it. Paper-based direct marketers take a similar approach. Call centers perhaps should follow suit more than they do.
Surveys, focus groups, etc. These are commonly expensive and unreliable (and the cheap internet ones commonly irritate people who do business with you). But sometimes they are, or seem to be, the only kind of information available.
Free-text data. On the whole I’ve been disappointed by the progress in text analytics. Still — and this overlaps with some previous points — there’s a lot of information in text or narrative form out there for the taking.
- Internally you might have customer emails, call center notes, warranty reports and a lot more.
- Externally there’s a lot of social media to mine.

*Sadly, Cantor Fitzgerald later became famous for being hit especially hard on 9/11/2001.

And then there’s my favorite example of all. Several decades ago, especially in the 1990s, supermarkets and mass merchants implemented point-of-sale (POS) systems to track every item sold, and then added loyalty cards through which they bribed their customers to associate their names with their purchases. Casinos followed suit. Airlines of course had loyalty/frequent-flyer programs too, which were heavily related to their marketing, although in that case I think loyalty/rewards were truly the core element, with targeted marketing just being an important secondary benefit. Overall, that’s an awesome example of aggressive data gathering. But here’s the thing, and it’s an example of why I’m confused about the value of data — I wouldn’t exactly say that grocers, mass merchants or airlines have been bastions of economic success. Good data will rarely save a bad business.

Related links

I first wrote up this point in a 2005 Computerworld column, and added a text-analytics nuance a year later, but since then I seem to have talked about it much more than I’ve written it down.
Please always keep in mind the risks to privacy in whatever you do.

The games of Watson

Curt Monash — Thu, 09 Jan 2014 20:57:56 +0000

IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.

I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:

MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
Cyc is connected to the computer science community’s standard unit of bogosity.

*Much of it, I’m ashamed to say, with my help, back in my stock analyst days.

AI is something of an umbrella category, often just meaning “Computerized stuff that we don’t know how to do yet”, or ” … only recently figured out how to do.” Automated decision-making is an aspect of AI, for example, but so also is natural language recognition. It used to be believed that most AI should be approached in the same way:

Come up with a clever way to represent knowledge.
Match the actual situation against the knowledge.
Produce a smart result.

But that template unfortunately proved disappointing time after time. The problem was typically that not enough knowledge could in practice be represented, and thus well-informed automated decisions could not be made. In particular, there was a “first step fallacy,” in which a demo system would solve a “toy problem”, but robust real-life systems never emerged.

Of course, there are exceptions to this general rule of disappointment; for example, Teknowledge and its fellow over-hyped expert system technology vendors of the 1980s (Intellicorp, Inference, et al.) did get a few solid production references. But the ones I remember best (e.g. American Express credit, United Airlines seat pricing, some equipment maintenance scheduling) were often for use cases that we’d now address in more straightforwardly mathematical ways.

Watson is generally promoted as helping with decision-making, but that message has to be scrutinized carefully. So far as I’ve been able to guess, the true core technology of IBM Watson is extracting knowledge from text — or primarily from text — and representing it in some way that is reasonably useful in answering natural language queries. The hope would then be to eventually achieve a rich enough knowledge base to support the Star Trek computer. But automated decision-making doesn’t just require knowledge; it also requires decision-making rules. And if Watson is significantly ahead of the 1980s decisioning state of the art (Rete, backward chaining, etc.), I’m not aware of how.

So if Watson is going to accomplish anything soon, it will probably be in areas where serious decision-making chops aren’t needed. Indeed, the application areas that I’ve seen mentioned for the past or near term are mainly:

Playing Jeopardy! That’s pretty simple from a decision-making standpoint.
Advising on treatments for a specific disease (not actually built yet). As noted above, that’s 1970s-level decisioning.
Knowledge extraction from medical research articles. That has very little to do with decisioning, and incidentally sounds a lot like what SPSS (before it was acquired by IBM) and Temis were already doing years ago.
Natural-language customer interaction. That may not involve any decisioning at all.

Returning to the point that Watson’s core technology is probably natural language, it seems fair to say that IBM these days is probably better at the text mining side than at speech understanding. Evidence I’m thinking of includes:

That seems to be what IBM itself is saying on its speech recognition page.
I also recall IBM’s natural language recognition projects being regarded as not going well in the late 1990s. (Project Penelope, I believe, although I can’t confirm that via googling.)
IBM’s LanguageWare sounded more oriented to text mining in 2008.
IBM bought SPSS, which had decent text mining technology.

And while this is too old to really count as evidence, IBM had a famously unsuccessful language recognition deal with Artificial Intelligence Corporation way back in 1983-4.*

*Yeah, I helped raise money for AICorp too, and also for Symbolics. As you might imagine, my investment banking trophies do not have pride of place on my desk.

One last observation — text mining has a very mixed track record. Watson will have to go far beyond predecessor text technologies to become nearly the big deal IBM is suggesting it will be.

Related links

Some language recognition humor
Jack Vaughan reports on Watson

Notes, links and comments August 6, 2012

Curt Monash — Mon, 06 Aug 2012 05:11:08 +0000

I haven’t done a notes/link/comments post for a while. Time for a little catch-up.

1. MySQL now has a memcached integration story. I haven’t checked the details. The MySQL team is pretty hard to talk with, due to the heavy-handedness of Oracle’s analyst relations.

2. The Large Hadron Collider offers some serious numbers, including:

1 petabyte/second.
6 x 10⁹ collisions/second.
Only 1 in 10¹³collision records kept (which I guess knocks things down to a 100 byte/second average, from the standpoint of persistent storage).
Real-time filtering by a cluster of several thousand machines, over a 25 nanosecond period.

3. One application area we don’t talk about much for analytic technologies is education. However:

Knewton vigorously talks up the idea of online learning that adapts to the students’ previous responses, complete with the “Big Data” buzzword.
Knewton evidently likes graphs, and seems to be eagerly awaiting scale-out capabilities in Neo4j.
The New York Times offered a survey article about analytics in education. It seemed to be focused on Arizona State University — where I attended the only educational software conference I’ve ever gone to, in approximately 1984. One concerning aspect: There didn’t seem to be any reason to be sure the outcomes they were working toward had much to do with an actually better education.

So how soon will budgets emerge for all this, especially in the United States? I’m not sure.

Education has all sorts of problems at both at the grade-school and collegiate levels, including bureaucratic weirdness and huge financial pressures.
Textbook publisher Macmillan is investing significant capital in education technology businesses … but diversifications of that kind have often gone wrong before.

4. Recent posts with robust comment threads — and this is a very partial list — include:

Pros and cons of Microsoft SQL Server were explored after I opined about SQL Server to MySQL migration.
There was a lot of commentary on my May series of graph analysis and management posts.
Later, Neo’s Philip Rathle added clarifying detail to my post on Neo Technology and Neo4j.
My June series on Hadoop drew numerous comments and clarifications too.
There was vigorous response when I suggested in May that “Big Data” might be overhyped …
… but nothing like what transpired when I said something similar in September, 2011.

5. Finally — and thoroughly superseding my post on disk, flash, and RAM — I saw an awesome round-up of latency numbers, which I’ll just quote below:

L1 cache reference …………………………………………………… 0.5 ns
Branch mispredict ……………………………………………………….. 5 ns
L2 cache reference ………………………………………………………. 7 ns
Mutex lock/unlock ………………………………………………………. 25 ns
Main memory reference …………………………………………… 100 ns
Compress 1K bytes with Zippy ……………………………… 3,000 ns
Send 2K bytes over 1 Gbps network …………………… 20,000 ns
SSD random read ……………………………………………….. 150,000 ns
Read 1 MB sequentially from memory ……………….. 250,000 ns
Round trip within same datacenter …………………… 500,000 ns
Read 1 MB sequentially from SSD* ………………… 1,000,000 ns
Disk seek ……………………………………………………….. 10,000,000 ns
Read 1 MB sequentially from disk ……………….. 20,000,000 ns
Send packet CA -> Netherlands -> CA ……… 150,000,000 ns

Repeating that in different units, it’s:

    L1 cache reference ......................... 0.5 ns

    Branch mispredict ............................ 5 ns

    L2 cache reference ........................... 7 ns

    Mutex lock/unlock ........................... 25 ns

    Main memory reference ...................... 100 ns

    Compress 1K bytes with Zippy ................. 3 µs

    Send 2K bytes over 1 Gbps network ........... 20 µs

    SSD random read ............................ 150 µs

    Read 1 MB sequentially from memory ......... 250 µs

    Round trip within same datacenter .......... 0.5 ms

    Read 1 MB sequentially from SSD* ............. 1 ms

    Disk seek ................................... 10 ms

    Read 1 MB sequentially from disk ............ 20 ms

    Send packet CA ->  Netherlands -> CA ....... 150 ms

People’s facility with statistics — extremely difficult to predict

Curt Monash — Mon, 06 Aug 2012 05:11:06 +0000

My recent post on broadening the usefulness of statistics presupposed two things about the statistical sophistication of business intelligence tool users:

It varies a lot.
In many cases, it isn’t be very high.

Let me now say a little more on the subject. My basic message is — people’s facility with statistics is extremely difficult to predict.

If you DO have to make a point estimate, however, you could do worse than just putting quotation marks around the last four words of that sentence …

Suppose we measure people’s statistical understanding on a 5-point scale:

People who haven’t clue what a p-value is.
People who think a p-value of .05 signifies a 95% chance of truth.
People who know better than that, but who still think that “statistically significant” is pretty close to the same as “true”.
People who know better yet, but aren’t fluent in using statistical techniques correctly.
People who are fluent in statistics.

Just knowing somebody’s job description, can you confidently predict their ranking to within, say, +/- 1 point? I suggest you can’t. People differ wildly in general numeracy and in specific statistical knowledge.

Even our guesses about average knowledge may be off, not least because education is changing things. I got to graduate school without even knowing what a conditional probability was;* now a whole generation of kids is growing up with option of taking AP Statistics. On the flip side, a long list of recent studies suggests that research scientists, physicians, et al. are less clueful about statistics than we might have thought. A quick googling on statistical errors by scientists turned up:

Several stories about a paper uncovering a particular, frequent error in published neuroscience papers.
A list of 20 common errors in biomedical papers.
Another paper about common errors in medical research.
An Elsevier guide to common errors reviewers might find in submitted papers.

No wonder that a large fraction of medical research can’t be reproduced.**

*3 years later I had taught a low-end college course in statistics and written a PhD thesis on game theory …but let’s not over-generalize from that part of the story. Anyhow, these days my ranking would be somewhere in the 4 range.

**Another reason might be HeLa cell contamination, but I digress.

Related link

A new twist on statistics education

Memory-centric data management when locality matters

Curt Monash — Mon, 16 Jul 2012 01:13:40 +0000

Ron Pressler of Parallel Universe/SpaceBase pinged me about a data grid product he was open sourcing, called Galaxy. The idea is that a distributed RAM grid will allocate data, not randomly or via consistent hashing, but rather via a locality-sensitive approach. Notes include:

The original technology was developed to track moving objects on behalf of the Israeli Air Force.
The commercial product is focused on MMO (Massively MultiPlayer Online) games (or virtual worlds).
The underpinnings are being open sourced.
Ron suggests that, among other use cases, Galaxy might work well for graphs.
Ron argues that one benefit is that when lots of things cluster together — e.g. characters in a game — there’s a natural way to split them elastically (shrink the radius for proximity).
The design philosophy seems to be to adapt as many ideas as possible from the way CPUs manage (multiple levels of) RAM cache.

The whole thing is discussed in considerable detail in a blog post and a especially in a Hacker News comment thread. There’s also an error-riddled TechCrunch article.

In the areas I cover, “error-riddled TechCrunch article” is pretty much a redundant phrase — but that post looked particularly bad.

Meanwhile, I just noticed a May, 2009 blog post out of Progress Apama. The idea was that event streaming technology could be used to track moving objects, something I heard directly from the CEP (Complex Event Processing) vendors in the 2007 – 2009 period as well.

My tentative opinions on all this start:

Locality is really important for graphs. Random partitioning is crazy if there’s a locality-friendly alternative.
Ron plays different MMOs than I do. That said, the real market would more likely be new games than existing ones. And Guild Wars 2 (for example) is showing the way to gathering many characters together in a small game area.
It’s easy to conceive of cases in which there’s so much specific information about moving objects’ locations that you have to throw much of it away, rather than persisting it all. That speaks for memory-centric technology in general, and data reduction in particular (in the CEP sense of “data reduction”, not the statistics meaning).
Sensor and scientific data often have strong locality.

Related link

I’ve written a fair amount recently about graph data management, although I haven’t tackled the partitioning issue head-on.

Cool analytic stories

Curt Monash — Mon, 21 May 2012 07:25:37 +0000

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know.

Financial trading is highly dependent on analytics, except perhaps for low-speed dealing in some traditional securities classes.* Specific algorithms are highly competitive, and can change within weeks. A broad range of analytic techniques have at least been tried. Some have failed so spectacularly as to precipitate a global financial crisis.

*Equities, corporate bonds, maybe municipal bonds — i.e., the stuff that’s still governed by “fundamental” — as opposed to generic quantitative — analysis.

It’s tough to be sure just what governments are doing with all that questionably-legal data they’ve assembled to fight terrorism, but we can be pretty sure that there’s a lot of relationship analytics involved. And whatever they’re doing seems to be working. As Glenn Greenwald frequently points out, there are substantially no terrorist plots in the United States, except for the ones that law enforcement itself invents. Anti-fraud applications have used relationship analytics for the past few years as well.

Also in the graph area, LinkedIn is doing a good job with People You May Know.

Also in crime-fighting, anti-spam techniques have gotten really good. Akismet blocks well over 99% of the spam comments to this blog, and I can’t recall the last false positive I saw. Google Mail approaches 99% in spam-catching (ironically, the ones that get through commonly come from mailing list sellers), and it’s been a long time since I noticed a false positive worth worrying about.

Shifting gears now, a huge fraction of analytic efforts are devoted to being more effective in one-on-one consumer marketing offers. Some of the most visible examples are in telecommunications, specifically in churn prevention. My favorite may be the case of multilingual text analytic integration in Switzerland:

I once confirmed that SPSS customer Cablecom‘s statistical models for churn and the like absolutely included text data; Cablecom even assigned different weights to the same apparent level of emotion depending on whether the text was in German, French, or Italian.

Other noteworthy adventures in marketing analytics include:

Harrah’s successes in targeting gambling customers, even though I’m having trouble validating something I thought I’d read, namely that over 100% of Harrah’s profits came from analytics on loyalty card data.
Various creepily effective efforts in de-anonymization.
The recently famous story of Target targeting pregnant women.

So is it all about trading, marketing, and crime fighting? The answer is closer to being “Yes” than I would like. That said:

Biomedical research is a hugely important area for analytic progress, and bioinformatics has gone from being a minor task to a full-time laboratory job. I hope we’ll have a bunch of great stories soon.
Just as medicine needs to be personalized, so does education. Perhaps some low-hanging analytic fruit will be plucked soon.
General optimization and operations research have been around for decades. They’re still going strong, with a lot more data points.
Ditto process control. And the UIs are cooler now.
Planning is a mess. Conceptual breakthroughs are sorely needed.

Related posts

Notes on the analysis of large graphs

Curt Monash — Mon, 14 May 2012 03:35:29 +0000

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:

An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.

Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.

Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be:

Store a table with one row per edge.
Do an (n-1)-way join, where n is the number of edges in the path or subgraph.

In many cases the cardinality of intermediate result sets would be high, and you’d basically be doing a series of full table scans. Those could take a while.

There are various approaches to dealing with this challenge. For example:

Graph analysis has been around long enough that much of it has surely been done relationally.
I wrote about some specific relational strategies for graph analysis five years ago.
A lot of graph analysis these days is being done in Hadoop (or other MapReduce, notably Aster Data’s).
Objectivity Infinite Graph and Google Pregel emphasize pre-fetching (or pre-shipping) edges that might soon be needed.
Yarcdata, with its Cray genes, tries to optimize hardware (single RAM image across a cluster, with a whole lot of multithreading) for in-memory Apache Jena performance. Unfortunately, I’m not clear as to which data structure(s) Jena uses.

When trying to figure out which of these techniques is likely to win in the most demanding cases, I run into the key controversy around analytic graph data management — how successfully can graphs be partitioned? Opinions vary widely, with the correct answers in each case surely depending on:

The topology of the graph.
The size of the graph.
The length of the paths that need to be examined.

But in the interest of getting this posted tonight, I’ll leave further discussion of graph partitioning to another time.

MarkLogic 5, and why you might care

Curt Monash — Tue, 01 Nov 2011 04:03:59 +0000

MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:

More-of-the-same in line with MarkLogic’s core positioning.
A new bi-directional Hadoop connector.
A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.

Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.

*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.

MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:

MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
… which has been optimized from the getgo for poly-structured data.
MarkLogic can and does scale out to handle large amounts of data.
MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.

Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.

My November, 2010 overview of MarkLogic technology remains pretty relevant. One correction, however: Node heterogeneity configurations, in which “data” and “evaluation” nodes reside on separate servers, are the exception rather than the rule.

Like Vertica, MarkLogic has laudably said that true academic researchers can get MarkLogic for free without the severe license restrictions. Free MarkLogic should be of particular interest to researchers who:

Are studying natural networks or graphs, such as social networks or biological pathways. (This might be a fit in the social or biological sciences.)
Are managing metadata for, say, a variety of disparate kinds of experimental files. (This might be a fit anywhere in the natural sciences.)
Are managing actual documents, images, videos, etc., or data about such things. (This might be a fit in the humanities or social sciences.)

MarkLogic provided some disclosable financial substance by email, which I shall quote verbatim:

MarkLogic has 45% revenue growth and 55-60% license growth year over year.
We expect to finish this year with over $85 million in revenue, up from $55 million last year.

Arithmetical purists might note that 85/55 is more than 145%, but I’m just going to settle for the information I got and move on.

Edit: I posted separately about the MarkLogic Hadoop connector. As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now would not be convenient. (My backup discipline isn’t what it should be, and the only copy of my notes about that product is on a heavy tower computer in a house that doesn’t have working power.)

Commercial software for academic use

Curt Monash — Fri, 14 Oct 2011 06:21:21 +0000

As Jacek Becla explained:

Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.
What’s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.

Even so, I think that academic researchers, in the natural and social sciences alike, commonly overlook the wealth of commercial software that could help them in their efforts.

I further think that the commercial software industry could do a better job of exposing its work to academics, where by “expose” I mean:

Give your stuff to academics for free.
Call their attention to your free offering.

Reasons to do so include:

Public benefit. Scientific research is important.
Training future customers. There’s huge academic/commercial crossover, especially as students join the for-profit workforce.

The biggest issue is probably large-scale database management. There’s a feeling, permeating for example parts of the XLDB conference and the associated SciDB project, that data stores suitable for holding large amounts of data are either:

Hadoop or
Forbiddingly expensive.

I think that’s overstated. In particular:

You can put >10 terabytes of machine-generated data (or any other kind) into Infobright and have it well taken care of; Infobright is open source.
You can put >1 petabyte into [name redacted],* among others; [name redacted]* should be out soon with a generously free offering for academic users. Edit: That would be Vertica.
Conventional relational queries, graph analysis, statistical analysis preparation and more can all be much faster in a good analytic DBMS than in alternative kinds of data stores.
Integration between SQL and other analytic languages is ever improving, as analytic DBMS evolve into “analytic platforms“.

*My permission to use the name was yanked after this post was largely drafted. I’m sufficiently pleased with the forthcoming offering itself that I can’t get upset about the procedural confusion.

With a couple of exceptions, the statistics/predictive analytics situation seems more reasonable. Industry leaders such as SAS Institute and SPSS (now an IBM company) have engaged in varying degrees of academic outreach. R is in the process of crossing over from academia to business.

Unfortunately, I know next to nothing about Stata or, elsewhere in the technical languages area, Mathworks/Matlab. (Who knew that Mathworks was a $600 million company, local to my geographical area?)

One statistical tool that should perhaps be more present in academia is KXEN. KXEN seems to have some nice differentiation in not making you understand in advance which of your variables are most important. Econometricians and others with large numbers of independent variables might wish to take note.

If you think the true situation is nonlinear, and you’re trying to approximate it with linear models, you almost always have a large number of variables to consider. True, monomials in independent variables aren’t actually independent, but it might be interesting to pretend that they are and see if any insights fall out that could help in more rigorous analysis.

I’d further argue that, as part of neglecting commercial analytic DBMS, the scientific community in particular neglects the potential of integrated analytic platforms. Admittedly, the early leaders in that area — Aster Data, perhaps followed by Netezza (now an IBM company) — aren’t exactly priced in an academic-friendly way. But Vertica, EMC Greenplum, et al. are playing catch-up with analogous technology, and they’re more likely to offer appealing academic pricing.

There’s also the investigative analytics side of business intelligence, especially in the area of visualization/discovery. While Spotfire (now a TIBCO company) got much of its start in research-oriented areas, the otherwise more visible — no pun intended — QlikTech and Tableau don’t seem to have done much in academia. Datameer and yet-younger Hadoop-oriented business intelligence startups don’t seem to be doing much on the academic front either, more’s the pity.

Frankly, I think that most scientific analytic technology needs are also found in the business world.* That convergence will only get closer as businesses focus more on machine-generated data. Commercial software companies should pay more attention to scientists, and scientists should gaze out more often from their ramshackle, budget-constrained ivory towers.

*The converse isn’t as true. Businesses have issues not well reflected in science, derived (for example) from the complexity of their transactional schemas, or from office-politics considerations around “one version of the truth”.

Edit: Some links that seem relevant to this year’s XLDB program

Zynga and LinkedIn
Objectivity Infinite Graph
eBay as of last year’s XLDB (the most expensive blog post I ever wrote, in light of Greenplum’s subsequent response)