As Jacek Becla explained:
- Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.
- What’s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.
Even so, I think that academic researchers, in the natural and social sciences alike, commonly overlook the wealth of commercial software that could help them in their efforts.
I further think that the commercial software industry could do a better job of exposing its work to academics, where by “expose” I mean:
- Give your stuff to academics for free.
- Call their attention to your free offering.
Reasons to do so include:
- Public benefit. Scientific research is important.
- Training future customers. There’s huge academic/commercial crossover, especially as students join the for-profit workforce.
The biggest issue is probably large-scale database management. There’s a feeling, permeating for example parts of the XLDB conference and the associated SciDB project, that data stores suitable for holding large amounts of data are either:
- Hadoop or
- Forbiddingly expensive.
I think that’s overstated. In particular:
- You can put >10 terabytes of machine-generated data (or any other kind) into Infobright and have it well taken care of; Infobright is open source.
- You can put >1 petabyte into [name redacted],* among others; [name redacted]* should be out soon with a generously free offering for academic users. Edit: That would be Vertica.
- Conventional relational queries, graph analysis, statistical analysis preparation and more can all be much faster in a good analytic DBMS than in alternative kinds of data stores.
- Integration between SQL and other analytic languages is ever improving, as analytic DBMS evolve into “analytic platforms“.
*My permission to use the name was yanked after this post was largely drafted. I’m sufficiently pleased with the forthcoming offering itself that I can’t get upset about the procedural confusion.
With a couple of exceptions, the statistics/predictive analytics situation seems more reasonable. Industry leaders such as SAS Institute and SPSS (now an IBM company) have engaged in varying degrees of academic outreach. R is in the process of crossing over from academia to business.
Unfortunately, I know next to nothing about Stata or, elsewhere in the technical languages area, Mathworks/Matlab. (Who knew that Mathworks was a $600 million company, local to my geographical area?)
One statistical tool that should perhaps be more present in academia is KXEN. KXEN seems to have some nice differentiation in not making you understand in advance which of your variables are most important. Econometricians and others with large numbers of independent variables might wish to take note.
If you think the true situation is nonlinear, and you’re trying to approximate it with linear models, you almost always have a large number of variables to consider. True, monomials in independent variables aren’t actually independent, but it might be interesting to pretend that they are and see if any insights fall out that could help in more rigorous analysis.
I’d further argue that, as part of neglecting commercial analytic DBMS, the scientific community in particular neglects the potential of integrated analytic platforms. Admittedly, the early leaders in that area — Aster Data, perhaps followed by Netezza (now an IBM company) — aren’t exactly priced in an academic-friendly way. But Vertica, EMC Greenplum, et al. are playing catch-up with analogous technology, and they’re more likely to offer appealing academic pricing.
There’s also the investigative analytics side of business intelligence, especially in the area of visualization/discovery. While Spotfire (now a TIBCO company) got much of its start in research-oriented areas, the otherwise more visible — no pun intended — QlikTech and Tableau don’t seem to have done much in academia. Datameer and yet-younger Hadoop-oriented business intelligence startups don’t seem to be doing much on the academic front either, more’s the pity.
Frankly, I think that most scientific analytic technology needs are also found in the business world.* That convergence will only get closer as businesses focus more on machine-generated data. Commercial software companies should pay more attention to scientists, and scientists should gaze out more often from their ramshackle, budget-constrained ivory towers.
*The converse isn’t as true. Businesses have issues not well reflected in science, derived (for example) from the complexity of their transactional schemas, or from office-politics considerations around “one version of the truth”.
Edit: Some links that seem relevant to this year’s XLDB program
- Zynga and LinkedIn
- Objectivity Infinite Graph
- eBay as of last year’s XLDB (the most expensive blog post I ever wrote, in light of Greenplum’s subsequent response)