October 14, 2011

Commercial software for academic use

As Jacek Becla explained:

Even so, I think that academic researchers, in the natural and social sciences alike, commonly overlook the wealth of commercial software that could help them in their efforts.

I further think that the commercial software industry could do a better job of exposing its work to academics, where by “expose” I mean:

Reasons to do so include:

The biggest issue is probably large-scale database management. There’s a feeling, permeating for example parts of the XLDB conference and the associated SciDB project, that data stores suitable for holding large amounts of data are either:

I think that’s overstated. In particular:

*My permission to use the name was yanked after this post was largely drafted. I’m sufficiently pleased with the forthcoming offering itself that I can’t get upset about the procedural confusion.

With a couple of exceptions, the statistics/predictive analytics situation seems more reasonable. Industry leaders such as SAS Institute and SPSS (now an IBM company) have engaged in varying degrees of academic outreach. R is in the process of crossing over from academia to business.

Unfortunately, I know next to nothing about Stata or, elsewhere in the technical languages area, Mathworks/Matlab. (Who knew that Mathworks was a $600 million company, local to my geographical area?)

One statistical tool that should perhaps be more present in academia is KXEN. KXEN seems to have some nice differentiation in not making you understand in advance which of your variables are most important. Econometricians and others with large numbers of independent variables might wish to take note.

If you think the true situation is nonlinear, and you’re trying to approximate it with linear models, you almost always have a large number of variables to consider. True, monomials in independent variables aren’t actually independent, but it might be interesting to pretend that they are and see if any insights fall out that could help in more rigorous analysis.

I’d further argue that, as part of neglecting commercial analytic DBMS, the scientific community in particular neglects the potential of integrated analytic platforms. Admittedly, the early leaders in that area — Aster Data, perhaps followed by Netezza (now an IBM company) — aren’t exactly priced in an academic-friendly way. But Vertica, EMC Greenplum, et al. are playing catch-up with analogous technology, and they’re more likely to offer appealing academic pricing.

There’s also the investigative analytics side of business intelligence, especially in the area of visualization/discovery. While Spotfire (now a TIBCO company) got much of its start in research-oriented areas, the otherwise more visible — no pun intended — QlikTech and Tableau don’t seem to have done much in academia. Datameer and yet-younger Hadoop-oriented business intelligence startups don’t seem to be doing much on the academic front either, more’s the pity.

Frankly, I think that most scientific analytic technology needs are also found in the business world.* That convergence will only get closer as businesses focus more on machine-generated data. Commercial software companies should pay more attention to scientists, and scientists should gaze out more often from their ramshackle, budget-constrained ivory towers.

*The converse isn’t as true. Businesses have issues not well reflected in science, derived (for example) from the complexity of their transactional schemas, or from office-politics considerations around “one version of the truth”.

Edit: Some links that seem relevant to this year’s XLDB program


Comments

7 Responses to “Commercial software for academic use”

  1. Robert Hodges on October 14th, 2011 2:46 pm

    Hi Kurt, there are some notable exceptions to the rule that commercial DBMS do not support scientific projects. Microsoft seems to have been very generous in providing DBMS technology to universities. For example, the Pan-STARRS PS1 project (http://pan-starrs.ifa.hawaii.edu/public/home.html) uses MS SQL Server, unless they have changed recently. Pan-STARRS incidentally gives new meaning to the phrase “spatial query.”

  2. Jacek Becla on October 14th, 2011 3:59 pm

    Curt,

    You touched on many important points, and you did it very well!

    One comment I’d make is that many scientific projects do not fall under the umbrella of academic use from the perspective of commercial software licenses. I suspect you did mean both academic and scientific use here.

    I will also point out that sometimes it is not just the cost of commercial *software* that is the barrier. Commercial software often comes in an appliance, and that is problematic for multi-decade experiments, that are (pretty much always) required to reproduce (all published) results.

    The good news is that many larger scientific projects are willing to try… PanSTARRS was just mentioned, SDSS is a good example, GAIA chose Intersystem’s Cache… I think it is a battle we can win!

  3. Curt Monash on October 17th, 2011 10:13 am

    Jacek,

    I’m not sure I understand the appliance problem. A SQL query will have the same results from DBMS to DBMS, wrapped in hardware or otherwise, unless you use vendor-specific extensions. What’s more, relatively few of these extensions are by way of approximation, especially non-reproducible extension. Yes, there are some time series interpolations, but they’re deterministic. Yes, there are some fast approximate medians/deciles/whatever, but you can also do slower precise ones.

    Price perhaps aside, I’m not understanding the reason not to use Vertica or Aster nCluster or Infobright or whatever, if they seem well-suited to the job.

  4. Jacek Becla on October 24th, 2011 1:56 pm

    Curt,

    The appliance problem is related to being locked into specialized (and yes, typically expensive) hardware. (1) In scientific environments with multi-lab projects and multi-project labs hardware is often shared between projects, or repurposed, and specialized hardware makes it much harder. (2) Reproducing results generated 10 years ago is often an issue: it is way easier to virtualize the environment if you don’t have to deal with specialized software that can only run on some specialized but no-longer-supported hardware. (3) A lot of data is correlated and crossing the boundaries between different appliance boxes can be non-trivial in some cases. (4) Debugging is another issue.

  5. Curt Monash on October 24th, 2011 3:29 pm

    Jacek,

    Got it. Anyhow, there are lots of commercial analytic DBMS products that aren’t tied to hardware. Indeed, pretty much the only ones that are are Netezza (strictly), Teradata (for all practical purposes unless you have a pretty small database), and Oracle (ditto, if you’re viewing Oracle as an analytic rather than OLTP option).

  6. Joe Hellerstein on October 28th, 2011 2:24 pm

    FYI, Greenplum’s been giving away their engine to research for a long while now. Since it’s almost completely Postgres-compliant in its client tools and UDF APIs etc, it’s easy for academics to ramp up to it.

    Chris Re’s Hazy project at Wisconsin is starting to use it, I believe — they do in-database statistical machine learning. Very impressive work, BTW: http://research.cs.wisc.edu/hazy/. Also our former student Daisy Wang, now on the faculty at Florida has used it to do in-database statistical ML for text analysis and entity extraction: http://www.cise.ufl.edu/~daisyw/

    We’re planning to harvest research efforts like these via open-source in MADlib: http://madlib.net . Currently only Postgres and Greenplum ports supported, but I’m eager to get community energy around both new algorithmics and ports to other DBMSs. There’s at least a couple folks on the MADlib mailing lists talking about a Vertica port.

  7. Curt Monash on October 28th, 2011 4:11 pm

    Joe,

    Greenplum single-node http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/ disappointed me when it turned out to be a dud. Glad to hear the real thing is being given away for free!

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.