Discussion of how database and related technologies are used to support scientific research. Related subjects include:
IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more
|Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research||5 Comments|
I’m not a big fan of conferences, but I really like XLDB. Last year I got a lot out of XLDB, even though I couldn’t stay long (my elder care issues were in full swing). The year before I attended the whole thing — in Lyon, France, no less — and learned a lot more. This year’s XLDB conference is at SLAC — the organization formerly known as the Stanford Linear Accelerator Center — on Sand Hill Road in Menlo Park, October 18-19. As of right now, I plan to be there, at least on the first day. XLDB’s agenda and registration details (inexpensive) can be found on the XLDB conference website.
The only reason I wouldn’t go is if that turned out to be a lousy week for me to travel to California.
The people who go XLDB tend to be really smart — either research scientists, hardcore database technologists, or others who can hold their own with those folks. Audience participation can be intense; the most talkative members I can recall were Mike Stonebraker, Martin Kersten, Michael McIntire, and myself. Even the vendor folks tend to the smart — past examples include Stephen Brobst, Jeff Hammerbacher, Luke Lonergan, and IBM Fellow Laura Haas. When we had a datageek bash on my last trip to the SF area, several guys said they were planning to attend XLDB as well.
XLDB stands for eXtremely Large DataBases, and those are indeed what gets talked about there. Read more
|Categories: Data warehousing, Predictive modeling and advanced analytics, Scientific research||5 Comments|
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
|Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics||5 Comments|
I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.
Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:
- 42,000 Hadoop nodes …
- … holding 180-200 petabytes of data.
|Categories: Cloudera, Facebook, Hadoop, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, Petabyte-scale data management, Scientific research, Web analytics, Yahoo||13 Comments|
In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more
Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.
Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more
|Categories: Analytic technologies, Health care, Michael Stonebraker, MySQL, Open source, Parallelization, Petabyte-scale data management, Scientific research, Surveillance and privacy||2 Comments|
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
|Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture||7 Comments|
Scientific data commonly:
- Comes in large volumes
- Is machine-generated
- Is augmented by synthetic and/or derived data
- Has a spatial and/or temporal structure
In those respects, it is akin to some of the hottest areas for big data analytics, including:
- Investment trade data – big, partly machine generated, augmented (often), temporal
- Web/network log data – big, machine-generated, post-processed into derived form, temporal
- Marketing analytic data – big, post-processed into derived form
- Genomic data
So when Jacek Becla started the XLDB conferences on the premise that scientific and big data analytic challenges have a lot in common, he had a point. There are several tough database problems that the science-focused folks have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that’s one of two big reasons why you should consider participating in XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA, as an attendee, sponsor, or both.
The other big reason is that it is important for the world that XLDB succeed. Read more
|Categories: Investment research and trading, Log analysis, Scientific research, Web analytics||2 Comments|
I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.
|Categories: Analytic technologies, Data warehousing, eBay, GIS and geospatial, Microsoft and SQL*Server, SciDB, Scientific research, Web analytics||3 Comments|