Discussion of how database and related technologies are used to support scientific research. Related subjects include:
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
|Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics||5 Comments|
I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.
Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:
- 42,000 Hadoop nodes …
- … holding 180-200 petabytes of data.
|Categories: Cloudera, Facebook, Hadoop, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, Petabyte-scale data management, Scientific research, Web analytics, Yahoo||11 Comments|
In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more
Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.
Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more
|Categories: Analytic technologies, Health care, Michael Stonebraker, MySQL, Open source, Parallelization, Petabyte-scale data management, Scientific research, Surveillance and privacy||2 Comments|
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
|Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture||7 Comments|
Scientific data commonly:
- Comes in large volumes
- Is machine-generated
- Is augmented by synthetic and/or derived data
- Has a spatial and/or temporal structure
In those respects, it is akin to some of the hottest areas for big data analytics, including:
- Investment trade data – big, partly machine generated, augmented (often), temporal
- Web/network log data – big, machine-generated, post-processed into derived form, temporal
- Marketing analytic data – big, post-processed into derived form
- Genomic data
So when Jacek Becla started the XLDB conferences on the premise that scientific and big data analytic challenges have a lot in common, he had a point. There are several tough database problems that the science-focused folks have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that’s one of two big reasons why you should consider participating in XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA, as an attendee, sponsor, or both.
The other big reason is that it is important for the world that XLDB succeed. Read more
|Categories: Investment research and trading, Log analysis, Scientific research, Web analytics||2 Comments|
I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.
|Categories: Analytic technologies, Data warehousing, eBay, GIS and geospatial, Microsoft and SQL*Server, SciDB, Scientific research, Web analytics||3 Comments|
Greenplum is announcing today that you can run Greenplum software on a single 8-core commodity server, free. First and foremost, that’s a strong statement that Greenplum wants enterprises to pay it for Greenplum’s parallelization/”private cloud” capabilities. Second, it may be an attractive gift to a variety of folks who want to extract insight from terabyte-scale databases of various kinds.
Greenplum Single-Node Edition:
- Is free of charge, although you can buy support.
- Has no restrictions on use, production or otherwise.
- Has no restrictions on database size.
- Is closed-source.
For those who want free, terabyte-scale data warehousing software, Greenplum Single-Node Edition may be quite appealing, considering that the main available alternatives are:
- General-purpose open-source DBMS, such as PostgreSQL and MySQL (lacking analytic DBMS performance and features)
- Infobright Community Edition (the other best choice – Infobright’s commercial sales success indicates the solidity of Infobright’s technology)
- Rough research-project code and other other questionable open source offerings
- Crippleware from other commercial analytic DBMS vendors (e.g., Teradata)
For example, comparing PostgreSQL-based Greenplum with PostgreSQL itself, Greenplum offers:
- The ability to scale out queries across all cores in your box (and no, pgpool is not a serious alternative)
- Storage alternatives such as columnar (I am told that EnterpriseDB recently stopped funding a project for a PostgreSQL columnar option)
|Categories: Analytic technologies, Data warehousing, EnterpriseDB and Postgres Plus, Greenplum, Infobright, Open source, PostgreSQL, Pricing, Scientific research||12 Comments|
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance