October 9, 2009

I have some presentations coming up (all on October Thursdays)

On Thursday, October 15, and two different times (10:00 am and 1:00 pm Eastern time), I’ll be giving a webinar for Aster Data on MapReduce. The content is very much work in progress, but it definitely will:

Be overviewy in nature
Emphasize SQL/MapReduce integration

Then, on the evening of Thursday, October 22, there’s something called the Boston Big Data Summit, in Waltham, where “Big Data” evidently is to be construed as anything from a few terabytes on up. (Things are smaller in the Northeast than in California …) It’s being put together by Amrith Kumar (who I don’t really know) and Bob Zurek (who everybody knows). This is the inaguaral meeting. It seems I’m both giving the keynote and running the subsequent panel, one of whose participants will be Ellen Rubin. Read more

Categories: Analytic technologies, Aster Data, Cloud computing, MapReduce, Presentations

4 Comments

October 6, 2009

Oracle Exadata customers presenting at Oracle Open World

Greg Rahn tweeted a list of Exadata-focused sessions at Oracle Open World next week. As Oracle employees and supporters have been foreshadowing, there will be Exadata users and user-like folks presenting. I identified what look like half a dozen (not counting any who, for example, will make surprise appearances at keynote addresses), specifically: Read more

Categories: Data warehousing, Exadata, Market share and customer counts, Oracle, Teradata

5 Comments

October 6, 2009

Oracle and Vertica on compression and other physical data layout features

In my recent post on Exadata pricing, I highlighted the importance of Oracle’s compression figures to the discussion, and the uncertainty about same. This led to a Twitter discussion featuring Greg Rahn* of Oracle and Dave Menninger and Omer Trajman of Vertica. I also followed up with Omer on the phone. Read more

Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Oracle, Theory and architecture, Vertica Systems

14 Comments

October 6, 2009

Oracle’s version of “actually, we’ve been doing MapReduce all along too”

In a recent blog post, Jean-Pierre Dijcks of Oracle makes the argument that Oracle has supported MapReduce all along, essentially because:

You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Map steps.
You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Reduce steps.
Oracle offers a mechanism for parallelizing procedural logic.

Oracle doesn’t appear to have an explicit Map/Reduce programming interface, but I wouldn’t be surprised if Oracle Consulting cranked one out at some point to meet customer demand.

The post goes on to claim the usual in-database MapReduce benefit of avoiding the overhead of intermediate query result materialization. Presumably, then, Oracle’s quasi-MapReduce would also lack query fault-tolerance.

Categories: Analytic technologies, MapReduce, Oracle, Parallelization

1 Comment

October 5, 2009

Oracle Exadata 2 capacity pricing

Summary of Oracle Exadata 2 capacity pricing

Analyzing Oracle Exadata pricing is always harder than one would first think. But I’ve finally gotten around to doing an Oracle Exadata 2 pricing spreadsheet. The main takeaways are:

If we believe Oracle’s claims of 10X compression, Exadata 2 costs more per terabyte of user data than Netezza TwinFin — $22-26K/TB vs. TwinFin’s <$20K — but less than the Teradata 2550.
These figures are highly sensitive to assumptions about Oracle’s hybrid columnar compression.
Similarly, if Netezza or Teradata were to significantly upgrade their own compression, the price comparison would look quite different.
Options such as Data Mining or Oracle Spatial add 12% or so each to Exadata’s total system price.

Longer version

When Oracle introduced Exadata last year it was, well, expensive. Exadata 2 has now been announced, and it is significantly cheaper than Exadata 1 per terabyte of user data, based on:

Similar overall pricing
Twice the disk capacity
Better compression

13 Comments

October 4, 2009

Jacek Becla on issues in scientific data management

Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own. Read more

Categories: Analytic technologies, Hadoop, MapReduce, Objectivity and Infinite Graph, Open source, Parallelization, SciDB, Scientific research

4 Comments

October 3, 2009

Martin Kersten on issues in scientific data management

Martin Kersten emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited it, and am posting it below. Read more

Categories: Analytic technologies, Clustering, Parallelization, SciDB, Scientific research

3 Comments

October 3, 2009

Issues in scientific data management

In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:

A data model based on multidimensional arrays, not sets of tuples
A storage model based on versions and not update in place
Built-in support for provenance (lineage), workflows, and uncertainty
Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
Open source in order to foster a community of contributors and to insure that data is never “locked up” — a critical requirement for scientists

However: Read more

Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, GIS and geospatial, Hadoop, Open source, SciDB, Scientific research, Specific users, Web analytics

7 Comments

October 1, 2009

MapReduce tidbits

I’ve never had children, and so have never had to supervise squabbling siblings, each accusing the other of selfishness and insufficient sharing. Perhaps the MapReduce vendors are a form of karmic payback. Be that as it may, my client Cloudera has organized Hadoop World on October 2 in New York, and my other client Aster Data is hosting a MapReduce-centric Big Data Summit the night before, at the same venue. Even if you don’t go, both conference’s agenda pages offer a peek into what’s going on in MapReduce applications. I’m not going either, but even so I hope to post an overview of MapReduce uses after the conferences serve to publicize some of them.

Even better, I plan to hold a couple of webinars on MapReduce, the first at 10 am (blech) and 1 pm Eastern time on October 15. They’re sponsored by Aster Data, and so will have a strong SQL/MapReduce orientation.

In connection with its conference, Aster is introducing an nCluster-Hadoop connector — i.e., a loader from HDFS (Hadoop Distributed File System) implemented in SQL/MapReduce. In particular: Read more

Categories: Aster Data, Cloudera, Data warehousing, Hadoop, MapReduce

7 Comments

October 1, 2009

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.

Highlights of our visit included:

There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

I have some presentations coming up (all on October Thursdays)

Oracle Exadata customers presenting at Oracle Open World

Oracle and Vertica on compression and other physical data layout features

Oracle’s version of “actually, we’ve been doing MapReduce all along too”

Oracle Exadata 2 capacity pricing

Jacek Becla on issues in scientific data management

Martin Kersten on issues in scientific data management

Issues in scientific data management

MapReduce tidbits

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin