May 14, 2009

The secret sauce to Clearpace’s compression

In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.

To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method. NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows. Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.

Of the top of my head, examples for which this strategy might be particularly successful include:

Denormalized databases
Message stores with lots of header information
Addresses

Categories: Archiving and information preservation, Columnar database management, Database compression, Rainstor

8 Comments

May 13, 2009

Microsoft announced CEP this week too

Microsoft still hasn’t worked out all the kinks regarding when and how intensely to brief me. So most of what I know about their announcement earlier this week of a CEP/stream processing product* is what I garnered on a consulting call in March. That said, I sent Microsoft my notes from that call, they responded quickly and clearly to my question as to what remained under NDA, and for good measure they included a couple of clarifying comments that I’ll copy below.

*”in the SQL Server 2008 R2 timeframe,” about which Microsoft wrote “the first Community Technology Preview (CTP) of SQL Server 2008 R2 will be available for download in the second half of 2009 and the release is on track to ship in the first half of calendar year 2010. “

Perhaps it is more than coincidence that IBM rushed out its own announcement of an immature CEP technology — due to be more mature in a 2010 release — immediately after Microsoft revealed its plans. Anyhow, taken together, these announcements support my theory that the small independent CEP/stream processing vendors are more or less ceding broad parts of the potential stream processing market.

The main use cases Microsoft talks about for CEP are in the area of sensor data. Read more

Categories: Analytic technologies, Application areas, Microsoft and SQL*Server, Streaming and complex event processing (CEP)

8 Comments

May 13, 2009

IBM System S Streams, aka InfoSphere Streams, aka stream processing, aka “please don’t call it CEP”

IBM has hastily announced System S Streams, a product that was supposed to be called InfoSphere Streams and introduced only in 2010. Apparently, the rush is because senior management wanted to talk about it later this week, and perhaps also because it was implicitly baked into some of IBM’s advertising already. Scrambling ensued. Even so, Jeff Jones and team got to me fast, and briefed me — fairly non-technically, unfortunately, but otherwise how I like it, namely on a harmless embargo and without any NDAs. That’s more than can be said for my clients at Microsoft, who also introduced CEP this week, but I digress …

*Indeed, as I draft this post-Celtics-game, the embargo is already expired.

Marketing aside, IBM System S/InfoSphere Streams is indeed a CEP/stream processing engine + language (with an Eclipse-based development environment). Apparently, IBM’s thinks InfoSphere Streams (if that’s what it winds up being renamed to) is or will be differentiated from other CEP packages in:

Scale-out. (That’s the one that appears to be real today. In fact, there’s a prototype running on Blue Gene.)
Support for complex datatypes such as XML, text, voice, video, etc.
Security and general industrial-strengthness.

Categories: Analytic technologies, Application areas, IBM and DB2, Investment research and trading, Scientific research, Streaming and complex event processing (CEP)

3 Comments

May 12, 2009

How much state is saved when an MPP DBMS node fails?

Mark Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post:

My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?

Honestly, I’d just be guessing at the answer.

Would any vendors or other knowledgeable folks care to take a crack at answering directly?

Categories: Data warehousing, Parallelization

10 Comments

May 12, 2009

Chuck Norris Java jokes

There’s a long list of Chuck Norris Java jokes. Most are pretty lame, but I liked a few, including:

Code runs faster when Chuck Norris watches it.

Garbage collector only runs on Chuck Norris code to collect the bodies.

Categories: Humor

2 Comments

May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more

Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo

68 Comments

May 8, 2009

Oracle’s hardware strategy

Larry Ellison stated clearly in an email interview with Reuters (links here and here) that Oracle intends to keep Sun’s hardware business and indeed intends to invest in the SPARC chip. Naturally, I have a few thoughts about this.

As Stephen O’Grady points out, Sun’s main strength lay in selling to the large enterprise market. Well, that’s Oracle’s overwhelming focus too. As I noted two years ago:

One Oracle response is to provide lots of add-on technologies for high-end customers, on the database and middle tiers alike. In app servers it’s done surprisingly well against BEA. It’s sold a lot of clustering. And it’s bought into and tried to popularize niche technologies like TimesTen and Tangosol’s.

This all makes perfect sense – it’s a great fit for Oracle’s best customers, and a way to get thousands of extra dollars per server from enterprises that may already have bought all-you-can-eat licenses to the Oracle DBMS. And being so sensible, it fits into the Clayton Christensen disruption story in two ways:

Oracle may be helpless against mid-tier competition, but it sure has the high-end core of its market locked up.

As one type of technology is commoditized, value is created in other parts of the technology stack.

Oracle’s ongoing acquisition spree in system software, application software, and now hardware just supports that story. MySQL, embedded Java, and so on may be welcome to Oracle as yet more opportunities to tap additional markets — but Oracle’s emphasis is and surely will remain on the large enterprise market.

The next notable point may be found in Larry’s key quote: Read more

Categories: Data warehouse appliances, Data warehousing, Exadata, HP and Neoview, IBM and DB2, Oracle

8 Comments

May 4, 2009

37 Ways To Get More From Analytics, Version 2.0

As I hoped, there were some very helpful responses to my post listing ways to improve analytic effectiveness. Here’s a second draft incorporating them. Comments continue to be very welcome. I need to finalize this soon. Read more

Categories: Analytic technologies, Business intelligence, Data warehousing, Presentations, Web analytics

4 Comments

April 30, 2009

eBay’s two enormous data warehouses

A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.

Metrics on eBay’s main Teradata data warehouse include:

>2 petabytes of user data
10s of 1000s of users
Millions of queries per day
72 nodes
>140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
100s of production databases being fed in

Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:

6 1/2 petabytes of user data
17 trillion records
150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
96 nodes
200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
4.5 petabytes of storage
70% compression
A small number of concurrent users

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, eBay, Greenplum, Petabyte-scale data management, Teradata, Web analytics

49 Comments

April 29, 2009

37 Ways To Get More From Analytics

I posted several stages of my thinking in connection with a February presentation on how to buy an analytic DBMS. The whole process seemed like a success, with good input early on, and at least one new client directly attracted by the uploaded slide presentation. So now I’m trying the same idea again, starting at an even earlier stage of the process.

I’m going to be speaking this September at six of the seven installments of Netezza’s 2009 traveling regional user conference, namely those in London, Milan, and the United States. (Edited for schedule changes.) The topic is going to be something like “N Ways to Get More From Analytics”, for N a decent-sized two-digit integer. The talk is meant to be more conceptual, upbeat, rah-rah, and/or inspirational than is my usual style, at the cost of perhaps being less complete, detailed, or carefully organized. Right now I’m at the point of sharing an initial list of ideas, and throwing open the question: What did I leave out?

The initial list is: Read more

Categories: Analytic technologies, Memory-centric data management, Presentations

21 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The secret sauce to Clearpace’s compression

Microsoft announced CEP this week too

IBM System S Streams, aka InfoSphere Streams, aka stream processing, aka “please don’t call it CEP”

How much state is saved when an MPP DBMS node fails?

Chuck Norris Java jokes

Facebook, Hadoop, and Hive

Oracle’s hardware strategy

37 Ways To Get More From Analytics, Version 2.0

eBay’s two enormous data warehouses

37 Ways To Get More From Analytics

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin