Data warehousing

Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:

May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more

Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo

68 Comments

May 8, 2009

Oracle’s hardware strategy

Larry Ellison stated clearly in an email interview with Reuters (links here and here) that Oracle intends to keep Sun’s hardware business and indeed intends to invest in the SPARC chip. Naturally, I have a few thoughts about this.

As Stephen O’Grady points out, Sun’s main strength lay in selling to the large enterprise market. Well, that’s Oracle’s overwhelming focus too. As I noted two years ago:

One Oracle response is to provide lots of add-on technologies for high-end customers, on the database and middle tiers alike. In app servers it’s done surprisingly well against BEA. It’s sold a lot of clustering. And it’s bought into and tried to popularize niche technologies like TimesTen and Tangosol’s.

This all makes perfect sense – it’s a great fit for Oracle’s best customers, and a way to get thousands of extra dollars per server from enterprises that may already have bought all-you-can-eat licenses to the Oracle DBMS. And being so sensible, it fits into the Clayton Christensen disruption story in two ways:

Oracle may be helpless against mid-tier competition, but it sure has the high-end core of its market locked up.

As one type of technology is commoditized, value is created in other parts of the technology stack.

Oracle’s ongoing acquisition spree in system software, application software, and now hardware just supports that story. MySQL, embedded Java, and so on may be welcome to Oracle as yet more opportunities to tap additional markets — but Oracle’s emphasis is and surely will remain on the large enterprise market.

The next notable point may be found in Larry’s key quote: Read more

Categories: Data warehouse appliances, Data warehousing, Exadata, HP and Neoview, IBM and DB2, Oracle

8 Comments

May 4, 2009

37 Ways To Get More From Analytics, Version 2.0

As I hoped, there were some very helpful responses to my post listing ways to improve analytic effectiveness. Here’s a second draft incorporating them. Comments continue to be very welcome. I need to finalize this soon. Read more

Categories: Analytic technologies, Business intelligence, Data warehousing, Presentations, Web analytics

4 Comments

April 30, 2009

eBay’s two enormous data warehouses

A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.

Metrics on eBay’s main Teradata data warehouse include:

>2 petabytes of user data
10s of 1000s of users
Millions of queries per day
72 nodes
>140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
100s of production databases being fed in

Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:

6 1/2 petabytes of user data
17 trillion records
150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
96 nodes
200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
4.5 petabytes of storage
70% compression
A small number of concurrent users

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, eBay, Greenplum, Petabyte-scale data management, Teradata, Web analytics

49 Comments

April 28, 2009

Data warehouse storage options — cheap, expensive, or solid-state disk drives

This is a long post, so I’m going to recap the highlights up front. In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:

There’s currently a huge — one order of magnitude — performance difference between cheap and expensive disks for data warehousing workloads.
New disk generations coming soon will have best-of-both-worlds aspects, combining high-end performance with lower-end cost and power consumption.
Solid-state drives will likely add one or two orders of magnitude to performance a few years down the road. Echoing the most famous logjam in VC history — namely the 60+ hard disk companies that got venture funding in the 1980s — 20+ companies are vying to cash in.

In other news, Carson likes 10 Gigabit Ethernet, dislikes Infiniband, and is “ecstatic” about Intel’s Nehalem, which will be the basis for Teradata’s next generation of servers.

Categories: Data warehouse appliances, Data warehousing, eBay, Solid-state memory, Storage, Teradata

16 Comments

April 28, 2009

The SAP/Teradata deal explained

When I first saw the press release about the latest SAP/Teradata deal, I thought it sounded very Barney. But it turns out there’s a little bit of substance, as well. Amazingly, SAP BW doesn’t really run on Teradata right now. This deal will fix that. The time frame seems to be that SAP-BW-on-Teradata will ship with SAP BW 7.2 whenever that goes out. (First half of 2010?) Early adopters may be able to get their hands on it as early as Q3 2009.

Note: It surely would be more precise to insert “NetWeaver” a few times into that paragraph.

Just to be clear — I still don’t see this as a big deal. It doesn’t portend any grand SAP/Teradata joint mission to smite Oracle, IBM, and/or Microsoft. Nor is it a telling first step toward an SAP/Teradata merger. It just removes a particular competitive disadvantage Teradata had vs. Oracle et al., from which Teradata’s smaller specialist competitors still suffer. And it offers SAP BW customers another high-quality DBMS option.

Categories: Business intelligence, Data warehousing, SAP AG, Teradata

Vertica pricing and customer metrics

Since last fall, Vertica’s stated pricing has been “$100K per terabyte of user data.” Vertica hastens to point out that unlike, for example, appliance vendors or Sybase, it only charges for deployment licenses; development and test are free (although of course you have to Bring Your Own hardware). Offer the past few weeks, I’ve gotten other pricing comments from Vertica to the effect that:

Of course, Vertica offers substantial negotiated quantity discounts. (Specifics that Vertica told me are confidential.)
Actually,Vertica’s official price list (unpublished but apparently freely available to prospects) contains quantity discounts too.
Finally, Vertica told me that its actual average price is around $25K/terabyte, and gave me person to publish same.

I didn’t press my luck and ask exactly what “average” means in this context.

As for customers, metrics I got include: Read more

Categories: Data warehousing, Market share and customer counts, Pricing, Vertica Systems

4 Comments

April 24, 2009

Some DB2 highlights

I chatted with IBM Thursday, about recent and imminent releases of DB2 (9.5 through 9.7). Highlights included:

DB2 is getting Oracle emulation, which I posted about separately.
IBM says that it had >50 new DB2 data warehouse customers last year. I neglected to ask how many of these had been general-purpose DB2 customers all along.
By “data warehouse customer” I mean a user for InfoSphere Warehouse, which previously was called DB2’s DPF (Data Partitioning Feature). Apparently, this includes both logical and physical partitioning. E.g., DB2 isn’t shared-nothing without this feature.
IBM is proud of DB2’s compression, which it claims commonly reaches 70-80%. It calls this “industry-leading” in comparison to Oracle, SQL Server, and other general-purpose relational DBMS.
DB2 compression’s overall effect on performance stems from a trade-off between I/O (lessened) and CPU burden (increased). For OLTP workloads, this is about a wash. For data warehousing workloads, IBM says 20% performance improvement from compression is average.
DB2 now has its version of one of my favorite Oracle security features, called Label Based Access Control. A label-control feature can make it much easier to secure data on a row-by-row, value-by-value basis. The obvious big user is national intelligence, followed by financial services. IBM says the health care industry also has interest in LBAC.
Also in the security area, IBM reworked DB2’s audit feature for 9.5
I think what I heard in our discussion of DB2 virtualization is:
- Increasingly, IBM is seeing production use of VMware, rather than just test/development.
- IBM believes it is a much closer partner to VMware than Oracle or Microsoft is, because it’s not pushing its own competing technology.
- Generally, virtualization is more important for OLTP workloads than data warehousing ones, because OLTP apps commonly only need part of the resources of a node while data warehousing often wants the whole node.
- AIX data warehousing is an exception. I think this is because AIX equates to big SMP boxes, and virtualization lets you spread out the data warehousing processing across more nodes, with the usual parallel I/O benefits.
When IBM talks of new autonomic/self-tuning features in DB2, they’re used mainly for databases under 1 terabyte in size. Indeed, the self-tuning feature set doesn’t work with InfoSphere Warehouse.
Even with the self-tuning feature it sounds as if you need at least a couple of DBA hours per instance per week, on average.
DB2 on Linux/Unix/Windows has introduced some enhanced workload management features analogous to those long found in mainframe DB2. For example, resource allocation rules can be scheduled by time. (The point of workload management is to allocate resources such as CPU or I/O among the simultaneous queries or other tasks that contend for them.) Workload management rules can have thresholds for amounts of resources consumed, after which the priority for a task can go up (“Get it over with!”) or down (“Stop hogging my system!”).

Categories: Application areas, Data warehousing, Database compression, IBM and DB2, Market share and customer counts, OLTP, Parallelization, Workload management

2 Comments

April 22, 2009

Clearing some of my buffer

I have a large number of posts still in backlog. For starters, there are ones based on recent visits with Aster, Greenplum, Sybase, Vertica, and a Very Large User. I suspect I’ll write more soon on Oracle as well. Plus there’s my whole future-of-online-media area. And quite a bit more will grow out of planned research.

So there are a whole lot of other worthy subjects I doubt I’ll be getting to any time soon. In some cases, of course, other people are doing great jobs of writing about same. Here are pointers to a few links that I am glad to recommend:

I wrote recently that I’ve discovered a number of different in-memory OLAP engines. Cindi Howson far outdid that, writing at length for Intelligent Enterprise on in-memory analytics, in an article that seems to itself be a teaser for a longer, free white paper on the subject.
CouchDB posted an eye-catching, risque slide presentation promoting CouchDB and, more generally, key-value stores, at least for internet applications. And yes, they’ve integrated MapReduce.
Merv Adrian posted favorably about Birst, with special reference to its OEM efforts. As previously noted, I was highly unimpressed with Birst’s end-user BI story at the time of its September roll-out, and Jerome Pineau’s recent examination did nothing to reassure me. But perhaps OEM is a different matter.
Merv also offers an interesting post about data integration upstart Expressor, and a highly favorable one about “visualization” vendor Tableau.
Ann All interviewed Nigel Pendse, who grumped that BI features are overrated, and what end users really want is great query performance. I’m not so sure about the features side of that, but I’m hugely in agreement about the performance. That’s a big part of why the analytic DBMS industry is so vibrant. It’s also why in-memory OLAP is suddenly so hot.

Categories: Analytic technologies, Business intelligence, CouchDB, Data warehousing, EAI, EII, ETL, ELT, ETLT, Expressor, MapReduce, Memory-centric data management, MOLAP, Presentations, Tableau Software, Theory and architecture

MySQL storage engine round-up, with Oracle-related thoughts

Here’s what I know about MySQL storage engines, more or less.

MySQL with MyISAM is fast. But it’s not transactional. Except for limited purposes, MySQL with MyISAM is a pretty crummy DBMS. Nothing can change that.
MySQL with InnoDB is transactional. But it’s not particularly fast. MySQL with InnoDB is a pretty mediocre DBMS. Oracle could fix that, at least partially, over time.
I don’t know much about Falcon, Maria, and so on. With Oracle winding up owning both MySQL and InnoDB, the motivation for those engines (except as Oracle-free forks) might fade.
Infobright is the most established of the rest. At the moment I’m not recommending it for most industrial-strength uses unless the user is particularly cash-constrained. But I wouldn’t be surprised if that changed soon. A cheap, fast, simple columnar analytic DBMS has a place in the world.
Kickfire is next in line, offering a hardware-based growth path for users who’ve maxed out on what unaided MySQL can do. It remains to be seen for how many users the desire to keep things simple and stay with MySQL outweighs the desire to avoid custom hardware. Having Oracle salespeople all over those accounts surely wouldn’t help. Kickfire also has a second market, namely OEM vendors who are mainly interested in the superfast chip. That would probably be pretty unaffected by Oracle.
Tokutek offers a technical proposition that’s hard to match head-on without going the CEP route. Users who care are likely to be MySQL shops. Tokutek’s main challenge is to prove that it sufficiently outdoes competing technical strategies for sufficiently many users. Oracle ownership of MySQL seems pretty irrelevant to Tokutek’s success or failure.
Calpont offers a kind of lightweight Exadata alternative. With Calpont’s packaging and positioning perennially unclear, it’s difficult to predict the effect of a particular change — i.e., Oracle buying MySQL — in Calpont’s market environment.
I haven’t heard from transactionally-oriented ScaleDB since I wrote about them a year ago. Apparently, they’re rolling out beta product this week, and their venerable techie guru sadly passed away earlier this month.

Categories: Calpont, Columnar database management, Data warehousing, Exadata, Infobright, Kickfire, MySQL, Open source, Oracle, Tokutek and TokuDB

14 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in