October 25, 2009

Reports of perfectly-balanced hardware configurations are greatly exaggerated

Data warehouse appliance and software appliance vendors like to claim that they’ve worked out just the right hardware configuration(s), and that a single configuration is correct for a fairly broad range of workloads. But there are a lot of reasons to be dubious about that. Specific vendor evidence includes:

Teradata ascribes considerable importance to a Virtual Storage technology whose main purpose is to allow mixing of heterogeneous storage devices in a single system. And the discussion rarely suggests that these parts will be in a rigid fixed relationship.
Netezza — as Teradata keeps reminding me — often sells boxes with the expectation that they won’t be filled with data, so as to increase spindle count and hence performance.
Oracle/Sun have dropped some comments about Exadata being more flexibly configured going forward.
Kickfire’s new “high-end” appliance lets you attach fairly arbitrary amounts of external storage.
And of course, software-only analytic DBMS vendors run their software in all sorts of hardware and storage environments.

What’s more, the claim never made a lot of sense anyway. With the rarest of exceptions, even a single data warehouse’s workload will contain different queries that strain different parts of the system in different ratios. Calculating the “ideal” hardware configuration for that single workload would be forbiddingly difficult. And even if one could calculate it, it almost surely would be different than another user’s “ideal” configuration. How a single hardware configuration can be “ideally balanced” for a broad class of use cases boggles the imagination.

Categories: Data warehouse appliances, Data warehousing, Exadata, Kickfire, Netezza, Oracle, Teradata

6 Comments

October 19, 2009

Greenplum Single-Node Edition — sometimes free is a real cool price

Greenplum is announcing today that you can run Greenplum software on a single 8-core commodity server, free. First and foremost, that’s a strong statement that Greenplum wants enterprises to pay it for Greenplum’s parallelization/”private cloud” capabilities. Second, it may be an attractive gift to a variety of folks who want to extract insight from terabyte-scale databases of various kinds.

Greenplum Single-Node Edition:

Is free of charge, although you can buy support.
Has no restrictions on use, production or otherwise.
Has no restrictions on database size.
Is closed-source.

For those who want free, terabyte-scale data warehousing software, Greenplum Single-Node Edition may be quite appealing, considering that the main available alternatives are:

General-purpose open-source DBMS, such as PostgreSQL and MySQL (lacking analytic DBMS performance and features)
Infobright Community Edition (the other best choice – Infobright’s commercial sales success indicates the solidity of Infobright’s technology)
Rough research-project code and other other questionable open source offerings
Crippleware from other commercial analytic DBMS vendors (e.g., Teradata)

For example, comparing PostgreSQL-based Greenplum with PostgreSQL itself, Greenplum offers:

The ability to scale out queries across all cores in your box (and no, pgpool is not a serious alternative)
Storage alternatives such as columnar (I am told that EnterpriseDB recently stopped funding a project for a PostgreSQL columnar option)

Categories: Analytic technologies, Data warehousing, EnterpriseDB and Postgres Plus, Greenplum, Infobright, Open source, PostgreSQL, Pricing, Scientific research

14 Comments

October 19, 2009

This week at the Teradata Partners user conference

Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what’s going on, although names, dates, and details will have to await conversations and press releases this week.

Teradata is productizing “private cloud,” under names including “Teradata Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” and “Teradata Elastic Mart Builder.” I.e., Teradata hopes to leapfrog Greenplum in its “Enterprise Data Cloud” strategy. This is only fair, in that Greenplum lifted the idea from Teradata and eBay in the first place. It also provides major support for what I think is an extremely sensible trend. Give or take issues of who announces and ships what a couple months before or after a competitor, my early thinking is that the main differences between Greenplum and Teradata in this regard will be:
- Virtual as opposed to just physical data marts, based on robust workload management software. (Advantage: Teradata)
- Pricing, deployment options. (Advantage: Greenplum)
- Features that don’t directly relate to enterprise/private cloud. (Advantage: Either, often Teradata.)
Teradata is generally strengthening its data movement technology, e.g. for making various appliances work in sync. I’m not too clear yet on the details of that. I think this is what Teradata’s phrase “ecosystem management” refers to.
Teradata is (pre-)announcing – at least as a statement of direction — an appliance based on solid-state drives (SSDs). I’ve thought for a while that Teradata was a leader in thinking through the issues around solid-state memory in data warehousing, so it makes sense that they’re among the leaders in actually coming to market as well. I plan to say more after meeting with, e.g., Carson Schmidt.
Teradata has achieved a 300%ish speed-up in geospatial processing. I gather this is largely a byproduct of the parallel analytics work Teradata did around strengthening its SAS integration. However, there don’t seem to be a lot of Teradata geospatial users yet.
Teradata Express, Teradata’s free Windows-based crippleware, is being ported to Amazon EC2 and VMware as well. Presumably to avoid cannibalizing Teradata product sales, there are quite a few limitations on Teradata Express, including system capacity, database size, and “no production use.”
Teradata continues to extend its optimizations to handle queries issued by business intelligence tools. Previously, the focus of what Teradata discussed in this regard was query rewrite. But soon automatic recommendation and creation of Aggregate Join Indexes – i.e.., materialized views – will be included as well.

Categories: Analytic technologies, Business intelligence, Data integration and middleware, Data types, Data warehouse appliances, Data warehousing, EAI, EII, ETL, ELT, ETLT, GIS and geospatial, Solid-state memory, Storage, Teradata, Theory and architecture

4 Comments

October 18, 2009

Greenplum customer notes

In a briefing about a forthcoming product announcement, Greenplum threw in a slide saying:

Greenplum is getting 12-15 new (paying) customers per quarter, all of whom it fondly refers to as “Tier 1” enterprises.
Greenplum will hit the 100+ customer mark this quarter (thus joining Vertica and Infobright).
<10% of Greenplum business is now “influenced” by Sun hardware.

I asked Ben Werther to unpack that last claim for me. He quickly noted that it wasn’t his slide, but rather had been put together by colleagues. That said:

As of the past quarter or two, <10% of Greenplum’s sales activity is on Sun, which works out to maybe one sale per quarter and at most a small number of sales cycles. (That’s down from from 50%+ not that long ago.)
Most Greenplum business is now on HP or Dell equipment. Some is on IBM. There are some interesting sales cycles on Cisco’s new UCS (Unified Computing System) blades, but no closed deals yet. EMC seems to be part of the Cisco story.

No doubt part of the reason for the move away from Sun equipment is the impending Oracle acquisition. Another may be that the Greenplum/Sun appliance is somewhat underpowered. E.g., without particularly high levels of compression, eBay puts over 60 terabytes of data on each Greenplum node, which probably isn’t ideal from the standpoint of query performance.

Greenplum also says that 50% or so of sales are subscription-priced, rather than perpetual-licensed. I don’t have a sense for how long that’s been going on. (Edit: Ben Werther tells me this has been true for over a year.)

Categories: Data warehouse appliances, Data warehousing, Greenplum, Market share and customer counts, Pricing

2 Comments

October 18, 2009

Three big myths about MapReduce

Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics

11 Comments

October 18, 2009

Introduction to SenSage

I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage’s, hasty. Still, I think I have enough of a handle on SenSage basics to be worth writing up.

General SenSage highlights include:

Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, Log analysis, MapReduce, SenSage, Streaming and complex event processing (CEP), Telecommunications

3 Comments

October 18, 2009

Technical introduction to Splunk

As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.

Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include: Read more

Categories: Analytic technologies, Log analysis, MapReduce, Splunk, Structured documents, Text, Web analytics

12 Comments

October 18, 2009

General introduction to Splunk

I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.

Splunk’s technical stories include:

Text search over log files.
Business intelligence over text search. (That part sounds a lot like Attivio.)
MapReduce with schema flexibility and smart multi-stage execution plans. (That part sounds a lot like Aster Data.)

Kickfire capacity and pricing

Kickfire’s marketing communication efforts are still a work in progress. Kickfire did finally relax its secrecy about FPGA-vs.-custom-silicon – not coincidentally during Netezza’s recent publicity cycle. That wise choice helped Kickfire get some favorable attention recently for its technical and market strategy, e.g. from Daniel Abadi, Merv Adrian and, kicking things off — as it were — me. Weeks after a recent Kickfire product release, there’s finally a fairly accurate data sheet up, although there’s still one self-defeatingly misleading line I’ll comment on below. Pricing is a whole other area of confusion, although it seems that current list prices have been inadvertently* leaked in Merv’s post linked above, with only one inaccuracy that I can detect.**

*I gather from the company that they forgot to tell Merv pricing was NDA.

** Merv cited a price as “starting” that I believe to be top-of-the-line. No criticism of Merv is implied in that; Kickfire has not been very clear in communicating hard numbers.

All that said, if one takes Kickfire’s marketing statements literally, Kickfire list pricing is around $20-50K per terabyte for a few small, fixed, high-performance configurations. That’s all-in, for plug-and-play appliances. What’s more, that range is based on the actual published user data capacity numbers for various Kickfire models, which I think are low for several reasons:

Kickfire doesn’t officially admit that its model with 14.4 terabytes of disk can manage more than 6 terabytes of data, even though it clearly can.
Actually, those 14.4 terabytes of disk can be increased or lowered as you choose.
The basic compression figures implied in those calculations seem conservative.
Compression figures are a lot more conservative yet, in that Kickfire assumes you’ll have a lot of actual indexes on your data. I’m not sure that’s necessary for most workloads.

Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Kickfire, Pricing

3 Comments

October 15, 2009

MapReduce webinars and annotated slides

As previously noted, I’m giving a webinar twice today — i.e., Thursday, October 15 — at 10:00 am and 1:00 pm Eastern time.

The subject is MapReduce.
The sponsor is Aster Data.
Part of the webinar will be an explanation of MapReduce basics, especially the conflict between theory/propaganda and reality.
As you might guess from the identity of the sponsor, there will be an emphasis on how MapReduce and SQL play nicely with each other.
You can register for the webinar on Aster’s site.
(Edit) The webinar replay can be found here.
I’ve already uploaded the slides from which I will present. (But not the ones from which Aster folks will be talking. I’ve seen those, and there’s some good technical crunch in some of them.) The “Notes” under the slides have a number of relevant URLs for follow-up, as well as a small number of explanatory comments (e.g., as to why one slide simply has a quote from and corresponding picture of Shakespeare).

Categories: Aster Data, MapReduce, Presentations

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Reports of perfectly-balanced hardware configurations are greatly exaggerated

Greenplum Single-Node Edition — sometimes free is a real cool price

This week at the Teradata Partners user conference

Greenplum customer notes

Three big myths about MapReduce

Introduction to SenSage

Technical introduction to Splunk

General introduction to Splunk

Kickfire capacity and pricing

MapReduce webinars and annotated slides

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin