Data warehousing

Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:

September 13, 2009

HadoopDB

Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.

The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:

Open/cheap/free
Query fault-tolerance
The related concept of tolerating node degradation that isn’t an outright node failure.

HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.

The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more

Categories: Analytic technologies, Columnar database management, Data models and architecture, Data types, Data warehousing, Database diversity, Hadoop, MapReduce, Open source, Parallelization, PostgreSQL, Scientific research, Structured documents, Theory and architecture

5 Comments

September 13, 2009

Fault-tolerant queries

MapReduce/Hadoop fans sometimes raise the question of query fault-tolerance. That is — if a node fails, does the query need to be restarted, or can it keep going? For example, Daniel Abadi et al. trumpet query fault-tolerance as one of the virtues of HadoopDB. Some of the scientists at XLDB spoke of query fault-tolerance as being a good reason to leave 100s or 1000s of terabytes of data in Hadoop-managed file systems.

When we discussed this subject a few months ago in a couple of comment threads, it seemed to be the case that:

Hadoop generally has query fault-tolerance. Intermediate result sets are materialized, and data isn’t tied to nodes anyway. So if a node goes down, its work can be sent to another node.
Hive actually did not have query fault-tolerance at that time, but it was on the roadmap. (Edit: Actually, it did within a single MapReduce job. But one Hive job can comprise several rounds of MapReduce.)
Most DBMS vendors do not have query fault-tolerance. If a query fails, it gets restarted from scratch.
Aster Data’s nCluster, however, does appear to have some kind of query fault-tolerance.

This raises an obvious (pair of) question(s) — why and/or when would anybody ever care about query fault-tolerance? Read more

Categories: Analytic technologies, Aster Data, Data warehousing, Hadoop, Parallelization, Scientific research, Theory and architecture

10 Comments

September 10, 2009

Thinking about analytic speed

For a variety of reasons, I don’t plan to post my complete Enzee Universe keynote slide deck soon, if ever. But perhaps one or more of its subjects are worth spinning out in their own blog posts.

I’m going to start with analytic speed or, equivalently, analytic latency. There is, obviously, a huge industry emphasis on speed. Indeed, there’s so much emphasis that confusion often ensues. My goal in this post is not really to resolve the confusion; that would be ambitious to the max. But I’m at least trying to call attention to it, so that we can all be more careful in our discussions going forward, and perhaps contribute to a framework for those discussions as well.

Key points include:

1. There are two important senses of “latency” in analytics. One is just query response time. The other is the length of the interval between when data is captured and when it is available for analytic purposes. They’re often conflated — and indeed I shall do so for the remainder of this post.

2. There are many different kinds of analytic speed, which to a large extent can be viewed separately. Major areas include:

Data exploration. In-memory OLAP is a huge trend, and QlikView is a hot BI product line.
Budgeting/planning. In an unprecedentedly frightening economy, annual planning/forecasting cycles may well be too slow.
Operational integration. This is probably the biggest current area of mission-critical IT advancement. Not coincidentally, it is also the mainstay of the most expensive and complex data warehousing technologies. It’s also an ongoing area of application for event/stream processing, aka CEP.
General or deep analytics. This is what I seem to spend much of my time writing about — data warehousing price/performance, parallelized data mining, and much more.
Data administration. Ease of data mart spin-out and administration is becoming a major concern. And of course analytic appliance and DBMS vendors have been telling ease-of-deployment, low-DBA-involvement kinds of stories at least since Netezza first came to market.

There certainly are relationships among those; e.g., a really great analytic DBMS could help speed up any and all of the last three categories. But when assessing your needs, you can go quite far viewing each of those areas separately.

3. It is indeed important to carefully assess your need-for-speed. Acceptable levels of analytic latency vary widely, ranging from sub-millisecond to multi-month. Read more

Categories: Analytic technologies, Business intelligence, Data warehousing, Presentations

5 Comments

September 3, 2009

Teradata really means that those 100+ appliances are in PRODUCTION

I was misremembering. It turns out that when Teradata said it had over 100 appliances “in production”, it meant that >100 hardware-based appliances are actually in production. If you add in the software-only “appliances,” and count test/development as well as true production, the total rises to >200.

I tried to get a finer breakdown out of Teradata on a disclosable basis, but failed. The ostensible reason is that public companies often don’t do that sort of thing without permission from the investor relations department, and Teradata’s marketers evidently haven’t felt a sense of urgency about getting permission to, for example, communicate how well just the 25xx series is doing.

Categories: Data warehouse appliances, Data warehousing, Market share and customer counts, Teradata

1 Comment

September 3, 2009

Teradata and Netezza are doing MapReduce too

Netezza told me a while ago that it planned to introduce MapReduce, and agreed yesterday this was no longer NDAed. Stephen Brobst of Teradata* let slip at XLDB that Teradata has MapReduce too, apparently implemented but not yet generally available.

I don’t have details in either case. Netezza and Teradata evidently aren’t taking MapReduce as seriously as Aster Data, or even Greenplum or Vertica. But MapReduce has become pretty much of a “checkmark” item for large-database analytic DBMS vendors even so.

*Technically, Brobst is not and never has been a Teradata employee — but he’s widely and correctly regarded as being “of Teradata” even so. 🙂

Categories: Data warehousing, MapReduce, Netezza, Teradata

6 Comments

September 3, 2009

SAS on Netezza and other Netezza extensibility

I chatted with SAS CTO Keith Collins yesterday about the new SAS/Netezza in-database parallel data mining scoring offering. My impression is that this is very similar to SAS’ current Teradata support, notwithstanding SAS’ and Teradata’s apparent original intention of offering in-database modeling by now as well.

I gather this is a big performance-enhancing deal, just as it is for SPSS or Oracle’s own data mining over Oracle. However, I must confess to not yet understanding why. That is, I don’t know what’s so complicated about data mining scoring algorithms that makes hand-coding them in SQL particularly forbidding. My naive view of data mining is that you do a big regression to get a bunch of weights, and the resulting scoring algorithm is a linear combination of a few dozen variables. Evidently, that’s not quite right.

Anyhow, it turns out that SAS held off on this work until it could be done for TwinFin. That’s largely because TwinFin lets partners write code on Intel CPUs, while previously they had to write in C for Netezza’s FPGAs. I got a similar sense from at least one other Netezza partner as well.

Categories: Data warehouse appliances, Data warehousing, Netezza, Predictive modeling and advanced analytics, SAS Institute

5 Comments

September 3, 2009

Oracle Exadata hybrid columnar compression

Oracle Database 11g Release 2 is out, and as usual I wasn’t briefed — perhaps because Oracle is more scared than its competitors are of hard questions, perhaps for some other reason entirely.* Anyhow, Oracle Database 11 Release 2 contains an Exadata-only feature called hybrid columnar compression. The Oracle Database 11g Release 2 white paper says “data is grouped, ordered, and stored one column at a time.” But Kevin Closson clarifies:

The word hybrid is important.

Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.

So, “hybrid” is the word. But, none of that matters as much as the effectiveness. This form of compression is extremely effective.

That sounds a whole lot like PAX. Specifically, in Oracle’s case I would guess “hybrid columnar compression” provides the compression benefits of column stores, but not column stores’ I/O benefits, and also not any kind of in-memory compression. Read more

Categories: Columnar database management, Data warehousing, Database compression, Exadata, Oracle, Theory and architecture

20 Comments

September 2, 2009

Teradata has over 100 appliances in production

I recently wrote that Teradata had gotten serious about appliance product lines, and had non-trivial sales figures for them. In a press release today, Teradata is now explicitly saying (emphasis mine):

Teradata now has more than 100 appliances in production, including the Data Mart Appliance 551, the Data Warehouse Appliance 2550, and the Extreme Data Appliance 1550, which complement the core platform, the Teradata Active Enterprise Data Warehouse 5550.

The breakdowns on that are NDA, and anyhow I can’t find them immediately in my notes.* But if memory serves — while a lot of those appliances are used for test and development, a whole other lot of them are used to do actual production query-answering work. (Edit: Memory turned out to be wrong.) Read more

Categories: Data warehouse appliances, Data warehousing, Market share and customer counts, Teradata

2 Comments

August 25, 2009

Sybase IQ technical highlights

General highlights of the Sybase IQ technical story include:

Sybase IQ is an analytic DBMS with a columnar/column-store architecture
Unlike most analytic DBMS, Sybase IQ has a shared-disk architecture.
The Sybase IQ indexing story is a bit complicated, with a bunch of different index kinds. Most are focused on columns with low cardinality, and it least in some cases are a lot like bitmaps. (Sybase IQ when first introduced was a pure bitmap index product, with a single index type “Fast Project”.) But one index kind, “High Group” — designed for columns with high cardinality – is an exception to most generalities about other Sybase IQ index kinds, and instead is more akin to a b-tree.
Unlike Vertica, Sybase stores each column of data only once. I don’t see how it would make sense to have multiple indexes on the same column, but I didn’t actually ask whether doing so is possible or common.
Sybase estimates that Sybase IQ requires ¼ the DBA effort of, say, Oracle. (Frankly, that’s not a particularly good figure.) Obviously, this is just a broad-brush average.
Sybase recently repurposed an acquired ETL tool to be focused on Sybase IQ. IQ of course also works with various third-party tools, certified or otherwise.
Sybase’s Power Designer CASE (Computer-Aided Software Engineering)/database design tool works with Sybase IQ.
Sybase is proud of Sybase IQ’s new in-database analytics capabilities, but I haven’t yet grasped what, if anything, is differentiated about them.
Sybase has an ILM (Information Lifecycle Management) story built around the point that different columns can be stored on different kinds of media.

Highlights of the Sybase IQ compression story include: Read more

Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Sybase, Theory and architecture

11 Comments

August 25, 2009

Sybase IQ business notes

As specialized analytic DBMS go, Sybase is near the top of the charts both in age (Sybase IQ was first introduced in the mid 1990s) and adoption. That’s even more true, of course, if we restrict the discussion strictly to columnar DBMS, aka column stores. Basic Sybase IQ adoption claims include:

>1500 users
>3000 installations (Sybase has variously cited 2.1 and 2.5+ as the installation/user ratio)
At least ~50-60 installations with >5 terabytes of user data

Note that 98% of Sybase IQ installations are under 5 terabytes; the heart of Sybase IQ’s business is the sub-terabyte data warehouse market.* Read more

Categories: Analytic technologies, Data mart outsourcing, Data warehousing, Investment research and trading, Sybase

3 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Data warehousing

HadoopDB

Fault-tolerant queries

Thinking about analytic speed

Teradata really means that those 100+ appliances are in PRODUCTION

Teradata and Netezza are doing MapReduce too

SAS on Netezza and other Netezza extensibility

Oracle Exadata hybrid columnar compression

Teradata has over 100 appliances in production

Sybase IQ technical highlights

Sybase IQ business notes

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin