February 27th, 2008 Curt Monash
I’ve posted a couple times about eBay’s analytics side. As a companion, Don Burleson pointed me at a fascinating November, 2006 slide presentation outlining eBay’s transactional architecture and evolution. Highlights include:
- A whole lot of manual slicing of Oracle databases, so as not to exceed their capacity.
- A whole lot of careful design and ordering of transactions.
- Putting all the business logic in the application tier, with a custom O/R mapper. There’s lots of caching there, but very little state.
The presentation has a bunch of specific numbers, in case anybody wants to dive in.
Please subscribe to our feed!
Technorati Tags: transaction processing, OLTP
Posted in OLTP database management, Specific users, eBay | No Comments »
February 26th, 2008 Curt Monash
I had a non-technical introduction today to Exasol, a data warehouse specialist that has gotten a little buzz recently for publishing TPC-H results even faster than ParAccel’s. Here are some highlights:
- Exasol was founded back in 2000.
- Exasol is a German company, with 60 employees. While I didn’t ask, the vast majority are surely German.
- Exasol has two customers. 6-8 more are Coming Real Soon. Most or all of those are in Germany, although one may be in Asia.
- Karstadt (big German retailer) has had Exasol deployed for 3 years. The other deployed customer is the German subsidiary of data provider IMS Health.
- [Redacted for confidentiality] is a strategic investor in and partner of Exasol. [Redacted for confidentiality]’s only competing partnership is with Oracle.
- Exasol’s system is more completely written from scratch than many. E.g., all they use from Linux are some drivers, and maybe a microkernel.
- Exasol runs in-memory. There doesn’t seem to be a disk-centric mode.
- Exasol’s data access methods are sort of like columnar, but not exactly. I look forward to a more technical discussion to sort that out.
- Exasol’s claimed typical compression is 5-7X. As in the Vertica story, database operations are carried out on compressed data.
- Exasol says it has performed a very fast TPC-H inhouse at the 30 terabyte level. However, its deployed sites are probably a lot smaller than that. IMS Health is cited in its literature as 145 gigabytes.
- Oracle and Microsoft are listed as Exasol partners, so there may be some kind of plug-compatibility or back-end processing story.
Please subscribe to our feed!
Posted in Analytics and analytic technologies, Data warehousing, Exasol, Relational database management systems, Specific users | No Comments »
February 26th, 2008 Curt Monash
There’s been some confusion over my post about eBay’s multiple petabytes of data. So to clarify, let me say:
- eBay’s figure of >1.4 petabytes of data — for its largest single analytic database — counts disks or something, not raw user data.
- I previously published a strong conjecture that the database vendor in question was Teradata, which is definitely an eBay supplier. In particular, it is definitely not an Oracle data warehouse.
- While eBay isn’t saying who it is either — not even off-the-record — the 50%ish compression figures they experience just happen to map well to Teradata’s usual range.
- Edit: Just to be clear — not that there was any doubt, but I have reconfirmed that eBay is a Teradata user, in or including eBay’s Paypal division.
Please subscribe to our feed!
Posted in Analytics and analytic technologies, Data warehouse appliances, Data warehousing, Relational database management systems, Specific users, Teradata, eBay | No Comments »
February 23rd, 2008 Curt Monash
The server move has completed. The brief outage is behind us. Comments have been turned back on. All SHOULD be well.
I plan to write a little more soon about web hosting over on the Monash Report, if for no other reason than that what’s there is not wholly accurate and needs updating.
Posted in About this blog | No Comments »
February 22nd, 2008 Curt Monash
I’m moving servers again. In connection with that, I’m turning comments off for a few hours.
Everything SHOULD be fine again by Saturday.
Posted in About this blog | No Comments »
February 20th, 2008 Curt Monash
Billy Newport of IBM sees a lot of similarities between his app-server-based product ObjectGrid and H-Store. In both cases, constrained tree schemas are assumed, and OLTP performance goodness ensues. A couple of points I noted on a quick skim through his blog:
- He calls out RAM consumption as a challenge for this kind of architecture.
- He points out that it’s a big advantage to have data called and used in the same address space.
Being based in RAM is obviously a huge part of the H-Store scheme. But so is having transaction execution be close to the database.
IBM now has both ObjectGrid and a memory-centric DBMS (solidDB) that they’ve been using as a front end for DBMS. Integration of the two could be pretty interesting.
Please sign up for our feed!
Posted in Cache, Database theory and practice, H-Store, IBM and DB2, Memory-centric data management, OLTP database management, Relational database management systems, solidDB | No Comments »
February 19th, 2008 Curt Monash
I wrote yesterday about the H-Store project, the latest from the team of researchers who also brought us C-Store and its commercialization Vertica. H-Store is designed to drastically improve efficiency in OLTP database processing, in two ways. First, it puts everything in RAM. Second, it tries to gain an additional order of magnitude on in-memory performance versus today’s DBMS designs by, for example, taking a very different approach to ensuring ACID compliance.
Today I had the chance to talk with two more of the H-Store researchers, Sam Madden and Daniel Abadi.
Read the rest of this entry »
Posted in Database diversity, H-Store, Memory-centric data management, OLTP database management | 1 Comment »
February 19th, 2008 Curt Monash
Kalido briefed me last week, under pre-TDWI embargo. To a first approximation, their story is confusingly buzzword-laden, as is evident from their product names. The Kalido suite is called the Kalido Information Engine, and it comprises:
- Kalido Business Information Modeler (the newest part)
- Kalido Dynamic Information Warehouse
- Kalido Universal Information Director
- Kalido Master Data Management
But those mouthfuls aside, Kalido has some pretty interesting things to say about data warehouse schema complexity and change.
Read the rest of this entry »
Posted in Data warehousing, EII, ETL, and/or EAI, Kalido | 1 Comment »
February 18th, 2008 Curt Monash
I recently caught up with ParAccel’s CTO Barry Zane and Marketing VP Kim Stanick for a long technical discussion, which they have graciously continued by email. It would be impolitic in the extreme to comment on what led up to that. Let’s just note that many things I’ve previously written about ParAccel are now inoperative, and go straight to the highlights.
Read the rest of this entry »
Posted in Columnar architectures, Data warehousing, Microsoft and SQL*Server, ParAccel, Portability, transparency, and plug-compatibility | 4 Comments »
February 18th, 2008 Curt Monash
Last week, Dan Weinreb tipped me off to something very cool: Mike Stonebraker and a group of MIT/Brown/Yale colleagues are calling for a complete rewrite of OLTP DBMS. And they have a plan for how to do it, called H-Store, as per a paper and an associated slide presentation.
Read the rest of this entry »
Posted in Database diversity, Database theory and practice, H-Store, Memory-centric data management, Michael Stonebraker, OLTP database management | 28 Comments »
February 16th, 2008 Curt Monash
In a response to my recent five-part series on DBMS diversity, Mike Stonebraker has proposed his own taxonomy of data management technologies over on Vertica’s Database Column blog.
- OLTP DBMSs focused on fast, reliable transaction processing
- Analytic/Data Warehouse DBMSs focused on efficient load and ad-hoc query performance
-
Science DBMSs — after all MatLab does not scale to disk-sized arrays
- RDF stores focused on efficiently storing semi-structured data in this format
-
XML stores focused on semi-structured data in this format
- Search engines — the big players all use proprietary engines in this area
- Stream Processing Engines focused on real-time StreamSQL
- “Lean and Mean,” less-than-a-database engines focused on doing a small number of things very well (embedded databases are probably in this category)
- MapReduce and Hadoop — after all Google has enough “throw weight” to define a category
He goes on to say that each will be architected differently, except that — as he already convinced me back in July — RDF will be well-managed by specialty data warehouse DBMS. Read the rest of this entry »
Posted in Data types, Database diversity, Database theory and practice, Michael Stonebraker, Mid-range DBMS, OLTP database management, RDF and graphs, Relational database management systems | No Comments »
February 15th, 2008 Curt Monash
This is the fifth of a five-part series on database management system choices. For the first post in the series, please click here.
Relational database management systems have three essential elements:
- Rows and columns. Theoretically, rows and columns may be inessential to the relational model. But in reality, they are built into the design of every real-world relational product. If you don’t have rows and columns, you’re not using the product to do what it was well-designed for.
- Predicate logic. Theoretically, everything can be fitted into a predicate Procrustean bed. But if you’re looking for relevancy rankings on a text search, binary logic is a highly convoluted way to get them.
- Fixed schemas. Database theorists commonly assume that databases have fixed schemas. If this means that 90%+ of all information is null or missing, they have elegant ways of dealing with that. Even so, as computing gets ever more concerned with individuals — each with his/her/its unique “profile(s)” — fixed schemas get ever harder to maintain.
If any of these three elements is missing or inappropriate, then a traditional relational database management system may not be the best choice.
Read the rest of this entry »
Posted in Data types, Database diversity, Database theory and practice | 1 Comment »
February 15th, 2008 Curt Monash
This is the fourth of a five-part series on database management system choices. For the first post in the series, please click here.
The other threat to the high-end relational DBMS vendors aims squarely at the heart of their business. It’s the mid-range relational database management systems, which are doing an ever-larger fraction of what their high-end cousins can. That said, different products do different things well. So if you’re not blindly paying up for the security of an all-things-to-all-people high-end DBMS, there are a number of factors you might want to consider.
Read the rest of this entry »
Posted in Database diversity, Database theory and practice, Mid-range DBMS, OLTP database management, Relational database management systems | 2 Comments »
February 15th, 2008 Curt Monash
This is the third of a five-part series on database management system choices. For the first post in the series, please click here.
High-end OLTP relational database management system vendors try to offer one-stop shopping for almost all data management needs. But as I noted in my prior post, their product category is facing two major competitive threats. One comes from specialty data warehouse database management system products. I’ve covered those extensively in this blog, with key takeaways including:
- Specialty data warehouse products offer huge cost advantages versus less targeted DBMS. This applies to purchase/maintenance and administrative costs alike. And it’s true even when the general-purposed DBMS boast data warehousing features such as star indexes, bitmap indexes, or sophisticated optimizers.
- The larger the database, the bigger the difference. It’s almost inconceivable to use Oracle for a 100+ terabyte data warehouse. But if you only have 5 terabytes, Oracle is a perfectly viable – albeit annoying and costly – alternative.
- Most specialty data warehouse products have a shared-nothing architecture. Smaller parts are cheaper per unit of capacity. Hence shared nothing/grid architectures are inherently cheaper, at least in theory. In data warehousing, that theoretical possibility has long been made practical.
- Specialty data warehouse products with row-based architectures are commonly sold in appliance formats. In particular, this is true of Teradata, Netezza, DATAllegro, and Greenplum. One reason is that they’re optimized to stream data off of disk fairly sequentially, as opposed to relying on random seeks.
- Specialty data warehouse products with columnar architectures are commonly available in software-only formats. Even so, Vertica and ParAccel also boast appliance deals, with HP and Sun respectively.
- There is tremendous technical diversity and differentiation in the specialty data warehouse system market.
Let me expand on that last point. Different features may or may not be important to you, depending on whether your precise application needs include:
Read the rest of this entry »
Posted in Analytics and analytic technologies, Data warehouse appliances, Data warehousing, Database diversity, Database theory and practice, Relational database management systems | 17 Comments »
February 15th, 2008 Curt Monash
This is the second of a five-part series on database management system choices. For the first post in the series, please click here.
For the most part, relational database management systems divide into four major classes:
- High-end OLTP (OnLine Transaction Processing) relational DBMS. Oracle is the flagship for this category, followed by DB2.
- Specialty data warehouse DBMS. Teradata is the leader here, followed by Netezza, DATAllegro, ParAccel, Vertica, Infobright, Greenplum, Kognitio, Sybase IQ, and a host of others.
- Mid-range relational database management systems. Most of the contenders here fall into one or more of three categories: Open-source-based relational DBMS (MySQL, PostgreSQL, EnterpriseDB); reseller-focused relational DBMS (Progress OpenEdge, Pervasive PSQL); or crippled “editions” of high-end systems. Microsoft SQL Server was once a clear mid-range system, but now is better classified as high-end OLTP.
- Embedded relational database management systems. The leader of this category is Sybase’s SQL Anywhere. Also significant are memory-centric products Oracle TimesTen and solidDB.
Read the rest of this entry »
Posted in Database diversity, Database theory and practice, OLTP database management, Relational database management systems | 8 Comments »
February 15th, 2008 Curt Monash
This is the first in a 5-part series of posts on data management product choices. By pre-arrangement, Mike Stonebraker is responding on The Database Column, starting with his own taxonomy of DBMS types.
In the 1990s, most database management experts believed that a single general-purpose DBMS could meet substantially all needs. If you just kept adding in enough datatypes and data access methods (e.g., specialized indexes), your DBMS could eventually do a good job of meeting almost any requirement. And so, from the late 1990s into the beginning of this decade, it seemed that technology was supporting business trends, and the DBMS industry was inexorably consolidating. There was an oligopoly of high-end vendors, who sold increasingly similar super-sophisticated database management systems. Nothing else in database management seemed to matter.
Well, we were wrong. The big thing we overlooked is that database optimizations go down to the level of actual storage.
Read the rest of this entry »
Posted in Database diversity, Database theory and practice | 9 Comments »
February 14th, 2008 Curt Monash
I finally caught up with Bob Zurek about EnterpriseDB’s foray into the Elastra cloud. Here are some highlights:
- There have been dozens of applicants for the EnterpriseDB/Elastra beta program. As is usual in limited beta programs, EnterpriseDB is trying to sort out the ones who’ll make a big commitment from the tire-kickers.
- The main interest in EnterpriseDB/Elastra has come from ISVs, and secondarily from purely online businesses (e.g., SaaS vendors, web businesses, and a large MMO game vendors). There’s been a little interest from enterprises.
- Significant fractions of the EnterpriseDB/Elastra beta applications come from each of the Oracle, PostgreSQL, and MySQL user communities. A few come from SQL Server. None come from DB2.
- Bob praised Elastra for its technology in clustering, starting/stopping instances, etc. He also said that EnterpriseDB had “educated” Elastra on EnterpriseDB internals and/or admin tools, to make the integration work.
- EnterpriseDB will start turning on a few beta Elastra customers any day now (i.e., it may well not take until March, the original target).
Please subscribe to our feed!
Posted in Cloud computing, Elastra, EnterpriseDB and Postgres Plus, Mid-range DBMS, OLTP database management, Open source RDBMS, Relational database management systems | No Comments »
February 11th, 2008 Curt Monash
Single largest database >1.4 petabytes.
From Oliver Ratzesberger’s LinkedIn profile:
Our systems process in excess of 10 billion records per day, serving thousands of users and delivering hundreds of millions of queries per month in a true global 24×7 operation with distributed teams around the globe on systems over 5 PB in size (largest single system >1.4PB).
Posted in Specific users, eBay | 3 Comments »
February 8th, 2008 Curt Monash
Please do not rely on the parts of the post below that are about ParAccel. See our February 18 post about ParAccel instead.
I’ve already posted about a chat I had with Mike Stonebraker regarding Vertica yesterday. I naturally raised the subject of load speed, unaware that Mike’s colleague Stan Zlodnik had posted at length about load speed the day before. Given that post, it seems timely to go into a bit more detail, and in particular to address three questions:
- Can columnar DBMS do operational BI?
- Can columnar DBMS do ELT (Extract-Load-Transform, as opposed to ETL)?
- Are columnar DBMS’ load speeds a problem other than in issues #1 and #2?
Read the rest of this entry »
Posted in Analytics and analytic technologies, Business intelligence, Columnar architectures, Data warehousing, Database theory and practice, EII, ETL, and/or EAI, Michael Stonebraker, ParAccel, Sybase, Vertica Systems | No Comments »
February 7th, 2008 Curt Monash
While chatting with Mike Stonebraker today, I finally understood why he and Dave DeWitt launched the Great MapReduce Debate:
It was all about academia.
DeWitt noticed cases where study of MapReduce replaced study of real database management in the computer science curriculum. And he thought some MapReduce-related research papers were at best misleading. So DeWitt and Stonebraker decided to set the record straight.
Fireworks ensued.
Posted in Google, BigTable, and MapReduce, Michael Stonebraker | 5 Comments »
February 7th, 2008 Curt Monash
I chatted with Andy Ellicott and Mike Stonebraker of Vertica today. Some of the content is embargoed until February 19 (for TDWI), but here are some highlights of the rest.
- Vertica now is “approaching” 50 paid customers, up from 15 or so in early November. (Compared to most of Vertica’s fellow data warehouse specialists, that’s a lot.) Many — perhaps most — of these customers are hedge funds or telcos.
- Vertica’s typical lag from sale to deployment is about one quarter.
- Vertica’s typical initial selling price is $250K. Or maybe it’s $100-150K. The Vertica guys are generally pretty forthcoming, but pricing is an exception. Whatever they charge, it’s strictly per terabyte of user data. They think they are competitive with other software vendors, and cheaper, all-in, than appliance vendors.
- One subject on which they’re totally non-forthcoming (lawyers’ orders) is the recent patent lawsuit filed by Sybase. They wouldn’t even say whether they thought it was bogus because they didn’t infringe, or whether they thought it was bogus because the patent shouldn’t have been granted.
- Average Vertica database size is a little under 10 terabytes of user data, with many examples in the 15-20 Tb range. Lots of customers plan to expand to 50-100 Tb.
- Vertica claims sustainable load speeds of 3-5 megabytes/sec/node, irrespective of database size. Data is sucked into RAM uncompressed, then written out a gig/node at a time, compressed. Gigabyte chunks are then merged on disk, which is superfast as it doesn’t involve sorting. (30 megabytes/second.) Mike insists this doesn’t compromise compression.
We also addressed the subject of Vertica’s schema assumptions, but I’ll leave that to another post.
Please sign up for our feed!
Posted in Analytics and analytic technologies, Data warehousing, Michael Stonebraker, Relational database management systems, Sybase, Vertica Systems | 5 Comments »
February 5th, 2008 Curt Monash
The Register reports on PostgreSQL 8.3, and emphasizes OLTP speedups and reductions in administrative burden:
Among the changes, Heap Only Tuples (HOT) that may cut the maintenance overhead of frequently updated tables by up to 75 per cent, spread checkpoints and background writer autotuning to reduce the impact of check points on response times, and an asynchronous commit option that also speeds the response times of certain transactions.
I wonder how EnterpriseDB compares on these features.
Edit: Slashdot has discussion and links. And here’s a PostgreSQL feature matrix.
Posted in EnterpriseDB and Postgres Plus, Mid-range DBMS, OLTP database management, Open source RDBMS, PostgreSQL | 1 Comment »
February 1st, 2008 Curt Monash
Dan Weinreb was one of the key techies at Object Design, the company that made the object-oriented database management system ObjectStore. (Object Design later merger into Excelon, which was eventually sold to Progress, which has deemphasized but still supports ObjectStore.) Recently he wrote a pair of long and fascinating articles about Object Design, ObjectStore, and OODBMS, the first of which makes the case that “object-oriented database management systems succeeded.” Read the rest of this entry »
Posted in Objects, Progress, Apama, and DataDirect | No Comments »
February 1st, 2008 Curt Monash
I’ve run into a research/alpha/whatever project called CouchDB a couple of times now. It’s yet another “Who needs relational databases? Who needs schemas?” kind of idea. Rather, CouchDB is for taking random documents and banging them into databases, then calculating views on the fly as needed. It’s REST-friendly. Lucene and a web server are built in.
Damien Katz seems to be the driving force behind CouchDB, and his discussion of document-oriented development seems to be a good starting point. Read the rest of this entry »
Posted in CouchDB, Database diversity, Database theory and practice, Native XML | 3 Comments »