Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Teradata Developer Exchange (DevX) begins to emerge
Every vendor needs developer-facing web resources, and Teradata turns out to have been working on a new umbrella site for its. It’s called Teradata Developer Exchange — DevX for short. Teradata DevX seems to be in a low-volume beta now, with a press release/bigger roll-out coming next week or so. Major elements are about what one would expect:
- Articles
- Blogs
- Downloads
- Surprisingly, so far as I can tell, no forums
If you’re a Teradata user, you absolutely should check out Teradata DevX. If you just research Teradata — my situation 🙂 — there are some aspects that might be of interest anyway. In particular, I found Teradata’s downloads instructive, most particularly those in the area of extensibility. Mainly, these are UDFs (User-Defined Functions), in areas such as:
- Compression
- Geospatial data
- Imitating Oracle or DB2 UDFs (as migration aids)
Also of potential interest is a custom-portlet framework for Teradata’s management tool Viewpoint. A straightforward use would be to plunk some Viewpoint data into a more general system management dashboard. A yet cooler use — and I couldn’t get a clear sense of whether anybody’s ever done this yet — would be to offer end users some insight as to how long their queries are apt to run.
Categories: Database compression, Emulation, transparency, portability, GIS and geospatial, Teradata | 2 Comments |
MySQL forking heats up, but not yet to the benefit of non-GPLed storage engine vendors
Last month, I wrote “This is a REALLY good time to actively strengthen the MySQL forkers,” largely on behalf of closed-source/dual-source MySQL storage engine vendors such as Infobright, Kickfire, Calpont, Tokutek, or ScaleDB. Yesterday, two of my three candidates to lead the effort — namely Monty Widenius/MariaDB/Monty Program AB and Percona — came together to form something called the Open Database Alliance. Details may be found:
- On the Open Database Alliance website
- In a press release
- On Monty Widenius’ blog
- In a Stephen O’Grady blog post based on a discussion with Monty Widenius
- In an ars technica blog post based on a discussion with Monty Program AB’s Kurt von Finck
But there’s no joy for the non-GPLed MySQL storage engine vendors in the early news. Read more
Categories: MySQL, Open source, Theory and architecture | 16 Comments |
Facebook’s experiences with compression
One little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to be:
- Facebook uses gzip, and gets a little bit more than 6X compression.
- Experiments suggest bzip2 would reduce data by another 20% or so, increasing compression to the 7.5X range.
- The downside of bzip2 is 15-25% processing overhead, depending on the kind of data.
Categories: Data warehousing, Database compression, Facebook, Hadoop | 2 Comments |
The secret sauce to Clearpace’s compression
In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.
To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method. NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows. Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.
Of the top of my head, examples for which this strategy might be particularly successful include:
- Denormalized databases
- Message stores with lots of header information
- Addresses
Categories: Archiving and information preservation, Columnar database management, Database compression, Rainstor | 8 Comments |
Facebook, Hadoop, and Hive
I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.
Updating the metrics in my Cloudera post,
- Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
- Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
- Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.
Nothing else in my Cloudera post was called out as being wrong.
In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more
Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo | 55 Comments |
eBay’s two enormous data warehouses
A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.
Metrics on eBay’s main Teradata data warehouse include:
- >2 petabytes of user data
- 10s of 1000s of users
- Millions of queries per day
- 72 nodes
- >140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
- 100s of production databases being fed in
Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:
- 6 1/2 petabytes of user data
- 17 trillion records
- 150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
- 96 nodes
- 200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
- 4.5 petabytes of storage
- 70% compression
- A small number of concurrent users
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, eBay, Greenplum, Petabyte-scale data management, Teradata, Web analytics | 48 Comments |
Some DB2 highlights
I chatted with IBM Thursday, about recent and imminent releases of DB2 (9.5 through 9.7). Highlights included:
- DB2 is getting Oracle emulation, which I posted about separately.
- IBM says that it had >50 new DB2 data warehouse customers last year. I neglected to ask how many of these had been general-purpose DB2 customers all along.
- By “data warehouse customer” I mean a user for InfoSphere Warehouse, which previously was called DB2’s DPF (Data Partitioning Feature). Apparently, this includes both logical and physical partitioning. E.g., DB2 isn’t shared-nothing without this feature.
- IBM is proud of DB2’s compression, which it claims commonly reaches 70-80%. It calls this “industry-leading” in comparison to Oracle, SQL Server, and other general-purpose relational DBMS.
- DB2 compression’s overall effect on performance stems from a trade-off between I/O (lessened) and CPU burden (increased). For OLTP workloads, this is about a wash. For data warehousing workloads, IBM says 20% performance improvement from compression is average.
- DB2 now has its version of one of my favorite Oracle security features, called Label Based Access Control. A label-control feature can make it much easier to secure data on a row-by-row, value-by-value basis. The obvious big user is national intelligence, followed by financial services. IBM says the health care industry also has interest in LBAC.
- Also in the security area, IBM reworked DB2’s audit feature for 9.5
- I think what I heard in our discussion of DB2 virtualization is:
- Increasingly, IBM is seeing production use of VMware, rather than just test/development.
- IBM believes it is a much closer partner to VMware than Oracle or Microsoft is, because it’s not pushing its own competing technology.
- Generally, virtualization is more important for OLTP workloads than data warehousing ones, because OLTP apps commonly only need part of the resources of a node while data warehousing often wants the whole node.
- AIX data warehousing is an exception. I think this is because AIX equates to big SMP boxes, and virtualization lets you spread out the data warehousing processing across more nodes, with the usual parallel I/O benefits.
- When IBM talks of new autonomic/self-tuning features in DB2, they’re used mainly for databases under 1 terabyte in size. Indeed, the self-tuning feature set doesn’t work with InfoSphere Warehouse.
- Even with the self-tuning feature it sounds as if you need at least a couple of DBA hours per instance per week, on average.
- DB2 on Linux/Unix/Windows has introduced some enhanced workload management features analogous to those long found in mainframe DB2. For example, resource allocation rules can be scheduled by time. (The point of workload management is to allocate resources such as CPU or I/O among the simultaneous queries or other tasks that contend for them.) Workload management rules can have thresholds for amounts of resources consumed, after which the priority for a task can go up (“Get it over with!”) or down (“Stop hogging my system!”).
Categories: Application areas, Data warehousing, Database compression, IBM and DB2, Market share and customer counts, OLTP, Parallelization, Workload management | 2 Comments |
Clearing some of my buffer
I have a large number of posts still in backlog. For starters, there are ones based on recent visits with Aster, Greenplum, Sybase, Vertica, and a Very Large User. I suspect I’ll write more soon on Oracle as well. Plus there’s my whole future-of-online-media area. And quite a bit more will grow out of planned research.
So there are a whole lot of other worthy subjects I doubt I’ll be getting to any time soon. In some cases, of course, other people are doing great jobs of writing about same. Here are pointers to a few links that I am glad to recommend:
- I wrote recently that I’ve discovered a number of different in-memory OLAP engines. Cindi Howson far outdid that, writing at length for Intelligent Enterprise on in-memory analytics, in an article that seems to itself be a teaser for a longer, free white paper on the subject.
- CouchDB posted an eye-catching, risque slide presentation promoting CouchDB and, more generally, key-value stores, at least for internet applications. And yes, they’ve integrated MapReduce.
- Merv Adrian posted favorably about Birst, with special reference to its OEM efforts. As previously noted, I was highly unimpressed with Birst’s end-user BI story at the time of its September roll-out, and Jerome Pineau’s recent examination did nothing to reassure me. But perhaps OEM is a different matter.
- Merv also offers an interesting post about data integration upstart Expressor, and a highly favorable one about “visualization” vendor Tableau.
- Ann All interviewed Nigel Pendse, who grumped that BI features are overrated, and what end users really want is great query performance. I’m not so sure about the features side of that, but I’m hugely in agreement about the performance. That’s a big part of why the analytic DBMS industry is so vibrant. It’s also why in-memory OLAP is suddenly so hot.
MySQL storage engine round-up, with Oracle-related thoughts
Here’s what I know about MySQL storage engines, more or less.
- MySQL with MyISAM is fast. But it’s not transactional. Except for limited purposes, MySQL with MyISAM is a pretty crummy DBMS. Nothing can change that.
- MySQL with InnoDB is transactional. But it’s not particularly fast. MySQL with InnoDB is a pretty mediocre DBMS. Oracle could fix that, at least partially, over time.
- I don’t know much about Falcon, Maria, and so on. With Oracle winding up owning both MySQL and InnoDB, the motivation for those engines (except as Oracle-free forks) might fade.
- Infobright is the most established of the rest. At the moment I’m not recommending it for most industrial-strength uses unless the user is particularly cash-constrained. But I wouldn’t be surprised if that changed soon. A cheap, fast, simple columnar analytic DBMS has a place in the world.
- Kickfire is next in line, offering a hardware-based growth path for users who’ve maxed out on what unaided MySQL can do. It remains to be seen for how many users the desire to keep things simple and stay with MySQL outweighs the desire to avoid custom hardware. Having Oracle salespeople all over those accounts surely wouldn’t help. Kickfire also has a second market, namely OEM vendors who are mainly interested in the superfast chip. That would probably be pretty unaffected by Oracle.
- Tokutek offers a technical proposition that’s hard to match head-on without going the CEP route. Users who care are likely to be MySQL shops. Tokutek’s main challenge is to prove that it sufficiently outdoes competing technical strategies for sufficiently many users. Oracle ownership of MySQL seems pretty irrelevant to Tokutek’s success or failure.
- Calpont offers a kind of lightweight Exadata alternative. With Calpont’s packaging and positioning perennially unclear, it’s difficult to predict the effect of a particular change — i.e., Oracle buying MySQL — in Calpont’s market environment.
- I haven’t heard from transactionally-oriented ScaleDB since I wrote about them a year ago. Apparently, they’re rolling out beta product this week, and their venerable techie guru sadly passed away earlier this month.
Categories: Calpont, Columnar database management, Data warehousing, Exadata, Infobright, Kickfire, MySQL, Open source, Oracle, Tokutek and TokuDB | 14 Comments |
Calpont update — you read it here first!
Calpont has gone through a lot of strategy iterations since its founding. The super-short version is that Calpont originally planned an appliance built around a SQL chip, much like Kickfire. But after various changes in management and venture backing, Calpont turned itself into a software-only analytic DBMS vendor relying on a MySQL front end. Calpont is now at the stage of announcing an Early Adopter program at the MySQL conference on Wednesday, although details of Calpont’s product release timing, pricing, feature set, etc. are all To Be Determined.
Minor highlights of the Calpont technical story include: Read more