September 19, 2009

Oracle gives a few customer database size examples

In its recent quarterly conference call, Oracle said (as per the Seeking Alpha transcript):

AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&T, 250 terabytes; Yahoo!, 700 terabytes — just gives you an idea of the size of the databases that are out there and how they are growing, and that’s driving the need for greater throughput.

I don’t know what’s being counted there, but I wouldn’t be surprised if those were legit user-data figures.

Some other notes:

September 13, 2009

HadoopDB

Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.

The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:

HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.

The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more

September 13, 2009

Fault-tolerant queries

MapReduce/Hadoop fans sometimes raise the question of query fault-tolerance. That is — if a node fails, does the query need to be restarted, or can it keep going? For example, Daniel Abadi et al. trumpet query fault-tolerance as one of the virtues of HadoopDB. Some of the scientists at XLDB spoke of query fault-tolerance as being a good reason to leave 100s or 1000s of terabytes of data in Hadoop-managed file systems.

When we discussed this subject a few months ago in a couple of comment threads, it seemed to be the case that:

This raises an obvious (pair of) question(s) — why and/or when would anybody ever care about query fault-tolerance? Read more

September 12, 2009

Introduction to the XLDB and SciDB projects

Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more

September 12, 2009

Availability nightmares continue

We’re having a lot of outages on our blogs.  Downtown Host tells me that huge numbers of MySQL processes are being spawned. I have trouble understanding why, as WP-SuperCache (Edit: Actually, just WP-Cache) is enabled, robots.txt has a crawl delay, and so on.

As of yesterday, we were getting 1 1/2 megabytes/hour of “MySQL database has gone away” errors. After Downtown Host declined to discuss that subject with us, Melissa Bradshaw implemented — at least for this blog — a workaround to change the MySQL wait_delay settings ourselves.  Clever idea, and seemed to work for half a day — but now the problems have returned.

Downtown Host isn’t saying much more than “Look at these logs. Your blogs are experiencing a lot of queries and spawning dozens upon dozens of MySQL processes. The main offender is DBMS2.” I don’t know when we’ll get this sorted out. I fly to Europe tomorrow. I have a cough. I’m exhausted. I’m sorry.

September 11, 2009

Xkoto Gridscale highlights

I talked yesterday with cofounders Albert Lee and Ariff Kassam of Xkoto. Highlights included: Read more

September 10, 2009

Thinking about analytic speed

For a variety of reasons, I don’t plan to post my complete Enzee Universe keynote slide deck soon, if ever. But perhaps one or more of its subjects are worth spinning out in their own blog posts.

I’m going to start with analytic speed or, equivalently, analytic latency. There is, obviously, a huge industry emphasis on speed. Indeed, there’s so much emphasis that confusion often ensues. My goal in this post is not really to resolve the confusion; that would be ambitious to the max. But I’m at least trying to call attention to it, so that we can all be more careful in our discussions going forward, and perhaps contribute to a framework for those discussions as well.

Key points include:

1. There are two important senses of “latency” in analytics. One is just query response time. The other is the length of the interval between when data is captured and when it is available for analytic purposes. They’re often conflated — and indeed I shall do so for the remainder of this post.

2. There are many different kinds of analytic speed, which to a large extent can be viewed separately. Major areas include:

There certainly are relationships among those; e.g., a really great analytic DBMS could help speed up any and all of the last three categories. But when assessing your needs, you can go quite far viewing each of those areas separately.

3. It is indeed important to carefully assess your need-for-speed. Acceptable levels of analytic latency vary widely, ranging from sub-millisecond to multi-month. Read more

September 10, 2009

What could or should make Oracle/MySQL antitrust concerns go away?

When the Oracle/MySQL deal was first announced, I wrote:

I can probably come up with business practices that could make things very hard on Oracle/MySQL competitors … but I haven’t found a compelling antitrust trigger on my first pass over the subject.

Subsequently, there’s been a lot of discussion about whether or not Oracle can use control of MySQL to make life difficult for third-party MySQL storage engine vendors.

Now that the European Commission is delaying the Oracle/Sun deal, explicitly because of Oracle/MySQL antitrust fears.  That is, the European Commission wants to be reassured that an Oracle takeover of MySQL won’t unduly impinge upon the future availability of open source/low cost DBMS alternatives.  This raises that natural question:

What could Oracle do to assure concerned parties that its ownership of MySQL won’t unduly hamper open-source-based DBMS competition?

I think that’s indeed the crucial question. The Oracle/Sun deal has enough momentum at this point that it both should and will be allowed to happen — perhaps with safeguards — rather than banned outright. If  you have concerns about Oracle’s pending acquisition of MySQL, you should speak up and outline what kinds of regulatory safeguards would alleviate the problems you foresee.

More or less obvious possibilities include:

September 3, 2009

Teradata really means that those 100+ appliances are in PRODUCTION

I was misremembering.  It turns out that when Teradata said it had over 100 appliances “in production”, it meant that >100 hardware-based appliances are actually in production. If you add in the software-only “appliances,” and count test/development as well as true production, the total rises to >200.

I tried to get a finer breakdown out of Teradata on a disclosable basis, but failed. The ostensible reason is that public companies often don’t do that sort of thing without permission from the investor relations department, and Teradata’s marketers evidently haven’t felt a sense of urgency about getting permission to, for example, communicate how well just the 25xx series is doing.

September 3, 2009

Continuent on clustering

Robert Hodges, CTO of my client Continuent, put up a blog post laying out his and Continuent’s views on database clustering. Continuent offers Tungsten, its third try at database clustering technology, targeted at MySQL, PostgreSQL, and perhaps Oracle. Unlike Continuent’s more ambitious. second-generation product, Tungsten offers single-master replication, which in Robert’s view allows for great ease of deployment and administration (he likes the phrase “bone-simple”).

The downside to Continuent Tungsten ‘s stripped down architecture is that it doesn’t solve the most extreme performance scale-out problems. Instead, Continuent focuses on the other big benefits of keeping your data in more than one place, namely high availability and data loss prevention (i.e., backup).

Continuent has been around for a number of years, starting out in Finland but now being based in Silicon Valley. For most purposes, however, it’s reasonable to think of Continuent and Tungsten as start-up efforts.

As you might guess from the references to Finland and MySQL, Continuent’s products are open source, or at least have open source versions. I’m still a little fuzzy as to which features are open sourced and which are not. For that matter, I’m still unclear as to Tungsten’s feature list overall …

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.