Availability nightmares continue
We’re having a lot of outages on our blogs. Downtown Host tells me that huge numbers of MySQL processes are being spawned. I have trouble understanding why, as WP-SuperCache (Edit: Actually, just WP-Cache) is enabled, robots.txt has a crawl delay, and so on.
As of yesterday, we were getting 1 1/2 megabytes/hour of “MySQL database has gone away” errors. After Downtown Host declined to discuss that subject with us, Melissa Bradshaw implemented — at least for this blog — a workaround to change the MySQL wait_delay settings ourselves. Clever idea, and seemed to work for half a day — but now the problems have returned.
Downtown Host isn’t saying much more than “Look at these logs. Your blogs are experiencing a lot of queries and spawning dozens upon dozens of MySQL processes. The main offender is DBMS2.” I don’t know when we’ll get this sorted out. I fly to Europe tomorrow. I have a cough. I’m exhausted. I’m sorry.
Categories: About this blog, MySQL | 4 Comments |
Xkoto Gridscale highlights
I talked yesterday with cofounders Albert Lee and Ariff Kassam of Xkoto. Highlights included: Read more
Categories: Clustering, IBM and DB2, Market share and customer counts, Microsoft and SQL*Server, Parallelization, Pricing, Xkoto | 15 Comments |
Thinking about analytic speed
For a variety of reasons, I don’t plan to post my complete Enzee Universe keynote slide deck soon, if ever. But perhaps one or more of its subjects are worth spinning out in their own blog posts.
I’m going to start with analytic speed or, equivalently, analytic latency. There is, obviously, a huge industry emphasis on speed. Indeed, there’s so much emphasis that confusion often ensues. My goal in this post is not really to resolve the confusion; that would be ambitious to the max. But I’m at least trying to call attention to it, so that we can all be more careful in our discussions going forward, and perhaps contribute to a framework for those discussions as well.
Key points include:
1. There are two important senses of “latency” in analytics. One is just query response time. The other is the length of the interval between when data is captured and when it is available for analytic purposes. They’re often conflated — and indeed I shall do so for the remainder of this post.
2. There are many different kinds of analytic speed, which to a large extent can be viewed separately. Major areas include:
- Data exploration. In-memory OLAP is a huge trend, and QlikView is a hot BI product line.
- Budgeting/planning. In an unprecedentedly frightening economy, annual planning/forecasting cycles may well be too slow.
- Operational integration. This is probably the biggest current area of mission-critical IT advancement. Not coincidentally, it is also the mainstay of the most expensive and complex data warehousing technologies. It’s also an ongoing area of application for event/stream processing, aka CEP.
- General or deep analytics. This is what I seem to spend much of my time writing about — data warehousing price/performance, parallelized data mining, and much more.
- Data administration. Ease of data mart spin-out and administration is becoming a major concern. And of course analytic appliance and DBMS vendors have been telling ease-of-deployment, low-DBA-involvement kinds of stories at least since Netezza first came to market.
There certainly are relationships among those; e.g., a really great analytic DBMS could help speed up any and all of the last three categories. But when assessing your needs, you can go quite far viewing each of those areas separately.
3. It is indeed important to carefully assess your need-for-speed. Acceptable levels of analytic latency vary widely, ranging from sub-millisecond to multi-month. Read more
Categories: Analytic technologies, Business intelligence, Data warehousing, Presentations | 5 Comments |
What could or should make Oracle/MySQL antitrust concerns go away?
When the Oracle/MySQL deal was first announced, I wrote:
I can probably come up with business practices that could make things very hard on Oracle/MySQL competitors … but I haven’t found a compelling antitrust trigger on my first pass over the subject.
Subsequently, there’s been a lot of discussion about whether or not Oracle can use control of MySQL to make life difficult for third-party MySQL storage engine vendors.
Now that the European Commission is delaying the Oracle/Sun deal, explicitly because of Oracle/MySQL antitrust fears. That is, the European Commission wants to be reassured that an Oracle takeover of MySQL won’t unduly impinge upon the future availability of open source/low cost DBMS alternatives. This raises that natural question:
What could Oracle do to assure concerned parties that its ownership of MySQL won’t unduly hamper open-source-based DBMS competition?
I think that’s indeed the crucial question. The Oracle/Sun deal has enough momentum at this point that it both should and will be allowed to happen — perhaps with safeguards — rather than banned outright. If you have concerns about Oracle’s pending acquisition of MySQL, you should speak up and outline what kinds of regulatory safeguards would alleviate the problems you foresee.
More or less obvious possibilities include:
- Divest MySQL. This is obviously an extreme measure, but it surely would work.
- Provide some money and trademark rights to MySQL forkers. If MariaDB and Drizzle were put into strong competitive positions with MySQL today, it’s hard to argue how regulators could object to any future Oracle maneuverings Oracle might envision with the GPLed side of MySQL.
- Offer a standard, attractive, long-term deal to MySQL bundlers. The commercial/non-GPL version of MySQL is a requirement for appliance vendors (surely), OEM vendors (probably), and storage engine vendors (maybe — I disagree, but I’m evidently in the minority).
- Strengthen PostgreSQL. 🙂 Realistically, that’s not going to be part of any Oracle/MySQL resolution, so I’ll leave it as a subject for another time.
Categories: Mid-range, MySQL, Open source, Oracle, PostgreSQL | 9 Comments |
Teradata really means that those 100+ appliances are in PRODUCTION
I was misremembering. It turns out that when Teradata said it had over 100 appliances “in production”, it meant that >100 hardware-based appliances are actually in production. If you add in the software-only “appliances,” and count test/development as well as true production, the total rises to >200.
I tried to get a finer breakdown out of Teradata on a disclosable basis, but failed. The ostensible reason is that public companies often don’t do that sort of thing without permission from the investor relations department, and Teradata’s marketers evidently haven’t felt a sense of urgency about getting permission to, for example, communicate how well just the 25xx series is doing.
Categories: Data warehouse appliances, Data warehousing, Market share and customer counts, Teradata | 1 Comment |
Continuent on clustering
Robert Hodges, CTO of my client Continuent, put up a blog post laying out his and Continuent’s views on database clustering. Continuent offers Tungsten, its third try at database clustering technology, targeted at MySQL, PostgreSQL, and perhaps Oracle. Unlike Continuent’s more ambitious. second-generation product, Tungsten offers single-master replication, which in Robert’s view allows for great ease of deployment and administration (he likes the phrase “bone-simple”).
The downside to Continuent Tungsten ‘s stripped down architecture is that it doesn’t solve the most extreme performance scale-out problems. Instead, Continuent focuses on the other big benefits of keeping your data in more than one place, namely high availability and data loss prevention (i.e., backup).
Continuent has been around for a number of years, starting out in Finland but now being based in Silicon Valley. For most purposes, however, it’s reasonable to think of Continuent and Tungsten as start-up efforts.
As you might guess from the references to Finland and MySQL, Continuent’s products are open source, or at least have open source versions. I’m still a little fuzzy as to which features are open sourced and which are not. For that matter, I’m still unclear as to Tungsten’s feature list overall …
Categories: Clustering, Continuent, MySQL, Open source, PostgreSQL | 3 Comments |
Teradata and Netezza are doing MapReduce too
Netezza told me a while ago that it planned to introduce MapReduce, and agreed yesterday this was no longer NDAed. Stephen Brobst of Teradata* let slip at XLDB that Teradata has MapReduce too, apparently implemented but not yet generally available.
I don’t have details in either case. Netezza and Teradata evidently aren’t taking MapReduce as seriously as Aster Data, or even Greenplum or Vertica. But MapReduce has become pretty much of a “checkmark” item for large-database analytic DBMS vendors even so.
*Technically, Brobst is not and never has been a Teradata employee — but he’s widely and correctly regarded as being “of Teradata” even so. 🙂
Categories: Data warehousing, MapReduce, Netezza, Teradata | 6 Comments |
SAS on Netezza and other Netezza extensibility
I chatted with SAS CTO Keith Collins yesterday about the new SAS/Netezza in-database parallel data mining scoring offering. My impression is that this is very similar to SAS’ current Teradata support, notwithstanding SAS’ and Teradata’s apparent original intention of offering in-database modeling by now as well.
I gather this is a big performance-enhancing deal, just as it is for SPSS or Oracle’s own data mining over Oracle. However, I must confess to not yet understanding why. That is, I don’t know what’s so complicated about data mining scoring algorithms that makes hand-coding them in SQL particularly forbidding. My naive view of data mining is that you do a big regression to get a bunch of weights, and the resulting scoring algorithm is a linear combination of a few dozen variables. Evidently, that’s not quite right.
Anyhow, it turns out that SAS held off on this work until it could be done for TwinFin. That’s largely because TwinFin lets partners write code on Intel CPUs, while previously they had to write in C for Netezza’s FPGAs. I got a similar sense from at least one other Netezza partner as well.
Categories: Data warehouse appliances, Data warehousing, Netezza, Predictive modeling and advanced analytics, SAS Institute | 5 Comments |
Oracle Exadata hybrid columnar compression
Oracle Database 11g Release 2 is out, and as usual I wasn’t briefed — perhaps because Oracle is more scared than its competitors are of hard questions, perhaps for some other reason entirely.* Anyhow, Oracle Database 11 Release 2 contains an Exadata-only feature called hybrid columnar compression. The Oracle Database 11g Release 2 white paper says “data is grouped, ordered, and stored one column at a time.” But Kevin Closson clarifies:
The word hybrid is important.
Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.
So, “hybrid” is the word. But, none of that matters as much as the effectiveness. This form of compression is extremely effective.
That sounds a whole lot like PAX. Specifically, in Oracle’s case I would guess “hybrid columnar compression” provides the compression benefits of column stores, but not column stores’ I/O benefits, and also not any kind of in-memory compression. Read more
Categories: Columnar database management, Data warehousing, Database compression, Exadata, Oracle, Theory and architecture | 20 Comments |
Teradata has over 100 appliances in production
I recently wrote that Teradata had gotten serious about appliance product lines, and had non-trivial sales figures for them. In a press release today, Teradata is now explicitly saying (emphasis mine):
Teradata now has more than 100 appliances in production, including the Data Mart Appliance 551, the Data Warehouse Appliance 2550, and the Extreme Data Appliance 1550, which complement the core platform, the Teradata Active Enterprise Data Warehouse 5550.
The breakdowns on that are NDA, and anyhow I can’t find them immediately in my notes.* But if memory serves — while a lot of those appliances are used for test and development, a whole other lot of them are used to do actual production query-answering work. (Edit: Memory turned out to be wrong.) Read more