Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Notes on columnar/TPC-H compression
I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.* When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).
*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.
But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace). Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing. So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.
And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.
| Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Vertica Systems | Leave a Comment |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL.
Correction to a recent quote
I’m quoted in a recent article around Aster’s appliance announcement as saying data warehouse appliances are more suitable for small workgroups of analysts crunching small amounts of data than they are for other uses.
But that’s not what I think at all.
I do think the ease-of-administration pitch for appliances makes them particularly well suited for users who want to scrape by without doing much database adminstration. This is especially appealing to departments or smaller enterprises. And the first/best scenario that comes to mind is indeed a small team of analysts, with good SQL skills but lightweight DBA experience, although Netezza has proved that many other kinds of users can find appliances appealing as well.
But that small team of analysts may maintain the largest database in the firm.
And by the way — notwithstanding the MySpace counterexample, most of Aster’s initial customers had <10 terabyte databases, and I think indeed <5 terabyte. The “frontline” pitch succeeded for Aster before (MySpace again aside) any better-big-data-crunching story did.
| Categories: Analytic technologies, Aster Data, Data warehouse appliances, Data warehousing, Theory and architecture | Leave a Comment |
Xtreme Data readies a different kind of FPGA-based data warehouse appliance
Xtreme Data called me to talk about its plans in the data warehouse appliance business, almost all details of which are currently embargoed. Still, a few points may be worth noting ahead of more precise information, namely:
- Xtreme Data’s basic idea is to take a custom board and build a data warehouse appliance around it.
- An Xtreme Data board looks a lot like a conventional two-socket board, but has only one four-core CPU. In addition, it sports some FPGAs (Field-Programmable Gate Arrays).
- In the Xtreme Data appliance, the FPGAs will be used for core SQL processing, after the data is ingested via conventional I/O. This is different from Netezza’s approach to FPGA-based data warehouse appliances, in which the FPGA sits in the place of a disk controller and touches the data first, before passing it off to a more or less conventional CPU.
- While preparing entry into the data warehouse appliance business, Xtreme Data has sold its board to 150 other outfits, many quite impressive. Buyers seem to be FPGA users who previously had to craft their own custom boards. According to Xtreme Data, major uses by these customers include:
- Military/intelligence/digital signal processing.
- Military/intelligence/cybersecurity (a newish area for Xtreme Data)
- Bioinformatics/high-throughput gene sequencing (a “handful” of customers)
- Medical imaging
- More or less pure university research of various sorts (around 50 customers)
- … but not database management.
- Xtreme Data’s website has a non-obvious URL.
So far as I can tell, Xtreme Data’s 1.0 product will — like most other 1.0 analytic database management products — be focused on price/performance, without little or no positive differentiation in the way of features.
| Categories: Data warehouse appliances, Data warehousing, Netezza, Theory and architecture | 6 Comments |
Aster Data enters the appliance game
Aster Data is rolling out a line of nCluster appliances today. Highlights include:
- Configurations ranging from 9 6.25 terabytes to 1 petabyte of user data. (Edit: Here’s the up-to-date data sheet.)
- A $50K “Express Edition” price for <1 terabyte of user data. Unfortunately, that’s the only stated price.
- The option of bundled MicroStrategy.
- “MapReduce” in the name, which suggests something about the positioning — i.e., enterprise decision support, rather than Aster’s usual web/”frontline” emphasis. (Edit: That also fits with Aster’s recent MapReduce-for-.NET announcement.) (Edit: Actual name is Aster MapReduce Data Warehouse Appliance.)
- Claims that because Aster runs effectively on cheaper, more truly “commodity” hardware than competitors, you get more hardware bang for the buck if you buy from Aster.
I don’t have a lot more to add right now, mainly because I wrote at some length about Aster’s non-appliance-specific, non-MapReduce technology and positioning a couple of weeks ago.
| Categories: Analytic technologies, Aster Data, Business intelligence, Data warehouse appliances, Data warehousing, Database compression, MapReduce, Pricing | 16 Comments |
The TPC-H benchmark is a blight upon the industry
ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results. But TPCs are worthless today.
It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life. The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.
For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.
Meanwhile, the RAM:data ratio is around 1:11 It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that. And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.
More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.
| Categories: Analytic technologies, Benchmarks and POCs, Buying processes, Columnar database management, Data warehousing, Database compression, ParAccel | 85 Comments |
Aster Data on parallelism
Aster Data’s core claim boils down to “We do parallelism better.” Aster has shied away from saying that for marketing purposes, for fear of the response “Yeah, right, everybody says that.” But when I talked with Mayank Bawa, Steve Wooledge, et al. yesterday, I focused discussions on just that point. Based on that chat and others before, here are some highlights (as I understand them) of what Aster claims, believes, or believes to be differentiated about its nCluster technology:
| Categories: Analytic technologies, Aster Data, Data warehousing, MapReduce, Parallelization, Theory and architecture | 3 Comments |
An example of what’s wrong with big vendors’ approaches to BI (SAP in this case)
I just found Chris Kanaracus’ article about SAP’s rollout last month of its “clear enterprises” strategy. The money quote comes from Sara Lee, the user SAP seems to have trotted out:
But Sara Lee has not yet decided to purchase the software, and there are substantial underlying tasks to perform as well, he added.
“This is giving us the horsepower [to analyze data] but we need to have harmonized and structured data underneath it.”
This is from the leading test user of the product?
Business intelligence and the associated data management processes need to be reimagined, and I’m increasingly coming to suspect that the big BI conglomerates aren’t up to the task.
| Categories: Analytic technologies, Business intelligence, SAP AG, Specific users, Theory and architecture | Leave a Comment |
Google Fusion Tables
Google has announced an experimental cloud-based data management system called Fusion Tables. A press article and Slashdot thread ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.
What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it’s a place to dump data in a grid of cells, comment on it, version it, and do elementary data manipulation. This could, I guess, be useful as an alternative to traditional RDBMS — assuming, of course, that you want to have a row-by-row debate about 100 megs of data.
Seriously, while Google Fusion Tables bears some vague resemblance to what I’m thinking about for the future of both business intelligence and data marts, it sounds as if it has a long way to go before it’s something most enterprises should spend time looking at.
| Categories: Analytic technologies, Google, Theory and architecture | 1 Comment |
Per-terabyte pricing
Software-only DBMS vendors sometimes price per terabyte of user data. Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes. In both cases, this means raw data, uncompressed, without counting indexes or temp space.
Client experience teaches me that this definition is easy to forget, so let me reemphasize the key point:
Per-terabyte pricing is based on a calculated figure. Per-terabyte pricing is not based on the current disk space used by your database when managed by the DBMS you are replacing.
There’s at least one important difference in how Vertica and Greenplum calculate database size. No matter how many times you copy the data, Vertica only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy. Similarly, Vertica charges only for deployment, and not for test or development; I didn’t remember to ask what Greenplum’s policies are in those regards. (Edit: Greenplum says in a comment below that it doesn’t charge for test or development data either.)
*That policy is a great fit with Vertica’s performance recommendation that you should store columns in different sort orders, perhaps an average of two copies per column.
| Categories: Columnar database management, Data warehousing, Greenplum, Pricing, Vertica Systems | 4 Comments |
Merv Adrian on SAND Technology
Merv Adrian blogged about SAND Technology, casting significant doubt on SAND’s business prospects. At this point, I can’t say I disagree. On the other hand, SAND does have public, audited financial statements showing it generating more revenue than a lot of other analytic DBMS or archiving vendors probably make. Columnar DBMS vendors doing better than SAND are Sybase, Vertica, maybe Infobright — and who else?
| Categories: Archiving and information preservation, Columnar database management, Data warehousing, SAND Technology | 1 Comment |
Daniel Abadi on Kickfire and related subjects
Daniel Abadi has a new blog, whose first post centers around Kickfire. The money quote is (emphasis mine):
In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker’s voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore’s law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore’s law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past
Good point.
More generally, Abadi speculates about the market for MySQL-compatible data warehousing. My responses include:
- OF COURSE there are many MySQL users who need to move to a serious analytic DBMS.
- What’s less clear is whether there’s any big advantage to those users in remaining MySQL-compatible when they do move. I’m not sure what MySQL-specific syntax or optimizations they’d have that would be difficult to port to a non-MySQL system.
- It’s nice to see Abadi speaking well of Infobright and its technology.
- To say that Infobright went open source because it was “desperate” is overstated. That said, I don’t think Infobright was on track to prosper without going open source.
- While open source and MySQL go together, an appliance like Kickfire loses many (not all) of the benefits of open source.
- Calpont has indeed never disclosed a customer win. Any year now … (Just kidding, Vogel!)
- In general, seeing Abadi be so favorable toward Vertica competitors adds credibiity to the recent Hadoop vs. DBMS paper.
Anyhow, as previously noted, I’m a big Daniel Abadi fan. I look forward to seeing what else he posts in his blog, and am optimistic he’ll live up to or exceed its stated goals.
| Categories: Calpont, Columnar database management, DBMS product categories, Data warehouse appliances, Data warehousing, Infobright, Kickfire, MySQL, Open source, Theory and architecture | 2 Comments |
Greenplum update — Release 3.3 and so on
I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including:
| Categories: Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Parallelization, PostgreSQL | 9 Comments |
Sneakernet to the cloud
Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet. In some circumstances, he is surely right. Possible implications include:
- When sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, cloud, cloud — but Clearpace thinks you really should have a bit of its software onsite too, to compress the data before sending it across the wire.
- Getting data from one cloud to another cloud could be problematic. I’m fond of saying that weblog data naturally lives in the cloud at your hosting company’s location, so you should analyze it there too. But this makes the most sense if you analyze it or at least filter/reduce it in place. (That said, the really, really big web companies have lots of different data centers, and presumably do move huge amounts of log data from place to place.)
But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.
| Categories: Amazon and its cloud, Cloud computing, Database compression, EAI, EII, ETL, ELT, ETLT, Web analytics | 1 Comment |
Teradata Developer Exchange (DevX) begins to emerge
Every vendor needs developer-facing web resources, and Teradata turns out to have been working on a new umbrella site for its. It’s called Teradata Developer Exchange — DevX for short. Teradata DevX seems to be in a low-volume beta now, with a press release/bigger roll-out coming next week or so. Major elements are about what one would expect:
- Articles
- Blogs
- Downloads
- Surprisingly, so far as I can tell, no forums
If you’re a Teradata user, you absolutely should check out Teradata DevX. If you just research Teradata — my situation
— there are some aspects that might be of interest anyway. In particular, I found Teradata’s downloads instructive, most particularly those in the area of extensibility. Mainly, these are UDFs (User-Defined Functions), in areas such as:
- Compression
- Geospatial data
- Imitating Oracle or DB2 UDFs (as migration aids)
Also of potential interest is a custom-portlet framework for Teradata’s management tool Viewpoint. A straightforward use would be to plunk some Viewpoint data into a more general system management dashboard. A yet cooler use — and I couldn’t get a clear sense of whether anybody’s ever done this yet — would be to offer end users some insight as to how long their queries are apt to run.
| Categories: Database compression, Emulation, transparency, portability, GIS and geospatial, Teradata | 2 Comments |
MySQL forking heats up, but not yet to the benefit of non-GPLed storage engine vendors
Last month, I wrote “This is a REALLY good time to actively strengthen the MySQL forkers,” largely on behalf of closed-source/dual-source MySQL storage engine vendors such as Infobright, Kickfire, Calpont, Tokutek, or ScaleDB. Yesterday, two of my three candidates to lead the effort — namely Monty Widenius/MariaDB/Monty Program AB and Percona — came together to form something called the Open Database Alliance. Details may be found:
- On the Open Database Alliance website
- In a press release
- On Monty Widenius’ blog
- In a Stephen O’Grady blog post based on a discussion with Monty Widenius
- In an ars technica blog post based on a discussion with Monty Program AB’s Kurt von Finck
But there’s no joy for the non-GPLed MySQL storage engine vendors in the early news. Read more
| Categories: MySQL, Open source, Theory and architecture | 16 Comments |
Facebook’s experiences with compression
One little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to be:
- Facebook uses gzip, and gets a little bit more than 6X compression.
- Experiments suggest bzip2 would reduce data by another 20% or so, increasing compression to the 7.5X range.
- The downside of bzip2 is 15-25% processing overhead, depending on the kind of data.
| Categories: Data warehousing, Database compression, Facebook and Cassandra, Hadoop | 2 Comments |
The secret sauce to Clearpace’s compression
In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.
To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method. NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows. Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.
Of the top of my head, examples for which this strategy might be particularly successful include:
- Denormalized databases
- Message stores with lots of header information
- Addresses
| Categories: Archiving and information preservation, Clearpace, Columnar database management, Database compression | 7 Comments |
Facebook, Hadoop, and Hive
I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.
Updating the metrics in my Cloudera post,
- Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
- Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
- Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.
Nothing else in my Cloudera post was called out as being wrong.
In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.
| Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook and Cassandra, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo | 30 Comments |
eBay’s two enormous data warehouses
A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.
Metrics on eBay’s main Teradata data warehouse include:
- >2 petabytes of user data
- 10s of 1000s of users
- Millions of queries per day
- 72 nodes
- >140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
- 100s of production databases being fed in
Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:
- 6 1/2 petabytes of user data
- 17 trillion records
- 150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
- 96 nodes
- 200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
- 4.5 petabytes of storage
- 70% compression
- A small number of concurrent users
| Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Greenplum, Petabyte-scale data management, Teradata, Web analytics, eBay | 20 Comments |
Some DB2 highlights
I chatted with IBM Thursday, about recent and imminent releases of DB2 (9.5 through 9.7). Highlights included:
- DB2 is getting Oracle emulation, which I posted about separately.
- IBM says that it had >50 new DB2 data warehouse customers last year. I neglected to ask how many of these had been general-purpose DB2 customers all along.
- By “data warehouse customer” I mean a user for InfoSphere Warehouse, which previously was called DB2’s DPF (Data Partitioning Feature). Apparently, this includes both logical and physical partitioning. E.g., DB2 isn’t shared-nothing without this feature.
- IBM is proud of DB2’s compression, which it claims commonly reaches 70-80%. It calls this “industry-leading” in comparison to Oracle, SQL Server, and other general-purpose relational DBMS.
- DB2 compression’s overall effect on performance stems from a trade-off between I/O (lessened) and CPU burden (increased). For OLTP workloads, this is about a wash. For data warehousing workloads, IBM says 20% performance improvement from compression is average.
- DB2 now has its version of one of my favorite Oracle security features, called Label Based Access Control. A label-control feature can make it much easier to secure data on a row-by-row, value-by-value basis. The obvious big user is national intelligence, followed by financial services. IBM says the health care industry also has interest in LBAC.
- Also in the security area, IBM reworked DB2’s audit feature for 9.5
- I think what I heard in our discussion of DB2 virtualization is:
- Increasingly, IBM is seeing production use of VMware, rather than just test/development.
- IBM believes it is a much closer partner to VMware than Oracle or Microsoft is, because it’s not pushing its own competing technology.
- Generally, virtualization is more important for OLTP workloads than data warehousing ones, because OLTP apps commonly only need part of the resources of a node while data warehousing often wants the whole node.
- AIX data warehousing is an exception. I think this is because AIX equates to big SMP boxes, and virtualization lets you spread out the data warehousing processing across more nodes, with the usual parallel I/O benefits.
- When IBM talks of new autonomic/self-tuning features in DB2, they’re used mainly for databases under 1 terabyte in size. Indeed, the self-tuning feature set doesn’t work with InfoSphere Warehouse.
- Even with the self-tuning feature it sounds as if you need at least a couple of DBA hours per instance per week, on average.
- DB2 on Linux/Unix/Windows has introduced some enhanced workload management features analogous to those long found in mainframe DB2. For example, resource allocation rules can be scheduled by time. (The point of workload management is to allocate resources such as CPU or I/O among the simultaneous queries or other tasks that contend for them.) Workload management rules can have thresholds for amounts of resources consumed, after which the priority for a task can go up (”Get it over with!”) or down (”Stop hogging my system!”).
| Categories: Application areas, Data warehousing, Database compression, IBM and DB2, Market share, OLTP, Parallelization | 2 Comments |
Clearing some of my buffer
I have a large number of posts still in backlog. For starters, there are ones based on recent visits with Aster, Greenplum, Sybase, Vertica, and a Very Large User. I suspect I’ll write more soon on Oracle as well. Plus there’s my whole future-of-online-media area. And quite a bit more will grow out of planned research.
So there are a whole lot of other worthy subjects I doubt I’ll be getting to any time soon. In some cases, of course, other people are doing great jobs of writing about same. Here are pointers to a few links that I am glad to recommend:
- I wrote recently that I’ve discovered a number of different in-memory OLAP engines. Cindi Howson far outdid that, writing at length for Intelligent Enterprise on in-memory analytics, in an article that seems to itself be a teaser for a longer, free white paper on the subject.
- CouchDB posted an eye-catching, risque slide presentation promoting CouchDB and, more generally, key-value stores, at least for internet applications. And yes, they’ve integrated MapReduce.
- Merv Adrian posted favorably about Birst, with special reference to its OEM efforts. As previously noted, I was highly unimpressed with Birst’s end-user BI story at the time of its September roll-out, and Jerome Pineau’s recent examination did nothing to reassure me. But perhaps OEM is a different matter.
- Merv also offers an interesting post about data integration upstart Expressor, and a highly favorable one about “visualization” vendor Tableau.
- Ann All interviewed Nigel Pendse, who grumped that BI features are overrated, and what end users really want is great query performance. I’m not so sure about the features side of that, but I’m hugely in agreement about the performance. That’s a big part of why the analytic DBMS industry is so vibrant. It’s also why in-memory OLAP is suddenly so hot.
MySQL storage engine round-up, with Oracle-related thoughts
Here’s what I know about MySQL storage engines, more or less.
- MySQL with MyISAM is fast. But it’s not transactional. Except for limited purposes, MySQL with MyISAM is a pretty crummy DBMS. Nothing can change that.
- MySQL with InnoDB is transactional. But it’s not particularly fast. MySQL with InnoDB is a pretty mediocre DBMS. Oracle could fix that, at least partially, over time.
- I don’t know much about Falcon, Maria, and so on. With Oracle winding up owning both MySQL and InnoDB, the motivation for those engines (except as Oracle-free forks) might fade.
- Infobright is the most established of the rest. At the moment I’m not recommending it for most industrial-strength uses unless the user is particularly cash-constrained. But I wouldn’t be surprised if that changed soon. A cheap, fast, simple columnar analytic DBMS has a place in the world.
- Kickfire is next in line, offering a hardware-based growth path for users who’ve maxed out on what unaided MySQL can do. It remains to be seen for how many users the desire to keep things simple and stay with MySQL outweighs the desire to avoid custom hardware. Having Oracle salespeople all over those accounts surely wouldn’t help. Kickfire also has a second market, namely OEM vendors who are mainly interested in the superfast chip. That would probably be pretty unaffected by Oracle.
- Tokutek offers a technical proposition that’s hard to match head-on without going the CEP route. Users who care are likely to be MySQL shops. Tokutek’s main challenge is to prove that it sufficiently outdoes competing technical strategies for sufficiently many users. Oracle ownership of MySQL seems pretty irrelevant to Tokutek’s success or failure.
- Calpont offers a kind of lightweight Exadata alternative. With Calpont’s packaging and positioning perennially unclear, it’s difficult to predict the effect of a particular change — i.e., Oracle buying MySQL — in Calpont’s market environment.
- I haven’t heard from transactionally-oriented ScaleDB since I wrote about them a year ago. Apparently, they’re rolling out beta product this week, and their venerable techie guru sadly passed away earlier this month.
| Categories: Calpont, Columnar database management, Data warehousing, Exadata, Infobright, Kickfire, MySQL, Open source, Oracle, Tokutek | 13 Comments |
Calpont update — you read it here first!
Calpont has gone through a lot of strategy iterations since its founding. The super-short version is that Calpont originally planned an appliance built around a SQL chip, much like Kickfire. But after various changes in management and venture backing, Calpont turned itself into a software-only analytic DBMS vendor relying on a MySQL front end. Calpont is now at the stage of announcing an Early Adopter program at the MySQL conference on Wednesday, although details of Calpont’s product release timing, pricing, feature set, etc. are all To Be Determined.
Minor highlights of the Calpont technical story include:
| Categories: Calpont, Columnar database management, Data warehousing, MySQL, Open source, Parallelization, Theory and architecture | Leave a Comment |
Infobright update
For the past couple of quarters, Infobright has been MySQL’s partner of choice for larger data warehousing applications. Infobright’s stated business metrics, and I quote, include:
> 50 Customers in 7 Countries
> 25 Partners on 4 continents
A vibrant open source community
+1 million visitors
Approaching 10,000 downloads
2,000 active community participants
These may be compared with analogous metrics Infobright offered in February.
Infobright has also made or promised a variety of technological enhancements. Ones that are either shipping now or promised soon include:
| Categories: Columnar database management, Data warehousing, Infobright, MySQL, Open source | 4 Comments |
