Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Calpont update — you read it here first!
Calpont has gone through a lot of strategy iterations since its founding. The super-short version is that Calpont originally planned an appliance built around a SQL chip, much like Kickfire. But after various changes in management and venture backing, Calpont turned itself into a software-only analytic DBMS vendor relying on a MySQL front end. Calpont is now at the stage of announcing an Early Adopter program at the MySQL conference on Wednesday, although details of Calpont’s product release timing, pricing, feature set, etc. are all To Be Determined.
Minor highlights of the Calpont technical story include: Read more
Categories: Calpont, Columnar database management, Data warehousing, MySQL, Open source, Parallelization, Theory and architecture | Leave a Comment |
Infobright update
For the past couple of quarters, Infobright has been MySQL’s partner of choice for larger data warehousing applications. Infobright’s stated business metrics, and I quote, include:
> 50 Customers in 7 Countries
> 25 Partners on 4 continents
A vibrant open source community
+1 million visitors
Approaching 10,000 downloads
2,000 active community participants
These may be compared with analogous metrics Infobright offered in February.
Infobright has also made or promised a variety of technological enhancements. Ones that are either shipping now or promised soon include: Read more
Categories: Columnar database management, Data warehousing, Infobright, MySQL, Open source | 6 Comments |
Introduction to Tokutek
Tokutek has a paradoxical pitch: Tokutek writes data particularly quickly, and therefore you’re supposed to buy Tokutek for query-oriented uses. Highlights of the Tokutek story include:
- Tokutek is a MySQL storage engine.
- MySQL/Tokutek writes indexed data a lot faster than B-tree-based alternatives. (The claim is 10s of 1000s of rows per second on a single server.)
- MySQL/Tokutek reads data at B-tree speeds. (But not, I presume, at the speed of specialized analytic DBMS.)
- Tokutek is not yet ACID-compliant. They’re working on that, but we don’t know what the performance implications will be when they achieve it. ACID compliance won’t come as soon as the May release (Tokutek Version 2.0).
- Tokutek has made one sale. Others are in the pipeline.
Tokutek’s initial target market is the usual combination of clickstream/personalization/other network management. The idea is that many data warehouse technologies have trouble getting latency below, say, 15 seconds to 5 minutes, at least at very high update volumes. So if immediacy is more important than raw complex query performance, Tokutek’s performance profile could be attractive. Read more
Categories: Data warehousing, MySQL, Tokutek and TokuDB, Web analytics | 14 Comments |
Cloudera presents the MapReduce bull case
Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment
Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.
Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.
Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:
- 2 1/2 petabytes of data managed via Hadoop
- 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
- Ad targeting queries run every 15 minutes in Hadoop
- Dashboard roll-up queries run every hour in Hadoop
- Ad-hoc research/analytic Hadoop queries run whenever
- Anti-fraud analysis done in Hadoop
- Text mining (e.g., of things written on people’s “walls”) done in Hadoop
- 100s or 1000s of simultaneous Hadoop queries
- JSON-based social network analysis in Hadoop
Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.
Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more
Ingres update
I talked with Ingres today. Much of the call was fluff — open-source rah-rah, plus some numbers showing purported success, but so finely parsed as to be pretty meaningless. (To Ingres’ credit, they did offer to let me talk w/ their CFO, even if they offered no promises as to whether he’d offer any more substantive information.) Highlights included: Read more
Categories: Actian and Ingres, Data warehousing, EnterpriseDB and Postgres Plus, MySQL, Open source, Oracle, PostgreSQL, Sybase | 6 Comments |
Lots of analytic DBMS vendors are hiring
After writing about a Twitter jobs page, it occurred to me to check out whether analytic DBMS vendors are still hiring. Based on the Careers pages on their websites, I determined that Aster, Greenplum, Kickfire, and ParAccel all evidently are, in various mixes of (mainly) technical and field positions. At that point I got bored and stopped.
I didn’t choose those vendors entirely at random. If I had to name three vendors who are said to have had small layoffs at some point over the past few quarters, it would be ParAccel, Greenplum, and Kickfire. So if even they are hiring, the analytic DBMS sector is still pretty healthy … or at least thinks it is. 😉
Categories: Aster Data, Data warehousing, Greenplum, Kickfire, ParAccel | 5 Comments |
Somebody is spreading Teradata acquisition rumors again
An mass email from Tom Coffing was forwarded to me today that starts:
I have heard from reliable sources that both HP and SAP have purchased more than 5% of Teradata stock. My sources tell me that both companies appear to be positioning themselves for a bid.
I got my version of the same email from Coffing yesterday with a different introduction but otherwise the same substance (he’s pushing a new product of his). It also had a different From address.
Possible explanations include but are not limited to:
- Coffing knows something (seems unlikely, but I haven’t actually checked www.sec.gov to confirm or disconfirm)
- Coffing thinks he knows something
- Coffing just made this up (I hope not)
- There’s an April Fool’s Day prank going on (not by me — after my bizarre March, I’m recusing myself from April Fool’s pranks this year)
Categories: Data warehousing, HP and Neoview, SAP AG, Teradata | 4 Comments |
Twitter is considering using MapReduce
From a Twitter job listing (formatting mine). The most interesting section is “Additional preferred experience.” Read more
Categories: Analytic technologies, Data warehousing, MapReduce, Specific users, Web analytics | 6 Comments |
Kickfire update
I talked recently with my clients at Kickfire, especially newish CEO Bruce Armstrong. I also visited the Kickfire blog, which among other virtues features a fairly clear overview of Kickfire technology. (I did my own Kickfire overview in October.) Highlights of the current Kickfire story include:
- Kickfire is initially focused on three heavily overlapping markets — network event analysis, the general Web 2.0/clickstream/online marketing analytics area, and MySQL/LAMP data warehousing.
- Kickfire has blogged about a few sales to unnamed customers in those markets.
- I think network management is a market that’s potentially friendly to five-figure-cost appliances. After all, networking equipment is generally sold in appliance form. Kickfire doesn’t dispute this analysis.
- Kickfire’s sales so far are to run databases in the sub-terabyte range, although both Kickfire and its customers intend to run bigger databases soon. (Kickfire describes the range as 300 GB – 1 TB.) Not coincidentally, Kickfire believes that MySQL doesn’t scale very well past 100 GB without a lot of partitioning effort (in the case of data warehouses) or sharding (in the case of OLTP).
- When Bruce became CEO, he let go some sales, marketing, and/or business development folks. He likes to call this a restructuring of Kickfire rather than a reduction-in-force, but anyhow — that’s what happened. There are now about 50 employees, and Kickfire still has most of the $20 million it raised last August in the bank. Edit: The company clarifies that it actually wound up with more sales and marketing people than before.
- Kickfire has thankfully deemphasized various marketing themes I found annoying, such as ascribing great weight to TPC-H benchmarks or explaining why John von Neumann originally made bad choices in his principles of computer design.
Categories: Data warehouse appliances, Data warehousing, Kickfire, MySQL, Open source, Web analytics | 1 Comment |
Oracle introduces a half-rack version of Exadata
Oracle has introduced what amounts to a half-rack Exadata machine. My thoughts on this basically boil down to “makes sense” and “no big deal.” Specifically:
- The new Baby Exadata still holds 10 terabytes or more.
- Most specialty analytic DBMS purchases are still for databases of 10 terabytes or smaller.
- Large enterprise data warehouse projects are often being deferred or cut back due to the economic crunch, but smaller projects with credible, quick ROIs are doing fine.
- Exadata is evidently being sold overwhelmingly to Oracle loyalists. Other analytic DBMS vendors aren’t telling me of serious Exadata competition yet. If the market for Exadata is primarily “happy Oracle data warehouse users”, that’s mainly folks who have <5-10 terabytes of user data today.
- Oracle Exadata beta tests were done on a kind of half-rack configuration anyway.
Categories: Data warehouse appliances, Data warehousing, Exadata, Oracle | Leave a Comment |