Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Greenplum blogs about some customers
I’ve written some about Greenplum’s customers at eBay and Fox Interactive Media. But as I recently grumped, I’m not in the mood right now to write much about other Greenplum customers. Fortunately, Greenplum has filled the gap itself. Marketing chief Paul Salazar just blogged about a number of other big Greenplum customers. And last month Paul blogged in considerable detail about what he characterizes as an enterprise data warehouse (EDW) conversion — Oracle replacement — at a large pharmaceutical company.
Categories: Application areas, Data warehousing, Greenplum, Oracle | Leave a Comment |
The future of data marts
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:
- Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
- Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.
- In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
- Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.
- Consolidating data marts onto one common technological platform has important benefits.
In essence, Greenplum is pitching the story:
- Thesis: Enterprise Data Warehouses (EDWs)
- Antithesis: Data Warehouse Appliances
- Synthesis: Greenplum’s Enterprise Data Cloud vision
When put that starkly, it’s overstated, not least because
Specialized Analytic DBMS != Data Warehouse Appliance
But basically it makes sense, for two main reasons:
- Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
- On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.
More on Fox Interactive Media’s use of Greenplum
Greenplum’s most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across a “review” of Greenplum by FIM’s Brian Dolan, neatly summarizing his views about Greenplum’s strengths, weaknesses, and uses inside Fox. Highlights include: Read more
Categories: Data warehousing, Fox and MySpace, Greenplum, Web analytics | 2 Comments |
Merv Adrian on SAND Technology
Merv Adrian blogged about SAND Technology, casting significant doubt on SAND’s business prospects. At this point, I can’t say I disagree. On the other hand, SAND does have public, audited financial statements showing it generating more revenue than a lot of other analytic DBMS or archiving vendors probably make. Columnar DBMS vendors doing better than SAND are Sybase, Vertica, maybe Infobright — and who else?
Categories: Archiving and information preservation, Columnar database management, Data warehousing, SAND Technology | 1 Comment |
Daniel Abadi on Kickfire and related subjects
Daniel Abadi has a new blog, whose first post centers around Kickfire. The money quote is (emphasis mine):
In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker’s voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore’s law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore’s law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past
Good point.
More generally, Abadi speculates about the market for MySQL-compatible data warehousing. My responses include:
- OF COURSE there are many MySQL users who need to move to a serious analytic DBMS.
- What’s less clear is whether there’s any big advantage to those users in remaining MySQL-compatible when they do move. I’m not sure what MySQL-specific syntax or optimizations they’d have that would be difficult to port to a non-MySQL system.
- It’s nice to see Abadi speaking well of Infobright and its technology.
- To say that Infobright went open source because it was “desperate” is overstated. That said, I don’t think Infobright was on track to prosper without going open source.
- While open source and MySQL go together, an appliance like Kickfire loses many (not all) of the benefits of open source.
- Calpont has indeed never disclosed a customer win. Any year now … (Just kidding, Vogel!)
- In general, seeing Abadi be so favorable toward Vertica competitors adds credibiity to the recent Hadoop vs. DBMS paper.
Anyhow, as previously noted, I’m a big Daniel Abadi fan. I look forward to seeing what else he posts in his blog, and am optimistic he’ll live up to or exceed its stated goals.
Categories: Calpont, Columnar database management, Data warehouse appliances, Data warehousing, DBMS product categories, Infobright, Kickfire, MySQL, Open source, Theory and architecture | 2 Comments |
Greenplum update — Release 3.3 and so on
I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including: Read more
Categories: Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Market share and customer counts, Parallelization, PostgreSQL, Pricing | 11 Comments |
Greenplum will be announcing some stuff
Greenplum is having a webinar Monday to announce “The Next Big Leap in Data Warehousing” (capitalization theirs). The idea they’ll be talking about is a genuinely good one. And off the top of my head I can only think of a few vendors who implemented it before Greenplum, and even fewer who emphasize it explicitly. So if you like webinars, you might want to listen in. I plan to blog about the general concept soon after the 12:01 am Monday embargo lifts. (Uh, guys, it is Monday rather than Tuesday, right?) Read more
Categories: Data warehousing, Greenplum, Specific users | 1 Comment |
How big are the intelligence agencies’ data warehouses?
Edit: The relevant part of the article cited has now been substantially changed, in line with Jeff Jonas’ remarks in the comment thread below.
Joe Harris linked me to an article that made a rather extraordinary claim:
At another federal agency Jonas worked at (he wouldn’t say which), they had a very large data warehouse in the basement. The size of the data warehouse was a secret, but Jonas estimated it at 4 exabytes (EB), and increasing at the rate of 5 TB per day.
Now, if one does the division, the quote claims it takes 800,000 days for the database to double in size, which is absurd. Perhaps this (Jeff) Jonas guy was just talking about a 4 petabyte system and got confused. (Of course, that would still be pretty big.) But before I got my arithmetic straight, I ran the 4 exabyte figure past a couple of folks, as a target for the size of the US government’s largest classified database. Best guess turns out to be that it’s 1-2 orders of magnitude too high for the government’s largest database, not 3. But that’s only a guess …
Categories: Data warehousing, Specific users | 5 Comments |
Facebook’s experiences with compression
One little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to be:
- Facebook uses gzip, and gets a little bit more than 6X compression.
- Experiments suggest bzip2 would reduce data by another 20% or so, increasing compression to the 7.5X range.
- The downside of bzip2 is 15-25% processing overhead, depending on the kind of data.
Categories: Data warehousing, Database compression, Facebook, Hadoop | 2 Comments |
How much state is saved when an MPP DBMS node fails?
Mark Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post:
My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?
Honestly, I’d just be guessing at the answer.
Would any vendors or other knowledgeable folks care to take a crack at answering directly?
Categories: Data warehousing, Parallelization | 10 Comments |