The Vertica story (with soundbites!)
I’ve blogged separately that:
- Vertica has a bunch of customers, including seven with 1 or more petabytes of data each.
- Vertica has progressed down the analytic platform path, with Monday’s release of Vertica 5.0.
And of course you know:
- Vertica (the product) is columnar, MPP, and fast.*
- Vertica (the company) was recently acquired by HP.**
Categories: Benchmarks and POCs, Columnar database management, ParAccel, Parallelization, Vertica Systems | 4 Comments |
Vertica as an analytic platform
Vertica 5.0 is coming out today, and delivering the down payment on Vertica’s analytic platform strategy. In Vertica lingo, there’s now a Vertica SDK (Software Development Kit), featuring Vertica UDT(F)s* (User-Defined Transform Functions). Vertica UDT syntax basics start: Read more
Categories: Analytic technologies, Data warehousing, GIS and geospatial, Predictive modeling and advanced analytics, RDF and graphs, Vertica Systems, Workload management | 7 Comments |
Temporal data, time series, and imprecise predicates
I’ve been confused about temporal data management for a while, because there are several different things going on.
- Date arithmetic. This of course has been around for a very long — er, for a very long time.
- Time-series-aware compression. This has been around for quite a while too.
- “Time travel”/snapshotting — preserving the state of the database at previous points in time. This is a matter of exposing (and not throwing away) the information you capture via MVCC (Multi-Version Concurrency Control) and/or append-only updates (as opposed to update-in-place). Those update strategies are increasingly popular for pretty much anything except update-intensive OLTP (OnLine Transaction Processing) DBMS, so time-travel/snapshotting is an achievable feature for most vendors.
- Bitemporal data access. This occurs when a fact has both a transaction timestamp and a separate validity duration. A Wikipedia article seems to cover the subject pretty well, and I touched on Teradata’s bitemporal plans back in 2009.
- Time series SQL extensions. Vertica explained its version of these to me a few days ago. I imagine Sybase IQ and other serious financial-trading market players have similar features.
In essence, the point of time series/event series SQL functionality is to do SQL against incomplete, imprecise, or derived data.* Read more
Categories: Analytic technologies, Data types, Investment research and trading, Log analysis, Sybase, Telecommunications, Theory and architecture, Vertica Systems | 2 Comments |
Columnar DBMS vendor customer metrics
Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive. Read more
Investigative analytics and derived data: Enzee Universe 2011 talk
I’ll be speaking Monday, June 20 at IBM Netezza’s Enzee Universe conference. Thus, as is my custom:
- I’m posting draft slides.
- I’m encouraging comment (especially in the short time window before I have to actually give the talk).
- I’m offering links below to more detail on various subjects covered in the talk.
The talk concept started out as “advanced analytics” (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed “best practices” session. So I suggested we constrain the subject by focusing on a specific application area — customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides — and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of investigative analytics and derived data. And, as always when I speak, I’ll try to raise consciousness about the issues of liberty and privacy, our options as a society for addressing them, and the crucial role we play as an industry in helping policymakers deal with these technologically-intense subjects.
Slide 3 refers back to a post I made last December, saying there are six useful things you can do with analytic technology:
- Operational BI/Analytically-infused operational apps: You can make an immediate decision.
- Planning and budgeting: You can plan in support of future decisions.
- Investigative analytics (multiple disciplines): You can research, investigate, and analyze in support of future decisions.
- Business intelligence: You can monitor what’s going on, to see when it necessary to decide, plan, or investigate.
- More BI: You can communicate, to help other people and organizations do these same things.
- DBMS, ETL, and other “platform” technologies: You can provide support, in technology or data gathering, for one of the other functions.
Slide 4 observes that investigative analytics:
- Is the most rapidly advancing of the six areas …
- … because it most directly exploits performance & scalability.
Slide 5 gives my simplest overview of investigative analytics technology to date: Read more
Notes and links, June 15, 2011
Five things: Read more
Metaphors amok
It all started when I disputed James Kobielus’ blogged claim that Hadoop is the nucleus of the next-generation cloud EDW. Jim posted again to reiterate the claim, only this time he wrote that all EDW vendors [will soon] bring Hadoop into their heart of their architectures. (All emphasis mine.)
That did it. I tweeted, in succession:
- Actually, I vote for Hadoop as the lungs of the EDW — first place of entry for essential nutrients.
- Data integration can be the heart of the EDW, pumping stuff around. RDBMS/analytic platform can be the brain.
- iPad-based dashboards that may engender envy, but which actually are only used occasionally and briefly … well, you get the picture.*
*Woody Allen said in Sleeper that the brain was his second-favorite organ.
Of course, that body of work was quickly challenged. Responses included: Read more
Categories: Analytic technologies, Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Fun stuff, Hadoop, Humor, MapReduce | Leave a Comment |
Infobright 4.0
Infobright is announcing its 4.0 release, with imminent availability. In marketing and product alike, Infobright is betting the farm on machine-generated data. This hasn’t been Infobright’s strategy from the getgo, but it is these days, with pretty good focus and commitment. While some fraction of Infobright’s customer base is in the Sybase-IQ-like data mart market — and indeed Infobright put out a customer-win press release in that market a few days ago — Infobright’s current customer targets seem to be mainly:
- Web companies, many of which are already MySQL users.
- Telecommunication and similar log data, especially in OEM relationships.
- Trading/financial services, especially at mid-tier companies.
Key aspects of Infobright 4.0 include: Read more
Categories: Data warehousing, Database compression, Infobright, Investment research and trading, Log analysis, Open source, Telecommunications, Web analytics | 8 Comments |
Patent nonsense: Parallel Iron/HDFS edition
Alan Scott commented with concern about Parallel Iron’s patent lawsuit attacking HDFS (Hadoop Distributed File System), filed in — where else? — Eastern Texas. The patent in question — US 7,415,565 — seems to in essence cover any shared-nothing block storage that exploits a “configurable switch fabric”; indeed, it’s more oriented to OLTP (OnLine Transaction Processing) than to analytics. For example, the Background section starts: Read more
Categories: EMC, Hadoop, MapReduce, Parallelization, Storage | 9 Comments |
Hadoop confusion from Forrester Research
Jim Kobielus started a recent post
Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?”
What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still quite immature.
So far, so good. But I disagree with almost everything Jim wrote after that.
Jim’s thesis seems to be that Hadoop will only be mature when a significant fraction of analytic DBMS vendors have own-branded versions of Hadoop alongside their DBMS, possibly via acquisition. Based on this, he calls for a formal, presumably vendor-driven Hadoop standardization effort, evidently for the whole Hadoop stack. He also says that
Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still 3-5 years from fruition
where by “cloud” I presume Jim means first and foremost “private cloud.”
I don’t think any of that matches Hadoop’s actual strengths and weaknesses, whether now or in the 3-7 year future. My reasoning starts:
- Hadoop is well on its way to being a surviving data-storage-plus-processing system — like an analytic DBMS or DBMS-imitating data integration tool …
- … but Hadoop is best-suited for somewhat different use cases than those technologies are, and the gap won’t close as long as the others remain a moving target.
- I don’t think MapReduce is going to fail altogether; it’s too well-suited for too many use cases.
- Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.
- The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.
As for the rest of Jim’s claim — I see three main candidates for the “nucleus of the next-generation enterprise data warehouse,” each with better claims than Hadoop:
- Relational DBMS, much like today. (E.g., Teradata, DB2, Exadata or their successors.) This is the case in which robustness of the central data store matters most.
- Grand cosmic data integration tools. (The descendants of Informatica PowerCenter, et al.) This is the case in which the logic of data relationships can safely be separated from physical storage.
- Nothing. (The architecture could have several strong members, none of which is truly the “nucleus.”) This is the case in which new ways keep being invented to extract high value from data, outrunning what grandly centralized solutions can adapt to. I think this is the most likely case of all.