IBM and DB2
Analysis of IBM and various of its product lines in database management, analytics, and data integration.
There’s a growing trend for DBMS to beef up their support for multiple data manipulation languages (DMLs) or APIs — and there’s a special boom in JSON support, MongoDB-compatible or otherwise. So I talked earlier tonight with IBM’s Bobbie Cochrane about how JSON is managed in DB2.
For starters, let’s note that there are at least four strategies IBM could have used.
- Store JSON in a BLOB (Binary Large OBject) or similar existing datatype. That’s what IBM actually chose.
- Store JSON in a custom datatype, using the datatype extensibility features DB2 has had since the 1990s. IBM is not doing this, and doesn’t see a need to at this time.
- Use DB2 pureXML, along with some kind of JSON/XML translator. DB2 managed JSON this way in the past, via UDFs (User-Defined Functions), but that implementation is superseded by the new BLOB-based approach, which offers better performance in ingest and query alike.
- Shred — to use a term from XML days — JSON into a bunch of relational columns. IBM experimented with this approach, but ultimately rejected it. In dismissing shredding, Bobbie also disdained any immediate support for schema-on-need.
IBM’s technology choices are of course influenced by its use case focus. It’s reasonable to divide MongoDB use cases into two large buckets:
- Hardcore internet and/or machine-generated data, for example from a website.
- Enterprise data aggregation, for example a “360-degree customer view.”
IBM’s DB2 JSON features are targeted at the latter bucket. Also, I suspect that IBM is generally looking for a way to please users who enjoy working on and with their MongoDB skills. Read more
|Categories: Data models and architecture, IBM and DB2, MongoDB and 10gen, NoSQL, pureXML, Structured documents||2 Comments|
Oracle announced its in-memory columnar option Sunday. As usual, I wasn’t briefed; still, I have some observations. For starters:
- Oracle, IBM (Edit: See the rebuttal comment below), and Microsoft are all doing something similar …
- … because it makes sense.
- The basic idea is to take the technology that manages indexes — which are basically columns+pointers — and massage it into an actual column store. However …
- … the devil is in the details. See, for example, my May post on IBM’s version, called BLU, outlining all the engineering IBM did around that feature.
- Notwithstanding certain merits of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that point.
I’d also add that Larry Ellison’s pitch “build columns to avoid all that index messiness” sounds like 80% bunk. The physical overhead should be at least as bad, and the main saving in administrative overhead should be that, in effect, you’re indexing ALL columns rather than picking and choosing.
Anyhow, this technology should be viewed as applying to traditional business transaction data, much more than to — for example — web interaction logs, or other machine-generated data. My thoughts around that distinction start:
- I argued back in 2011 that traditional databases will wind up in RAM, basically because …
- … Moore’s Law will make it ever cheaper to store them there.
- Still, cheaper != cheap, so this is a technology only to use with your most valuable data — i.e., that transactional stuff.
- These are very tabular technologies, without much in the way of multi-structured data support.
|Categories: Columnar database management, Data warehousing, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, OLTP, Oracle, SAP AG, Workday||6 Comments|
- Developed by Cloudera.
- An Apache incubator project.
- Slated to be rolled into CDH — Cloudera’s Hadoop distribution — over the next couple of weeks.
- Only useful with Hive in Version 1, but planned to also work in the future with other Hadoop data access systems such as Pig, search and so on.
- Lacking in administrative scalability in Version 1, something that is also slated to be fixed in future releases.
Apparently, Hadoop security options pre-Sentry boil down to:
- Kerberos, which only works down to directory or file levels of granularity.
- Third-party products.
Sentry adds role-based permissions for SQL access to Hadoop:
- By server.
- By database.
- By table.
- By view.
for a variety of actions — selections, transformations, schema changes, etc. Sentry does this by examining a query plan and checking whether each step in the plan is permissible. Read more
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more
I’ll start with three observations:
- Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
- Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
- In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.
As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more
My July 2 comments on predictive modeling were far from my best work. Let’s try again.
1. Predictive analytics has two very different aspects.
Developing models, aka “modeling”:
- Is a big part of investigative analytics.
- May or may not be difficult to parallelize and/or integrate into an analytic RDBMS.
- May or may not require use of your whole database.
- Generally is done by humans.
- Often is done by people with special skills, e.g. “statisticians” or “data scientists”.
More precisely, some modeling algorithms are straightforward to parallelize and/or integrate into RDBMS, but many are not.
Using models, most commonly:
- Is done by machines …
- … that “score” data according to the models.
- May be done in batch or at run-time.
- Is embarrassingly parallel, and is much more commonly integrated into analytic RDBMS than modeling is.
2. Some people think that all a modeler needs are a few basic algorithms. (That’s why, for example, analytic RDBMS vendors are proud of integrating a few specific modeling routines.) Other people think that’s ridiculous. Depending on use case, either group can be right.
3. If adoption of DBMS-integrated modeling is high, I haven’t noticed.
|Categories: Ayasdi, Data warehousing, Hadoop, Health care, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute||4 Comments|
- Cloudera changed CEOs last week. Tom Reilly, late of ArcSight, is the new guy (I don’t know him), while Mike Olson’s titles become Chairman and Chief Strategy Officer. Mike told me Friday that Reilly had secretly been working with him for months.
- Mike shared good-sounding numbers with me. But little is for public disclosure except the stat >400 employees.
- There are always rumors of infighting at Cloudera, perhaps because from earliest days Cloudera was a place where tempers are worn on sleeves. That said, Mike denied stories of problems between him and COO Kirk Dunn, and greatly praised Kirk’s successes at large-account sales.
- Cloudera now self-identifies pretty clearly as an analytic data management company. The vision is multiple execution engines – MapReduce, Impala, something more memory-centric, etc. – talking to any of a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less anything.*
- Mike told me that Cloudera didn’t have any YARN users in production, but thought there would be some by year-end. Even so, he thinks it’s fair to say that Cloudera users have substantial portions of Hadoop 2 in production, for example NameNode failover and HDFS (Hadoop Distributed File System) performance enhancements. Ditto HCatalog.
*Of course, there will always be exceptions. E.g., some formats can be updated on a short-request basis, while others can only be written to via batch conversions.
- There’s a widespread belief that Hortonworks is being shopped. Numerous folks – including me — believe the rumor of an Intel offer for $700 million. Higher figures and alternate buyers aren’t as widely believed.
- Views of MapR market traction, never high, are again on the downswing.
- IBM Big Insights seems to have some traction.
- In case there was any remaining doubt — DBMS vendors are pretty unanimous in agreeing that it makes sense to have Hadoop too. To my knowledge SAP hasn’t been as clear about showing a markitecture incorporating Hadoop as most of the others have … but then, SAP’s markitecture is generally less clear than other vendors’.
- Folks I talk with are generally wondering where and why Datameer lost its way. That still leaves Datameer ahead of other first-generation Hadoop add-on vendors (Karmasphere, Zettaset, et al.), in that I rarely hear them mentioned at all.
- I visited with my client Platfora. Things seem to be going very well.
- My former client Revelytix seems to have racked up some nice partnerships. (I had something to do with that. :))
|Categories: Cloudera, Data warehousing, Datameer, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts, Platfora, SAP AG, Zettaset||11 Comments|
I had a good chat with IBM about IBM BLU, aka BLU Accelerator or Acceleration. BLU basics start:
- BLU is a part of DB2.
- BLU works like a columnar analytic DBMS.
- If you want to do a join combining BLU and non-BLU tables, all the BLU tables are joined first, and the result set is joined to the other tables by the rest of DB2.
And yes — that means Oracle is now the only major relational DBMS vendor left without a true columnar story.
BLU’s maturity and scalability basics start:
- BLU is coming out in IBM DB2 10.5, this quarter.
- BLU will initially be single-server, but …
- … IBM claims “near-linear” scalability up to 64 cores, and further says that …
- … scale-out for BLU is coming “soon”.
- IBM already thinks all your analytically-oriented DB2 tables should be in BLU.
- IBM describes the first version of BLU as being optimized for 10 TB databases, but capable of handling 20 TB.
BLU technical highlights include: Read more
|Categories: Columnar database management, Data pipelining, Data warehousing, Database compression, IBM and DB2, Workload management||20 Comments|
Way back in 2006, I wrote about a cool Netezza feature called the zone map, which in essence allows you to do partition elimination even in the absence of strict range partitioning.
Netezza’s substitute for range partitioning is very simple. Netezza features “zone maps,” which note the minimum and maximum of each column value (if such concepts are meaningful) in each extent. This can amount to effective range partitioning over dates; if data is added over time, there’s a good chance that the data in any particular date range is clustered, and a zone map lets you pick out which data falls in the desired data range.
I further wrote
… that seems to be the primary scenario in which zone maps confer a large benefit.
But I now think that part was too pessimistic. For example, in bulk load scenarios, it’s easy to imagine ways in which data can be clustered or skewed. And in such cases, zone maps can let you skip a large fraction of potential I/O.
Over the years I’ve said that other things were reminiscent of Netezza zone maps, e.g. features of Infobright, SenSage, InfiniDB and even Microsoft SQL Server. But truth be told, when I actually use the phrase “zone map”, people usually give me a blank look.
In a recent briefing about BLU, IBM introduced me to a better term — data skipping. I like it and, unless somebody comes up with a good reason not to, I plan to start using it myself.
2. Numerous vendors are blending SQL and JSON management in their short-request DBMS. It will take some more work for me to have a strong opinion about the merits/demerits of various alternatives.
The default implementation — one example would be Clustrix’s — is to stick the JSON into something like a BLOB/CLOB field (Binary/Character Large Object), index on individual values, and treat those indexes just like any others for the purpose of SQL statements. Drawbacks include:
- You have to store or retrieve the JSON in whole documents at a time.
- If you are spectacularly careless, you could write JOINs with odd results.
IBM DB2 is one recent arrival to the JSON party. Unfortunately, I forgot to ask whether IBM’s JSON implementation was based on IBM DB2 pureXML when I had the chance, and IBM hasn’t gotten around to answering my followup query.
3. Nor has IBM gotten around to answering my followup queries on the subject of BLU, an interesting-sounding columnar option for DB2.
4. Numerous clients have asked me whether they should be active in DBaaS (DataBase as a Service). After all, Amazon, Google, Microsoft, Rackspace and salesforce.com are all in that business in some form, and other big companies have dipped toes in as well. Read more