IBM and DB2
Analysis of IBM and various of its product lines in database management, analytics, and data integration.
1. Censorship worries me, a lot. A classic example is Vietnam, which basically has outlawed online political discussion.
And such laws can have teeth. It’s hard to conceal your internet usage from an inquisitive government.
2. Software and software related patents are back in the news. Google, which said it was paying $5.5 billion or so for a bunch of Motorola patents, turns out to really have paid $7 billion or more. Twitter and IBM did a patent deal as well. Big numbers, and good for certain shareholders. But this all benefits the wider world — how?
The purpose of legal intellectual property protections, simply put, is to help make it a good decision to create something. …
Why does “securing … exclusive Right[s]” to the creators of things that are patented, copyrighted, or trademarked help make it a good decision for them to create stuff? Because it averts competition from copiers, thus making the creator a monopolist in what s/he has created, allowing her to at least somewhat value-price her creation.
I.e., the core point of intellectual property rights is to prevent copying-based competition. By way of contrast, any other kind of intellectual property “right” should be viewed with great suspicion.
That Constitutionally-based principle makes as much sense to me now as it did then. By way of contrast, “Let’s give more intellectual property rights to big corporations to protect middle-managers’ jobs” is — well, it’s an argument I view with great suspicion.
But I find it extremely hard to think of a technology industry example in which development was stimulated by the possibility of patent protection. Yes, the situation may be different in pharmaceuticals, or for gadgeteering home inventors, but I can think of no case in which technology has been better, or faster to come to market, because of the possibility of a patent-law monopoly. So if software and business-method patents were abolished entirely – even the ones that I think could be realistically adjudicated — I’d be pleased.
3. In November, 2008 I offered IT policy suggestions for the incoming Obama Administration, especially: Read more
|Categories: Buying processes, Google, IBM and DB2, Public policy, Surveillance and privacy||1 Comment|
IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.
I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:
- MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
- Cyc is connected to the computer science community’s standard unit of bogosity.
Relational DBMS used to be fairly straightforward product suites, which boiled down to:
- A big SQL interpreter.
- A bunch of administrative and operational tools.
- Some very optional add-ons, often including an application development tool.
Now, however, most RDBMS are sold as part of something bigger.
- Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
- IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
- Microsoft SQL Server is part of a stack, starting with the Windows operating system.
- Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
- … SAP HANA, which is closely associated with SAP’s applications.
- Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
- MySQL’s glory years were as part of the “LAMP” stack.
- Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.
The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:
- I admire the raw research.
- The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
- Some of the details are questionable.
- There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
- The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)
Anyhow: Read more
There’s a growing trend for DBMS to beef up their support for multiple data manipulation languages (DMLs) or APIs — and there’s a special boom in JSON support, MongoDB-compatible or otherwise. So I talked earlier tonight with IBM’s Bobbie Cochrane about how JSON is managed in DB2.
For starters, let’s note that there are at least four strategies IBM could have used.
- Store JSON in a BLOB (Binary Large OBject) or similar existing datatype. That’s what IBM actually chose.
- Store JSON in a custom datatype, using the datatype extensibility features DB2 has had since the 1990s. IBM is not doing this, and doesn’t see a need to at this time.
- Use DB2 pureXML, along with some kind of JSON/XML translator. DB2 managed JSON this way in the past, via UDFs (User-Defined Functions), but that implementation is superseded by the new BLOB-based approach, which offers better performance in ingest and query alike.
- Shred — to use a term from XML days — JSON into a bunch of relational columns. IBM experimented with this approach, but ultimately rejected it. In dismissing shredding, Bobbie also disdained any immediate support for schema-on-need.
IBM’s technology choices are of course influenced by its use case focus. It’s reasonable to divide MongoDB use cases into two large buckets:
- Hardcore internet and/or machine-generated data, for example from a website.
- Enterprise data aggregation, for example a “360-degree customer view.”
IBM’s DB2 JSON features are targeted at the latter bucket. Also, I suspect that IBM is generally looking for a way to please users who enjoy working on and with their MongoDB skills. Read more
|Categories: Data models and architecture, IBM and DB2, MongoDB and 10gen, NoSQL, pureXML, Structured documents||2 Comments|
Oracle announced its in-memory columnar option Sunday. As usual, I wasn’t briefed; still, I have some observations. For starters:
- Oracle, IBM (Edit: See the rebuttal comment below), and Microsoft are all doing something similar …
- … because it makes sense.
- The basic idea is to take the technology that manages indexes — which are basically columns+pointers — and massage it into an actual column store. However …
- … the devil is in the details. See, for example, my May post on IBM’s version, called BLU, outlining all the engineering IBM did around that feature.
- Notwithstanding certain merits of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that point.
I’d also add that Larry Ellison’s pitch “build columns to avoid all that index messiness” sounds like 80% bunk. The physical overhead should be at least as bad, and the main saving in administrative overhead should be that, in effect, you’re indexing ALL columns rather than picking and choosing.
Anyhow, this technology should be viewed as applying to traditional business transaction data, much more than to — for example — web interaction logs, or other machine-generated data. My thoughts around that distinction start:
- I argued back in 2011 that traditional databases will wind up in RAM, basically because …
- … Moore’s Law will make it ever cheaper to store them there.
- Still, cheaper != cheap, so this is a technology only to use with you most valuable data — i.e., that transactional stuff.
- These are very tabular technologies, without much in the way of multi-structured data support.
|Categories: Columnar database management, Data warehousing, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, OLTP, Oracle, SAP AG, Workday||5 Comments|
- Developed by Cloudera.
- An Apache incubator project.
- Slated to be rolled into CDH — Cloudera’s Hadoop distribution — over the next couple of weeks.
- Only useful with Hive in Version 1, but planned to also work in the future with other Hadoop data access systems such as Pig, search and so on.
- Lacking in administrative scalability in Version 1, something that is also slated to be fixed in future releases.
Apparently, Hadoop security options pre-Sentry boil down to:
- Kerberos, which only works down to directory or file levels of granularity.
- Third-party products.
Sentry adds role-based permissions for SQL access to Hadoop:
- By server.
- By database.
- By table.
- By view.
for a variety of actions — selections, transformations, schema changes, etc. Sentry does this by examining a query plan and checking whether each step in the plan is permissible. Read more
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1′s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more
I’ll start with three observations:
- Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
- Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
- In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.
As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more
My July 2 comments on predictive modeling were far from my best work. Let’s try again.
1. Predictive analytics has two very different aspects.
Developing models, aka “modeling”:
- Is a big part of investigative analytics.
- May or may not be difficult to parallelize and/or integrate into an analytic RDBMS.
- May or may not require use of your whole database.
- Generally is done by humans.
- Often is done by people with special skills, e.g. “statisticians” or “data scientists”.
More precisely, some modeling algorithms are straightforward to parallelize and/or integrate into RDBMS, but many are not.
Using models, most commonly:
- Is done by machines …
- … that “score” data according to the models.
- May be done in batch or at run-time.
- Is embarrassingly parallel, and is much more commonly integrated into analytic RDBMS than modeling is.
2. Some people think that all a modeler needs are a few basic algorithms. (That’s why, for example, analytic RDBMS vendors are proud of integrating a few specific modeling routines.) Other people think that’s ridiculous. Depending on use case, either group can be right.
3. If adoption of DBMS-integrated modeling is high, I haven’t noticed.
|Categories: Data warehousing, Hadoop, Health care, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute||2 Comments|