Discussion of Hadoop. Related subjects include:

Open source database management systems

September 28, 2015

The potential significance of Cloudera Kudu

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Combined with Impala, Kudu is (among other things) an attempt to build a no-apologies analytic DBMS (DataBase Management System) into Hadoop. My reactions to that start:

I’ll expand on that last point. Analytics is no longer just about fast queries on raw or simply-aggregated data. Data transformation is getting ever more complex — that’s true in general, and it’s specifically true in the case of transformations that need to happen in human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” isn’t that a big deal yet in commercial apps featuring machine-generated data — if growth trends continue as much of us expect, it’s only a matter of time before that changes.

Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor). But it also all requires strong low-latency analytic data underpinnings, and I suspect that several kinds of data subsystem will prosper. I expect Kudu-supported Hadoop/Spark to be a strong contender for that role, along with the best of the old-school analytic RDBMS, Tachyon-supported Spark, one or more contenders from the Hana/MemSQL crowd (i.e., memory-centric RDBMS that purport to be good at analytics and transactions alike), and of course also whatever Cloudera’s strongest competitor(s) choose to back.

September 28, 2015

Cloudera Kudu deep dive

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Let’s talk in more detail about how Kudu stores data.

Read more

September 28, 2015

Introduction to Cloudera Kudu

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.

*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes. :)

For starters:

Read more

September 17, 2015

Rocana’s world

For starters:

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

Read more

August 3, 2015

Data messes

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — :) — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

Inconsistency can take multiple forms, including:  Read more

June 10, 2015

Hadoop generalities

Occasionally I talk with an astute reporter — there are still a few left :) — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.

There is a group of questions going around that includes:

To a first approximation, my responses are:  Read more

June 8, 2015

Teradata will support Presto

At the highest level:

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project:  Read more

May 26, 2015

IT-centric notes on the future of health care

It’s difficult to project the rate of IT change in health care, because:

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

These vastly greater amounts of data cited above will allow for greatly changed analytics.
Read more

May 20, 2015

MemSQL 4.0

I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:

The main new aspects of MemSQL 4.0 are:

There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.

Read more

May 13, 2015

Notes on analytic technology, May 13, 2015

1. There are multiple ways in which analytics is inherently modular. For example:

Also, analytics is inherently iterative.

If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.

2. In 2011, I wrote, in the context of agile predictive analytics, that

… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.

I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises.  Read more

Next Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.