EAI, EII, ETL, ELT, ETLT

Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:

November 19, 2012

Incremental MapReduce

My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:

The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.

These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.

A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.

But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more

October 18, 2012

Notes on Hadoop adoption and trends

With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”

If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed):   Read more

October 7, 2012

IBM’s ETL

Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:

However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required. Read more

September 24, 2012

Notes on Hadoop adoption

I successfully resisted telephone consulting while on vacation, but I did do some by email. One was on the oft-recurring subject of Hadoop adoption. I think it’s OK to adapt some of that into a post.

Notes on past and current Hadoop adoption include:

Thoughts on how Hadoop adoption will look going forward include: Read more

September 7, 2012

Integrated internet system design

What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.

Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.

The top integration and integration-like challenges for me, from a practical standpoint, are:

Other concerns that get mentioned include:

Let’s skip those latter issues for now, focusing instead on the first four.

Read more

August 24, 2012

Hadoop notes: Informatica, Splunk, and IBM

Informatica, Splunk, and IBM are all public companies, and correspondingly reticent to talk about product futures. Hence, anything I might suggest about product futures from any of them won’t be terribly detailed, and even the vague generalities are “the Good Lord willin’ an’ the creek don’ rise”.

Never let a rising creek overflow your safe harbor.

Anyhow:

1. Hadoop can be an awesome ETL (Extract/Transform/Load) execution engine; it can handle huge jobs and perform a great variety of transformations. (Indeed, MapReduce was invented to run giant ETL jobs.) Thus, if one offers a development-plus-execution stack for ETL processes, it might seem appealing to make Hadoop an ETL execution option. And so:

Informatica told me about other interesting Hadoop-related plans as well, but I’m not sure my frieNDA allows me to mention them at all.

IBM, however, is standing aside. Specifically, IBM told me that it doesn’t see the point of doing the same thing, as its ETL engine — presumably derived from the old Ascential product line — is already parallel and performant enough.

2. Last year, I suggested that Splunk and Hadoop are competitors in managing machine-generated data. That’s still true, but Splunk is also preparing a Hadoop co-opetition strategy. To a first approximation, it’s just Hadoop import/export. However, suppose you view Splunk as offering a three-layer stack: Read more

August 8, 2012

What kinds of metadata are important anyway?

In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of metadata that enterprises need and want, especially in the context of data integration and ETL and ELT (Extract/Transform/Load/Transform). That raises a natural question — what kinds of metadata do users need or want? In the hope of spurring discussion, from vendors and users alike, I’m splitting this question out into a separate post.

Please comment with your thoughts about ETL-related metadata needs. The conversation needs to advance.

In the relational world, there are at least three kinds of metadata:

As I see it:

Read more

July 28, 2012

Some Vertica 6 features

Vertica 6 was recently announced, and so it seemed like a good time to catch up on Vertica features. The main topics I want to address are:

Also:

In general, the main themes of Vertica 6 appear to be:

Let’s do the analytic functionality first. Notes on that include:

I’ll also take this opportunity to expand on something I wrote about a few vendors — including Vertica — at the end of my post on approximate query results. When I probed how customers of Vertica and other RDBMS-based analytic platform vendors used vendor-proprietary advanced analytic SQL and other analytic capabilities, answers included: Read more

July 24, 2012

Notes on Datameer

In a short October, 2011 post about Datameer, I wrote:

Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).

That’s all still mainly true, although with the recent Datameer 2.0:

In essence, Datameer has two positionings.

Read more

July 8, 2012

Database diversity revisited

From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.

The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:

The Big Seven database challenges that almost any enterprise faces are: Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.