EAI, EII, ETL, ELT, ETLT

Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:

May 29, 2013

Syncsort extends Hadoop MapReduce

My client Syncsort:

*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already. :)

The essence of the Syncsort DMX-h ETL Edition story is:

More details can be found in a slide deck Syncsort graciously allowed me to post. Read more

April 1, 2013

Some notes on new-era data management, March 31, 2013

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included: Read more

March 26, 2013

Platfora at the time of first GA

Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.

In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.

Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthedRead more

February 13, 2013

It’s hard to make data easy to analyze

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start:  Read more

November 19, 2012

Incremental MapReduce

My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:

The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.

These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.

A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.

But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more

October 18, 2012

Notes on Hadoop adoption and trends

With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”

If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed):   Read more

October 7, 2012

IBM’s ETL

Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:

However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required. Read more

September 24, 2012

Notes on Hadoop adoption

I successfully resisted telephone consulting while on vacation, but I did do some by email. One was on the oft-recurring subject of Hadoop adoption. I think it’s OK to adapt some of that into a post.

Notes on past and current Hadoop adoption include:

Thoughts on how Hadoop adoption will look going forward include: Read more

September 7, 2012

Integrated internet system design

What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.

Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.

The top integration and integration-like challenges for me, from a practical standpoint, are:

Other concerns that get mentioned include:

Let’s skip those latter issues for now, focusing instead on the first four.

Read more

August 24, 2012

Hadoop notes: Informatica, Splunk, and IBM

Informatica, Splunk, and IBM are all public companies, and correspondingly reticent to talk about product futures. Hence, anything I might suggest about product futures from any of them won’t be terribly detailed, and even the vague generalities are “the Good Lord willin’ an’ the creek don’ rise”.

Never let a rising creek overflow your safe harbor.

Anyhow:

1. Hadoop can be an awesome ETL (Extract/Transform/Load) execution engine; it can handle huge jobs and perform a great variety of transformations. (Indeed, MapReduce was invented to run giant ETL jobs.) Thus, if one offers a development-plus-execution stack for ETL processes, it might seem appealing to make Hadoop an ETL execution option. And so:

Informatica told me about other interesting Hadoop-related plans as well, but I’m not sure my frieNDA allows me to mention them at all.

IBM, however, is standing aside. Specifically, IBM told me that it doesn’t see the point of doing the same thing, as its ETL engine — presumably derived from the old Ascential product line — is already parallel and performant enough.

2. Last year, I suggested that Splunk and Hadoop are competitors in managing machine-generated data. That’s still true, but Splunk is also preparing a Hadoop co-opetition strategy. To a first approximation, it’s just Hadoop import/export. However, suppose you view Splunk as offering a three-layer stack: Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.