EAI, EII, ETL, ELT, ETLT
Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:
ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy.
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.
I talked with Teradata about a bunch of stuff yesterday, including this week’s announcements in in-database predictive modeling. The specific news was about partnerships with Fuzzy Logix and Revolution Analytics. But what I found more interesting was the surrounding discussion. In a nutshell:
- Teradata is finally seeing substantial interest in in-database modeling, rather than just in-database scoring (which has been important for years) and in-database data preparation (which is a lot like ELT — Extract/Load/transform).
- Teradata is seeing substantial interest in R.
- It seems as if similar groups of customers are interested in both parts of that, such as:
This is the strongest statement of perceived demand for in-database modeling I’ve heard. (Compare Point #3 of my July predictive modeling post.) And fits with what I’ve been hearing about R.
|Categories: EAI, EII, ETL, ELT, ETLT, Parallelization, Predictive modeling and advanced analytics, Revolution Analytics, SAS Institute, Telecommunications, Teradata||1 Comment|
Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:
- Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
- As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
- Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
- … a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
- Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
- Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
- Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.
In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social. Read more
My client Syncsort:
- Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
- Has a strong history in and fondness for sort.
- Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
- Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
- Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.
*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already.
The essence of the Syncsort DMX-h ETL Edition story is:
- DMX-h inherits the various ETL-suite trappings of DMX.
- Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
- With a copy of DMX on every node, DMX-h can do parallel load/export.
Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included: Read more
Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.
In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.
Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed: Read more
|Categories: Business intelligence, Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Platfora, Workload management||12 Comments|
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start: Read more
My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:
The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.
These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.
A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.
But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more
|Categories: Cloudant, Couchbase, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MongoDB, RDF and graphs||1 Comment|
With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”
If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed): Read more
|Categories: Business intelligence, Cloudera, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Health care, Investment research and trading, MapR, Market share and customer counts, Telecommunications, Web analytics||5 Comments|
Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:
- Ab Initio
However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required. Read more