EAI, EII, ETL, ELT, ETLT
Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:
From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
It took me a bit of time, and an extra call with Vertica’s long-time R&D chief Shilpa Lawande, but I think I have a decent handle now on Vertica 7, code-named Crane. The two aspects of Vertica 7 I find most interesting are:
- Flex Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica performance).
- What sounds like an alternate query execution capability for short-request queries, the big point of which is that it saves them from being broadcast across the whole cluster, hence improving scalability. (Adding nodes of course doesn’t buy you much for the portion of a workload that’s broadcast.)
Other Vertica 7 enhancements include:
- A lot of Bottleneck Whack-A-Mole.
- “Significant” improvements to the Vertica management console.
- Security enhancements (Kerberos), Hadoop integration enhancements (HCatalog), and enhanced integration with Hadoop security (Kerberos again).
- Some availability hardening. (“Fault groups”, which for example let you ensure that data is replicated not just to 2+ nodes, but also that the nodes aren’t all on the same rack.)
- Java as an option to do in-database analytics. (Who knew that feature was still missing?)
- Some analytic functionality. (Approximate COUNT DISTINCT, but not yet Approximate MEDIAN.)
Overall, two recurring themes in our discussion were:
- Load and ETL (Extract/Transform/Load) performance, and/or obviating ETL.
- Short-request performance, in the form of more scalable short-request concurrency.
Glassbeam checked in recently, and they turn out to exemplify quite a few of the themes I’ve been writing about. For starters:
- Glassbeam has an analytic technology stack focused on poly-structured machine-generated data.
- Glassbeam partially organizes that data into event series …
- … in a schema that is modified as needed.
Glassbeam basics include:
- Founded in 2009.
- Based in Santa Clara. Back-end engineering in Bangalore.
- $6 million in angel money; no other VC.
- High single-digit customer count, …
- … plus another high single-digit number of end customers for an OEM offering a limited version of their product.
All Glassbeam customers except one are SaaS/cloud (Software as a Service), and even that one was only offered a subscription (as oppose to perpetual license) price.
So what does Glassbeam’s technology do? Glassbeam says it is focused on “machine data analytics,” specifically for the “Internet of Things”, which it distinguishes from IT logs.* Specifically, Glassbeam sells to manufacturers of complex devices — IT (most of its sales so far ), medical, automotive (aspirational to date), etc. — and helps them analyze “phone home” data, for both support/customer service and marketing kinds of use cases. As of a recent release, the Glassbeam stack can: Read more
Much of modern analytic technology deals with what might be called an entity-centric sequence of events. For example:
- You receive and open various emails.
- You click on and look at various web sites and pages.
- Specific elements are displayed on those pages.
- You study various products, and even buy some.
Analytic questions are asked along the lines “Which sequences of events are most productive in terms of leading to the events we really desire?”, such as product sales. Another major area is sessionization, along with data preparation tasks that boil down to arranging data into meaningful event sequences in the first place.
A number of my clients are focused on such scenarios, including WibiData, Teradata Aster (e.g. via nPath), Platfora (in the imminent Platfora 3), and others. And so I get involved in naming exercises. The term entity-centric came along a while ago, because “user-centric” is too limiting. (E.g., the data may not be about a person, but rather specifically about the actions taken on her mobile device.) Now I’m adding the term event series to cover the whole scenario, rather than the “event sequence(s)” I might appear to have been hinting at above.
I decided on “event series” earlier this week, after noting that: Read more
|Categories: Aster Data, Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Platfora, Predictive modeling and advanced analytics, Teradata, Vertica Systems, Web analytics, WibiData||19 Comments|
ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy.
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.
I talked with Teradata about a bunch of stuff yesterday, including this week’s announcements in in-database predictive modeling. The specific news was about partnerships with Fuzzy Logix and Revolution Analytics. But what I found more interesting was the surrounding discussion. In a nutshell:
- Teradata is finally seeing substantial interest in in-database modeling, rather than just in-database scoring (which has been important for years) and in-database data preparation (which is a lot like ELT — Extract/Load/transform).
- Teradata is seeing substantial interest in R.
- It seems as if similar groups of customers are interested in both parts of that, such as:
This is the strongest statement of perceived demand for in-database modeling I’ve heard. (Compare Point #3 of my July predictive modeling post.) And fits with what I’ve been hearing about R.
|Categories: EAI, EII, ETL, ELT, ETLT, Parallelization, Predictive modeling and advanced analytics, Revolution Analytics, SAS Institute, Telecommunications, Teradata||1 Comment|
Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:
- Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
- As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
- Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
- … a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
- Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
- Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
- Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.
In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social. Read more
My client Syncsort:
- Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
- Has a strong history in and fondness for sort.
- Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
- Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
- Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.
*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already.
The essence of the Syncsort DMX-h ETL Edition story is:
- DMX-h inherits the various ETL-suite trappings of DMX.
- Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
- With a copy of DMX on every node, DMX-h can do parallel load/export.
Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included: Read more