EAI, EII, ETL, ELT, ETLT
Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:
In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:
- The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
- “Schema Management”
- Is used to define fields and so on.
- Is not equivalent to what is ordinarily meant by schema validation, in that …
- … it allows schemas to change, but puts constraints on which changes are allowed.
- Is done in plug-ins that live with the producer or consumer of data.
- Is based on the Hadoop-oriented file format Avro.
Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products. Read more
|Categories: Complex event processing (CEP), Data integration and middleware, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Kafka and Confluent, Market share and customer counts||Leave a Comment|
Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.
Making Cloudera run in the cloud has three major aspects:
- Cloudera’s usual software, ported to run on the cloud platform(s).
- Cloudera Director, which for example launches cloud instances.
- Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.
Features new in this week’s release of Cloudera Director include:
- An API for job submission.
- Support for spot and preemptable instances.
- High availability.
- Some cluster repair.
- Some cluster cloning.
I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.
As for porting, let me start by noting: Read more
I’m on two overlapping posting kicks, namely “lessons from the past” and “stuff I keep saying so might as well also write down”. My recent piece on Oracle as the new IBM is an example of both themes. In this post, another example, I’d like to memorialize some points I keep making about business intelligence and other analytics. In particular:
- BI relies on strong data access capabilities. This is always true. Duh.
- Therefore, BI and other analytics vendors commonly reinvent the data management wheel. This trend ebbs and flows with technology cycles.
Similarly, BI has often been tied to data integration/ETL (Extract/Transform/Load) functionality.* But I won’t address that subject further at this time.
*In the Hadoop/Spark era, that’s even truer of other analytics than it is of BI.
My top historical examples include:
- The 1970s analytic fourth-generation languages (RAMIS, NOMAD, FOCUS, et al.) commonly combined reporting and data management.
- The best BI visualization technology of the 1980s, Executive Information Systems (EIS), was generally unsuccessful. The core reason was a lack of what we’d now call drilldown. Not coincidentally, EIS vendors — notably leader Comshare — didn’t do well at DBMS-like technology.
- Business Objects, one of the pioneers of the modern BI product category, rose in large part on the strength of its “semantic layer” technology. (If you don’t know what that is, you can imagine it as a kind of virtual data warehouse modest enough in its ambitions to actually be workable.)
- Cognos, the other pioneer of modern BI, depending on capabilities for which it needed a bundled MOLAP (Multidimensional OnLine Analytic Processing) engine.
- But Cognos later stopped needing that engine, which underscores my point about technology ebbing and flowing.
|Categories: Business intelligence, Business Objects, Cognos, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Information Builders, MicroStrategy, Software as a Service (SaaS), Teradata||3 Comments|
I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.
All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:
- It’s all about analytic SQL queries.
- Specifically, it’s about reducing duplicated work.
- It is not an “optimizer” in the ordinary RDBMS sense of the word.
- It’s delivered via SaaS (Software as a Service).
- Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
- … in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.
|Categories: Business intelligence, Cloudera, Data pipelining, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, SQL/Hadoop integration||4 Comments|
One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:
- >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
- ~75,000 users of MongoDB Cloud Manager.
- Estimated ~1/4 million production users of MongoDB total.
Also >530 staff, and I think that number is a little out of date.
MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:
- Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
- A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results.
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
- Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.
There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. Read more
|Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents, Text||6 Comments|
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including: Read more
Let’s start with some terminology biases:
- I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
- Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).
So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.
It turns out that:
- Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
- Zoomdata depends heavily on Spark.
- Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
- Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
- Zoomdata also tries to detect data variability.
- Zoomdata is OEM/embedding-friendly.
*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.
Core aspects of Zoomdata’s technical strategy include: Read more
Occasionally I talk with an astute reporter — there are still a few left — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.
There is a group of questions going around that includes:
- Is Hadoop overhyped?
- Has Hadoop adoption stalled?
- Is Hadoop adoption being delayed by skills shortages?
- What is Hadoop really good for anyway?
- Which adoption curves for previous technologies are the best analogies for Hadoop?
To a first approximation, my responses are: Read more
|Categories: Application areas, Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, MapR, MapReduce, Market share and customer counts, Open source, Pricing||6 Comments|
I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.
Here are some thoughts as to why, and also as to challenges that need to be overcome.
There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:
- When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
- Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.
Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. Here’s the small amount that I have actually figured out.
As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.
- In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
- Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
- For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).
Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:
- The NoSQL database is likely to have more null values.
- The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.
That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.
|Categories: Business intelligence, Cassandra, EAI, EII, ETL, ELT, ETLT, HBase, MongoDB, NoSQL, Structured documents||5 Comments|