EAI, EII, ETL, ELT, ETLT
Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:
Human real-time
I first became an analyst in 1981. And so I was around for the early days of the movement from batch to interactive computing, as exemplified by:
- The rise of minicomputers as mainframe alternatives (first VAXen, then the ‘nix systems that did largely supplant mainframes).
- The move from batch to interactive computing even on mainframes, a key theme of 1980s application software industry competition.
Of course, wherever there is interactive computing, there is a desire for interaction so fast that users don’t notice any wait time. Dan Fylstra, when he was pitching me the early windowing system VisiOn, characterized this as response so fast that the user didn’t tap his fingers waiting.* And so, with the move to any kind of interactive computing at all came a desire that the interaction be quick-response/low-latency. Read more
DataStax Enterprise 2.0
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
Juggling analytic databases
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes. Read more
Kinds of data integration and movement
“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data –
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other. Read more
| Categories: Clustering, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, MapReduce | 8 Comments |
Departmental analytics — best practices
I believe IT departments should support and encourage departmental analytics efforts, where “support” and “encourage” are not synonyms for “control”, “dominate”, “overwhelm”, or even “tame”. A big part of that is:
Let, and indeed help, departments have the data they want, when they want it, served with blazing performance.
Three things that absolutely should NOT be obstacles to these ends are:
- Corporate DBMS standards.
- Corporate data governance processes.
- The difficulties of ETL.
| Categories: Business intelligence, Data mart outsourcing, Data warehousing, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics | 4 Comments |
Agile predictive analytics — the “easy” parts
I’m hearing a lot these days about agile predictive analytics, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to using data as quickly as reasonably possible. But discussing particulars is hard, for several reasons:
- Pundits tend to sketch castles in the air.
- Vendors tend to confuse part of the story — generally the part they happen to offer
— with the whole. - Different use cases give rise to different kinds of issues.
At least three of the generic arguments for agility apply to predictive analytics:
- Doing the correct thing soon is usually better than doing the same correct thing later.
- If it doesn’t take much time to do something, hopefully it doesn’t take that much expense (labor and so on) either.
- It’s hard to get new stuff completely right on the first try. Often, the best strategy is to come close fast, then fix what’s still not ideal.
But the reasons to want agile predictive analytics don’t stop there.
| Categories: EAI, EII, ETL, ELT, ETLT, Investment research and trading, Predictive modeling and advanced analytics | 13 Comments |
QlikView 11 and the rise of collaborative BI
QlikView 11 came out last month. Let me start by pointing out:
- As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some “better late than never” features.
- The leading-edge stuff is concentrated in the general area of “collaboration”.
- Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways.
- The “Well, it’s about time!” feature list starts with the ability to load QlikView via third-party ETL tools (Informatica now, others coming).
- QlikTech is generally good at putting up pretty pictures of its product. You can find some in the “What’s New in QlikView 11″ document via a general QlikView resource page.*
- Stephen Swoyer wrote a good article summarizing QlikView 11.
*One confusing aspect to that paper: non-standard uses of the terms “analytic app” and “document”.
As QlikTech tells it, QlikView 11 adds two kinds of collaboration features:
- Integration with social media, which QlikTech calls “asynchronous integration.”
- Direct sharing of the QlikView UI, which QlikTech calls “synchronous integration.”
I’d add a third kind, because QlikView 11 also takes some baby steps toward what I regard as a key aspect of BI collaboration — the ability to define and track your own metrics. It’s way, way short of what I called for in metric flexibility in a post last year, but at least it’s a small start.
MarkLogic’s Hadoop connector
It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.
Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:
- Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
- Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.
Otherwise, the whole thing is just what you would think:
- Hadoop can read from and write to MarkLogic, in parallel at both ends.
- If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
- If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”
MarkLogic said that it wrote this Hadoop connector itself.
| Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management | 2 Comments |
Where Datameer is positioned
I’ve chatted with Datameer a couple of times recently, mainly with CEO Stefan Groschupf, most recently after XLDB last Tuesday. Nothing I learned greatly contradicts what I wrote about Datameer 1 1/2 years ago. In a nutshell, Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).
Stefan reports that these capabilities are appealing to a significant fraction of enterprise or other commercial Hadoop users, especially the EtL and the BI. I don’t doubt him.
| Categories: Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop | 2 Comments |
Eight kinds of analytic database (Part 2)
In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more
