My client Syncsort:
- Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
- Has a strong history in and fondness for sort.
- Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
- Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
- Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.
*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already.
The essence of the Syncsort DMX-h ETL Edition story is:
- DMX-h inherits the various ETL-suite trappings of DMX.
- Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
- With a copy of DMX on every node, DMX-h can do parallel load/export.
Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:
- Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
- This year the figure is around 40%.
- The 6700 is based on Intel’s Sandy Bridge.
- Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
- Teradata has now largely switched over to InfiniBand.
Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:
- Teradata Unified Data Architecture.
- Fabric-based computing, even though this isn’t really about storage.
- Teradata SQL-H.
|Categories: Data integration and middleware, Data warehouse appliances, Data warehousing, Pricing, SAS Institute, Teradata||3 Comments|
As vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. Rather, SQL-H is a transparency interface that makes Hadoop data responsive to the same code that would work on Teradata Aster …
*Speaking of confusion — Teradata Aster seems to use the spellings SQL/MR and SQL-MR interchangeably.
… except that now there’s also a SQL-H for regular Teradata systems as well. While it has the same general features and benefits as SQL-H for Teradata Aster, the details are different, since the underlying systems are.
I hope that’s clear.
|Categories: Data integration and middleware, Data warehousing, Emulation, transparency, portability, Hadoop, SQL/Hadoop integration, Teradata||2 Comments|
Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included: Read more
Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.
In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.
Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed: Read more
|Categories: Business intelligence, Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Platfora, Workload management||11 Comments|
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start: Read more
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more
The 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems is out. I’ll split my comments into two posts — this one on concepts, and a companion on specific vendor evaluations.
- Maintaining working links to Gartner Magic Quadrants is an adventure. But as of early February, 2013, this link seems live.
- I also commented on the 2011, 2010, 2009, 2008, 2007, and 2006 Gartner Magic Quadrants for Data Warehouse DBMS.
Let’s start by again noting that I regard Gartner Magic Quadrants as a bad use of good research. On the facts:
- Gartner collects a lot of input from traditional enterprises. I envy that resource.
- Gartner also does a good job of rounding up vendor claims about user base sizes and the like. If nothing else, you should skim the MQ report for that reason.
- Gartner observations about product feature sets are usually correct, although not so consistently that they should be relied on.
When it comes to evaluations, however, the Gartner Data Warehouse DBMS Magic Quadrant doesn’t do as well. My concerns (which overlap) start:
- The Gartner MQ conflates many different use cases into one ranking (inevitable in this kind of work, but still regrettable).
- A number of the MQ vendor evaluations seem hard to defend. So do some of Gartner’s specific comments.
- Some of Gartner’s criteria seemingly amount to “parrots back our opinions to us”.
- As do I, Gartner thinks a vendor’s business and financial strength are important. But Gartner overdoes the matter, drilling down into picky issues it can’t hope to judge, such as assessing a vendor’s “ability to generate and develop leads.” *
- The 2012 Gartner Data Warehouse DBMS Magic Quadrant is closer to being a 1-dimensional ranking than 2-dimensional, in that entries are clustered along the line x=y. This suggests strong correlation among the results on various specific evaluation criteria.
|Categories: Data integration and middleware, Data warehousing, Database compression, Emulation, transparency, portability, Hadoop, Market share and customer counts, Oracle, Text||5 Comments|
Data/database virtualization seems to be a hot subject right now, and vendors of a broad variety of different technologies are all claiming to be in the space. A terminological mess has ensued, as Monash’s First and Third Laws of Commercial Semantics are borne out in spades.
If something is like “virtualization”, then it should resemble hypervisors such as VMware. To me:
- The core feature of a hypervisor is that it allows many somethings to run and coexist where ordinarily only one something would come into play. Here the “many somethings” are virtual machines and what’s going on inside them, and the “one something” is the ordinary operating system/hardware computing stack.
- A core feature of original VMware was that the “many somethings” could be quite different — for example, the operating environments of numerous different hardware systems you wanted to decommission, or of new systems that you didn’t want to buy quite yet.
- Important features of hypervisors include:
- The ability to have multiple virtual machines run side by side at once, safely.
- Flexible and powerful workload management if the virtual machines do contend for resources.
- Easy management.
- The negative feature of having sufficiently low overhead.
Anything that claims to be “like virtualization” should be viewed in that light. Read more
|Categories: Clustering, Data integration and middleware, ScaleDB, Theory and architecture, Transparent sharding||5 Comments|
My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:
The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.
These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.
A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.
But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more
|Categories: Cloudant, Couchbase, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MongoDB and 10gen, RDF and graphs||1 Comment|