Analysis of data management technology optimized for text data. Related subjects include:
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start: Read more
The 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems is out. I’ll split my comments into two posts — this one on concepts, and a companion on specific vendor evaluations.
- Maintaining working links to Gartner Magic Quadrants is an adventure. But as of early February, 2013, this link seems live.
- I also commented on the 2011, 2010, 2009, 2008, 2007, and 2006 Gartner Magic Quadrants for Data Warehouse DBMS.
Let’s start by again noting that I regard Gartner Magic Quadrants as a bad use of good research. On the facts:
- Gartner collects a lot of input from traditional enterprises. I envy that resource.
- Gartner also does a good job of rounding up vendor claims about user base sizes and the like. If nothing else, you should skim the MQ report for that reason.
- Gartner observations about product feature sets are usually correct, although not so consistently that they should be relied on.
When it comes to evaluations, however, the Gartner Data Warehouse DBMS Magic Quadrant doesn’t do as well. My concerns (which overlap) start:
- The Gartner MQ conflates many different use cases into one ranking (inevitable in this kind of work, but still regrettable).
- A number of the MQ vendor evaluations seem hard to defend. So do some of Gartner’s specific comments.
- Some of Gartner’s criteria seemingly amount to “parrots back our opinions to us”.
- As do I, Gartner thinks a vendor’s business and financial strength are important. But Gartner overdoes the matter, drilling down into picky issues it can’t hope to judge, such as assessing a vendor’s “ability to generate and develop leads.” *
- The 2012 Gartner Data Warehouse DBMS Magic Quadrant is closer to being a 1-dimensional ranking than 2-dimensional, in that entries are clustered along the line x=y. This suggests strong correlation among the results on various specific evaluation criteria.
|Categories: Data integration and middleware, Data warehousing, Database compression, Emulation, transparency, portability, Hadoop, Market share and customer counts, Oracle, Text||5 Comments|
I recently opined that, especially for cutting-edge internet businesses, analytic applications were not a realistic option; rather, analytic application subsystems are the most you can currently expect. Erin Griffith further observed that the problem isn’t just confined to analytics:
“We didn’t need 90 percent of the stuff they were offering, and when we told them what we did need — integration with social, curation tools, individual boutiques and analytics — they had nothing”
… a suitable solution to merge his editorial staff’s output with his separate site for selling tickets to events and goods … was not available, so had to build his own hybrid publishing and commerce platform. Likewise, Birchbox had to build a custom backend so that it could include videos and editorial content alongside its e-commerce site.
… it’s DIY or die.
With that as background, let’s consider why building business-to-consumer internet software is so complicated.
I’d suggest that a consumer website starts with four major conceptual parts: Read more
My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this: Read more
|Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text||4 Comments|
From time to time, I hear of regulatory requirements to retain, analyze, and/or protect data in various ways. It’s hard to get a comprehensive picture of these, as they vary both by industry and jurisdiction; so I generally let such compliance issues slide. Still, perhaps I should use one post to pull together what is surely a very partial list.
Most such compliance requirements have one of two emphases: Either you need to keep your customers’ data safe against misuse, or else you’re supposed to supply information to government authorities. From a data management and analysis standpoint, the former area mainly boils down to:
- Information security. This can include access control, encryption, masking, auditing, and more.
- Keeping data in an approved geographical area. (E.g., its country of origin.) This seems to be one of the three big drivers for multi-data-center processing (along with latency and disaster recovery), and hence is an influence upon numerous users’ choices in areas such as clustering and replication.
The latter, however, has numerous aspects.
First, there are many purposes for the data retention and analysis, including but by no means limited to: Read more
|Categories: Archiving and information preservation, Clustering, Data warehousing, Health care, Investment research and trading, Text||2 Comments|
- A database query is a predicate.
- A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.
And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.
Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.
First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.
Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:
- You do a SPARQL query.
- You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.
Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.
Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.
As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.
Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:
- Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
- Metamarkets, which does fast cardinality estimates via HyperLogLog.
- Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.
The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:
- (Always) Well, there’s a lot of custom coding.
- (Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.
Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.
|Categories: Aster Data, Business intelligence, Data models and architecture, Data warehousing, Database compression, Infobright, Text, Vertica Systems, Yarcdata and Cray||2 Comments|
From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.
The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:
- I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
- One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
- Even so, one product or technology may be well-suited to address a couple different kinds of challenges.
The Big Seven database challenges that almost any enterprise faces are: Read more
|Categories: Data integration and middleware, Data models and architecture, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, NoSQL, Object, OLTP, RDF and graphs, Structured documents, Talend, Text||3 Comments|
My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.
For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:
- Vanilla Cassandra.
- Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
- Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.
Cassandra, Solr, Lucene, and Hadoop are all Apache projects.
If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:
- Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
- There’s something called CQL (Cassandra Query Language), said to be SQL-like.
- There’s a JDBC driver for CQL.
- With Hadoop MapReduce also come Hive, Pig, and Sqoop.
- With Solr and Lucene come full-text search.
In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.
The two main ways to get direct SQL* access to data in DataStax Enterprise are:
*or very SQL-like, depending on how you view things
Before going further, let’s recall some Cassandra basics: Read more
|Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text||6 Comments|
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
|Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics||5 Comments|
SAP HANA has gotten much attention, mainly for its potential. I finally got briefed on HANA a few weeks ago. While we didn’t have time for all that much detail, it still might be interesting to talk about where SAP HANA stands today.
SAP HANA is positioned as an “appliance”. So far as I can tell, that really means it’s a software product for which there are a variety of emphatically-recommended hardware configurations — Intel-only, from what right now are eight usual-suspect hardware partners. Anyhow, the core of SAP HANA is an in-memory DBMS. Particulars include:
- Mainly, HANA is an in-memory columnar DBMS, based on SAP’s confusingly-renamed BI Accelerator/BW Accelerator. Analytics and most OLTP (OnLine Transaction Processing) go against the columnar part of HANA.
- The HANA DBMS also has an in-memory row storage option, used to store metadata, small tables, and so on.
- SAP HANA talks both SQL and MDX.
- The HANA DBMS is shared-nothing across blades or rack servers. I imagine that within an individual blade it’s shared everything. The usual-suspect data distribution or partitioning strategies are available — hash, range, round-robin.
- SAP HANA has what sounds like a natural disk-based persistence strategy — logs, snapshots, and so on. SAP says that this is synchronous enough to give ACID compliance. For some hardware partners, those “disks” are actually Fusion I/O cards.
- HANA is fault-tolerant “across servers”.
- Text support is “coming soon”, which makes sense, given that BI Accelerator was based on the TREX search engine in the first place. Inxight is also in the HANA text mix.
- You can put data into SAP HANA in a variety of obvious ways:
- Writing it directly.
- Trigger-based replication (perhaps from the DBMS that runs your SAP apps).
- Log-based replication (based on Sybase Replication Server).
- SAP Business Objects’ ETL tool.
SAP says that the row-store part is based both on P*Time, an acquisition from Korea some time ago, and also on SAP’s own MaxDB. The IBM white paper mentions only the MaxDB aspect. (Edit: Actually, see the comment thread below.) Based on a variety of clues, I conjecture that this was an aspect of SAP HANA development that did not go entirely smoothly.
Other SAP HANA components include: Read more