Analysis of data warehouse DBMS vendor Aster Data. Related subjects include:
Relational DBMS used to be fairly straightforward product suites, which boiled down to:
- A big SQL interpreter.
- A bunch of administrative and operational tools.
- Some very optional add-ons, often including an application development tool.
Now, however, most RDBMS are sold as part of something bigger.
- Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
- IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
- Microsoft SQL Server is part of a stack, starting with the Windows operating system.
- Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
- … SAP HANA, which is closely associated with SAP’s applications.
- Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
- MySQL’s glory years were as part of the “LAMP” stack.
- Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.
Much of modern analytic technology deals with what might be called an entity-centric sequence of events. For example:
- You receive and open various emails.
- You click on and look at various web sites and pages.
- Specific elements are displayed on those pages.
- You study various products, and even buy some.
Analytic questions are asked along the lines “Which sequences of events are most productive in terms of leading to the events we really desire?”, such as product sales. Another major area is sessionization, along with data preparation tasks that boil down to arranging data into meaningful event sequences in the first place.
A number of my clients are focused on such scenarios, including WibiData, Teradata Aster (e.g. via nPath), Platfora (in the imminent Platfora 3), and others. And so I get involved in naming exercises. The term entity-centric came along a while ago, because “user-centric” is too limiting. (E.g., the data may not be about a person, but rather specifically about the actions taken on her mobile device.) Now I’m adding the term event series to cover the whole scenario, rather than the “event sequence(s)” I might appear to have been hinting at above.
I decided on “event series” earlier this week, after noting that: Read more
|Categories: Aster Data, Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Platfora, Predictive modeling and advanced analytics, Teradata, Vertica Systems, Web analytics, WibiData||10 Comments|
Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:
- There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
- There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
- The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.
There’s much more, of course, but those are the essential pieces.
Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.
The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):
- BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
- BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
- BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
- Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.
Use cases suggested are a lot of marketing, plus anti-fraud.
*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.
So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:
- Ordinary SQL except in the FROM clause.
- Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.
Within those functions, the core idea is: Read more
|Categories: Application areas, Aster Data, Business intelligence, Data models and architecture, Data warehousing, Hadoop, Parallelization, Predictive modeling and advanced analytics, RDF and graphs, Teradata||4 Comments|
I recently wrote (emphasis added):
My clients at Teradata Aster probably see things differently, but I don’t think their library of pre-built analytic packages has been a big success. The same goes for other analytic platform vendors who have done similar (generally lesser) things. I believe that this is because such limited libraries don’t do enough of what users want.
The bolded part has been, shall we say, confirmed. As Randy Lea tells it, Teradata Aster sales qualification includes the determination that at least one SQL-MR operator — be relevant to the use case. (“Operator” seems to be the word now, rather than “function”.) Randy agreed that some users prefer hand-coding, but believes a large majority would like to push work to data analysts/business analysts who might have strong SQL skills, but be less adept at general mathematical programming.
This phrasing will all be less accurate after the release of Aster 6, which extends Aster’s capabilities beyond the trinity of SQL, the SQL-MR library, and Aster-supported hand-coding.
Randy also said:
- A typical Teradata Aster production customer uses 8-12 of the prebuilt functions (but now they seem to be called operators).
- nPath is used in almost every Aster account. (And by now nPath has morphed into a family of about 5 different things.)
- The Aster collaborative filtering operator is used in almost every account.
- Ditto a/the text operator.
- Several business intelligence vendors are partnering for direct access to selected Teradata Aster operators — mentioned were Tableau, TIBCO Spotfire, and Alteryx.
- I don’t know whether this is on the strength of a specific operator or not, but Aster is used to help with predictive parts failure applications in multiple industries.
And Randy seemed to agree when I put words in his mouth to the effect that the prebuilt operators save users months of development time.
Meanwhile, Teradata Aster has started a whole new library for relationship analytics.
|Categories: Application areas, Aster Data, Data warehousing, Predictive modeling and advanced analytics, Teradata, Text||1 Comment|
In a general pontification on positioning, I wrote:
every product in a category is positioned along the same set of attributes,
and went on to suggest that summary attributes were more important than picky detailed ones. So how does that play out for investigative analytics?
First, summary attributes that matter for almost any kind of enterprise software include:
- Performance and scalability. I write about analytic performance and scalability a lot. Usually that’s in the context of analytic DBMS, but it also arises in analytic stacks such as Platfora, Metamarkets or even QlikView, and also in the challenges of making predictive modeling scale.
- Reliability, availability and security.* This is more crucial for short-request applications than analytic ones, but even your analytic systems shouldn’t leak data or crash.
- Goodness of fit with legacy systems. I hate that one, because enterprises often sacrifice way too much in favor of that benefit.
- Price. Duh.
*I picked up that phrase when — abbreviated as RAS — it was used to characterize the emphasis for Oracle 8. I like it better than a general and ambiguous concept of “enterprise-ready”.
The reason I’m writing this post, however, is to call out two summary attributes of special importance in investigative analytics — which regrettably which often conflict with each other — namely:
- Agility. People don’t want to submit requests for reports or statistical analyses; they want to get answers as soon as the questions come to mind.
- Completeness of feature set — for a particular use case, that is. There’s no such thing as an investigative analytics offering with a feature set that’s close to complete for all purposes; even SAS, IBM and other behemoths fall short.
Much of what I work on boils down to those two subjects. For example: Read more
|Categories: Aster Data, Business intelligence, Data warehousing, KXEN, Predictive modeling and advanced analytics, SAS Institute, Teradata||8 Comments|
Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more
The genesis of this post is:
- Dave DeWitt sent me a paper about Microsoft Polybase.
- I argued with Dave about the differences between Polybase and Hadapt.
- I asked Daniel Abadi for his opinion.
- Dan agreed with Dave, in a long email …
- … that he graciously permitted me to lightly-edit and post.
I love my life.
Per Daniel (emphasis mine): Read more
|Categories: Aster Data, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, SQL/Hadoop integration, Theory and architecture||9 Comments|
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact. Read more
I’ve hacked both the PHP and CSS that drive this website. But if I had to write PHP or CSS from scratch, I literally wouldn’t know how to begin.
Something similar, I suspect, is broadly true of “business analysts.” I don’t know how somebody can be a competent business analyst without being able to generate, read, and edit SQL. (Or some comparable language; e.g., there surely are business analysts who only know MDX.) I would hope they could write basic SELECT statements as well.
But does that mean business analysts are comfortable with the fancy-schmantzy extended SQL that the analytic platform vendors offer them? I would assume that many are but many others are not. And thus I advised such a vendor recently to offer sample code, and lots of it — dozens or hundreds of isolated SQL statements, each of which does a specific task.* A business analyst could reasonably be expected to edit any of those to point them his own actual databases, even though he can’t necessarily be expected to easily write such statements from scratch. Read more
I took the opportunity of Teradata’s Aster/Hadoop appliance announcement to catch up with Teradata hardware chief Carson Schmidt. I love talking with Carson, about both general design philosophy and his views on specific hardware component technologies.
From a hardware-requirements standpoint, Carson seems to view Aster and Hadoop as more similar to each other than either is to, say, a Teradata Active Data Warehouse. In particular, for Aster and Hadoop:
- I/O is more sequential.
- The CPU:I/O ratio is higher.
- Uptime is a little less crucial.
The most obvious implication is differences in the choice of parts, and of their ratio. Also, in the new Aster/Hadoop appliance, Carson is content to skate by with RAID 5 rather than RAID 1.
I think Carson’s views about flash memory can be reasonably summarized as: Read more
|Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadoop, Solid-state memory, Storage, Teradata||2 Comments|