Aster Data – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Notes from a visit to Teradata

Curt Monash — Sun, 31 Aug 2014 09:17:29 +0000

I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.

First, let’s catch up with some personnel gossip. So far as I can tell:

Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
… Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
Oliver Ratzesberger runs Teradata’s software development.
Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.

The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.

In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:

The Teradata 1xxx series is focused on cost-per-bit.
The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
The Teradata 6xxx series is supposed to be able to do “everything”.
The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”

Also:

1xxx and 2xxx systems are meant to be I/O-constrained. 6xxx systems are meant to be constrained mainly by CPU, but every system will be I/O-constrained at some point.
There is at least one example of a Very Well Known organization buying Teradata’s Hadoop-only appliance despite not otherwise being a Hadoop customer. Teradata concedes, however, that this is not a common occurrence.
Customers are increasingly using co-location rather than their own data centers. Many colo organizations charge more or less strictly by floor space. Hence, there’s a push for maximum processing density per rack, power density and weight be damned.

Speaking of not being CPU-constrained — I heard 7-10% as an estimate for typical Hadoop utilization, and also 10-15%. While I didn’t ask, I presume these figures assume traditional MapReduce types of Hadoop workloads. I’m not sure why these figures are yet lower than eBay’s long-ago estimates of Hadoop “parallel efficiency”.

Like Carson used to do, Jeff shared a variety of hardware and networking tidbits with me. In particular:

Jeff is confident in Moore’s Law continuing for at least 5 more years. (I think that’s a near-consensus; the 2020s, however, are another matter.)
Teradata still uses SAS rather than SATA for all disk (spinning or solid-state) controllers. They’re now seeing 6-700 MB/sec/device on SSDs (Solid State Disk), up from 3-400.
SSD prices are down 60% over the past 6 months, vs. much slower declines previously.
Formerly a SanDisk/Pliant partisan, Teradata now thinks there are multiple vendors of good SSDs. (I’m not sure whether they’d be happy if I said which one they currently like best.)
Jeff foresees InfiniBand and Ethernet more or less merging. Right now Teradata is using a lot of 56 Gb/sec InfiniBand.

Since Oliver is now a Teradata mucky-muck, I asked about virtual data marts, an idea that he pretty much invented or at least popularized back in his eBay days. Comments included:

Teradata now calls them Data Labs.
Adoption is very high.
One major feature is “time boxing” — they expire after a period of time unless you renew them.
Analysis of virtual data mart usage is a good guide as to what you might want to add to your permanent data warehouse.

And I’ll stop here, although I hope that a couple more-focused posts will also eventually flow from the visit.

Teradata bought Hadapt and Revelytix

Curt Monash — Wed, 23 Jul 2014 08:29:02 +0000

My client Teradata bought my (former) clients Revelytix and Hadapt.* Obviously, I’m in confidentiality up to my eyeballs. That said — Teradata truly doesn’t know what it’s going to do with those acquisitions yet. Indeed, the acquisitions are too new for Teradata to have fully reviewed the code and so on, let alone made strategic decisions informed by that review. So while this is just a guess, I conjecture Teradata won’t say anything concrete until at least September, although I do expect some kind of stated direction in time for its October user conference.

*I love my business, but it does have one distressing aspect, namely the combination of subscription pricing and customer churn. When your customers transform really quickly, or even go out of existence, so sometimes does their reliance on you.

I’ve written extensively about Hadapt, but to review:

The HadoopDB project was started by Dan Abadi and two grad students.
HadoopDB tied a bunch of PostgreSQL instances together with Hadoop MapReduce. Lab benchmarks suggested it was more performant than the coyly named DBx (where x=2), but not necessarily competitive with top analytic RDBMS.
Hadapt was formed to commercialize HadoopDB.
After some fits and starts, Hadapt was a Cambridge-based company. Former Vertica CEO Chris Lynch invested even before he was a VC, and became an active chairman. Not coincidentally, Hadapt had a bunch of Vertica folks.
Hadapt decided to stick with row-based PostgreSQL, Dan Abadi’s previous columnar enthusiasm notwithstanding. Not coincidentally, Hadapt’s performance never blew anyone away.
Especially after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or all of its sales and marketing folks. Hadapt pivoted to emphasize its schema-on-need story.
Chris Lynch, who generally seems to think that IT vendors are created to be sold, shopped Hadapt aggressively.

As for what Teradata should do with Hadapt:

My initial thought for Hadapt was to just double down, pushing the technology forward, presumably including a columnar option such as the one Citus Data developed.
But upon reflection, if it made technical sense to merge the Aster and Hadapt products, that would be better yet.

I herewith apologize to Aster co-founder and Hadapt skeptic Tasso Argyros (who by the way has moved on from Teradata) for even suggesting such heresy.

Complicating the story further:

Impala lets you treat data in HDFS (Hadoop Distributed File System) as if it were in a SQL DBMS. So does Teradata SQL-H. But Hadapt makes you decide whether the data is in HDFS or the SQL DBMS, and it can’t be in both at once. Edit: Actually, see Dan Abadi’s comments below.
Impala and Oracle’s new SQL-H competitor have daemons running on every data node. So does one option in Hadapt. But I don’t think SQL-H does that yet.

I was less involved with Revelytix that with Hadapt (although I’m told I served as the “catalyst” for the original Teradata/Revelytix partnership). That said, Teradata — like Oracle — is always building out a data integration suite to cover a limited universe of data stores. And Revelytix’ dataset management technology is a nice piece toward an integrated data catalog.

Related posts

Dan Abadi and Dave DeWitt both drew distinctions among various SQL/Hadoop integrations.
Hadapt was of the original examples for my Cardinal Rules of DBMS Development.

Using multiple data stores

Curt Monash — Wed, 18 Jun 2014 16:03:10 +0000

I’m commonly asked to assess vendor claims of the kind:

“Our system lets you do multiple kinds of processing against one database.”
“Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”

So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:

The most extreme vendor marketing claims are false.
There are many different choices that make sense in at least some use cases each.

Horses for courses

It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:

Short-request vs. analytic.
SQL vs. non-SQL (NoSQL or otherwise).
Expensive/heavy-duty vs. cheap/easy-to-support.

Vendors are part of this consensus; already in 2005 I observed

For all practical purposes, there are no DBMS vendors left advocating single-server strategies.

Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.

Multiple data stores for a single application

We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree:

It’s common and sensible to manage authentication and authorization data in its own data store. Commonly, the data format is LDAP (Lightweight Directory Access Protocol).
It’s common and sensible to manage the “content” and “e-commerce transaction records” aspects of websites separately.
Even beyond that case, there are often performance reasons to manage BLOBs (Binary Large OBjects) outside your relational database.
Internet “interaction” data is also often best managed outside an RDBMS, in part because of its very non-tabular data structures.

The spectacular 2010 JP Morgan Chase outage was largely caused, I believe, by disregard of these precepts.

There also are cases in which applications dutifully get all their data via SQL queries, but send those queries to two or more DBMS. Teradata is proud that its systems can support rather transactional queries (for example in call-center use cases), but the same application may read from and write to a true OTLP database as well.

Further, many OLTP (OnLine Transaction Processing) applications do some fraction of their work via inbound or outbound messaging. Many buzzwords can come into play here, including but not limited to:

SOA (Service-Oriented Architecture). This is the most current and flexible one.
EAI (Enterprise Application Integration). This was a hot concept in the late 1990s, but was generally implemented with difficulties that SOA was later designed to alleviate.
Message-oriented middleware (MOM) and Publish/Subscribe. These are even older, and overlap greatly.

Finally, every dashboard that combines information from different data stores could be assigned to this category as well.

Multiple storage approaches in a single DBMS

In theory, a single DBMS could operate like two or more different ones glued together. A few functions should or must be centralized, such as administration, and communication with the outside world (connection handling, parsing, etc.). But data storage, query execution and so on could for the most part be performed by rather loosely coupled subsystems. And so you might have the best of both worlds — something that’s multiple data stores in the ways you want that diversity, but a single system in how it fits into your environment.

I discussed this idea last year with cautious optimism, writing:

So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.

… multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.

…

Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …

… strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM [have] been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.

In 2005 I had been more ambivalent, in part because my model was a full 1990s-dream “universal” DBMS:

IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML.

That covers most of the waterfront, but I’d like to more explicitly acknowledge three trends:

Among other things, Hadoop is a collection of DBMS (HBase, Impala, et al.) that in some cases are very loosely coupled to each other. The question is less how well the various data stores work together, and more how mature any one of them is on its own.
The multiple-data-models idea has been extended into schema-on-need, which is sometimes but not always housed in Hadoop.
Even on the relational side, multiple storage capabilities exist in one product.
- Vertica was designed that way from the get-go. (Like the old joke about police duos, one is to read and one is to write.)
- IBM, Microsoft and Oracle have all recently added some kind of in-memory columnar capability.
- Teradata, Aster (before Teradata bought them), Greenplum and Vertica all added some variant on row/column dual stores.

Related links

SQL vs. NoSQL, legacy vs. clean-up. (March, 2014)
The difficulty of DBMS development, including Hadoop-based ones (March, 2013)

Introduction to CitusDB

Curt Monash — Fri, 02 May 2014 08:00:08 +0000

One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:

CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.

*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.

Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.

Notwithstanding what I said about “fat heads”, CitusDB does have a concept of Master nodes. These:

Also use single-node copies of PostgreSQL.
Are blessedly able to scale out, although their underlying databases are entirely replicated.
Store no actual data, but do store metadata about each virtual node, including:
- Structural metadata.
- Location.
- Min/max column values (for data skipping).
- But not (yet) stats to help with query optimization.
Do some query planning and rewriting.
Handle administration, some of which is nicely parallelized/centralized. (E.g., an index choice can be made once and automatically propagated across all the relevant virtual nodes.)

CitusDB is definitely in its early days. For example:

If I understand correctly, the recent CitusDB 3.0 release is the first one on which data is redistributed among shards. Before that, you could only join tables that were either sharded on the same key, or else small enough to be broadcast-replicated across the whole cluster.
SQL coverage isn’t great. (E.g., no Windowing.)
Some hard-to-parallelize things aren’t implemented yet, e.g. exact median or generally-usable COUNT DISTINCT.
ACID is still lacking. Writes are batch-only, micro-batch or otherwise as the case may be.
CitusDB’s backup story is primitive, with the main options being:
- You can rely on having replicas on multiple nodes, even — if you like — in different data centers.
- You can backup each of the PostgreSQL nodes separately; CitusDB doesn’t yet offer automation for that.
CitusDB’s query optimization sounds pretty primitive.
I don’t recall Citus telling me of serious workload management.
CitusDB compression is block-level only. (PostgreSQL’s version of Lempel-Ziv.)

Still, the Citus Data folks seem to have good ideas, including some — as yet undisclosed — plans going forward. So if it sounds as if CitusDB might fit your needs better than more established scale-out RDBMS do, I’d encourage you to take a look at what Citus offers.

RDBMS and their bundle-mates

Curt Monash — Sun, 10 Nov 2013 19:22:48 +0000

Relational DBMS used to be fairly straightforward product suites, which boiled down to:

A big SQL interpreter.
A bunch of administrative and operational tools.
Some very optional add-ons, often including an application development tool.

Now, however, most RDBMS are sold as part of something bigger.

Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
Microsoft SQL Server is part of a stack, starting with the Windows operating system.
Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
… SAP HANA, which is closely associated with SAP’s applications.
Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
MySQL’s glory years were as part of the “LAMP” stack.
Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.

This phenomenon is, I think, much more driven by vendors than users. Most of the examples I listed work or could work perfectly well on their own.* But relational database management systems are seen as “strategic” products, which means in particular:

They’re often expensive to adopt (software, hardware, people costs).
They’re also often expensive to switch away from.

And strategic products, high price tags, and thick product stacks commonly go together.

*Netezza is an exception. But Exadata is not; while Oracle data warehousing was in a bad technical place before Exadata, Exadata software is what cleaned the problem up.

Also relevant is that I took those examples from relatively mature RDBMS market segments — high-end OLTP/general purpose (OnLine Transaction processing), mid-range OLTP/general-purpose, and analytic. Products in those sectors have had enough time to be built out. They also tend to have fairly close competitors, as the most important product features (e.g. columnar storage in analytic RDBMS, or online backup across the board) have been imitated numerous times each.

NewSQL, by way of contrast, is just as thin-stack as NoSQL is. Products in those sectors are immature; vendors are completing them first before wedding them to other technology layers. They’re also strongly differentiated; if you tell me what topology you need and which style(s) of API or DML (Data Manipulation Language) you prefer, the list of product candidates I give you may be short indeed.

HBase is the obvious exception to my “NoSQL products stand alone” generalization, but its market position is a matter of debate.

I have mixed feelings about this trend. For starters, I’m grudgingly becoming more sympathetic to DBMS/hardware bundles, notwithstanding their role as a way to gouge more money from customers than the hardware is actually worth. Why? Because of my opinion that there’s a general move toward appliances, clusters and clouds. In particular:

As DBMS become better at straddling and melding RAM, flash and disk, legitimate reasons to optimize hardware/software integration will increase.
Microsoft (with Parallel Data Warehouse) and SAP (with HANA) induce customers to adopt hardware “appliances” even though they don’t sell and profit from the hardware themselves. This shoots down the argument that appliances are only a vendor trick to squeeze out more profits.
Netezza’s super-easy installation was a really nice feature.

When it comes to RDBMS/business intelligence bundles, my thoughts start:

As a general rule, a benefit of BI is that it can get at data from lots of different sources. This speaks against tying it to a specific DBMS.
The vendor-specific evidence is mixed:
- IBM has never explained any user advantages to including Cognos in its analytic “appliance” product lines.
- Teradata did some special optimizations for MicroStrategy. This suggests that, conversely, MicroStrategy could benefit from DBMS-specific features.
- QlikView built a custom in-memory data store.
- Specialized business intelligence stacks are on the rise, although generally with a beyond-just-relational flavor.

And so I’m skeptical about RDBMS/BI integration, but willing to be persuaded otherwise.

The integration of advanced analytics with RDBMS leaves me perplexed. Gains in performance, scalability and/or development ease would seem, in many cases, too great to pass up. (E.g.. the Teradata Aster 6 story, analytic libraries and all.) And indeed most analytic platform vendors report some level of adoption. But the whole thing is moving more slowly than I expected. Meanwhile in the Hadoop world, a much lesser SQL capability — Hive — seems to be integrated into other analytic processing with enthusiasm. Perhaps the problem is that enterprises have to figure out which analytic techniques to use in the first place, before they worry too much about making them efficient.

And finally, when it comes to bundling of packaged applications with RDBMS — that depends on the class of application.

At the high end, it’s almost purely a pricing ploy, as those apps are usually written for lowest-common-denominator SQL functionality, so as to preserve portability.
A lot of mid-range apps are written against a specific DBMS, which is then resold along with the app. What’s more …
… most of those apps will migrate over time to a SaaS (Software as a Service) delivery model, which allows for a wholly integrated stack. And as the Workday example teaches us, database choices for SaaS apps can be pretty imaginative.

Related links

The refactoring of everything (July, 2013)
Comments about Gartner’s comments about a bunch of DBMS products (November, 2013)
The cardinal rules of DBMS development (March, 2013)

Entity-centric event series analytics

Curt Monash — Fri, 18 Oct 2013 08:29:24 +0000

Much of modern analytic technology deals with what might be called an entity-centric sequence of events. For example:

You receive and open various emails.
You click on and look at various web sites and pages.
Specific elements are displayed on those pages.
You study various products, and even buy some.

Analytic questions are asked along the lines “Which sequences of events are most productive in terms of leading to the events we really desire?”, such as product sales. Another major area is sessionization, along with data preparation tasks that boil down to arranging data into meaningful event sequences in the first place.

A number of my clients are focused on such scenarios, including WibiData, Teradata Aster (e.g. via nPath), Platfora (in the imminent Platfora 3), and others. And so I get involved in naming exercises. The term entity-centric came along a while ago, because “user-centric” is too limiting. (E.g., the data may not be about a person, but rather specifically about the actions taken on her mobile device.) Now I’m adding the term event series to cover the whole scenario, rather than the “event sequence(s)” I might appear to have been hinting at above.

I decided on “event series” earlier this week, after noting that:

“Time series” isn’t quite right, because it generally refers to a collection of time-stamped data of a single datatype.
“Event stream” isn’t quite right, because it connotes the immediacy of complex event/stream processing.
“Series” sounds better than “sequence”. While “sequence” would be the more accurate term from a strict mathematical standpoint, that ship sailed when time series weren’t called “time sequences” instead.

And that was even before I recalled hearing the term from Vertica a couple of years ago.

Analyzing event series is tricky even when all the events are of the same kind, and hence naturally fit into the same database table. For example:

Even the most specific of pattern-matches can, in SQL, require several nestings of time-stamp range sub-queries. (How else do you ensure that Event 2 happened after Event 1 but before Event 3?)
The most common end-user business intelligence UIs aren’t well suited to such analyses; specific new ones are being invented instead. I think they’re already OK for static views – trees, funnels, etc. – but I haven’t seen anything yet that seems great for navigation, or for human real-time interaction.

When you’re correlating events from multiple database columns or tables – or their nested data structure equivalents – things get hairier yet.

I also think that predictive modeling on event series, a huge subject for consumer internet companies, still has a long way to go. How exactly do you characterize the independent variables? For that matter, how do you characterize the dependent ones?

Bottom line: Event series are likely to be a major subject of data management and analytics innovation for a number of years to come.

Aster 6, graph analytics, and BSP

Curt Monash — Thu, 10 Oct 2013 11:42:38 +0000

Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:

There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.

There’s much more, of course, but those are the essential pieces.

Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.

The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):

BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.

Use cases suggested are a lot of marketing, plus anti-fraud.

*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.

So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:

Ordinary SQL except in the FROM clause.
Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.

Within those functions, the core idea is:

Various tables are explicitly given the roles of “Vertices”, “Edges”, and so on. (It can get reasonably complicated; e.g., “Vertices_1” and “Vertices_2” for a bipartite graph.)
Those “tables” can actually instead be views, subqueries or whatever.

Specific prebuilt functions — Aster is big on prebuilt functions — include but surely aren’t limited to:

PageRank (which of course generally is a way to estimate individual vertices’ relative influence).
Various things that seem to focus on measuring which relationships are or aren’t significant. (I’m not sure whether they’re NDA or not, so to stay on the safe side I won’t spell them out.)

Truth be told, these prebuilt functions sound pretty interesting.

As for underpinnings — the idea behind BSP is:

You have a computing job that is both iterative and parallel.
You parallelize it among a bunch of logical vertices, which may or may not correspond to physical computing servers.
The job is broken up into “supersteps”, wherein local processing happens at each vertex.
At the end of a superstep, each vertex can send messages to other vertices. The next superstep can’t start until all the messages have arrived.

Hopefully, various problems with message latency and unreliability that arise in other models of parallel computing are obviated by BSP.

So why use BSP for graph analytics? Well, it’s pretty obvious why BSP would be a decent model; the real question is why something that relies on classical data partitioning isn’t even better. And of course the answer to that one is that data partitioning doesn’t work for most graphs; whatever you do, there are going to be a whole lot of edges crossing partition boundaries.

Real-world graphs have short average path lengths — Six Degrees of Separation and all that. While that isn’t in itself a proof that partitioning can’t work, it should at least serve as a strong plausibility argument.

Since this is a first release of a graph-processing capability, it’s safe to assume there’s a lot missing. For example, every SQL-GR graph operation starts by retrieving data and building a graph; there’s no reuse. I presume that some analytic operations aren’t explicitly supported yet, or are of questionable performance. (Subgraph pattern matching was mentioned as an area that was not yet optimized for.) But with all those caveats, this still feels like a pretty interesting entry into the relationship analytics market.

Libraries in Teradata Aster

Curt Monash — Thu, 10 Oct 2013 11:40:38 +0000

I recently wrote (emphasis added):

My clients at Teradata Aster probably see things differently, but I don’t think their library of pre-built analytic packages has been a big success. The same goes for other analytic platform vendors who have done similar (generally lesser) things. I believe that this is because such limited libraries don’t do enough of what users want.

The bolded part has been, shall we say, confirmed. As Randy Lea tells it, Teradata Aster sales qualification includes the determination that at least one SQL-MR operator — be relevant to the use case. (“Operator” seems to be the word now, rather than “function”.) Randy agreed that some users prefer hand-coding, but believes a large majority would like to push work to data analysts/business analysts who might have strong SQL skills, but be less adept at general mathematical programming.

This phrasing will all be less accurate after the release of Aster 6, which extends Aster’s capabilities beyond the trinity of SQL, the SQL-MR library, and Aster-supported hand-coding.

Randy also said:

A typical Teradata Aster production customer uses 8-12 of the prebuilt functions (but now they seem to be called operators).
nPath is used in almost every Aster account. (And by now nPath has morphed into a family of about 5 different things.)
The Aster collaborative filtering operator is used in almost every account.
Ditto a/the text operator.
Several business intelligence vendors are partnering for direct access to selected Teradata Aster operators — mentioned were Tableau, TIBCO Spotfire, and Alteryx.
I don’t know whether this is on the strength of a specific operator or not, but Aster is used to help with predictive parts failure applications in multiple industries.

And Randy seemed to agree when I put words in his mouth to the effect that the prebuilt operators save users months of development time.

Meanwhile, Teradata Aster has started a whole new library for relationship analytics.

What matters in investigative analytics?

Curt Monash — Sun, 06 Oct 2013 12:10:21 +0000

In a general pontification on positioning, I wrote:

every product in a category is positioned along the same set of attributes,

and went on to suggest that summary attributes were more important than picky detailed ones. So how does that play out for investigative analytics?

First, summary attributes that matter for almost any kind of enterprise software include:

Performance and scalability. I write about analytic performance and scalability a lot. Usually that’s in the context of analytic DBMS, but it also arises in analytic stacks such as Platfora, Metamarkets or even QlikView, and also in the challenges of making predictive modeling scale.
Reliability, availability and security.* This is more crucial for short-request applications than analytic ones, but even your analytic systems shouldn’t leak data or crash.
Goodness of fit with legacy systems. I hate that one, because enterprises often sacrifice way too much in favor of that benefit.
Price. Duh.

*I picked up that phrase when — abbreviated as RAS — it was used to characterize the emphasis for Oracle 8. I like it better than a general and ambiguous concept of “enterprise-ready”.

The reason I’m writing this post, however, is to call out two summary attributes of special importance in investigative analytics — which regrettably which often conflict with each other — namely:

Agility. People don’t want to submit requests for reports or statistical analyses; they want to get answers as soon as the questions come to mind.
Completeness of feature set — for a particular use case, that is. There’s no such thing as an investigative analytics offering with a feature set that’s close to complete for all purposes; even SAS, IBM and other behemoths fall short.

Much of what I work on boils down to those two subjects. For example:

I recently suggested that navigation is a huge part of business intelligence differentiation. That’s because good navigation pretty much equates to BI agility. With luck, a BI tool that has the right navigation on the right data will get you to the result set you want, all within a few minutes.
There’s an obvious demand for agile predictive analytics. But if agility were all that mattered, KXEN — which excels in agility — would probably have done a lot better; KXEN’s problem was that it didn’t offer enough algorithmic breadth to meet enough users’ demands or needs.
Conversely, SAS has an exceptionally broad feature set. But few parts of the SAS product line offer much in the way of agility.
I’ve argued that analytic apps need to be continually customized, which is about as strong a pitch for agility as one can make. And that’s one of the major reasons that packaged analytic apps can’t really be feature-complete.
On the other hand, if you view incomplete predictive modeling apps as agility-enhancing application quick-starts — well, you’ve just described some of the most agile and also some of the most important parts of the SAS product line.
From an agility standpoint, the integration of predictive modeling into business intelligence would seem like pure goodness. Unfortunately, the most natural ways to do such integration would have very limited predictive features.
My clients at Teradata Aster probably see things differently — Edit: indeed they do — but I don’t think their library of pre-built analytic packages has been a big success. The same goes for other analytic platform vendors who have done similar (generally lesser) things. I believe that this is because such limited libraries don’t do enough of what users want.
I noted in July that complex, multi-stage predictive modeling is increasingly in vogue. Well, if predictive modeling is much more complicated than before, then things have to happen to make each step — or at least the average step — a lot easier and faster. I think that’s a core part of the value proposition for startups such as Ayasdi.

And finally: It is easier to be feature-complete — or at least feature-rich — for particular markets than across-the-board. That’s why I’ve steered a number of full-stack BI or predictive modeling technology clients toward vertical strategies.