Aster Data

Analysis of data warehouse DBMS vendor Aster Data. Related subjects include:

October 17, 2012

Hadoop/RDBMS integration: Aster SQL-H and Hadapt

Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:

Of course, there are plenty of differences. Those start: Read more

October 17, 2012

The Teradata Aster Big Analytics Aster/Hadoop appliance

My clients at Teradata are introducing a mix-em/match-em Aster/Hadoop box, officially called the Teradata Aster Big Analytics Appliance. Basics include:

My views on the Teradata Aster Big Analytics Appliance start: Read more

September 27, 2012

Hoping for true columnar storage in Oracle12c

I was asked to clarify one of my July comments on Oracle12c,

I wonder whether Oracle will finally introduce a true columnar storage option, a year behind Teradata. That would be the obvious enhancement on the data warehousing side, if they can pull it off. If they can’t, it’s a damning commentary on the core Oracle codebase.

by somebody smart who however seemed to have half-forgotten my post comparing (hybrid) columnar compression to (hybrid) columnar storage.

In simplest terms:

August 26, 2012

How immediate consistency works

This post started as a minor paragraph in another one I’m drafting. But it grew. Please also see the comment thread below.

Increasingly many data management systems store data in a cluster, putting several copies of data — i.e. “replicas” — onto different nodes, for safety and reliable accessibility. (The number of copies is called the “replication factor”.) But how do they know that the different copies of the data really have the same values? It seems there are three main approaches to immediate consistency, which may be called:

I shall explain.

Two-phase commit has been around for decades. Its core idea is:

Unless a piece of the system malfunctions at exactly the wrong time, you’ll get your consistent write. And if there indeed is an unfortunate glitch — well, that’s what recovery is for.

But 2PC has a flaw: If a node is inaccessible or down, then the write is blocked, even if other parts of the system were able to accept the data safely. So the NoSQL world sometimes chooses RYW consistency, which in essence is a loose form of 2PC: Read more

August 19, 2012

In-database analytics — analytic glossary draft entry

This is a draft entry for the DBMS2 analytic glossary. Please comment with any ideas you have for its improvement!

Note: Words and phrases in italics will be linked to other entries when the glossary is complete.

“In-database analytics” is a catch-all term for analytic capabilities, beyond standard SQL, running on the same machine as and under the management of an analytic DBMS. These can run in one or both of two modes:

In-database analytics may offer great performance and scalability advantages versus the alternative of extracting data and having it be processed on a separate server. This is particularly likely to be the case in MPP (Massively Parallel Processing) analytic DBMS environments.

Examples of in-database analytics include:

Other common domains for in-database analytics include sessionization, time series analysis, and relationship analytics.

Notable products offering in-database analytics include:

August 19, 2012

Analytic platform — analytic glossary draft entry

This is a draft entry for the DBMS2 analytic glossary. Please comment with any ideas you have for its improvement!

Note: Words and phrases in italics will be linked to other entries when the glossary is complete.

In our usage, an “analytic platform” is an analytic DBMS with well-integrated in-database analytics, or a data warehouse appliance that includes one. The term is also sometimes used to refer to:

To varying extents, most major vendors of analytic DBMS or data warehouse appliances have extended their products into analytic platforms; see, for example, our original coverage of analytic platform versions of as Aster, Netezza, or Vertica.

Related posts

July 12, 2012

Approximate query results

In theory:

And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.

Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.

First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.

Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:

Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.

Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.

As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.

Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:

The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:

Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.

June 26, 2012

Teradata SQL-H, using HCatalog

When I grumbled about the conference-related rush of Hadoop announcements, one example of many was Teradata Aster’s SQL-H. Still, it’s an interesting idea, and a good hook for my first shot at writing about HCatalog. Indeed, other than the Talend integration bundled into Hortonworks’ HDP 1, Teradata SQL-H is the first real use of HCatalog I’m aware of.

The Teradata SQL-H idea is:

At least in theory, Teradata SQL-H lets you use a full set of analytic tools against your Hadoop data, with little limitation except price and/or performance. Teradata thinks the performance of all this can be much better than if you just use Hadoop (35X was mentioned in one particularly favorable example), but perhaps much worse than if you just copy/extract the data to an Aster cluster in the first place.

So what might the use cases be for something like SQL-H? Offhand, I’d say:

By way of contrast, the whole thing makes less sense for dashboarding kinds of uses, unless the dashboard users are very patient when they want to drill down.

May 13, 2012

Notes on the analysis of large graphs

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:

Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.

Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more

November 23, 2011

Hope for a new PostgreSQL era?

In a comedy of briefing errors, I’m not too clear on the details of my client salesforce.com’s new PostgreSQL-as-a-service offering, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said:

So I think it would be cool if one or the other big company put significant wood behind the PostgreSQL arrow.

*While Vertica was originally released using little or no PostgreSQL code — reports varied — it featured high degrees of PostgreSQL compatibility.

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.