October 17, 2012

Notes on analytic hardware

I took the opportunity of Teradata’s Aster/Hadoop appliance announcement to catch up with Teradata hardware chief Carson Schmidt. I love talking with Carson, about both general design philosophy and his views on specific hardware component technologies.

From a hardware-requirements standpoint, Carson seems to view Aster and Hadoop as more similar to each other than either is to, say, a Teradata Active Data Warehouse. In particular, for Aster and Hadoop:

I/O is more sequential.
The CPU:I/O ratio is higher.
Uptime is a little less crucial.

The most obvious implication is differences in the choice of parts, and of their ratio. Also, in the new Aster/Hadoop appliance, Carson is content to skate by with RAID 5 rather than RAID 1.

I think Carson’s views about flash memory can be reasonably summarized as: Read more

Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadoop, Solid-state memory, Storage, Teradata

2 Comments

October 17, 2012

Hadoop/RDBMS integration: Aster SQL-H and Hadapt

Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:

You can dump any kind of data you want into Hadoop’s file system.
You can have data in a scale-out RDBMS to get good performance on analytic SQL.
You can access all the data (not just the relationally stored part) via SQL.
You can do MapReduce on all the data (not just the Hadoop-stored part).
To varying degrees, Hadapt and Aster each offer three kinds of advantage over Hadoop-with-Hive:
- SQL performance is (much) better.
- SQL functionality is better.
- At least some of your employees — the “business analysts” — can invoke MapReduce processes through SQL, if somebody else (e.g. your techies or the vendor’s) coded them up in the first place.

Of course, there are plenty of differences. Those start: Read more

Categories: Aster Data, Hadapt, Hadoop, Pricing, SQL/Hadoop integration, Teradata

5 Comments

October 17, 2012

The Teradata Aster Big Analytics Aster/Hadoop appliance

My clients at Teradata are introducing a mix-em/match-em Aster/Hadoop box, officially called the Teradata Aster Big Analytics Appliance. Basics include:

You can fill a rack with nodes either for the Aster DBMS or for Hadoop (Hortonworks flavor), or you can combine them in the same box.
If you combine them, they share management software (adapted from mainstream Teradata’s) and Infiniband.
An Aster node has 16 2.6-gigahertz cores and 24 900GB disk drives.
A Hadoop node has 12 2.0-gigahertz cores and 12 3TB drives.
A central part of Teradata’s strategy is that Aster and Hadoop nodes can work together via SQL-H.
The Teradata Aster Big Analytics Appliance is based on a family of Dell servers that fit more compactly into racks than do Teradata’s traditional products.
The Teradata Aster Big Analytics Appliance replaces a previous interim Teradata Aster appliance that used similar hardware to that in other Teradata systems.

My views on the Teradata Aster Big Analytics Appliance start: Read more

Categories: Aster Data, Hadapt, Pricing, SQL/Hadoop integration, Teradata

3 Comments

October 16, 2012

Hadapt Version 2

My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:

A very tight integration between an RDBMS-based analytic platform and Hadoop …
… that is decidedly immature as an analytic RDBMS …
… but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).

Solr is in the mix as well.

Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:

Dump multi-structured data into Hadoop.
Refine or just move some of it into an RDBMS.
Bring in data from other RDBMS.
Process of all the above via Hadoop MapReduce.
Process of all the above via SQL.
Use full-text indexes on the data.

Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.

At the highest level, Hadapt works like this: Read more

Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text

4 Comments

October 15, 2012

What is meant by “iterative analytics”

A number of people and companies are using the term “iterative analytics”. This is confusing, because it can mean at least three different things:

You analyze something quickly, decide the result is not wholly satisfactory, and try again. Examples might include:
- Aggressive use of drilldown, perhaps via an advanced-interface business intelligence tool such as Tableau or QlikView.
- Any case where you run a query or a model, think about the results, and run another one after that.
You develop an intermediate analytic result, and using it as input to the next round of analysis. This is roughly equivalent to saying that iterative analytics refers to a multi-step analytic process involving a lot of derived data.
#1 and #2 conflated/combined. This is roughly equivalent to saying that iterative analytics refers to all of to investigative analytics, sometimes known instead as exploratory analytics.

Based both on my personal conversations and a quick Google check, it’s reasonable to say #1 and #3 seem to be the most common usages, with #2 trailing a little bit behind.

But often it’s hard to be sure which of the various possible meanings somebody has in mind.

Related links

Monash’s First and Third Laws of Commercial Semantics state:

Categories: Analytic technologies, Business intelligence, QlikTech and QlikView, Tableau Software

3 Comments

October 12, 2012

(Relational) database (management system) — three analytic glossary draft entries

These are three closely-related draft entries for the DBMS2 analytic glossary. Please comment with any ideas you have for their improvement!

1. Database management system (DBMS)

In our definition, a database management system (DBMS) is:

Software that manages the reading and writing of data …
… through an application programming interface (API) …
… that depends solely upon the values of the data and similar logical information.

Commonly, that API takes the form of a data manipulation language (DML) such as SQL or MDX, but our definition allows for APIs as simple as those of key-value stores.

There are two major alternatives to our definition:

The above could be a definition of “data management software”, with the term “DBMS” reserved for systems with a true DML.
Many vendors and industry observers abbreviate “database management system” or “data management software” as “database”.

Two important distinctions among categories of DBMS and the processing they’re optimized for are:

Relational vs. non-relational
Short-request vs. analytic vs. general-purpose

2. Database

The term database has two common meanings in IT: Read more

Categories: Analytic glossary, Data models and architecture

3 Comments

October 11, 2012

Oracle and IBM — strategic context

By my standards, I’ve been writing a lot about Oracle and IBM recently. Let me now step back and review the context in which I view them.

At the highest level, Oracle and IBM have similar strategic priorities, in line with the Innovator’s Dilemma/Innovator’s Solution issues I keep mentioning. That is:

Oracle and IBM sell mainly to large enterprises with complex IT needs.
Oracle and IBM sell mainly to their respective existing customers.
Oracle and IBM are looking to preserve and expand revenue, margins, and share-of-wallet at those large existing customers.
Oracle and IBM rely on and encourage customers’ desire to consolidate purchasing among as few vendors as possible.
Technical implications include:
- Oracle and IBM invest in features that only large, complex enterprises care about.
- Oracle and IBM offer many kinds of technology and services, which they strive to make work fairly well together.

Of course, there are major differences in the two companies’ product and service portfolios. Some of the biggest are: Read more

Categories: Buying processes, IBM and DB2, Oracle

5 Comments

October 9, 2012

IBM Pure jargon

As best I can tell, IBM now has three related families of hardware/software bundles, aka appliances, aka PureSystems, aka something that sounds like “expert system” but in fact has nothing to do with the traditional rules-engine meaning of that term. In particular,

One of the three families is for the data tier, under the name PureData. That’s what’s new today.
One of the three families is for the application tier, under the name PureApplication. More information can be found here.
One of the three families is for “infrastructure”, under the name PureFlex. More information can be found here.

Within the PureData line, there are three sub-families:

One is based on DB2 pureScale and is said to be “optimized exclusively for transactional data workloads”.
One is based on Netezza, and is said to be “optimized exclusively for analytic workloads”.
One is based on DB2 with the shared-nothing option, and is said to be “optimized exclusively for operational analytic data workloads”, notwithstanding that the underlying software has for years been IBM’s flagship general-purpose (non-mainframe) DBMS.

The Netezza part of the story seems to start:

The Netezza name is being deprecated, except insofar as certain PureData systems are “Powered by Netezza Technology.”
Netezza didn’t trumpet slipstream hardware enhancements even when it was independent, and IBM sure isn’t reversing that policy now.
The Netezza software has been enhanced, most notably in a ~20X improvement in concurrency for “tactical” queries.

Perhaps someday I’ll be able to supply interesting details, for example about the concurrency improvement or about the uses (if any) customers are finding for Netezza’s in-database analytics — but as previously noted, analyzing big companies is hard.

Categories: Data warehouse appliances, IBM and DB2, Netezza, OLTP

4 Comments

October 7, 2012

IBM’s ETL

Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:

Informatica
IBM/Ascential
Ab Initio

However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required. Read more

Categories: EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization

7 Comments

October 6, 2012

Analyzing big companies is hard

Analyzing companies of any size is hard. Analyzing large ones, however, is harder yet.

I get (much) less substance in an hour on the phone with a megacorp than I do when I talk with a smaller company.
What large companies say is less reliable than what I hear from smaller ones.
Large companies have policies, procedures, bureaucracy and attitudes that get in the way of communicating in the first place.

Such limitations should be borne in mind in connection with anything I write about, for example, Oracle, Microsoft, IBM, or SAP.

There are many reasons for large companies to communicate less usefully with analysts than smaller ones do. Some of the biggest are:

For reasons of internal information flow, the people I talk with just know less than their counterparts at smaller companies. Similarly, what they do “know” is more often wrong, since different parts of the same company may not hold identical views.
That’s when we talk about real issues at all, which can get crowded out by large companies’ voluminous efforts in complex positioning, messaging, and product names.
Huge companies have huge bureaucracies, and they hurt.
- A small company C-level executive can make smart decisions about what to say or not say. A large company minion doesn’t have the same freedom.
- Just the process of getting access to even a mid-level spokesminion at a large company is harder than reaching a senior person at a smaller outfit.
Large firms are clearest when communicating with their existing customers and those organizations’ key influencers. They’re less effective or clear when opening themselves up to competitive comparisons.
If a company wants to behave unethically in its analyst dealings, there are economies of scale to doing so.

Categories: About this blog, IBM and DB2, Microsoft and SQL*Server, Oracle, SAP AG, Sybase

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Notes on analytic hardware

Hadoop/RDBMS integration: Aster SQL-H and Hadapt

The Teradata Aster Big Analytics Aster/Hadoop appliance

Hadapt Version 2

What is meant by “iterative analytics”

(Relational) database (management system) — three analytic glossary draft entries

Oracle and IBM — strategic context

IBM Pure jargon

IBM’s ETL

Analyzing big companies is hard

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin