October 2, 2008

History, focus, and technology of HP Neoview

On the basis of market impact to date, HP Neoview is just another data warehouse market participant – a dozen sales or so, a few systems in production, some evidence that it can handle 100 TB+ workloads, and so on. But HP’s BI Group CTO Greg Battas thinks Neoview is destined for greater things, because:

HP is Really Serious about making Neoview into a great, high-end system, no matter how long it takes. Reasons are much as you’d expect, including that data warehousing is a large fraction of all computing expenditure, and HP CEO Mark Hurd came over from Teradata.
There’s been a lot of investment and a long technical pedigree so far.
Greg thinks Neoview’s technology is really cool.

Greg says that he actually started the Neoview project as –- if I may be so bold as to paraphrase — a skunkworks, big-honking-data-mart, data warehouse appliance effort. However, his bosses redirected him toward a super-high-end emphasis, and the Neoview technical mandate is now to focus on “Three Cs” — Concurrency, Complexity, and Capacity – with Capacity being the easiest of the three. Performance, by way of contrast, is a relatively low priority, occupying perhaps 10-15% of the Neoview R&D budget. Workload management (25% of all R&D) is a bigger deal, as is the ability to execute a broad variety of queries, including complex ones.

But there also was some “technology in the bank” to draw on. Everybody knows that Neoview in some way grew out of Tandem’s NonstopSQL, which was one of the great relational DBMS of the 1980s. What is less obvious is exactly how that great OLTP DBMS – which was most famously used for running automatic teller machine (ATM) networks — turned into a data warehousing product. The story turns out to be that in the 1990s, Microsoft threw $40-50 million at the Nonstop SQL group, to port Tandem’s system software onto Windows. Greg confusingly reports both that this effort occurred in the “late” 1990s and that it was shelved after the Tandem/Compaq merger, which occurred in 1997. Anyhow …

… at some time in the semi-distant past, Tandem wrote 1 million lines of new code. This included two major new components or redesigns to the DBMS that are highly relevant to data warehousing, namely:

A shared-nothing, pipelined dataflow.
A new optimizer based on the University of Wisconsin Cascades project.

And then the code just sat on the shelf, until HP bought Compaq, Mark Hurd came over to run HP, and HP decided to get into the data warehouse appliance business.

So what does this all amount to for the HP Neoview technology architecture? Well, highlights include:

Neoview’s basic approach, as in most comparable systems, relies heavy on hashes and table scans. But Neoview’s reliance on these isn’t absolute. For example, Greg specifically mentioned nested-loop joins being used at Wal-Mart.
Neoview’s default block size is a Teradata-ish 128K. The other option is 32K.
Expressions – I assume this means projects and selects – are done via a kind of byte code, on the CPU. Greg suggested Teradata uses a similar approach.
Neoview uses a proprietary interconnect based on Tandem technology in the form of a switch fabric, in a full or fairly full mesh topology. Every processor can do 128 MB/sec of, I presume, data transfer. 250 MB/sec/leg of interconnect was also mentioned. (I’m not clear as to where the factor of 2 went.)
Every Neoview processor has its own disk. However, the disks are attached via the interconnect for failover. Typically there are 8 drives per node. I didn’t probe further as to exactly how disks, CPUs, and cores are arranged.
A disk’s contents are mirrored; in case of failover the work goes to other nodes. Something about how Greg described this – perhaps the lack of post-failover data distribution – made me worry about post-failure performance. Greg said that while this might be an issue in benchmarks, it didn’t arise in real life.*
Thanks to the pipelined dataflow engine, data is almost never spooled to disk. Rather, it is almost always streamed from operation to operation. This contrasts with a common characterization of Teradata, in which intermediate result sets are spooled to disk.
Neoview’s Cascades-based optimizer seems to be smart enough to, for example, do aggregations before joins when it makes sense. (Much the same is true of Aster Data’s optimizer.)
By the way, the basic idea behind Cascades – or at least Neoview’s version of it — is that it uses more heuristics than conventional cost-based optimizers do. That it, Neoview starts out with a candidate plan – perhaps derived in the usual way – and then considers variants on it. I didn’t quite gather whether this was basically just a way of navigating the tree of possibly query plans more effectively, with the final decision being made by cost estimates in exactly the same way conventional optimizers do, or if the differences ran deeper than that. However it works, Greg is quite proud of it, citing as one example a 300-join Microstrategy query that ran to 60 pages of SQL.
The Neoview folks are very proud of a feature they call Skew Buster. The idea is that hash joins can get effectively de-parallelized if the data clusters around a single hash key. What Skew Buster does is take the data associated with the offending key(s) and broadcast it to other nodes, thus reestablishing more even parallel processing.
Another feature Neoview trumpets is called Adaptive Segmentation, which sounds a lot like the “degrees of parallelism” flexibility in Oracle Parallel Query. The idea is that not all processors are used for a query if it seems the communication overhead isn’t worth the trouble.
Greg says that Neoview does analytics beyond query also in parallel, but all details are secret. Shut down on that subject, I forgot to ask about support for geospatial data.
When Neoview gets compression, HP plans to do it “right.” I.e., they plan to operate on compressed data in RAM to the extent possible (e.g., decompressing right before joins, or in a few cases even joining compressed data).

* A cynic might wonder exactly what vast real-life Neoview production experience Greg was referring to. But as I’ve said before — if you have a serious problem with disk failures affecting performance, you might want to reconsider either the quality of disks you’re using, or your system management practices … .

Categories: Data warehouse appliances, Data warehousing, HP and Neoview

Subscribe to our complete feed!

Comments

12 Responses to “History, focus, and technology of HP Neoview”

Joe Harris on October 2nd, 2008 9:21 am

Great post (as usual). This is by far the most insightful piece on NeoView to date. Which, frankly, does not speak well of HP’s PR.

Here’s my question though: What’s so “high end” about NeoView? If it’s as clever and fast and whizz bang as they say then why not offer it in bite size pieces as an appliance?

I don’t see this going anywhere unless they make it appliance and put it out in the public eye for scrutiny.

Pretend you’re a big telco for a minute… Teradata offers a reputation stability with speed, Netezza offers simplicity and speed, you’ve already got a ton of Oracle so you’ll give them a look.

Where is the NeoView hook? Being as good as Teradata isn’t enough. Even being twice as fast isn’t enough because Netezza is 5x when it counts.

Maybe they should give it away with the Itanium hardware it needs and ask Intel to foot the bill as a marketing effort.

Just a thought.
Glenn Paulley on October 2nd, 2008 1:34 pm

Some comments on a few of these technology points:

“Expressions – I assume this means projects and selects – are done via a kind of byte code, on the CPU. Greg suggested Teradata uses a similar approach.” Actually this pertains to the computation of any expression value in the engine, including aggregate functions, arithmetic functions, string functions, and so on. The idea behind using a byte-code machine is that the machine can, in principle, be “compiled” (optimized) at query build time to eliminate code that is unnecessary for this computation in this particular context. Other systems, including Sybase SQL Anywhere, use this approach.

“Neoview’s Cascades-based optimizer seems to be smart enough to, for example, do aggregations before joins when it makes sense. (Much the same is true of Aster Data’s optimizer.)” – Pushing/pulling aggregation above/below a join was studied by a fellow graduate student, Paul Yan, at the University of Waterloo in the mid-1990s as part of his PhD thesis (under the direction of Paul Larson, now at Microsoft Research). As far as I know DB2 was the first product to incorporate these optimizations.

“By the way, the basic idea behind Cascades – or at least Neoview’s version of it — is that it uses more heuristics than conventional cost-based optimizers do. That it, Neoview starts out with a candidate plan – perhaps derived in the usual way – and then considers variants on it.” – It is difficult to know how much Neoview’s implementation differs from other transformation-based optimizer implementations (such as Microsoft SQL Server) based on the Cascades framework (originally developed by Goetz Graefe, now at HP Labs). Every optimizer uses heuristics to reduce the size of the search space; whether or not one uses “more” heuristics over another is difficult to assess, because those assumptions are rarely documented, if even made public.
Curt Monash on October 2nd, 2008 2:35 pm

Thanks, Glenn — good points all!

CAM
Curt Monash on October 3rd, 2008 10:12 pm

Joe,

Erin McCabe recently joined HP’s BI unit. Expect better PR from them in the future! 🙂

Best,

CAM
Tom Williams on December 11th, 2008 9:43 am

You have to also consider the join algorithms when evaluating a decision support RDBMS. Teradata is the only vendor who can guarantee linear scalability and it is because of the hashed based file system which was built to solve decision support problems. Oracle, IBM and Neoview are all deployed on a b-tree file system which was designed for OLTP. This forces them to use nlogn join algorithms when the queries involve very large tables or the concurrency level is high.
Curt Monash on December 12th, 2008 1:37 am

Tom,

Most of the row-based competitors can, as one implementation option, do a hash partition, forgo indexes, and expect the queries to be satisfied by table scans.

So I’m not clear as to exactly what architectural point you are making that puts Teradata ahead of the newer guys, or for that matter that makes it impossible to use Oracle in the way that you described.

If all you’re saying is that b-trees aren’t the way to do decision support, and that the architectures of specialty products reflects this fact better than Oracle’s does, I agree completely. But it looked as if you were going to an extreme that I don’t see the foundation for.

CAM
Tom Williams on December 13th, 2008 7:00 pm

Which implementation besides Teradata provides linear scalability regardless of the size of the tables and the concurrent user level? From what I understand, Oracle, IMB, Neoview hash join plan is linear but is dependent on availability of sufficient meory. After that, their join plans are nlogn.

Linear scalability is very rare in computing and I’d be interested in knowing if anyone besides Teradata provides it in their RDBMS.
Curt Monash on December 14th, 2008 9:51 am

Tom,

I have a design that will ensure SUB-linear scalability, up to over a petabyte. On one terabyte of data, I’ll throttle performance by a factor of 10. On four terabytes, I’ll throttle it only by a factor of 8 … OK, I’m kidding. But to compare constant_1 times n vs. constant_2 times nlogn, it’s interesting to know what constant_1 and constant_2 are.

More generally, I’m confused by what you’re saying. You seem to be assigning a single scalability function to all join plans on a particular product, no matter what strategy the particular query’s execution plan uses. Taken literally, that’s totally absurd, and I’m not guessing successfully at your actual and surely more sensible meaning.
Tom Williams on December 15th, 2008 12:42 am

It is a bit difficult to go through it detail (and I did like your joke).

In short, most joins involve sorts and merges, which are expensive and can get very expensive when large data sets are involved. There are only two join plans that provide linear scalability, the hash join and the hash merge join. The hash merge join requires a hash based file sytem (different from hash distribution). The hash join employs a similiar technique but in memory. The problem is that memory runs out quickly and is often used for other operations like buffering. Teradata is the only RDBMS that provides the hash merge join.

So if I have to sort and merge large data sets, I really want the hash merge join available to the optimzer.
Goetz Graefe on January 27th, 2009 7:44 pm

For what it’s worth, the Cascades project was never associated with the University of Wisconsin – Madison. The only possible connection is that I got my degree there. I wrote the query optimizer code 1993-94 while on the faculty of Portland State University (in Oregon) and consulting for Tandem. In addition to the Tandem project (and now HP Neoview), the code also formed the foundation for query optimization in Microsoft SQL Server 7.0 and onwards.
Database Virtualization = Location Transparency. Old Wine in a New Bottle? « Share Virtual Machines on February 5th, 2009 2:59 am

[…] 2006. Oracle’s acquisition of TangoSol, Microsoft’s Project Velocity are following HP NeoView’s usage of distributed caches for solving large BI queries. Strictly speaking these are not […]
Notes on HBase | DBMS 2 : DataBase Management System Services on March 10th, 2015 2:24 pm

[…] Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

History, focus, and technology of HP Neoview

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin