September 28, 2008

Exadata and Oracle Database Machine parallelization clarified

Some kind Oracle development managers have reached out and helped me better understand where Oracle does or doesn’t stand in query and analytic parallelization. This post supersedes prior discussions of the subject over the past week.

Let’s start with the part everybody pretty much knows already:

There are two parts to a parallelization story — how you get data off of disk, and what you do with it once you have it.
To a first approximation, the best way to get a lot of data off of disk is in parallel, specifically with different CPUs talking to different disk drives. Until last week’s announcement of Exadata, Oracle was the most prominent holdout against this view. (That dubious honor now goes to Sybase.)
If processing units are working in parallel to get data off disk, it is optimal to, on the same node that receives the data:
- Do your projections. I.e., whittle down the data to just the columns you need. (That’s if you have a row-based system, as Oracle does; projections are moot in column-based systems.)
- Do your selections. I.e., execute whichever WHERE clauses can be handled before the joins start.
- Do any joins you can. For example, you may be joining two tables that have been partitioned on the same hash key, or you may be joining a large fact table to a small dimension table, where the latter has been replicated across every node.
Exadata is the first mainstream Oracle product that follows this optimal storage-facing parallelization strategy. Actually, opinions differ as to whether rigid coupling of processors to specific disks is actually necessary. But after supporting one extreme (the disk part of shared-everything), Oracle with Exadata has gone to the other extreme (the disk part of shared-nothing). Other vendors taking this approach include Teradata and Netezza. Greenplum and Vertica are less extreme.
After accessing data in parallel and filtering it to the extent possible on the nodes that retrieved it, Oracle then ships the data to a conventional Oracle database server. That server does any further query processing, along with any other analytics.

All that has been pretty clear from the getgo. Less obvious has been: How does the Oracle database server process the data it receives from the Exadata component? In particular, how parallel is the Oracle database server’s processing?

It turns out that the answer has little to do with Oracle Real Application Clusters (RAC). Indeed, it has so little to do with RAC that I’m wondering what RAC does to justify its >10% share of overall Oracle Database Machine pricing. In particular, different CPUs generally do not share RAM or cache when doing what Oracle refers to as DSS (Decision Support System — an old term) work. Thus — while I’m still not clear on all the specifics or exceptions! — it is generally fair to say that Oracle’s architecture on the database server side is akin to shared-nothing.

Please note that we’re talking here about two different pools of CPUs — the ones built into the Exadata part of the system, in charge of talking to their own private disk drives, and the ones in the RAC cluster, which do non-basic query execution, along with the rest of the analytics. Indeed, those two pools of CPUs could be of completely different brands and configurations, although at the moment they are similarly-named HP servers using identical Intel chips. I.e., Oracle has moved into the node heterogeneity camp. By way of contrast, the usual-suspect MPP vendors — Teradata, Netezza, Greenplum, Vertica, Aster Data, Paraccel, Exasol, DATAllegro — do most or all of their subsequent processing on the same nodes that retrieve data. Thus, Oracle is the first major vendor for whom it is important to remember that different parts of a query plan get parallelized across completely distinct sets of processors and processing units.*

*Yes, I know that each Netezza SPU (Snippet Processing Unit) couples a PowerPC chip and an FPGA (Field-Programmable Gate Array). But that’s a very different thing from having your data access occur on 14 servers and having the initial results sent to a different set of 8 servers.

So with all that background, I’m finally ready to lay out what I’ve gleaned about Oracle query and analytic parallelization, whether from public materials or private discussions.

Commericial database parallelization started in the mid-1990s. Indeed, I was writing about it back in 1994, and it was part of the story in my Sybase Sell recommendation that year. Ken Jacobs et al. explained to me at the time that in the choice between “static” and “dynamic” partitioning and parallelization, Oracle had opted for the “dynamic.” I.e., all CPUs could look at all data, but there would be a form of parallel processing anyway.
Oracle introduced a feature called Parallel Query long ago. Then, with the release of 10g, Oracle said in effect “Now we’ve removed the prior limitations and gotten Parallel Query right!”
Oracle tables or queries have to be explicitly enabled for parallelization. Degrees of parallelization (minimum and maximum processors devoted to a task) are also explicitly declared. Naturally, there are defaults and administrative tools that make that all pretty automatic or else easy. The default default, as it were, is for most or all available processors to be used in parallel.
Oracle also says it has parallelized a broad range of analytic functionality — data mining, OLAP, and so on, including generic UDFs (User-Defined Functions).

So how good is all this parallel technology? On the one hand, we know Oracle has been shipping it for a long time, and has it widely deployed. On the other, we also know that Oracle performance has been very problematic for large parallel queries. Surely most of those problems were due to the shared-disk bottleneck. But were they all (or so close to all as not to matter)? I don’t yet know.

Categories: Clustering, Data warehouse appliances, Data warehousing, Exadata, Oracle, Parallelization

Subscribe to our complete feed!

Comments

10 Responses to “Exadata and Oracle Database Machine parallelization clarified”

So what’s Oracle’s MPP-aware optimizer and query execution plan story? | DBMS2 -- DataBase Management System Services on September 28th, 2008 10:43 pm

[…] Edit: Answers to the title question have now shown up, and so the post below is now superseded by this one. […]
Ben Werther on September 29th, 2008 2:25 pm

With Oracle Exadata, Oracle has addressed the obvious low-hanging fruit of slow I/O performance, but all the existing architectural problems of RAC remain (i.e. weaknesses in parallel query planning and inefficient movement of data between nodes for execution of complex joins and aggregations).

Oracle’s Exadata isn’t a shared-nothing architecture. It is the same old RAC shared-disk architecture, but with a faster I/O pipe. Why else would they force you to have separate storage and database nodes? It is because RAC has a fundamental limitation that requires that every server can see all the storage in order to function.

This means that the processing is happening on an entirely different box than the one accessing the data in disk. Oracle’s ‘intelligent storage’ story (i.e. predicate pushdown) is a baby step to reduce this deficiency by pushing around less data, but doesn’t address the core issue.

By contrast Greenplum’s shared-nothing architecture means that the database lives on the storage nodes. This allows it to fully parallelize SQL, MapReduce and R (for statistical analysis), and do state-of-the-art analysis directly against the data — which makes Oracle’s ‘intelligent storage’ look extremely trivial by comparison.

Until Oracle is able to run the entire database directly against their disks (i.e. a shared-nothing architecture) they are going to be living with fundamental processing limitations.

For more details– http://www.greenplum.com/news/blogs/
Curt Monash on September 29th, 2008 5:16 pm

Hi Ben,

What part of what I posted do you believe is incorrect? Your post seems to contradict mine in part, but I’m not sure what your evidence or basis is for doing so.

Also — in another thread (the one based on Eric Lai’s article) — I posted a question to Greenplum, namely what did Scott mean by $20 K/TB — spinning disk or user data? Help would be appreciated.

Best,

CAM
Ben Werther on October 1st, 2008 1:08 am

Hi Curt,

Thanks for the note. I’m impressed by your coverage of all of these recent announcements and the good questions you have been asking.

In my note I was really questioning the notion that Oracle now has a shared-nothing MPP architecture and that Exadata increases parallelism and reduces the dependency on RAC.

Consider a few points:

– Oracle isn’t claiming to have made any notable changes to the database layer. As always, to execute a large analytical query across multiple machines on Oracle, you need a combination of Oracle ASM (to handle data partitioning), Oracle RAC (to coordinate between the multiple machines) and Parallel Query (to break up the work into multiple threads of execution).

– Oracle’s RAC architecture (prior to 10g) had all inter-machine communication (for joins processing, etc) going through the shared file system. All this disk coordination was really slow, so in 10g they split RAC’s processing into two layers — file system sharing (which still required that every database machine be able to see all the storage) — and ‘memory channel’ which is their name for the network connection between machines through which they push data for join processing.
–> i.e. Exadata provides more storage bandwidth, but doesn’t provide parallel query execution or inter-node coordination — that is RAC’s job. Without RAC you’d just have a bunch of database servers independently reading and writing from shared Exadata storage – but not working together to execute large queries in parallel.

– When designing Exadata, they had to make some interesting compromises. They wanted to leverage commodity storage and eliminate SAN bandwidth limitations. But, because each RAC database server fundamentally needs to see all the storage, they couldn’t use the shared-nothing MPP approach of Teradata, Netezza and Greenplum where nodes have dedicated storage and all communication is just inter-node traffic for join processing. So instead they built a commodity ‘SAN replacement’ where the storage and compute are separate.

– However with that there is a problem. The Exadata architecture requires that lots of data must be pushed between the storage and server nodes — data that would never need to be pumped over the network if the storage and compute were colocated. Even with expensive Infiniband they need to reduce the data transmitted – hence the ‘intelligent storage’ concept. Because they’ve split the processing they need to invent a separate pre-processing layer to try to reduce the overhead they’ve introduced. But this layer can’t do too much or it will interfere with Oracle’s processing, so they are only pushing down the trivial predicate filtering operation.
–> i.e. Exadata provides a shared storage infrastructure suitable for a shared-disk architecture. It doesn’t make Oracle a shared-nothing architecture, and by consequence most processing can’t be pushed down to the disks.

– By contrast, in a Greenplum system we’ll often get upwards of 2GB/s storage bandwidth within a node, and each node will execute its part of the query plan which could include arbitrary SQL (incl pre-aggregation, windowing and more), MapReduce processing, R/Perl/Python functions, etc. Our commodity interconnect fabric moves data between nodes for join processing, and there’s plenty of bandwidth because we don’t need to compete with storage traffic. Finally there’s no artificial split of processing, and everything is processed as close to the data as possible.

Anyway, this is a long-winded answer, but hopefully this clarifies things a little further.

Regards,
Ben
[http://www.greenplum.com/news/blogs/]
Rick Powell on October 1st, 2008 4:13 pm

Oracle has been working solutions on clustered databases since 1987 when version 6.2 came out with Oracle Parallel Server (OPS). This product was improved through version 8.n and was renamed Real Application Clusters when version 9.n came out. What it provides is the ability to pull together the power of multiple servers to satisfy database needs. It also provides fault tolerance to allow a database to continue in the event that a server is lost.

What is interesting in this latest offering is the Exadata Storage Server and how it can be employed to provide data throughput (20GB/Sec uncompressed / 50GB/Sec compressed). The ability to couple the exadata storage server, infiniband and RAC clusters will yield a significant improvement for database servers and those applications consume large amounts of data (such as DW, DSS and BI).

To me the question is to what extent can the Infiniband and Exadata Storage Server provide improvement for OLTP environments. Exadata Storage Server is optimized for data reads, but can it be optimized for writes too.
Mike Ferguson on October 12th, 2008 9:41 am

Curt,
I wonder if you could help me. I am still trying to get my head around Exadata. More specifically the link between the DBMS servers and the Exadata Storage Servers (cells). The Oracle white paper states that “Oracle Exadata enables function shipping from the database instance to the underlying storage”. My question is this. If there is function shipping occuring between the DBMS server and the underlying storage then why exactly do you need the ASM to present a shared disk view of all cells to each of the 8 Oracle DBMS servers? Are we saying that a query is NOT parallelised across the 8 servers at all? Are we saying that a query runs within 1 of the 8 only and the parallelisation is in the back end. In that case would it not be true that each Oracle instance would need access to a “global catalog” to know precisely where the data is (i.e. on each disk) in order to parallelise a query and function ship the parts of the query to just the nodes that have the data? I have seen nothing that states that the Oracle optimiser in an instance of Oracle running in the Database Machine has such a global catalog (or does each of the 8 database servers simply see it via the ASM on a virtual shared disk?) to know the precise location of the data on the disks. If a global catalog is not visible at the Oracle DBMS level then it would seem that a query would be parallelised across all cells and the CELLSRV software running in each cell would have to function like a cut down version of the Oracle DBMS? Would you agree? This CELLSRV software appears to have the smarts to do selections and projections on data within its cell and potentially some in-place join processing also. It seems interesting that each cell has two quad-core processors and 32GB memory. The DBMS nodes are also two quad-core processors with 8GB memory. While I can easily see that more memory is needed in a cell to get through more data before sending back the filtered result subset I wondered if the CELLSRV software is parallelising its subquery to exploit all 8 cores in the cell?

My key question is where is the global catalog? If there isn’t one then I am confused as to how function shipping occurs. I would also like to know more about what CELLSRV actually is.
Curt Monash on October 12th, 2008 10:27 am

Mike,

I don’t know, but those are excellent questions. Hopefully somebody from Oracle will address them when I visit on Thursday. We’ve set aside 2 1/2 + hours for DW + Exadata, so I’m looking forward to a fairly deep dive.
Mike Ferguson on October 12th, 2008 11:32 am

Curt,
I would be interested in what you discover at your meeting. I would be interested in asking a rather direct question – Is CELLSRV really an Oracle Instance in disguise?
Greg Rahn on October 13th, 2008 1:58 am

@Mike

I’m assuming you are quoting from the Exadata Technical Whitepaper.

If there is function shipping occuring between the DBMS server and the underlying storage then why exactly do you need the ASM to present a shared disk view of all cells to each of the 8 Oracle DBMS servers?

The “function shipping” applies to the smart scan offload processing. ASM is the integrated database volume and file system manager. Two different bits of functionality, both are required.

Are we saying that a query is NOT parallelised across the 8 servers at all?

Queries can be parallelized on one database server (intra-node parallelism), or across database servers (inter-node parallelism). The latter closely resembles shared-nothing query parallelism.

Are we saying that a query runs within 1 of the 8 only and the parallelisation is in the back end.

Operations are parallelized at both the database and Exadata layers.
I might suggest the Oracle SQL Parallel Execution paper for more details on Oracle Parallel Query.

In that case would it not be true that each Oracle instance would need access to a “global catalog” to know precisely where the data is (i.e. on each disk) in order to parallelise a query and function ship the parts of the query to just the nodes that have the data?

The database instance knows what files have a given table’s blocks, but ASM knows where those files are physically located on the LUNs.

It seems interesting that each cell has two quad-core processors and 32GB memory. The DBMS nodes are also two quad-core processors with 8GB memory.

You have it reversed. Exadata has 2 quad-core CPUs and 8GB memory, the database nodes have 2 quad-core CPUs and 32GB memory. Less memory on the Exadata servers.

My key question is where is the global catalog? If there isn’t one then I am confused as to how function shipping occurs. I would also like to know more about what CELLSRV actually is.

The only “global catalog” that contains the physical layout is ASM. The Exadata Storage Software (CELLSRV) is just simply a “smart” block server and filtering application. This is also covered in the Exadata Technical Whitepaper.

It might help to mention that one can have have multiple non-Exadata storage arrays in the same manor that one can have multiple Exadata storage servers. The difference being that w/o Exadata, all the processing (query and disk operations) are handled by the database CPUs. With Exadata, the disk operations as well as the offload processing (smart scans) are handled by the storage server CPUs.

Hopefully that addresses your questions.
Oracle Exadata and Oracle data warehouse appliance sound bites | DBMS2 -- DataBase Management System Services on November 11th, 2008 11:33 pm

[…] Medium-long term, the Exadata technical strategy could work very well. Exadata storage management addresses some of the problems with shared-everything; Oracle RAC addresses other; and it may not take many releases before Oracle gets query parallelization right as well. Edit: This point is superseded by my updated take on Oracle query parallelization. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Exadata and Oracle Database Machine parallelization clarified

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin