Comments on: Exadata and Oracle Database Machine parallelization clarified

By: Oracle Exadata and Oracle data warehouse appliance sound bites | DBMS2 -- DataBase Management System Services

Wed, 12 Nov 2008 03:33:07 +0000

[…] Medium-long term, the Exadata technical strategy could work very well. Exadata storage management addresses some of the problems with shared-everything; Oracle RAC addresses other; and it may not take many releases before Oracle gets query parallelization right as well. Edit: This point is superseded by my updated take on Oracle query parallelization. […]

By: Greg Rahn

Greg Rahn — Mon, 13 Oct 2008 05:58:41 +0000

@Mike I'm assuming you are quoting from the Exadata Technical Whitepaper. If there is function shipping occuring between the DBMS server and the underlying storage then why exactly do you need the ASM to present a shared disk view of all cells to each of the 8 Oracle DBMS servers? The "function shipping" applies to the smart scan offload processing. ASM is the integrated database volume and file system manager. Two different bits of functionality, both are required. Are we saying that a query is NOT parallelised across the 8 servers at all? Queries can be parallelized on one database server (intra-node parallelism), or across database servers (inter-node parallelism). The latter closely resembles shared-nothing query parallelism. Are we saying that a query runs within 1 of the 8 only and the parallelisation is in the back end. Operations are parallelized at both the database and Exadata layers. I might suggest the Oracle SQL Parallel Execution paper for more details on Oracle Parallel Query. In that case would it not be true that each Oracle instance would need access to a “global catalog” to know precisely where the data is (i.e. on each disk) in order to parallelise a query and function ship the parts of the query to just the nodes that have the data? The database instance knows what files have a given table's blocks, but ASM knows where those files are physically located on the LUNs. It seems interesting that each cell has two quad-core processors and 32GB memory. The DBMS nodes are also two quad-core processors with 8GB memory. You have it reversed. Exadata has 2 quad-core CPUs and 8GB memory, the database nodes have 2 quad-core CPUs and 32GB memory. Less memory on the Exadata servers. My key question is where is the global catalog? If there isn’t one then I am confused as to how function shipping occurs. I would also like to know more about what CELLSRV actually is. The only "global catalog" that contains the physical layout is ASM. The Exadata Storage Software (CELLSRV) is just simply a "smart" block server and filtering application. This is also covered in the Exadata Technical Whitepaper. It might help to mention that one can have have multiple non-Exadata storage arrays in the same manor that one can have multiple Exadata storage servers. The difference being that w/o Exadata, all the processing (query and disk operations) are handled by the database CPUs. With Exadata, the disk operations as well as the offload processing (smart scans) are handled by the storage server CPUs. Hopefully that addresses your questions.

By: Mike Ferguson

Mike Ferguson — Sun, 12 Oct 2008 15:32:23 +0000

Curt,
I would be interested in what you discover at your meeting. I would be interested in asking a rather direct question – Is CELLSRV really an Oracle Instance in disguise?

By: Curt Monash

Curt Monash — Sun, 12 Oct 2008 14:27:50 +0000

Mike,

I don’t know, but those are excellent questions. Hopefully somebody from Oracle will address them when I visit on Thursday. We’ve set aside 2 1/2 + hours for DW + Exadata, so I’m looking forward to a fairly deep dive.

By: Mike Ferguson

Mike Ferguson — Sun, 12 Oct 2008 13:41:18 +0000

Curt,
I wonder if you could help me. I am still trying to get my head around Exadata. More specifically the link between the DBMS servers and the Exadata Storage Servers (cells). The Oracle white paper states that “Oracle Exadata enables function shipping from the database instance to the underlying storage”. My question is this. If there is function shipping occuring between the DBMS server and the underlying storage then why exactly do you need the ASM to present a shared disk view of all cells to each of the 8 Oracle DBMS servers? Are we saying that a query is NOT parallelised across the 8 servers at all? Are we saying that a query runs within 1 of the 8 only and the parallelisation is in the back end. In that case would it not be true that each Oracle instance would need access to a “global catalog” to know precisely where the data is (i.e. on each disk) in order to parallelise a query and function ship the parts of the query to just the nodes that have the data? I have seen nothing that states that the Oracle optimiser in an instance of Oracle running in the Database Machine has such a global catalog (or does each of the 8 database servers simply see it via the ASM on a virtual shared disk?) to know the precise location of the data on the disks. If a global catalog is not visible at the Oracle DBMS level then it would seem that a query would be parallelised across all cells and the CELLSRV software running in each cell would have to function like a cut down version of the Oracle DBMS? Would you agree? This CELLSRV software appears to have the smarts to do selections and projections on data within its cell and potentially some in-place join processing also. It seems interesting that each cell has two quad-core processors and 32GB memory. The DBMS nodes are also two quad-core processors with 8GB memory. While I can easily see that more memory is needed in a cell to get through more data before sending back the filtered result subset I wondered if the CELLSRV software is parallelising its subquery to exploit all 8 cores in the cell?

My key question is where is the global catalog? If there isn’t one then I am confused as to how function shipping occurs. I would also like to know more about what CELLSRV actually is.

By: Rick Powell

Rick Powell — Wed, 01 Oct 2008 20:13:20 +0000

Oracle has been working solutions on clustered databases since 1987 when version 6.2 came out with Oracle Parallel Server (OPS). This product was improved through version 8.n and was renamed Real Application Clusters when version 9.n came out. What it provides is the ability to pull together the power of multiple servers to satisfy database needs. It also provides fault tolerance to allow a database to continue in the event that a server is lost.

What is interesting in this latest offering is the Exadata Storage Server and how it can be employed to provide data throughput (20GB/Sec uncompressed / 50GB/Sec compressed). The ability to couple the exadata storage server, infiniband and RAC clusters will yield a significant improvement for database servers and those applications consume large amounts of data (such as DW, DSS and BI).

To me the question is to what extent can the Infiniband and Exadata Storage Server provide improvement for OLTP environments. Exadata Storage Server is optimized for data reads, but can it be optimized for writes too.

By: Ben Werther

Ben Werther — Wed, 01 Oct 2008 05:08:43 +0000

Hi Curt,

Thanks for the note. I’m impressed by your coverage of all of these recent announcements and the good questions you have been asking.

In my note I was really questioning the notion that Oracle now has a shared-nothing MPP architecture and that Exadata increases parallelism and reduces the dependency on RAC.

Consider a few points:

– Oracle isn’t claiming to have made any notable changes to the database layer. As always, to execute a large analytical query across multiple machines on Oracle, you need a combination of Oracle ASM (to handle data partitioning), Oracle RAC (to coordinate between the multiple machines) and Parallel Query (to break up the work into multiple threads of execution).

– Oracle’s RAC architecture (prior to 10g) had all inter-machine communication (for joins processing, etc) going through the shared file system. All this disk coordination was really slow, so in 10g they split RAC’s processing into two layers — file system sharing (which still required that every database machine be able to see all the storage) — and ‘memory channel’ which is their name for the network connection between machines through which they push data for join processing.
–> i.e. Exadata provides more storage bandwidth, but doesn’t provide parallel query execution or inter-node coordination — that is RAC’s job. Without RAC you’d just have a bunch of database servers independently reading and writing from shared Exadata storage – but not working together to execute large queries in parallel.

– When designing Exadata, they had to make some interesting compromises. They wanted to leverage commodity storage and eliminate SAN bandwidth limitations. But, because each RAC database server fundamentally needs to see all the storage, they couldn’t use the shared-nothing MPP approach of Teradata, Netezza and Greenplum where nodes have dedicated storage and all communication is just inter-node traffic for join processing. So instead they built a commodity ‘SAN replacement’ where the storage and compute are separate.

– However with that there is a problem. The Exadata architecture requires that lots of data must be pushed between the storage and server nodes — data that would never need to be pumped over the network if the storage and compute were colocated. Even with expensive Infiniband they need to reduce the data transmitted – hence the ‘intelligent storage’ concept. Because they’ve split the processing they need to invent a separate pre-processing layer to try to reduce the overhead they’ve introduced. But this layer can’t do too much or it will interfere with Oracle’s processing, so they are only pushing down the trivial predicate filtering operation.
–> i.e. Exadata provides a shared storage infrastructure suitable for a shared-disk architecture. It doesn’t make Oracle a shared-nothing architecture, and by consequence most processing can’t be pushed down to the disks.

– By contrast, in a Greenplum system we’ll often get upwards of 2GB/s storage bandwidth within a node, and each node will execute its part of the query plan which could include arbitrary SQL (incl pre-aggregation, windowing and more), MapReduce processing, R/Perl/Python functions, etc. Our commodity interconnect fabric moves data between nodes for join processing, and there’s plenty of bandwidth because we don’t need to compete with storage traffic. Finally there’s no artificial split of processing, and everything is processed as close to the data as possible.

Anyway, this is a long-winded answer, but hopefully this clarifies things a little further.

Regards,
Ben
[http://www.greenplum.com/news/blogs/]

By: Curt Monash

Curt Monash — Mon, 29 Sep 2008 21:16:17 +0000

Hi Ben,

What part of what I posted do you believe is incorrect? Your post seems to contradict mine in part, but I’m not sure what your evidence or basis is for doing so.

Also — in another thread (the one based on Eric Lai’s article) — I posted a question to Greenplum, namely what did Scott mean by $20 K/TB — spinning disk or user data? Help would be appreciated.

Best,

CAM

By: Ben Werther

Ben Werther — Mon, 29 Sep 2008 18:25:55 +0000

With Oracle Exadata, Oracle has addressed the obvious low-hanging fruit of slow I/O performance, but all the existing architectural problems of RAC remain (i.e. weaknesses in parallel query planning and inefficient movement of data between nodes for execution of complex joins and aggregations).

Oracle’s Exadata isn’t a shared-nothing architecture. It is the same old RAC shared-disk architecture, but with a faster I/O pipe. Why else would they force you to have separate storage and database nodes? It is because RAC has a fundamental limitation that requires that every server can see all the storage in order to function.

This means that the processing is happening on an entirely different box than the one accessing the data in disk. Oracle’s ‘intelligent storage’ story (i.e. predicate pushdown) is a baby step to reduce this deficiency by pushing around less data, but doesn’t address the core issue.

By contrast Greenplum’s shared-nothing architecture means that the database lives on the storage nodes. This allows it to fully parallelize SQL, MapReduce and R (for statistical analysis), and do state-of-the-art analysis directly against the data — which makes Oracle’s ‘intelligent storage’ look extremely trivial by comparison.

Until Oracle is able to run the entire database directly against their disks (i.e. a shared-nothing architecture) they are going to be living with fundamental processing limitations.

For more details– http://www.greenplum.com/news/blogs/

By: So what’s Oracle’s MPP-aware optimizer and query execution plan story? | DBMS2 -- DataBase Management System Services

Mon, 29 Sep 2008 02:43:11 +0000

[…] Edit: Answers to the title question have now shown up, and so the post below is now superseded by this one. […]