Comments on: A question on MDX performance

By: Curt Monash

Curt Monash — Sat, 30 Jan 2010 06:11:40 +0000

Essbase actually does calculations erroneously?? Could you please say more about that? Thanks!

By: Steve J.

Steve J. — Fri, 29 Jan 2010 18:59:06 +0000

Reply to Paul Johnson – regarding preaggregation , you will find that an Essbase cube can be fully calc’d, but this can lead to DATA EXPLOSION and long calculation times. Optimization methods employed to reduce this phenomena can cause erroneous results when run against sparse asymmetrical dimensions , but otherwise I found Essbase to be a fantastic DataMart solution, and very intuitive since the gui bypasses the need to use MDX. Use a star or snowflake schema in a rdbms to feed your Essbase Cubes for best results.

By: Cubegeek

Cubegeek — Sat, 28 Nov 2009 09:49:10 +0000

I’ve worked with MSAS a bit and I’ve had engineers who worked for me tell me things that confirm my brief experiences as I got further from the technology.

Most people who get up the rather steep curve for MDX admire it for its elegance and prefer it to SQL on that basis. Nevertheless the experience is that it is generally not worth it to learn the language if you have appreciable experience in SQL.

If you have become something of an expert in tuning databases in the generations of products before the Greenplum and Vertica days, say with Essbase, Microstrategy, Teradata, Oracle Express or Sybase ASE, then you will have some familiarity with arcane 4G languages and dialects of SQL without many cross-product similarities. MDX was the language destined to solve all that, a sort of OLAP Esperanto. Unfortunately, the number of programmers deciding to hack SQL to perform against this and that sort of schema tended to dominate the few cross-platform specialists. In the end, it is my opinion that the lack of a predominating visualization stack obviated the need for widespread MDX adoption. As fat visual programming OLAP clients matured, enterprise customers began demanding thin clients. As full featured stacks became available, developers wanted LAMP, and so on. Now we are met on a battlefield testing whether PHP and MySQL will long endure, with people like me wondering how they ever got started considering the maturity and performance of products like Sybase and Essbase.

My direct experience is that flatly, for applications of any sophistication, on MSAS, stored procedures with T-SQL always performed better than their functional equivalents in MDX. In their aborted product stack, Performance Point, Microsoft engineers created a middle language PEL(?) that would supposedly choose which language was best suited for a task and then generate that code. All jokes about code-generators aside, my engineers always had to second guess PEL and ended up writing it all in T-SQL. So as a practical matter, there was no sense in learning MDX if you had already mastered T-SQL on the platform for which MDX was specifically designed. It would have been nice if MDX performed with speed commensurate with its elegance, but it simply didn’t. This was two years ago.

I know MDX lives on in Essbase but that Essbase expresses it differently than does MSAS. It does rather boil down to ‘it depends’, because language to language different platforms have different strengths, etc.

I expect that we will not learn the definitive answer to this question until there is a shakeout of DBs that survive the transition to cloud infrastructure. And in that regard I think a Greenplum or a Vertica or a Hadoop-based solution will win out. In other words it won’t come down to the semantic layer. The market never forced it to because there is no real OLAP standard (chicken or egg?).

In the end I say MDX iff you love MDX.

By: Robert Folkerts

Robert Folkerts — Thu, 19 Nov 2009 19:20:42 +0000

@ Daniel Lemire. I’m using Mondrian in a commercial setting and it is quite adequate as a ROLAP engine. The key to getting high performance is to make a few aggregate tables. For example, once you aggregate and reduce row counts from millions to 10’s of thousands, the responses become ‘snappy’ rather that mildly irritating. This does mean that I had to go out and actually watch users to see where the bottle necks are. Then I only had to build the aggregates that get used, since building the aggregates can be time consuming. We are running daily ETL, so daily rebuilds of the aggregates are practical.

By: Paul Johnson

Paul Johnson — Tue, 03 Nov 2009 20:06:20 +0000

“In my own work, I have set up a star schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000.”

OK, even at a generous 20 bytes/column we’re dealing with ~120GB of data here i.e. not a lot.

“In ad hoc analytics, is it expected that any query against such a dataset should return a result within a minute or two (i.e. before a user
gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table or caching mechanism)?”

Against a DBMS it depends on the complexity of the query and the database concurrency – what else is running at the same time. Assuming you
have all CPU and IO resources to yourself, and assuming a simple scan only or fact:dimension join query, the performance is largely (but not
wholly) a product of the capability of the disk/IO sub-system, which dictates your database read/scan rate.

“Or is that level of performance only expected with a high end massively parallel software/hardware solution?”

I hope not, as with those data volumes I wouldn’t expect you to go down the MPP route.

“The server specs I’m testing with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3Gb/s SAS drive, running Windows Server 2003 (x64).”

What about the IO sub-system? What DBMS are you running?

“…it is not possible to have all combinations of dimensions calculated in advance, in addition to being maintained.”

I beg to differ. That’s exactly what Queryobject from CrossZ systems delivers. Full pre-calculation of all measures aross all combination of dimensions at all levels of the hierarchy, for any input dataset size.

Most (all other?) OLAP vendors consider this akin to ‘boiling the ocean’ and consider it not achievable, so don’t even try.

Without full pre-calculation, the problem is that there is scanning to do at query time to satisfy the queries for which the answer is not pre-built.

Once the user gets the egg-timer for 10-15 minutes, how can they distinguish between busy and broken?

There are tier1 telcos in the US that have been using Queryobject in production for many years.

It’s SQL-compliant so you don’t have to learn MDX. Sorry Chris!

See: http://www.queryobject.com

By: Chris Webb

Chris Webb — Tue, 03 Nov 2009 16:31:14 +0000

Tom,

To answer your question, tuning of the Analysis Services cube would include building some aggregations (but not aggregation tables – SSAS creates aggregations internally and is very good at working out how to use them) but not necessarily include warming the cache (which is a widely used tuning technique, but personally I always aim for fast performance on a cold cache).

The volumes you’re talking about are just about average for SQL Server Analysis Services cubes today, although you will need to have a reasonable amount of Analysis Services knowledge to get the best possible performance. Generating a set of test queries is a good technique to use but I would say that a query that returns 50,000 cells is on the large side. In general I encourage users to run queries whose results can be displayed on one screen rather than do a ‘data dump’ style query and then try to manipulate the results in, say Excel. Obviously the larger the amount of data returned the longer the query will take to run and I don’t think there’s any value in a user asking for large amounts of data in one big chunk, rather than in more digestible, smaller pieces.

By: John Sequeira

John Sequeira — Mon, 02 Nov 2009 21:00:05 +0000

Tom,

Sorry if I misdirected you with Qlikview. I was listening to a podcast about Gemini [1] where one of the team members described snappy responses in Excel using a 100 million row test data set, and how it compressed down to 180Meg or so. I had assumed they used a pretty similar architecture, but maybe not.

The query you mention is pretty trivial in terms of MDX->SQL. In other words, you could implement your own star schema and write a simple query to bring those items back. (What Mondrian automates).

This would not really test the underlying store (Oracle/Mysql/etc), if that is your goal. It would probably serve to throw out horribly implemented MDX stores, but that’s about it.

I appreciate that you’re trying to simplify the problem to make it tractable. I think that this analysis is just not so amenable to a blog post or an email response.

I would take everyone’s collective unease as a sign that you might not get the meaningful results you require without digging deeper into aggregates and modeling, which I can certainly see why you’d want to avoid.

Given how central caching is to BI performance, and how much variety there is to implementing this, I just don’t think a performance test that ignores platform-specific pre-aggregation strategies is useful.

You may as well do what Curt said and write sql and benchmark the underlying RDMBS stores, for the MDX-fronted ROLAP stores.

[1] http://www.dotnetrocks.com/default.aspx?showNum=490

By: Tom Howley

Tom Howley — Mon, 02 Nov 2009 14:43:16 +0000

Many thanks for all of the interesting responses to my question, which Curt kindly posted for me. Here are some comments:

– I have evaluated different OLAP technologies that each support MDX. Using the same set of MDX queries on the same dataset loaded into the different OLAP systems seemed like a reasonable to way compare performance. It also means that MOLAP and ROLAP systems can be compared side by side. So Chris is correct in his statement that I am evaluating the performance of the OLAP technology, rather than the performance of MDX itself — a TPC benchmark for OLAP is a good way to describe it.

– John suggests trying Gemini or QlikView. I have tried QlikView on the 32 GB RAM 64-bit system with the 100 million row dataset mentioned above. It ran out of memory trying to load this dataset.

– Chris states “with those data volumes and that hardware then I would be very confident that for a properly designed and tuned cube you’d get query times of a few seconds or less for most reasonable queries.”. Does design and tuning of the cube, include the pre-calculation of aggregate tables (or some pre-loading of a cache)? Can I expect a response in the order of seconds for queries that the cube has not been prepared with (on the hardware mentioned)? One example of a query I have run is a simple crossjoin between two dimensions (50 members, 1000 members), selecting one measure, thus returning 50,000 cells. Is this a reasonable query for performance assessment?

Thanks again for all of your insight.

By: Jerome Pineau

Jerome Pineau — Sun, 01 Nov 2009 09:43:00 +0000

Thanks for the insight Chris. I think you’re obviously also right in your last sentence – hands down.
J.

By: Chris Webb

Chris Webb — Sat, 31 Oct 2009 21:00:36 +0000

Actually, rereading the question I think it actually makes more sense than the comments here (including mine) suggest. The questioner here is interested in “assessing the performance of an OLAP technology using a set of MDX queries”; I’m not sure they are confusing the query language with the underlying database technology at all.

What they seem to want is something like a TPC benchmark for OLAP: a dataset that can be loaded onto multiple OLAP platforms, and a set of MDX queries that can be run against that dataset on each platform so their performance can be compared directly. This seems like a reasonable requirement to me, but as I said in my original comment I don’t know of any such benchmarks.

Jerome, to pick up on some of your questions:
– No, MDX isn’t always always converted to SQL. In a pure MOLAP scenario (like Microsoft Analysis Services in MOLAP mode) nothing gets converted to SQL. Even in ROLAP scenarios, while SQL queries are issued to get the raw data needed for an MDX query, it’s still likely that there’ll be a lot of calculation and processing that will happen inside the OLAP engine afterwards.
– MDX is widely used in the Microsoft BI world, but given the number of other platforms that support it I am a bit mystified why you don’t hear so much about it in the wider world…
– …and this is probably because, as Jerome says, there is a steep learning curve. In fact for anyone that’s spent a lot of time thinking in SQL it can be particularly confusing. Which is a shame because I truly, honestly believe that MDX is an immensely powerful language and much better suited for expressing BI queries and calculations than SQL.