Vertica — just star and snowflake schemas?
One of the longest-running technotheological disputes I know of is the one pitting flat/normalized data warehouse architectures vs. cubes, stars, and snowflake schemas. Teradata, for example, is a flagwaver for the former camp; Microstrategy is firmly in the latter. (However, that doesn’t keep lots of retailers from running Microstrategy on Teradata boxes.) Attensity (a good Teradata partner) is in the former camp; text mining rival Clarabridge (sort of a Microstrategy spinoff) is in the latter. And so on.
Vertica is clearly in the star/snowflake camp as well. I asked them about this, and Vertica’s CTO Mike Stonebraker emailed a response. I’m reproducing it below, with light edits; the emphasis is also mine. Key points include:
- Almost everybody (that Vertica sees) wants stars and snowflakes, so that’s what Vertica optimizes for.
- Replicating small dimension tables across nodes is great for performance.
- Even so, Vertica is broadening its support for more general schemas as well.
Great question. This is something that we’ve thought a lot about and have done significant research on with large enterprise customers. … short answer is as follows:
Vertica supports star and snowflake schemas because that is the desired data structure for data warehousing. The overwhelming majority of the schemas we see are of this form, and we have highly optimized for this case.
This includes horizontally partitioning the fact table and replicating the dimension tables. This generates query plans with maximum parallelism and minimum execution time.
There are occasional non-snowflake schemas. For example, one customer insisted on running Vertica on the identical schema he was using for his current row store. To help his row-store perform better, he had split his fact table in half,* and used what we call a “barbell” schema, and he did not want to go to the effort of changing the design. Obviously, there is no penalty for “fat” fact tables in a column store. Hence the customer’s kludge was row-store specific. Nevertheless, he wanted to use his existing schema.
*Ed: I assume this means that different columns were stored in different tables, rather than that the rows were split up. Otherwise it wouldn’t really make sense.
For this reason, we are now extending the Vertica engine to support non-star/snowflake schemas, and will deliver this functionality in a few months.
In a similar vein, the overwhelming percentage of the schemas we see have an enormous fact table and (by comparison) tiny dimension tables (made even tinier by Vertica’s aggressive use of data compression). Therefore, replicating the dimension tables to improve query performance is the prudent thing to do — pay a little space to get vastly better query performance.
According to our customers, the goal is superior performance on star/snowflake schemas, and that is what Vertica delivers.
Comments
One Response to “Vertica — just star and snowflake schemas?”
Leave a Reply










[...] distributed system replicates the dimension tables on each node and partitions the fact table. Vertica says that they have customers that use more transactional models, but what does that mean for overall performance? Greenplum’s website says: “Utilizes [...]