Benchmarks and POCs – DBMS 2 : DataBase Management System Services

Notes on Spark and Databricks — technology

Curt Monash — Sun, 31 Jul 2016 14:30:18 +0000

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.

The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.

Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.

Some of the reason is that SparkSQL is too immature to be great.
Some is annoyance that Databricks isn’t putting everything it has into open source.
Some is that everything has its architectural trade-offs.

To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.

CDH 5.5

Curt Monash — Thu, 19 Nov 2015 11:52:01 +0000

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
  - When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
  - Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
  - More “platform” stuff from the Hadoop stack (e.g. for data ingest).
  - Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:

Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
Hundreds of nodes.
10s of simultaneous queries in dashboard use cases.
1 – 3 million queries/month as a common figure.

Cloudera also expressed the opinions that:

An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
It is common for Impala customers to use Hive for “data preparation”.
SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
SparkSQL’s main use cases are (and these overlap heavily):
- As part of an analytic process (as opposed to straightforwardly DBMS-like use).
- To persist data outside the confines of a single Spark job.

Wants vs. needs

Curt Monash — Sun, 23 Mar 2014 11:51:54 +0000

In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:

What users value in the product-buying process is quite different from what they value once a product is (being) put into use.

Here are some thoughts about how that comes into play today.

Wants outrunning needs

1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.

2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.

Even so, the want for speed keeps growing stronger.

3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.

4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.

5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.

6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.

7. Some enterprises still insist on keeping their IT operations on-premises. In a number of cases, that perceived need is hard to justify.

8. Conversely, I’ve steered clients away from data warehouse appliances and toward, say, Vertica, because they had a clear desire to be cloud-ready. However, I’m not aware that any of those companies ever actually deployed Vertica in the cloud.

Needs ahead of wants

1. Enterprises often don’t realize how much their lives can be improved via a technology upgrade. Those queries that take 6 hours on your current systems, but only 6 minutes on the gear you’re testing? They’d probably take 15 minutes or less on any competitive product as well. Just get something reasonably modern, please!

2. Every application SaaS vendor should offer decent BI. Despite their limited scope, dashboards specific to the SaaS application will likely provide customer value. As a bonus, they’re also apt to demo well.

3. If your customer personal-identity data that resides on internet-facing systems isn’t encrypted — why not? And please don’t get me started on passwords that are stored and mailed around in plain text.

4. Notwithstanding what I said above about elasticity being overrated, buyers often either underrate their needs for concurrent usage, or else don’t do a good job of testing concurrency. A lot of performance disappointments are really problems with concurrency.

5. As noted above, it’s possible to underrate one’s need for boring old SQL goodness.

Wants and needs in balance

1. Twenty years ago, I thought security concerns were overwrought. But in an internet-connected world, with customer data privacy and various forms of regulatory compliance in play, wants and needs for security seem pretty well aligned.

2. There also was a time when ease of set-up and installation were underrated. Not any more, however; people generally understand its great importance.

Things I keep needing to say

Curt Monash — Mon, 12 Aug 2013 06:45:54 +0000

Some subjects just keep coming up. And so I keep saying things like:

Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.

Most generalizations about Hadoop are false. Reasons include:

Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
The transition from Hadoop 1 to Hadoop 2 will be drastic.
For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.

Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.

Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.

Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)

“Big Data” doesn’t create rapid IT growth. If we only had traditional kinds of data, IT growth would be drastically negative, since Moore’s Law swamps traditional data growth. Whole new categories of data are always needed to fill the gap. And these days, they’re all categorized as “Big Data”.

The single central database is a myth. Things are never that simple, at least at large enterprises. Hence, in particular, the ideal EDW (Enterprise Data Warehouse) is a myth.

Analytic RDBMS and appliances aren’t necessarily expensive. Deals can be had. Yes, most vendors want at least a few hundred thousand dollars for most sales, but there are plenty of exceptions even to that rule. And at either large or small scales, things get very cheap, for example:

Various vendors’ free/”community” editions.
The $2 million/petabyte hardware+software price I published for Vertica.

And Infobright is typically an economical option inbetween those extremes, if you’re cool with its focus on machine-generated data.

Columnar relational DBMS are relational. Examples include Sybase IQ, Vertica, ParAccel, Infobright and numerous others.

Yes, that’s a tautology. Even so, distressingly many people forget it, columnar RDBMS vendor employees not excepted.

Amazon Redshift proves very little about ParAccel. Amazon bought some stock in ParAccel, and got a cheap license to a subset of ParAccel’s code, perhaps in the same deal. Big whoop. Yes,

It is claimed that there are a lot of Redshift users, I presume low-end ones.
ParAccel is fast.*

But none of that speaks to some profound, ongoing Amazon/ParAccel/Actian relationship.

*I hear that ParAccel is usually faster than Vertica and other alternatives in POCs/benchmarks (Proofs of Concept). But I also hear that ParAccel’s installation complexity continues to be a POC problem.

New technology in old categories of application will only be adopted as quickly as firms replace their apps. Yes, that’s a tautology too. Even so, it puts an upper bound on, for example, the speed with which on-premises applications will be replaced by cloud alternatives.

SAP HANA is not yet a serious OLTP (OnLine Transaction Processing) DBMS. Yes,

HANA has in some form been under development for a long time; its major antecedent is BI Accelerator, which shipped back in 2006.
RAM-centric processing makes sense.
HANA has a cool-sounding feature list.
SAP claims lots of HANA sales, and not just in conjunction with a few new SAP apps that require HANA to run.

But the stories of HANA sales and deployment momentum sure seem concentrated on analytic use cases. And by the way — even among analytic DBMS vendors, I don’t hear much emphasis on competing vs. HANA.

Current BI trends reflect 1990s deja vu. The hottest business intelligence products and vendors are adopted by departments, on the strength of their snazzy interfaces and short adoption cycles.* That’s exactly how BI spread in the 1990s, only now the word “visualization” gets used more.

*A common phrase for that is land-and-expand.

And finally,

I’m not impressed that your future products will in some small ways be superior to what your competitors have had in production for over a year.

Dave DeWitt responds to Daniel Abadi

Curt Monash — Thu, 06 Jun 2013 04:02:48 +0000

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

A key point that Daniel seems to be making is that parallel relational database systems are significantly faster than those that rely on the use of MapReduce for their query engines. I totally agree. In fact, several of us have been making the same point for years now (starting with the blog posts that Mike Stonebraker and I wrote more than 5 years ago). Time and time again relational database systems have been shown to be significantly faster. Last year we published a paper comparing PDW (w/o PolyBase) to Hive on two identical clusters (http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf). We found that PDW was 3-10 times faster than Hive when executing the TPC-H benchmark.

Cloudera (Impala) and Pivotal (Hawq) seem to have come around to this same viewpoint. While both systems use HDFS for storage, neither uses MapReduce for executing query plans of relational operators. The Impala query engine was written from scratch in C++, and Hawq (apparently) uses a version of the Greenplum query engine adapted to read data directly from HDFS.

Hadapt, like Impala and Hawq, assumes (in general) that there is a single cluster in play and that they (as the DBMS vendor) can elect to use the resources (CPU, memory, and disk) on each node of the cluster in the way they think is best. For example, all three use the CPU and memory resources of the cluster to run a relational DB engine on each node. While Impala and Hawq leave the data in HDFS, Hadapt has concluded that they can get better performance by loading the data into PostgreSQL tables before executing most queries. In the case of Hadapt, one conceptually could think of PostgreSQL instances on each Hadoop “datanode” as a special type of “file format” that has the capability of not only storing local data, but also performing some query execution on that local chunk of data. Otherwise, the overall global execution is based on the traditional MapReduce engine accessing the data either from HDFS or from PostgreSQL. It is interesting that all three systems have, to varying degrees, concluded that many of their customers are willing to sacrifice the fault tolerance and ultimate scalability that MapReduce provides for performance.

Unlike Hadapt, Impala, and Hawq, in designing and building PolyBase our goal was not to build a general purpose scalable data warehousing solution. For that we already have SQL Server PDW. Rather, our goal was to extend the capabilities of PDW by allowing customers to use inexpensive Hadoop clusters while preserving the same T-SQL interface (which is used by a large number of third party applications and BI tools) to easily query and combine data regardless of where it lives and what format it is in. We expect that customers will primarily use the Hadoop cluster for their “cold” data or as their “digital shoebox”.

In addition, unlike Hadapt, Hawq, and Impala, PDW with PolyBase does not make a single-cluster assumption. Users may have two or more clusters, or two or more regions of the same cluster dedicated to different types of data. A deliberate design decision that we made at the beginning of the project was not to assume any control over a customer’s Hadoop cluster. Rather, PolyBase only assumes that it (1) can read and write HDFS files and (2) can submit MapReduce jobs to the cluster for execution. PolyBase is agnostic about what operating system the nodes of the Hadoop cluster are running, whose Hadoop distribution the cluster is running (Hortonworks, Cloudera, etc.), or whether the cluster is on-premise or in the cloud. We deliberately adopted this approach as we felt that it gave our customers the degree of flexibility that they need to be successful for a wide range of applications. In PolyBase, clusters are treated as first-class citizens by the system; the decision as to where to process parts of a query is determined by the system’s parallel query optimizer, based on the location of the data and capabilities of the cluster (e.g., # of nodes, CPU, memory of the cluster, load, etc.). In some situations, PDW data may be pushed into a Hadoop cluster to do the processing there instead. Furthermore, PolyBase can also be used to seamlessly query Hadoop-only data from two or more distinct clusters (e.g., one on-premise and another in the cloud) without combining with any RDBMS data.

Some notes on new-era data management, March 31, 2013

Curt Monash — Mon, 01 Apr 2013 08:44:42 +0000

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

Workloads and use cases vary greatly.
In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included:

MongoDB’s tunable consistency seems really interesting, with numerous choices available at the program-statement level.
All rumored performance problems notwithstanding, Ron asserts that MongoDB often “kicks butt” in actual proof-of-concept (POC) bake-offs.
Ron cites “12 different language bindings” as a key example of developer functionality giving 10gen an advantage vs. Ron’s previous employer MarkLogic.
10gen is working hard on management tools, security, and so on.
Ron claims that the “MongoDB loses data” knock is a relic of the distant — i.e. 1-2 years ago — past.
We had the same “Who needs joins?” discussion that I used to have with MarkLogic — Ron’s former company — and which MarkLogic has since disavowed.
There’s nothing special about MongoDB’s b-tree indexes. (I mention that because Tokutek thinks it offers a faster MongoDB indexing option.)

While this wasn’t a numbers-oriented conversation, business highlights included:

A lot of MongoDB’s competition is RDBMS — Oracle, SQL Server, MySQL, etc.
MongoDB’s top NoSQL competitor is Cassandra. 10gen sees less Couchbase than before, and also less HBase than Cassandra.
There’s yet another favorable MongoDB soft metric — 50,000 registrants for free online education, 2/3 outside the US.

I can add that anecdotal evidence from other industry participants suggests there’s a lot of MongoDB mindshare.

Specific traditional-enterprise use cases we discussed focused on combining data from heterogeneous systems. Specifically mentioned were:

Reference data/360-degree customer view.
Reference data about securities.
Aggregation of analytic results from various analytic systems across an enterprise. (For risk management).

DBAs’ roles in development

A lot of marketing boils down to “We don’t need no stinking DBAs!!!” I’m thinking in particular of:

NoSQL.
Hadoop and/or exploratory BI* messaging that positions against the alleged badness of “traditional data warehousing”.

*See in particular the comments to that post.

The worst-case data warehousing scenario is indeed pretty bad. It could feature:

Much internal discussion and politicking to determine the One True Way to view various data fields, with …
… lots of ongoing bureaucratic safeguards in the area of data governance.
Long additional efforts in the area of performance tuning.
Data integration projects up the wazoo.

But if the goal is just to grab some data from an existing data warehouse, perhaps add in some additional data from the outside, and start analyzing it — well, then there are many attempted solutions to that problem, including from within the analytic RDBMS world. The question is whether the data warehouse administrators try to help — which usually means “Here’s your data; now go away and stop bothering me!” — or whether they focus on “business prevention”.

Meanwhile, on the NoSQL side:

The smart folks at WibiData felt the need for schema-definition tools over HBase.
Per Ron Avnur, MongoDB users are clamoring for consistency-rule specification via an administrative (rather than programmatic) UI.

It’s the old loose-/tight-coupling trade-off. Traditional relational practices offer a clean interface between database and code, but bundle the database characteristics for different applications tightly together. NoSQL tends to tie the database for any one app tightly to that app, at the cost of difficulties if multiple applications later try to use the same data. Either can make sense, depending on (for example):

How it seems natural to organize your development and data administration talent.
Whether the app is likely to survive long enough that you’ll want to run many other applications against the same database.

YCSB benchmark notes

Curt Monash — Fri, 18 Jan 2013 00:42:49 +0000

Two different vendors recently tried to inflict benchmarks on me. Both were YCSBs, so I decided to look up what the YCSB (Yahoo! Cloud Serving Benchmark) actually is. It turns out that the YCSB:

Was developed by — you guessed it! — Yahoo.
Is meant to simulate workloads that fetch web pages, including the writing portions of those workloads.
Was developed with NoSQL data managers in mind.
Bakes in one kind of sensitivity analysis — latency vs. throughput.
Is implemented in extensible open source code.

That actually sounds pretty good, especially the extensibility part;* it’s likely that the YCSB can be useful in a variety of product selection scenarios. Still, as recent examples show, benchmark marketing is an annoying blight upon the database industry.

*With extensibility you can test your own workloads and do your own sensitivity analyses.

A YCSB overview page features links both to the code and to the original explanatory paper. The clearest explanation of the YCSB I found there was:

Each operation against the data store is randomly chosen to be one of:

Insert: Insert a new record.

Update: Update a record by replacing the value of one field.

Read: Read a record, either one randomly chosen field or all fields.

Scan: Scan records in order, starting at a randomly chosen record key. The number of records to scan is randomly chosen.

As was anyway obvious from the benchmark’s purpose, there’s nothing about joins, distributed transactions, or other hallmarks of OLTP (OnLine Transaction Processing).

NuoDB generated some mediocre YCSB results, then made a big fuss because NuoDB got those numbers while operating through SQL. Blech. I guess they proved that NuoDB’s SQL parsing/execution layer is better than the worst thing one can imagine an undergraduate writing as a homework project, but otherwise little substance was demonstrated.

AeroSpike’s YCSB story isn’t as bad. Aerospike seems to have used the benchmark pretty much the way it was intended, and produced numbers that look better than NuoDB’s. Still, a few vertical markets aside, why does it matter how far under 10 milliseconds latency can get?* Further, Aerospike managed a 60 GB database with 30 GB of RAM per server, which is an awkward fit with its “You don’t need to put everything in RAM because we’re so fast on flash memory” positioning.

*If you really do care about that, maybe your app shouldn’t be making so many round trips.

So once again, I stand by my position that benchmark marketing is an annoying waste of everybody’s time.

Notes on HBase 0.92

Curt Monash — Tue, 19 Jun 2012 21:52:14 +0000

This is part of a four-post series, covering:

Annoying Hadoop marketing themes that should be ignored.
Hadoop versions and distributions, and their readiness or lack thereof for production.
In general, how “enterprise-ready” is Hadoop?
HBase 0.92 (this post)

As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:

Performance, scalability, and so on.
“Coprocessors”, which are like triggers or stored procedures.
Security, as the first major application of co-processors.

HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing.

*Examples include unfenced C++ extensions to analytic RDBMS or — which mattered more in the 1990s than now — “blade”/”cartridge”/datatype extensions to extensible RDBMS such as Illustra, Informix, Oracle, or DB2.

**Admittedly, in the current HBase community, a considerable fraction of user organizations fit the “very sophisticated”/co-developer template.

As for scalability and performance, it seems the advances there match clichés such as “low-hanging fruit” or Bottleneck Whack-a-Mole.

HBase b-trees used to be restricted to two levels; now they aren’t.
Replication among data centers has been strengthened (I eventually hear that about most NoSQL projects that aren’t Cassandra ).
HBase inherits some performance improvements in HBase itself.

Overall, Todd says several tests have indicated HBase performance improvements of 60% or better, with some particular cases of course going much higher (up to 2 1/2X).

My whole HBase discussion with Todd was pretty short, actually; just one of several subjects in a one-hour call. But we did squeeze in one topic that wasn’t 0.92-specific — namely, what does HBase storage tend to be like? Notes on that included:

HBase working sets are commonly in RAM, or else have cache hit ratios in at least the 60-80% range.
Solid-state memory isn’t generally used for HBase persistence. Small fast disks are beginning to appear.
When you do short-request and MapReduce processing against the same HBase database, the MapReduce part is usually still done using cheaper disks.

Exasol update

Curt Monash — Sun, 13 Nov 2011 02:37:13 +0000

I last wrote about Exasol in 2008. After talking with the team Friday, I’m fixing that now. The general theme was as you’d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on.

Top-level points included:

Exasol’s technical philosophy is substantially the same as before, albeit not with as extreme a focus on fitting everything in RAM.
Exasol believes its flagship DBMS EXASolution has great performance on a load-and-go basis.
Exasol has 25 EXASolution customers, all in Germany.*
5 of those are “cloud” customers, at hosting providers engaged by Exasol.
EXASolution database sizes now range from the low 100s of gigabytes up to 30 terabytes.
Pretty much the whole company is in Nuremberg.

*That excludes some money from Hitachi. Exasol’s Hitachi partnership is still in limbo, an apparent casualty of the world economic crisis.

On the technical side:

As noted in my 2008 post, EXASolution is a columnar, no-head-node MPP (Massively Parallel Processing) DBMS.
The main way EXASolution compresses data is via dictionary/tokenization. 5:1 is a typical compression ratio before mirroring and so on, out of a 2-10:1 range.
EXASolution writes data to blocks in memory that are smaller than what is otherwise its preferred size (1/2 to 5 megabytes). These are sent to disk, where merge eventually happens. Exasol insists that write performance has always been fully satisfactory to customers to date.
EXASolution doesn’t have much in the way of performance tuning knobs. Exasol says they aren’t needed, and says that one really can start an EXASolution POC (Proof of Concept) in a day or so.
EXASolution doesn’t have much in the way of workload management capabilities, except what’s automagic (e.g., short query bias). However, it does collect statistics you can query via your favorite BI tool.
EXASolution doesn’t have much in the way of analytic platform capabilities, although there is some Lua-based scripting. However, there’s something NDA in the analytic platform area Coming Soon.*

In general, the whole thing sounds somewhat like ParAccel, at least at a high level.

*Exasol is not and never has been our client, but we can keep secrets for them even so.

Naturally, Exasol believes EXASolution has fine concurrency, with at least one customer routinely running 2000 concurrent users, 200 concurrent sessions (via connection pooling), and 5-10 concurrent queries. Another customer has 3500 Cognos users. 1-200 concurrent queries appears to be the record peak load. Anyhow, Exasol says that plans to offer real workload management could be accelerated if a need were discovered.

Exasol says it almost never loses POCs, but admits that it competes fairly rarely against Vertica and ParAccel, no doubt for reasons of geography. Exasol boasts one visible Sybase IQ replacement (Sony Music).

While Exasol’s sales to date have been in Germany, there are plans to change that soon. At least one sales cycle is well underway in Eastern Europe. Offices in other Germanic countries are planned. Existing customers are planning to deploy additional copies outside Germany. Discussions are underway regarding other geographies, e.g. English-speaking ones.

Eight kinds of analytic database (Part 1)

Curt Monash — Tue, 05 Jul 2011 08:17:44 +0000

Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.

Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning.

Enterprise data warehouse (Full or partial)

Kinds of data likely to be included: All, but especially operational
Likely use styles: All
Canonical example: Central EDW for a big enterprise
Stresses: Concurrency, reliability, workload management

The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. Full EDWs are pipedreams. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you’re going to stress concurrency and/or operational use cases.

Traditional data mart

Kinds of data likely to be included: All
Likely use styles: Business intelligence, budgeting/consolidation, investigative
Examples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.
Stresses: Performance, concurrency, TCO

Whether or not you have something like an enterprise data warehouse, it’s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation. Some investigative analytics may be in the mix as well.

Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well. Ted Codd pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.

Investigative data mart — agile

Kinds of data likely to be included: All, especially customer-centric
Likely use styles: Investigative
Canonical example: A few analysts getting a few TB to examine
Stresses: Ease of setup/load, ease of admin, price/performance

Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they’re differentiated by database size.

If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) — and if that data is “single-subject” and fairly homogenous — your watchwords should be “cheap”, “easy”, and “fast”. You don’t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs), nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).

*If you have dozens or even hundreds of analysts hitting the same database, you’re probably back to the more concurrency-oriented scenarios outlined above.

Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don’t rule out Netezza’s lowest-end products (even if they’d really rather sell you something bigger). Or, if you’re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.

Investigative data mart — big

Kinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientific
Likely use styles: Investigative
Canonical example: Single-subject 20 TB – 20 PB relational database
Stresses: Performance, scale-out, analytic functionality

But if you’re looking at tens of terabytes of relational data, or even more, you really do have a “big data” problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.

Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression — e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.

Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full analytic platforms, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.

Continued in Part 2, where we cover some of the more difficult use cases.