Discussion of workload management technology, typically in analytic or mixed-workload DBMS.
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:
- CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
- Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
- CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
- Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.
*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.
Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.
|Categories: Aster Data, Citus Data, Columnar database management, Data warehousing, Database compression, Greenplum, Hadoop, Parallelization, PostgreSQL, SQL/Hadoop integration, Transparent sharding, Workload management||6 Comments|
I stopped by MemSQL last week, and got a range of new or clarified information. For starters:
- Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
- … which was the point of introducing MemSQL’s flash-based columnar option.
- One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
- Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
- At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
- A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
- MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.
On the more technical side: Read more
|Categories: Clustering, Clustrix, Columnar database management, Data warehousing, Database compression, In-memory DBMS, MemSQL, NewSQL, NuoDB, Specific users, Vertica Systems, VoltDB and H-Store, Workload management, Zynga||15 Comments|
There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.
- Hive is good at some tasks and terrible at others.
- Hive is good at batch data transformation.
- Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
- Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
- Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
- Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
- Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”. Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
- There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.
And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.
- A survey of SQL/Hadoop integration (February, 2014)
- The cardinal rules of DBMS development (March, 2013)
|Categories: Cloudera, Data warehousing, Facebook, Hadoop, SQL/Hadoop integration, Workload management||4 Comments|
It took me a bit of time, and an extra call with Vertica’s long-time R&D chief Shilpa Lawande, but I think I have a decent handle now on Vertica 7, code-named Crane. The two aspects of Vertica 7 I find most interesting are:
- Flex Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica performance).
- What sounds like an alternate query execution capability for short-request queries, the big point of which is that it saves them from being broadcast across the whole cluster, hence improving scalability. (Adding nodes of course doesn’t buy you much for the portion of a workload that’s broadcast.)
Other Vertica 7 enhancements include:
- A lot of Bottleneck Whack-A-Mole.
- “Significant” improvements to the Vertica management console.
- Security enhancements (Kerberos), Hadoop integration enhancements (HCatalog), and enhanced integration with Hadoop security (Kerberos again).
- Some availability hardening. (“Fault groups”, which for example let you ensure that data is replicated not just to 2+ nodes, but also that the nodes aren’t all on the same rack.)
- Java as an option to do in-database analytics. (Who knew that feature was still missing?)
- Some analytic functionality. (Approximate COUNT DISTINCT, but not yet Approximate MEDIAN.)
Overall, two recurring themes in our discussion were:
- Load and ETL (Extract/Transform/Load) performance, and/or obviating ETL.
- Short-request performance, in the form of more scalable short-request concurrency.
I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:
- Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
- At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
- While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
- Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
- Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
- The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.
Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.
I had a good chat with IBM about IBM BLU, aka BLU Accelerator or Acceleration. BLU basics start:
- BLU is a part of DB2.
- BLU works like a columnar analytic DBMS.
- If you want to do a join combining BLU and non-BLU tables, all the BLU tables are joined first, and the result set is joined to the other tables by the rest of DB2.
And yes — that means Oracle is now the only major relational DBMS vendor left without a true columnar story.
BLU’s maturity and scalability basics start:
- BLU is coming out in IBM DB2 10.5, this quarter.
- BLU will initially be single-server, but …
- … IBM claims “near-linear” scalability up to 64 cores, and further says that …
- … scale-out for BLU is coming “soon”.
- IBM already thinks all your analytically-oriented DB2 tables should be in BLU.
- IBM describes the first version of BLU as being optimized for 10 TB databases, but capable of handling 20 TB.
BLU technical highlights include: Read more
|Categories: Columnar database management, Data pipelining, Data warehousing, Database compression, IBM and DB2, Workload management||20 Comments|
Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.
In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.
Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed: Read more
|Categories: Business intelligence, Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Platfora, Workload management||12 Comments|
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
|Categories: Databricks, Spark and BDAS, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo||14 Comments|
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more