Yahoo – DBMS 2 : DataBase Management System Services

More notes on HBase

Curt Monash — Tue, 17 Mar 2015 18:13:50 +0000

1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
Files have indexes and optional Bloom filters.
Files are marked with min/max field values and time stamp ranges, which helps with data skipping.

Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

HBase is commonly used to receive machine-generated data. Everybody knows that.
Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.

OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:

Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
eBay has an HBase database largely on spinning disk, used to inform its search engine.

Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.

4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.

5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.

It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.

In general, Cloudera suggested that HBase was used in a fair number of OEM situations.

6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.

Related link

A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.

Notes on the Hortonworks IPO S-1 filing

Curt Monash — Sun, 07 Dec 2014 13:53:10 +0000

Given my stock research experience, perhaps I should post about Hortonworks’ initial public offering S-1 filing. For starters, let me say:

Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
- $11.7 million from everybody but Microsoft, …
- … plus $7.5 million from Microsoft, …
- … for a total of $19.2 million.
Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
- 2 on April 30, 2012.
- 9 on December 31, 2012.
- 25 on April 30, 2013.
- 54 on September 30, 2013.
- 95 on December 31, 2013.
- 233 on September 30, 2014.
Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
- In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
- The tentative top of the offering’s price range is $14/share.
- That’s also slightly down from the Series C price in mid-2013.

And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.

Overall, the Hortonworks S-1 is about 180 pages long, and — as is typical — most of it is boilerplate, minutiae or drivel. As is also typical, two of the most informative sections of the Hortonworks S-1 are:

The section called Management’s Discussion and Analysis.
The footnotes to the numbers, starting a couple of pages in.

The clearest financial statements in the Hortonworks S-1 are probably the quarterly figures on Page 62, along with the tables on Pages F3, F4, and F7.

Special difficulties in interpreting Hortonworks’ numbers include:

A large fraction of revenue has come from a few large customers, most notably Microsoft. Details about those revenues are further confused by:
- Difficulty in some cases getting a fix on the subscription/professional services split. (It does seem clear that Microsoft revenues are 100% subscription.)
- Some revenue deductions associated with stock deals, called “contra-revenue”.
Hortonworks changed the end of its fiscal year from April to December, leading to comparisons of a couple of eight-month periods.
There was a $6 million lawsuit settlement (some kind of employee poaching/trade secrets case), discussed on Page F-21.
There is some counter-intuitive treatment of Windows-related development (cost of revenue rather than R&D).

One weirdness is that cost of professional services revenue far exceeds 100% of such revenue in every period Hortonworks reports. Hortonworks suggests that this is because:

Professional services revenue is commonly bundled with support contracts.
Such revenue is recognized ratably over the life of the contract, as opposed to a more natural policy of recognizing professional services revenue when the services are actually performed.

I’m struggling to come up with a benign explanation for this.

In the interest of space, I won’t quote Hortonworks’ S-1 verbatim; instead, I’ll just note where some of the more specifically informative parts may be found.

Page 53 describes Hortonworks’ typical sales cycles (they’re long).
Page 54 says the average customer has increased subscription payments 25% year over year, but emphasize that the sample size is too small to be reliable.
Pages 55-63 have a lot of revenue and expense breakdowns.
Deferred revenue numbers (which are a proxy for billings and thus signed contracts) are on Page 65.
Pages II 2-3 list all (I think) Hortonworks financings in a concise manner.

And finally, Hortonworks’ dealings with its largest customers and strategic partners are cited in a number of places. In particular:

Pages 52-3 cover dealings with Yahoo, Teradata, Microsoft, and AT&T.
Pages 82-3 discusses OEM revenue from Hewlett-Packard, Red Hat, and Teradata, none of which amounts to very much.
Page 109 covers the Teradata agreement. It seems that there’s less going on than originally envisioned, in that Teradata made a nonrefundable prepayment far greater than turns out to have been necessary for subsequent work actually done. That could produce a sudden revenue spike or else positive revenue restatement as of February, 2015.
Page F-10 has a table showing revenue from Hortonworks’ biggest customers (Company A is Microsoft and Company B is Yahoo).
Pages F37-38 further cover Hortonworks’ relationships with Yahoo, Teradata and AT&T.

Correction notice: Some of the page numbers in this post were originally wrong, surely because Hortonworks posted an original and amended version of this filing, and I got the two documents mixed up. A huge Thank You goes to Merv Adrian for calling my attention to this, and I think I’ve now fixed them. I apologize for the errors!

Related links

Hortonworks business notes, not all of which turn out in retrospect to have been wholly accurate (August, 2013)
Spark vs. Tez (October, 2014)

Spark on fire

Curt Monash — Wed, 30 Apr 2014 10:41:17 +0000

Spark is on the rise, to an even greater degree than I thought last month.

Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
- A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
- MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
- The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
- A rich set of programming primitives in connection with that model.
- Support also for highly-iterative processing, of the kind found in machine learning.
- Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
- A clever approach to fault-tolerance.
Spark is a major contender in streaming.
There’s some cool machine-learning innovation using Spark.
Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*

*Yes, my fingerprints are showing again.

The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):

… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.

With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.

In an unfortunate non-development, Tachyon is not (yet?) part of Spark, and so it is hard for a Spark job’s data to be shared with other jobs (Spark or otherwise) or processes. That said:

The tight integration of data structures and processes gives similar performance benefits to those of in-process vs. out-of-process in-database analytic functions. (It also of course raises similar stability concerns, but those seem less important in the case of Spark than of a true DBMS.)
From a Hadoop vendor’s standpoint, Tachyon’s benefit of not requiring HDFS (Hadoop Distributed File System) isn’t important, and Tachyon somewhat conflicts with a newish effort called HDFS Caching.

A couple of Spark machine learning stories are very cool, in that they involve intra-day retraining of models. The better-known one is Yahoo’s, which in a prototype built in 120 lines of code trains a new model for recommendation of each candidate top-page news story. When I challenged that anecdote, Ion told me about his own former company Conviva, which retrains models every minute to decide which particular source of streaming video each client system will be connected to.

I am generally skeptical of immature SQL efforts, and SparkSQL is no exception. That said, it seems to be going in sensible directions, which should be welcome to those folks who used or were planning to use Shark anyway.

SparkSQL actually has its own optimizer, rather than using the inappropriate Hive one. As with many new optimizers, it’s starting out rule-based, but is planned to become cost-based down the road.
SparkSQL can run queries against data that’s either inside Spark or outside-but-accessible.
SparkSQL can be accessed via Python and other APIs.
Spark works with the Hive metastore, nee’ HCatalog.

And finally, there’s no public news as to what Databricks’ own business is. I think that’s a bit silly, but in fairness:

The Spark 1.0 launch will consume every bit of marketing bandwidth they have.
They don’t yet want to commit to a delivery date of their first offering.

Hortonworks business notes

Curt Monash — Sat, 24 Aug 2013 11:07:53 +0000

Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:

Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer. Edit: I have subsequently heard, very credibly, that the denial was untrue.
As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
Hortonworks used a figure of ~75 subscription customers. Edit: That figure turns out in retrospect to have been inflated. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
… a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
Other Hortonworks reseller partners seem more important in terms of helping customers consume HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Media.
- Telecommunications.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.

In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social.

These use cases can be any of a true new application, an enhancement to an existing application, or a general investigative analytics environment.
This adoption is typically driven by a line-of-business group, but IT is a key influencer, and IT usually winds up running the project.
Overall, this accounts for 70% of Hortonworks’ business by some metric.

The other 30% Hortonworks sees is efficiency-oriented — i.e., a cheaper way to store and/or process data.

Hortonworks assigns ELT (Extract/Load/Transform) to this group. Based in part on a subsequent conversation with Cloudera, I gather that batch ELT offload — especially but not only from large Teradata installations — is a significant fraction of the total.
“Data lake” and similar buzzwords fall into this group, as does “re-architecting”.
Hortonworks asserts that adopters from the 70% rapidly move to this kind of use as well, while Teradata customers typically start out in this part.
Unsurprisingly, this part is IT all the way.

One customer apparently estimates its fully burdened Hadoop costs at $900/terabyte/year.

Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera.

And finally: One of my favorite things to ask is “When you win, why do win?” — at least when I think the vendor won’t just reiterate their core marketing messages. Hortonworks gave a great, threefold answer:

Its relationships with Teradata, Microsoft, et al.
Its promise that it can get specific customer-requested features into Apache Hadoop on a specific timeframe. (Yes, the Contribution Olympics are still with us.)
Its claim of greater experience with truly huge clusters — not just Yahoo, but I don’t know who its other examples are.

Related link

A few weeks ago, I talked with Hortonworks at length about technology and other subjects.

Hortonworks, Hadoop, Stinger and Hive

Curt Monash — Tue, 06 Aug 2013 22:49:41 +0000

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Hortonworks founder J. Eric “Eric14” Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
~250 employees.
~70-75 subscription customers.

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

10ish nodes for a typical starting cluster.
100ish nodes for a typical “data lake” committed adoption.
Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
HBase used in >50% of installations.
Hive probably even more than that.
Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

It’s been in preview/release candidate/commercial beta mode for weeks.
Q3 is the goal; H2 is the emphatic goal.
Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include:

Providing a Hive-friendly execution environment in Hadoop 2.0. For example, this seems to be a main point of Tez, although Tez is also meant to support Pig and so on as well. (Recall the close relationship between Hortonworks and Pig fan Yahoo.)
Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet.
Improving Hive itself, notably in:
- SQL functionality.
- Query planning and optimization.
- Vectorized execution (Microsoft seems to be helping significantly with that).

Specific notes include:

Some of the Hive improvements — e.g. SQL windowing, better query planning over MapReduce 1 — came out in May.
Others — e.g. Tez port — seem to be coming soon.
Yet others — notably a true cost-based optimizer — haven’t even been designed yet.
Hive apparently often takes 4-5 seconds to plan a query, with a lot of the problem being slowness in the metadata store. (I hope that that’s already improved in HCatalog, but I didn’t think to ask.) Hortonworks thinks 100 milliseconds would be a better number.
Other SQL functionality that got mentioned was UDFs (User Defined Functions) and sub-queries. In general, it sounds as if the Hive community is determined to someday falsify the “Hive supports a distressingly small subset of SQL” complaint.

As for ORC:

ORC manages data in 256 megabyte chunks of rows. Within such chunks, ORC is columnar.
Hortonworks asserts that ORC is ahead of Parquet in such areas and indexing and predicate pushdown, and only admits a Parquet advantage in one area — the performance advantages of being written in C.
The major contributors to ORC are Hortonworks, Microsoft, and Facebook. There are ~10 contributors in all.
ORC has a 2-tiered compression story.
- “Lightweight” type-specific compression is mandatory, for example:
  - Dictionary/tokenization, for single columns within chunks.
  - Run-length encoding for integers.
- Block-level compression on top of that is optional, via a collection of usual-suspect algorithms.

Finally, I asked Hortonworks what it sees as a typical or default Hadoop node these days. Happily, the answers seemed like straightforward upgrades to what Cloudera said in October, 2012. Specifics included:

2 x 6 = 12 cores.
12 or so disks, usually 2-3 terabytes each. 4 TB disks are beginning to show up in “outlier” cases.
Usually 72 gigs or more of RAM. 128 gigs is fairly common. 256 sometimes happens.
10GigE is showing up at some web companies, but Hortonworks groaned a bit about the expense. Hearing that, I didn’t even ask about Infiniband, its use in certain Hadoop appliances notwithstanding.
Hortonworks isn’t seeing much solid-state drive adoption yet, some NameNodes excepted. No doubt that’s a cost issue.
Hortonworks sees GPUs only for “outlier” cases.

Related links

I’ve been posting quite a bit about SQL-on-Hadoop. Links can be found in my June Dan Abadi post.
When I posted in March about the great expense and difficulty of building a good DBMS, I was thinking especially of SQL-on-Hadoop.

Hadoop execution enhancements

Curt Monash — Mon, 11 Mar 2013 10:21:27 +0000

Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:

Yahoo is a big YARN user.
There are other — paying — YARN users.
YARN general availability is now targeted for well before the end of 2013.

Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

This is similar to the approach of BDAS Spark:

Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.

although Tez won’t match Spark’s richer list of primitive operations.

More specifically, there will be six primitive Tez operations:

HDFS (Hadoop Distributed File System) input and output.
Sorting on input and output (I’m not sure why that’s two operations rather than one).
Shuffling of input and output (ditto).

A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.

I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:

The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)

YCSB benchmark notes

Curt Monash — Fri, 18 Jan 2013 00:42:49 +0000

Two different vendors recently tried to inflict benchmarks on me. Both were YCSBs, so I decided to look up what the YCSB (Yahoo! Cloud Serving Benchmark) actually is. It turns out that the YCSB:

Was developed by — you guessed it! — Yahoo.
Is meant to simulate workloads that fetch web pages, including the writing portions of those workloads.
Was developed with NoSQL data managers in mind.
Bakes in one kind of sensitivity analysis — latency vs. throughput.
Is implemented in extensible open source code.

That actually sounds pretty good, especially the extensibility part;* it’s likely that the YCSB can be useful in a variety of product selection scenarios. Still, as recent examples show, benchmark marketing is an annoying blight upon the database industry.

*With extensibility you can test your own workloads and do your own sensitivity analyses.

A YCSB overview page features links both to the code and to the original explanatory paper. The clearest explanation of the YCSB I found there was:

Each operation against the data store is randomly chosen to be one of:

Insert: Insert a new record.

Update: Update a record by replacing the value of one field.

Read: Read a record, either one randomly chosen field or all fields.

Scan: Scan records in order, starting at a randomly chosen record key. The number of records to scan is randomly chosen.

As was anyway obvious from the benchmark’s purpose, there’s nothing about joins, distributed transactions, or other hallmarks of OLTP (OnLine Transaction Processing).

NuoDB generated some mediocre YCSB results, then made a big fuss because NuoDB got those numbers while operating through SQL. Blech. I guess they proved that NuoDB’s SQL parsing/execution layer is better than the worst thing one can imagine an undergraduate writing as a homework project, but otherwise little substance was demonstrated.

AeroSpike’s YCSB story isn’t as bad. Aerospike seems to have used the benchmark pretty much the way it was intended, and produced numbers that look better than NuoDB’s. Still, a few vertical markets aside, why does it matter how far under 10 milliseconds latency can get?* Further, Aerospike managed a 60 GB database with 30 GB of RAM per server, which is an awkward fit with its “You don’t need to put everything in RAM because we’re so fast on flash memory” positioning.

*If you really do care about that, maybe your app shouldn’t be making so many round trips.

So once again, I stand by my position that benchmark marketing is an annoying waste of everybody’s time.

Notes on Microsoft SQL Server

Curt Monash — Thu, 29 Nov 2012 10:35:27 +0000

I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,

Subjects I’ll mention include:

Hadoop
Parallel Data Warehouse
PolyBase
Columnar data management
In-memory data management (Hekaton)

One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.

Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.

But first we need some housekeeping. As best I understood Microsoft’s lingo:

Microsoft talks about selling in three form factors, collectively “ABC”:
- A = Appliance, which is how PDW (Parallel Data Warehouse, nee’ DATAllegro) is sold, in partnership with either Dell or HP.
- B = Box, which is catchy word for “software”.
- C = Cloud*
Names of major releases go with years — SQL Server 2005, 2008, 2012.
- Timing on the next major SQL Server release hasn’t been disclosed yet …
- … but hopefully will be clarified in the first half of 2013.
- In the mean time, it’s safe to say that it’s a small number of years away, not a small number of quarters.
Point releases of SQL Server are called “Service Packs”, and Service Pack 1 for SQL Server 2012 is now generally available.
Public betas for Azure are called “preview”, and that lingo has slipped into other form factors as well.
Microsoft’s Hadoop efforts are called HDInsight, across at least the Box and Cloud form factors.

*I.e. Azure; pay no attention to dictionaries and poets, who say that skies are azure, while clouds are puffy white.

Microsoft’s Hadoop/HDInsight story starts with what you’d expect:

You can get it in the cloud or on-premises.
Hortonworks did a lot of the work.
Microsoft does Tier 1 support; Hortonworks does Tiers 2 & 3.

The first level of HDInsight management tools will be based on Ambari and donated back to Apache open source, but you might want to integrate the use of those with Microsoft’s long-standing proprietary management suites.

Notes on SQL Server Parallel Data Warehouse include:

PDW apparently has real reference customers and so on.
PDW now uses DAS (Direct-Attached Storage) and the like, versus a previous strategy of simulating shared-nothing on a SAN (Storage Area Network).

What sounds like it might be cool is PolyBase, a PDW extension comparable to Hadapt or Teradata Aster SQL-H. Notes on that start:

Amusingly, PolyBase was developed in the lab of famed MapReduce skeptic Dave DeWitt.
PolyBase development has been underway for around 18 months.
PolyBase will ship with the next release of PDW, scheduled for the first half of 2013.

Technically, I gather:

It has or is a “new query processor” for PDW.
HDFS (Hadoop Distributed File System) now will look like an external table to SQL Server.
SQL Server’s query planner/cost-based optimizer has the choice of either pulling data from HDFS into SQL Server, or kicking off MapReduce jobs straight in Hadoop.

I didn’t ask whether HDFS and SQL Server live on the same nodes, ala Hadapt, or different ones, ala Teradata Aster — but I’m guessing the latter, based on Microsoft’s PolyBase page.

And by the way — if SQL Server has significant analytic platform capabilities, nobody’s ever briefed me on them. To the extent it doesn’t, PolyBase/Hadoop might evolve into a partial substitute.

Microsoft SQL Server has for a while had a columnar capability, kludged from its indexing system. The big limitations were:

The column store was read-only.
You had to have a row-organized version of the data sitting around somewhere.

Both those restrictions are being lifted — initially just in PDW appliances, but later in the “box” products as well. Naturally, Microsoft reports that compression is great, calling it “10X” just like the other cool columnar kids now do. At one point there were hasty mentions of “vector processing” and something that sounds like Netezza zone maps, but I didn’t get details of either.

Actually, I suspect there’s a bit of kludge left in there somewhere, as the no-row-based-version feature is “optional”, and the column store is being described as a “clustered index”.

That takes us to Hekaton, which is already in “preview” with about 100 customers even though it won’t be generally available until the next major SQL Server release a few years out. As on other subjects, I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.

A key point is that you only have to move some of your tables into Hekaton; you can manage the rest on disk as you always did. This may be regarded as somewhere in between storage tiering and full federation, in that SQL Server is one DBMS, but can invoke several very different storage engines.

And that’s all I have for now. Greater substance may or may not follow.

Related links

A vehement, multi-party debate on SAN versus DAS (2008)
DATAllegro’s one and only production customer (2009)
What Netezza zone maps and range partitioning evolved into (2010)
Andrew Brust’s post on PolyBase (last month)
Microsoft’s blog post on Hekaton (also last month)

Hadoop YARN — beyond MapReduce

Curt Monash — Mon, 23 Jul 2012 05:26:14 +0000

A lot of confusion seems to have built around the facts:

Hadoop MapReduce is being opened up into something called MapReduce 2 (MRv2).
Something called YARN (Yet Another Resource Negotiator) is involved.
One purpose of the whole thing is to make MapReduce not be required for Hadoop.
MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative, yet the MPI/YARN/Hadoop effort is somehow troubled.
Cloudera shipped YARN in June, yet simultaneously warned people away from actually using it.

Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.

YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
The YARN availability story goes:
- YARN is in alpha.
- YARN is expected to be in production at year-end, give or take.
- Cloudera made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than production.
- Hortonworks, in its own June release, only shipped code it advised putting into production.
My take on the YARN/MPI story goes something like this:
- Numerous people have told me of YARN/MPI delays.
- One person suggested that Greenplum is taking the lead in YARN/MPI integration, but has gotten slow and reclusive, apparently due to some big company-itis.
- I find that credible because of the Greenplum/SAS/MPI connection.
If I understood Arun correctly, the latency story on Hadoop MapReduce is approximately:
- Arun says that Hadoop’s reputation for taking 10s of seconds to start a Hadoop job is old news. It takes a low single-digit number of seconds.
- However, starting all that Java does take 100s of milliseconds at best — 200 milliseconds in an ideal case, 500 milliseconds more realistically, and that’s just on a single server.
- Thus, if you want human real-time interaction, Hadoop MapReduce is not and likely never will be the way to go. Getting Hadoop MapReduce latencies under a few seconds is likely to be more trouble than it’s worth — because of MapReduce, not because of Hadoop.
In particular — instead of incurring the overhead of starting processes up, Arun thinks low-latency needs should be met in a different way, namely by serving them from already-running processes. The examples he kept mentioning were the event processing projects Storm (out of Twitter, via an acquisition) and S4 (out of Yahoo).

The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will remove scalability bottlenecks.

At my current level of understanding, I don’t think it would be productive for me to try to explain things in a lot more detail than that.

After we talked, Arun sent over a list of links that I’ll just quote verbatim:

Real-time processing:
# Twitter Storm – https://github.com/nathanmarz/storm/wiki
# Apache S4 – http://incubator.apache.org/s4/
– YARN port: https://issues.apache.org/jira/browse/S4-25

Alternate programming paradigms to MapReduce:
# UCB Spark: http://www.spark-project.org/
– YARN port: https://github.com/mesos/spark-yarn/
# OpenMPI – http://www.open-mpi.org/
# HAMA
– YARN port: https://issues.apache.org/jira/browse/HAMA-431
# Giraph (graph processing based on Google Pregel) – http://giraph.apache.org/
– YARN port: https://issues.apache.org/jira/browse/GIRAPH-13

I’ll add that a September, 2011 post on Twitter Storm by David Bienvenido III was extremely helpful, as is a GitHub page on Storm concepts.

A couple more notes on all this:

I finally understand how speculative execution works, in the context of Hadoop. Namely, if the resource scheduler perceives a risk that a subtask will finish late, bottlenecking the overall job, the system will clone the process and run a second copy. Whichever finishes first wins.
Apache Zookeeper is pretty central to Hadoop high availability, and is expected to stay that way even when YARN comes around.

Finally, if you’re coming from an RDBMS background, it’s natural to think of YARN as a workload management system. In that context, I’d observe:

YARN has heretofore only managed RAM. However, …
… Arun said he planned to check in some form of CPU management within the next week.
I think the YARN folks need to talk with some workload management experts at the RDBMS companies to better understand the workload management state of the art.

Big data terminology and positioning

Curt Monash — Mon, 09 Jan 2012 01:35:57 +0000

Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:

Bigness — Volume, Velocity, size
Structure — Variety, Variability, Complexity

given that

High-velocity “big data” problems are usually high-volume as well.*
Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.

But the conflation should stop there.

*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.

When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:

Relational big data is data of high volume that fits well into a relational DBMS.
Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.

Notes on all this include:

“Relational big data” is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.
The paradigmatic example of “multi-structured big data” is log files. Thus, multi-structured big data is commonly what you need a big bit bucket for.
One might want to equate non-analytic relational big data technology to “NewSQL”. However, I’m struggling to think of a database size range in which the entire NewSQL industry can match Oracle’s market share alone.
One might want to equate non-analytic multi-structured big data technology to “NoSQL”. However:
- “NoSQL” is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.
- “NoSQL” has non-ACID/low(er)-data-integrity connotations that aren’t appropriate for all non-relational systems.
Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
- 1-2 terabytes if you’ve never bought anything past Oracle Standard Edition.
- 5-10 terabytes if you’re already paying for Oracle Enterprise Edition.
- A lot higher than that if you actually find Oracle Exadata to be cost-effective.
Depending on how big one acknowledges as “big”, the market share leader in “big bit bucket” use cases is either Splunk or Hadoop.
If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.
It is wrong to say that the large web companies invented “big data” technology. But it is more reasonable to say they invented much of “multi-structured big data” management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.