eBay – DBMS 2 : DataBase Management System Services

More notes on HBase

Curt Monash — Tue, 17 Mar 2015 18:13:50 +0000

1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
Files have indexes and optional Bloom filters.
Files are marked with min/max field values and time stamp ranges, which helps with data skipping.

Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

HBase is commonly used to receive machine-generated data. Everybody knows that.
Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.

OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:

Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
eBay has an HBase database largely on spinning disk, used to inform its search engine.

Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.

4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.

5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.

It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.

In general, Cloudera suggested that HBase was used in a fair number of OEM situations.

6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.

Related link

A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.

Data as an asset

Curt Monash — Mon, 22 Sep 2014 03:49:00 +0000

We all tend to assume that data is a great and glorious asset. How solid is this assumption?

Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
In many cases, however, data’s value diminishes quickly.
Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
- Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
- Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.

*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.

This all raises the idea — if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include:

Actual scientific, clinical, seismic, or engineering research.
Actual selling of (usually proprietary) data, with the straightforward economic proposition of “Get once, sell to multiple customers more cheaply than they could get it themselves.” Examples start:
- This is the essence of the stock quote business. And Michael Bloomberg started building his vast fortune by adding additional data to what the then-incumbents could offer, for example by getting fixed-income prices from Cantor Fitzgerald.*
- Multiple marketing-data businesses operate on this model.
- Back when there was a small but healthy independent paper newsletter and directory business, its essence was data.
- And now there are many online data selling efforts, in niches large and small.
Internet ad-targeting businesses. Making money from your great ad-targeting technology usually involves access to lots of user-impression and de-anonymization data as well.
Aggressive testing by internet businesses, of substantive offers and marketing-display choices alike. At the largest, such as eBay, you’ll rarely see a page that doesn’t have at least one experiment on it. Paper-based direct marketers take a similar approach. Call centers perhaps should follow suit more than they do.
Surveys, focus groups, etc. These are commonly expensive and unreliable (and the cheap internet ones commonly irritate people who do business with you). But sometimes they are, or seem to be, the only kind of information available.
Free-text data. On the whole I’ve been disappointed by the progress in text analytics. Still — and this overlaps with some previous points — there’s a lot of information in text or narrative form out there for the taking.
- Internally you might have customer emails, call center notes, warranty reports and a lot more.
- Externally there’s a lot of social media to mine.

*Sadly, Cantor Fitzgerald later became famous for being hit especially hard on 9/11/2001.

And then there’s my favorite example of all. Several decades ago, especially in the 1990s, supermarkets and mass merchants implemented point-of-sale (POS) systems to track every item sold, and then added loyalty cards through which they bribed their customers to associate their names with their purchases. Casinos followed suit. Airlines of course had loyalty/frequent-flyer programs too, which were heavily related to their marketing, although in that case I think loyalty/rewards were truly the core element, with targeted marketing just being an important secondary benefit. Overall, that’s an awesome example of aggressive data gathering. But here’s the thing, and it’s an example of why I’m confused about the value of data — I wouldn’t exactly say that grocers, mass merchants or airlines have been bastions of economic success. Good data will rarely save a bad business.

Related links

I first wrote up this point in a 2005 Computerworld column, and added a text-analytics nuance a year later, but since then I seem to have talked about it much more than I’ve written it down.
Please always keep in mind the risks to privacy in whatever you do.

Kinds of data integration and movement

Curt Monash — Mon, 12 Mar 2012 08:53:50 +0000

“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.

There are two main paradigms for data integration:

Movement or replication — you take data from one place and copy it to another.
Federation — you treat data in multiple different places logically as if it were all in one database.

Data movement and replication typically take one of three forms:

Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.

Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:

Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
Cleaning and quality — with new uses of data can come new requirements for accuracy.
Master, reference, or canonical data —
Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.

In particular, the following are largely different from each other.

Local replication, for high availability and/or read scalability. If you have an important database application — SQL or non-SQL as the case may be — and you want to ensure its availability, you may choose to have a second machine capable of running it, on standby, possessed of a fully and reliability current copy of the data. These needs are usually met by local, synchronous replication, where “synchronous” means that an update is not committed on the first machine until there’s assurance it will also go through on the second one. (Fortunately, there are faster forms of synchronous replication than two-phase commit.)

In some cases, you also need the application to run against several copies of the data, for performance. That can be achieved with similar technology.

Remote replication, most commonly (but not only) for disaster recovery. If you really, really want to ensure application availability, you may also choose to have a live system ready to go at a different physical location. In that case you still want best-realistic-speed for data replication, but it needs to be asynchronous — your updates on the main system can’t afford to wait for acknowledgements to be sent back and forth across a wide-area network.

Synchronously replicated or not, you might want to send work to your high-availability or disaster recovery database copies, for performance, to the extent that you have them anyway. In particularly, asynchronous replication is fast enough for almost any analytic use case.

Low-latency replication, to populate an analytic database. It’s increasingly desirable to stream/trickle operational updates straight into an analytic database. But the two database management systems involved are likely to be different. That places additional demands on the replication technology beyond what is needed for replicating like-to-like.

Rebalancing, in a shared-nothing and/or adaptable DBMS. Databases are not always kept in single, homogenous systems. That can create the need to move or copy data from one place to another. An obvious example is when you add, delete, or repair nodes in a shared-nothing system. Another is when data is moved among different tiers of storage (e.g. solid-state vs. hard-disk).

Cross-database information lifecycle management (ILM). Sometimes you rebalance among different databases, managed by different hardware or software. Even if we assume that off-line storage isn’t involved — “disk is the new tape” — general ILM is lot more complicated than the single-DBMS kind.

ETL (Extract/Transform/Load). In the simplest cases, ETL takes data from one database and puts in another, often on a batch basis, sometimes in a trickle/stream. But unlike what we call “replication”, ETL also allows significant changes to data along the way. Even ETL distinguished primarily by performance puts data through complex processing pipelines.

Conceptually, and irrespective of what really is or isn’t going on, it’s probably easier to think of ETL as something you copy data into and then back out of than simply a set of pipes.

ELT (Extract/Load/Transform). On the other hand, sometimes ETL’s main function is indeed piping, with the transformation happening after the data gets to its new location. ELT can be appealing when (for example) the destination is a cheap, efficient analytic DBMS. Also, a particular rich form of ELT is possible using Hadoop (or other) MapReduce.

Often, it’s reasonable to say that the “E” and “L” parts of ELT are done via ETL technology, while the “T” is done via something else.

Data mart spin-out. When he was at eBay, Oliver Ratzesberger made a big deal of the ability to spin out a data mart very quickly. There are two kinds of ways to do this:

Virtual data mart spin-out, with no physical data movement at all. (This is Oliver’s preferred way.)
Physical data mart spin-out, based on copying the data. Greenplum, inspired by then-customer eBay, was probably the first to make a big fuss about that.

One way and/or the other, fast data-mart spin-out has become an important — albeit still forward-leaning — feature for analytic DBMS.

Business intelligence tools, querying multiple databases. Often, enterprises have looked to BI to achieve what they see as data integration. It’s pretty straightforward for a BI tool to query multiple relational databases. This is not exactly the same thing as doing ETL to support a BI project, or even as selling BI and ETL more or less bundled together.

Indexing, search, and/or query of external databases, from within a particular data store. Sometimes it’s the data store itself that reaches out to other databases. Text search engines are particularly likely to index information stored in other places — and by the way, in the case of text, the index usually holds a complete copy of the information being indexed. But relational DBMS have occasionally-used “external table” functionality as well.

Different storage engines in the same DBMS. Frequently, the makers of a single database management system find it advantageous to have two or more rather different storage engines under the covers. The base case is that some data gets put in one engine, some in another, and that’s that. But in a few cases, data might move from one engine to another. An example of the latter strategy is Vertica, with features both a write-optimized store (in memory) and read-optimized store (what we really think of as Vertica).

Bidirectional Hadoop (HDFS)/DBMS connectors. A variant on these two approaches are the bidirectional Hadoop connectors that various DBMS vendors have announced. Details vary, but functionality can include the ability to do DBMS queries that incorporate Hadoop data and/or Hadoop jobs that directly access the DBMS.

Service-Oriented Architecture (SOA), Enterprise Application Integration (EAI), Enterprise Service Bus (ESB), Composite applications … There are a whole lot of concepts that boil down to “letting loosely-coupled applications or databases — usually operational ones — exchange data as needed.” I probably should know more about these areas.

Of course, even that taxonomy understates the complexity of the subject. Most notably, various capabilities are combined in single vendors, products, or projects. For example:

If you’re going NewSQL or NoSQL, you probably expect answers for local replication, remote replication, and rebalancing. You may want to stream to an analytic DBMS or Hadoop cluster as well.
ELT and ETL are commonly combined into an ETLT strategy. In particular, ETL vendors are trying to subsume the data transformation capabilities of Hadoop.

I’m sure there’s much here to disagree with, and even more to criticize in the way of omissions. Fire away! Just please recall — no market categorization is ever precise.

QlikView 11 and the rise of collaborative BI

Curt Monash — Wed, 16 Nov 2011 13:19:52 +0000

QlikView 11 came out last month. Let me start by pointing out:

As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some “better late than never” features.
The leading-edge stuff is concentrated in the general area of “collaboration”.
Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways.
The “Well, it’s about time!” feature list starts with the ability to load QlikView via third-party ETL tools (Informatica now, others coming).
QlikTech is generally good at putting up pretty pictures of its product. You can find some in the “What’s New in QlikView 11” document via a general QlikView resource page.*
Stephen Swoyer wrote a good article summarizing QlikView 11.

*One confusing aspect to that paper: non-standard uses of the terms “analytic app” and “document”.

As QlikTech tells it, QlikView 11 adds two kinds of collaboration features:

Integration with social media, which QlikTech calls “asynchronous integration.”
Direct sharing of the QlikView UI, which QlikTech calls “synchronous integration.”

I’d add a third kind, because QlikView 11 also takes some baby steps toward what I regard as a key aspect of BI collaboration — the ability to define and track your own metrics. It’s way, way short of what I called for in metric flexibility in a post last year, but at least it’s a small start.

That direct sharing of user interfaces is a cool feature, which every business intelligence vendor should offer. In an era of distributed workforces, when people can’t be assumed able to huddle around the same desk, it has value even for use among close coworkers. But it also should prove useful in a variety of more naturally remote use cases, multiple examples of which can be found in each of the areas of:

Support (internal or external).
Faceoffs — I mean collaborations — between two or more enterprise departments. Examples might include: manufacturing and purchasing, manufacturing and sales, or accounting and anybody else.

As for social media being used for BI collaboration — that’s generally in the air. For example:

salesforce.com is pushing enterprise social media use broadly, and will surely increase its emphasis on the social media/BI intersection now that Dave Kellogg is there.
Spotfire has announced similar features in its latest release.
The more cumbersome side of the feature set (portal-based collaboration, emailing of individual reports) has been available from multiple vendors for years.
eBay open-sourced a more dataset-centric version of the idea, just as Oliver Ratzesberger left the firm.*

*Edit: That didn’t turn out to actually happen.

BI has been a communication tool since the first green paper report was dumped on the first desk. And there’s been collaboration in doing analysis at least since it’s been possible to email .XLS file attachments. Still, BI is too often used as bludgeon rather than binocular. Hopefully, the current generation of technology will finally serve to change that.

What those nested data structures are about

Curt Monash — Wed, 19 Oct 2011 17:29:59 +0000

As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.

The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:

All 50 search results you were shown, and their positions in the search rankings.
Every ad, image, or graphical element.
An ID as to which test you were participating in (every page you see on eBay has some element being tested).

*Edit: Oliver subsequently moved on to Sears and then Teradata.

There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)

Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.

Some notes on Hadoop (mainly) and appliances

Curt Monash — Fri, 23 Sep 2011 19:59:42 +0000

1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.

2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.

3. On its latest earnings call, Oracle clearly said it would introduce a Hadoop appliance, versus just hinting at a Hadoop appliance the prior quarter. The money quote was:

Finally, big data or the searching of large amounts of data using Hadoop. After Hadoop finishes filtering the data, the place you want to put that data is an Oracle Database, and that’s what a lot of our customers are doing. And we are exploiting the trend, the big data technology and the big data trend, if you prefer, by building a Hadoop appliance that attaches to the Oracle Exadata database or any Oracle Database for that matter. But you don’t have to buy our Hadoop appliance if you can use whatever servers you want running Hadoop, and we provide the interface between Hadoop and the Oracle Database.

In other words, Oracle is saying “We’d like to sell you a Hadoop appliance, but you can run Hadoop in some other way and we’ll coexist with it just fine.” That makes sense; refusing to coexist with Hadoop is not exactly a realistic option.

4. Back in June, I expressed great skepticism about the idea of a Hadoop appliance. There was at least partial pushback in the comment thread from both Amr Awadallah and Eric Baldeschwieler. Oops.

Their reasoning seems to be centered around matters of installation, administration, and general packaging.

5. A month ago I noted aggressive near-term plans for Apache Hadoop evolution. As noted above, one reason this is needed is competition from folks like MapR. Also, I note that:

Three years ago, Oliver Ratzesberger’s group at eBay complained that CPU utilization running Hadoop was at 18%.
Now Oliver uses a figure of 10-15%., and attributes an even lower figure to — I’m guessing here — Yahoo. (Another possibility might be Facebook.)
In between eBay became one of the biggest and most prominent users of Hadoop.

The moral of eBay’s Hadoop adventures, as I see it, is neither “Hadoop sucks!” nor “Hadoop doesn’t suck!”; rather, it’s that there’s a lot of scope for Hadoop to operate differently in the future than it does today.

Similarly, whatever throughput Yahoo does or doesn’t get, it clearly has adopted Hadoop at the expense of the columnar-in-Postgres system it previously was so proud of.

Also, there has been a claim going around that — notwithstanding NameNode’s status as a single point of Hadoop failure — no Hadoop installation has ever lost data due to a NameNode failure. The folks at MapR beg to differ, and sent over some links that sure seem to say the opposite.

6. Since we’ve just established that Hadoop will change, rapidly and pretty fundamentally, what exactly is the benefit of an appliance that is “balanced” for Hadoop usage today?

Notes and links October 22, 2010

Curt Monash — Fri, 22 Oct 2010 06:47:05 +0000

A number of recent posts have had good comments. This time, I won’t call them out individually.

Evidently Mike Olson of Cloudera is still telling the machine-generated data story, exactly as he should be. The Information Arbitrage/IA Ventures folks said something similar, focusing specifically on “sensor data” …

… and, even better, went on to say:

Privacy is dead.
What do we consider to be the boundaries of privacy, especially with respect to items like medical data? In a data privacy-free world, should we be regulating data usage instead? How do we deal with asymmetric access to our personal data, e.g., how is it that insurance companies claim the right to our personal information?

Obviously, my answer to the second question is Yes!!!!

Also from Hadoop World — Dave Menninger, now an analyst, reports on some Hadoop metrics:

How big is “big data”? In his opening remarks, Mike shared some statistics from a survey of attendees. The average Hadoop cluster among respondents was 66 nodes and 114 terabytes of data. However there is quite a range. The largest in the survey responses was a cluster of 1,300 nodes and more than 2 petabytes of data. (Presenters from eBay blew this away, describing their production cluster of 8,500 nodes and 16 petabytes of storage.) Over 60 percent of respondents had 10 terabytes or less, and half were running 10 nodes or less.

That eBay comment was particularly interesting.

A while back, Doug Henschen noted that Netezza flagship reference Catalina Marketing is now at 2.5 petabytes. Most of that is in one 600 billion row table. Oddly, the article talks of the Netezza/SAS partnership accelerating model-building via in-database scoring (not modeling) technology. Doug also wrote of a lot of analytic DBMS replacements, including:

Microsoft by ParAccel
Oracle by Aster Data, IBM, Oracle Exadata, probably Netezza, and probably Hadoop
Netezza by Greenplum
IBM by Teradata

Carl Olofson pointed out on Twitter that DataScaler was an in-memory database technology just bought by Oracle. This inspired me to google on them, and I found a sparse DataScaler CEO blog. I link it because of an amusing juxtaposition — the second-to-last post says, in effect, “We make appliances and we recommend all these awesome technology design partners who helped us design the hardware,” while the very last post says “Designing our own hardware was a mistake.”

Fred Holahan is now VP of Marketing at VoltDB, which is a lesson to me about giving free consulting … Anyhow, Fred tells me that VoltDB has about a dozen users on their way to production, some of whom are headed to being VoltDB paying customers, some of whom are not.

eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more

Curt Monash — Wed, 06 Oct 2010 13:21:00 +0000

I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included:

eBay has thrown out Greenplum. (Edit: As per the comments below, eBay wouldn’t endorse that wording itself.) eBay’s 6 ½ petabyte Greenplum database has turned into a >10 petabyte Teradata database, which will grow 2 1/2x further in size soon.
- Specifically, Oliver told me there are 8 petabytes of spinning disk, with 80% compression. So that’s 40 petabytes before you multiply by a reducing factor to cover mirroring, temp space, and so on. My low end for that factor would be 25-28%; my high end would be 35-40%; either way, we’re talking about >10 petabytes of true user data.
- The 8 petabytes of spinning disk are headed to 20 petabytes next year.
- Oliver gave the impression that Greenplum got thrown out more for reliability reasons than performance. (While eBay saw a major performance difference between Teradata and Greenplum, Oliver previously indicated he was inclined to attribute this more to specific Sun Thumper hardware/storage choices than to software.)
That database, called “Singularity,” has some interesting aspects – notably, a character field that’s a string of name-value pairs – on which you can do views and so on for virtual tables — in a table that otherwise has dozens of conventional relational columns.
- The system ingests log data in the form of lots and lots of name-value pairs.
- The most commonly found ones go into columns in the usual way.
- The rest are strung together into, well, a character string.
- Teradata has developed some features for eBay that make it easier to index, query, etc. on that character string of name-value pairs.
eBay’s more EDW-like (Enterprise Data Warehouse) multi-petabyte Teradata database continues to grow, with the main system apparently up to 4 ½ petabytes from the previous 2 ½.
I took the opportunity to ask what kinds of data marts (virtual or otherwise) were spun out in practice.
- In Oliver’s ranking,
  - #1 was derived data based on other data already in the data warehouse.
  - #2 was other data within eBay that had never been put into the data warehouse in the first place.
  - #3 was data truly from outside data.
- Todd Walter chimed in to point out that at other Teradata customers who perhaps didn’t have as fully fleshed out an EDW, #1 and #2 could be reversed.
eBay sees Hadoop as an interesting tool for certain special purposes.
- eBay likes Hadoop for certain tasks such as image analysis. (Edit: And analysis of search results.)
- eBay doesn’t like Hadoop for anything that requires data movement, such as a join.
- Similarly, eBay doesn’t like HBase.
eBay is enamored of the idea to do “social networking around analytics.”
- This is something that has been built but not rolled out yet.
- It seems more focused on actual business intelligence than on the underlying data, unlike Greenplum Chorus, which seems more focused on the databases themselves.
- Since it hasn’t been rolled out yet, we don’t know which (if any) of activity streams, forums, or whatever will actually get significant adoption.

Nested data structures keep coming up, especially for log files

Curt Monash — Sat, 31 Jul 2010 10:42:06 +0000

Nested data structures have come up several times now, almost always in the context of log files.

Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
Facebook’s log files have a big nested data structure flavor.

I don’t have a grasp yet on what exactly is happening here, but it’s something.

Cloudera Enterprise and Hadoop evolution

Curt Monash — Wed, 30 Jun 2010 17:22:27 +0000

I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say:

If you are or want to be a serious MapReduce user – and you’re past the “play around over the weekend” stage — you probably should have either:
- A serious non-DBMS MapReduce distribution.
- MapReduce integrated into your analytic DBMS.
- Both.
The obvious choice for non-DBMS MapReduce is Hadoop.
The obvious choice for a Hadoop distribution is Cloudera Enterprise.
Cloudera Enterprise has three main aspects, in an inseparable bundle:
- Distributions for a double-digit number of open source projects. It’s nice having all that in one package – unless, of course, you like playing with Tinkertoys.
- Proprietary Cloudera code.
- Cloudera support.
Cloudera says its proprietary code is and in the future is planned to be concentrated – at least in large part — on integrating open source technology with closed source products. This has the virtue of being targeted directly at that segment of the market which has proven it’s actually willing to pay money for software.
Cloudera Enterprise areas of focus, now and in the presumed future, include:
- Core Hadoop engine, which Cloudera says is quite predictably and appropriately evolving more slowly than the tools around it.
- Development, management and administrative tools, including:
  - Pig and Hive. Cloudera says >70% of Facebook Hadoop jobs are initiated through Hive, and the same is true of Yahoo and Pig.
  - Connectivity to commercial tools.
  - The product formerly known as “Cloudera Desktop.”
- Workflow, which in this context refers to letting you create a Hadoop application as a sequence of small steps, rather than forcing you to kluge it into being one unwieldy thing. At the moment, this is much less widely adopted than Pig and Hive, but Cloudera has high hopes for it, because of its obvious benefits in modularity and manageability.
- Quasi-DBMS technology. Besides Hive and Pig, this includes HBase. Cloudera says there has been considerable demand for HBase, and it is pleased that the project is now mature enough to ship. Cloudera stresses that it intends HBase not for OLTP, but as an adjunct to analytic processing. E.g., Cloudera suggests HBase would be a fine vehicle for replicating dimension tables across each node of a cluster.
- Data connectivity, e.g. to MySQL or to sensor log files.
Cloudera Enterprise pricing is well below DBMS prices – not by a full order of magnitude, if I’m right about everybody’s quantity discount policies, but even so by a lot. Details are NDA.

Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:

Somebody at Cloudera who comes primarily from the user and open source communities.
Somebody at Cloudera who has actually worked at a software company before.

But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.

Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera’s own customer base — included:

Notwithstanding eBay’s prior skepticism about MapReduce, it is quoted saying nice things in a Cloudera press release, and has apparently become quite a large Hadoop user, starting out with a search-quality use case.
Typical Hadoop deployment sizes are 10 nodes or so when experimenting, 80-500+ in production.
10 terabytes/node – I’m pretty sure Cloudera meant of user data — is not inconceivable, so a cost-conscious 500-node user could have 5 petabytes of data managed by Hadoop.
Cloudera has half a dozen customers at the 75+ node production level.
Web and financial services are the two vertical markets moving most aggressively into Hadoop production. The government is also in significant Hadoop production, but the details of that are classified.
Web uses for Hadoop include:
- Clickstream – sessionization, etc. – that’s a super-mainstream use.
- Search – analyzing search attempts in conjunction with structured data.
- Machine learning (for ad serving, etc.).
Financial services uses for Hadoop include:
- Internal trading rule enforcement/fraud detection.
- Complex ETL.
- Portfolio risk assessment (typically overnight).

None of this is inconsistent with previous surveys of Hadoop use cases.

Various users talked at the Hadoop Summit this week. I wasn’t there, and won’t write about their stories for now. That said, Twitter’s slide deck from same has some interesting stuff, including:

7 TB/day ETLed from MySQL.
Petabytes-being-stored accordingly coming soon.
Open sourcing their ETL tool Crane.
3-4X LZO compression at little CPU cost.
HBase is a more usable for them than HDFS, which isn’t mutable enough.
Pig = 5% of code and coding effort vs. vanilla Hadoop at 30% or less performance hit.