Discussion of eBay’s use of database and analytic technology. Related subjects include:
- The use of analytic technologies to study web and network event data
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that: Read more
|Categories: Cloudera, eBay, Facebook, Hadoop, HBase, Market share and customer counts, NoSQL, Open source, Petabyte-scale data management, Specific users, Yahoo||4 Comments|
We all tend to assume that data is a great and glorious asset. How solid is this assumption?
- Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
- In many cases, however, data’s value diminishes quickly.
- Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
- Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
- Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.
*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.
This all raises the idea – if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include: Read more
|Categories: Data mart outsourcing, eBay, Health care, Investment research and trading, Log analysis, Scientific research, Text, Web analytics||5 Comments|
“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data –
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other. Read more
|Categories: Clustering, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, MapReduce||10 Comments|
QlikView 11 came out last month. Let me start by pointing out:
- As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some “better late than never” features.
- The leading-edge stuff is concentrated in the general area of “collaboration”.
- Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways.
- The “Well, it’s about time!” feature list starts with the ability to load QlikView via third-party ETL tools (Informatica now, others coming).
- QlikTech is generally good at putting up pretty pictures of its product. You can find some in the “What’s New in QlikView 11″ document via a general QlikView resource page.*
- Stephen Swoyer wrote a good article summarizing QlikView 11.
*One confusing aspect to that paper: non-standard uses of the terms “analytic app” and “document”.
As QlikTech tells it, QlikView 11 adds two kinds of collaboration features:
- Integration with social media, which QlikTech calls “asynchronous integration.”
- Direct sharing of the QlikView UI, which QlikTech calls “synchronous integration.”
I’d add a third kind, because QlikView 11 also takes some baby steps toward what I regard as a key aspect of BI collaboration — the ability to define and track your own metrics. It’s way, way short of what I called for in metric flexibility in a post last year, but at least it’s a small start.
As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.
The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:
- All 50 search results you were shown, and their positions in the search rankings.
- Every ad, image, or graphical element.
- An ID as to which test you were participating in (every page you see on eBay has some element being tested).
*Edit: Oliver subsequently moved on to Sears and then Teradata.
There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)
Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.
|Categories: Data models and architecture, Data warehousing, eBay, Log analysis, Web analytics||6 Comments|
1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.
2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.
|Categories: Data warehouse appliances, eBay, EMC, Greenplum, Hadoop, MapR, MapReduce, Open source, Oracle||2 Comments|
A number of recent posts have had good comments. This time, I won’t call them out individually.
Evidently Mike Olson of Cloudera is still telling the machine-generated data story, exactly as he should be. The Information Arbitrage/IA Ventures folks said something similar, focusing specifically on “sensor data” …
… and, even better, went on to say: Read more
I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included: Read more
|Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata||30 Comments|
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
|Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture||7 Comments|
I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say: Read more