Web analytics
Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
- The use of analytic technologies for logfile analysis
- (in Text Technologies) Online marketing
What those nested data structures are about
As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.
The explanation was led by Oliver Ratzesberger, late of eBay*and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:
- All 50 search results you were shown, and their positions in the search rankings.
- Every ad, image, or graphical element.
- An ID as to which test you were participating in (every page you see on eBay has some element being tested).
*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.
There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)
Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.
| Categories: Data models and architecture, Data warehousing, eBay, Log analysis, Web analytics | 2 Comments |
Hadoop notes
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
| Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics | 5 Comments |
Aster Data business trends
Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to acquisition by their new orange overlords. The answers aren’t what they used to be. Aster no longer focuses much on what it used to call frontline (i.e., low-latency, operational) applications; those are of course a key strength for Teradata. Rather, Aster focuses on investigative analytics — they’ve long endorsed my use of the term — and on the batch run/scoring kinds of applications that inform operational systems.
| Categories: Analytic technologies, Application areas, Aster Data, Data warehousing, DataStax, Liberty and privacy, RDF and graphs, Teradata, Web analytics | 1 Comment |
Couchbase business update
I decided I needed some Couchbase drilldown, on business and technology alike, so I had solid chats with both CEO Bob Wiederhold and Chief Architect Dustin Sallings. Pretty much everything I wrote at the time Membase and CouchOne merged to form Couchbase (the company) still holds up. But I have more detail now.
Context for any comments on customer traction includes:
- Membase went into limited production release in October, and full release in January. Similar things are true of CouchDB.
- Hence, most sales of Couchbase’s products have been made over the past 6 months.
- Couchbase (the merged product) is at this point only in a pre-production developer’s release.
- Couchbase has both a direct sales force and a classic open-source “funnel”-based online selling model. Naturally, Couchbase’s understanding of what its customers are doing is more solid with respect to the direct sales base.
- Most of Couchbase’s revenue to date seems to have come from a limited number of big-ticket “lighthouse” accounts (as opposed to, say, the larger number of smaller deals that come in through the online funnel).
That said,
- Most Membase purchases are for new applications, as opposed to memcached migrations. However, customers are the kinds of companies that probably also are using memcached elsewhere.
- Most other Membase purchases are replacements for the Membase/MySQL combination. Bob says those are easy sales with short sales cycles.
- Pure memcached support is a small but non-zero business for Couchbase, and a fine source of upsell opportunities.
- In the pipeline but not so much yet in the customer base are SaaS vendors and the like who use and may want to replace traditional DBMS such as Oracle. Other than among those, Couchbase doesn’t compete much yet with Oracle et al.
- Pure CouchDB isn’t all that much of a business, at least relative to community size, as CouchDB is a single-server product commonly used by people who are content not to pay for support.
Membase sales are concentrated in five kinds of internet-centric companies, which in declining order are: Read more
MongoDB users and use cases
I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren’t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that’s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I’d guess those really do use the bulk of available disk space. Read more
| Categories: Data models and architecture, Games and virtual worlds, Log analysis, MongoDB and 10gen, NoSQL, Solid-state memory, Specific users, Splunk, Telecommunications, Web analytics | 12 Comments |
HBase is not broken
It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.
After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers using, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details. Read more
| Categories: Cloudera, Derived data, Facebook, Hadoop, HBase, Log analysis, Market share and customer counts, Open source, Specific users, Web analytics | 6 Comments |
Petabyte-scale Hadoop clusters (dozens of them)
I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.
Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:
- 42,000 Hadoop nodes …
- … holding 180-200 petabytes of data.
Eight kinds of analytic database (Part 2)
In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more
Eight kinds of analytic database (Part 1)
Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.
Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more
Citrusleaf RTA
Citrusleaf has released an add-on product called Citrusleaf RTA (Real-Time Attribution). It’s to be used when:
- You want to update dashboards within a minute.
- You want to update predictive models fairly quickly (within the hour?), although it’s not clear to me how much the models are being updated or changed with that latency.
The metrics envisioned are:
- 100 or so ad impressions per person …
- … for 1 billion or so people …
- … stored for 30-90 days …
- … where each ad impression is a fairly short record …
- … stored on disk …
- … but indexed in a way so that the index can fit into RAM.
- 50-100,000 writes per second. (I didn’t ask on what amount of hardware.)
- Several hundred reads per second.
A consistent relational schema is NOT assumed.
Citrusleaf’s solution is:
- Have one index entry for each of the 1 billion people.
- Bang each new object/record to disk. Include in it a pointer to the previous object/record for the same person.
- Each time a new object/record is added, update the index in place so that it now points to the new once. Hence, the index is sized according to the number of people, not according to the total number of objects/records.
- Eventually let objects/records age off in the obvious way.
The downside is that when you do read 100 objects/records per person, you might need to do 100 seeks.
