What those nested data structures are about
As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.
The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:
- All 50 search results you were shown, and their positions in the search rankings.
- Every ad, image, or graphical element.
- An ID as to which test you were participating in (every page you see on eBay has some element being tested).
*Edit: Oliver subsequently moved on to Sears and then Teradata.
There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)
Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.
Categories: Data models and architecture, Data warehousing, eBay, Log analysis, Web analytics | 7 Comments |
Oracle is buying Endeca
Oracle is buying Endeca. The official talking points for the deal aren’t a perfect match for Endeca’s actual technology, but so be it.
In that post, I wrote:
… the Endeca paradigm is really to help you make your way through a structured database, where different portions of the database have different structures. Thus, at various points in your journey, it automagically provides you a list of choices as to where you could go next.
That kind of thing could help Oracle with apps like the wireless telco product catalog deal MongoDB got.
Going back to the Endeca-post quote well, Endeca itself said:
Inside the MDEX Engine there is no overarching schema; each data record carries its own metadata. This enables the rapid combination of a wide range of structured and unstructured content into Latitude’s unified data model. Once inside, the MDEX Engine derives common dimensions and metrics from the available metadata, instantly exposing each for high-performance refinement and analysis in the Discovery Framework. Have a new data source? Simply add it and the MDEX Engine will create new relationships where possible. Changes in source data schema? No problem, adjustments on the fly are easy.
And I pointed out that the MDEX engine was a columnar DBMS.
Meanwhile, Oracle’s own columnar DBMS efforts have been disappointing. Endeca could be an intended answer to that. However, while Oracle’s track record with standalone DBMS acquisitions is admirable (DEC RDB, MySQL, etc.), Oracle’s track record of integrating DBMS acquisitions into the Oracle product itself is not so good. (Express? Essbase? The text product line? None of that has gone particularly well.)
So while I would expect Endeca’s flagship e-commerce shopping engine products to flourish under Oracle’s ownership, I would be cautious about the integration of Endeca’s core technology into the Oracle product line.
Categories: Columnar database management, Endeca, Oracle | 7 Comments |
Vertica Community Edition
The press release announcing Vertica’s Community Edition is a bit vague. And indeed, much of what I know about Vertica Community Edition is along the lines of “This is what I think will happen, but of course it could still change.” That said, I believe:
- Vertica Community Edition has all of regular Vertica’s features. However …
- … HP Vertica reserves the right to open a feature gap in future releases.
- The license restriction on Vertica Community Edition is that you’re limited to 1 terabyte of data, and 3 nodes. I imagine that’s for one production copy, and you’re perfectly free to also set up mirrors for test, development, disaster recovery, and so on. However …
- … HP Vertica would be annoyed if you stuck a free copy of Vertica on each of 50 nodes and managed the whole thing via, say, Hadapt.
- HP Vertica plans to be very generous with true academic researchers, suspending or waiving limits on database size and node count. Not coincidentally, Vertica Community Edition is being announced at XLDB, where Vertica is also a top-level sponsor. (I introduced Vertica and XLDB’s Jacek Becla to each other as soon as I heard about Vertica’s Community Edition plans.)
- The only support available for Vertica Community Edition is through forums. This could change.
I’m a big supporter of the Vertica Community Edition idea, for four reasons:
- It should now be easier to download and evaluate Vertica.
- Vertica Community Edition could be a big help to academic researchers.
- Vertica could now be more appealing to some of the “Omigod, we’re outgrowing Oracle Standard Edition and we don’t want to pay up for Oracle Enterprise Edition/Exadata” crowd.
- People are under the impression that what Vertica actually charges today resembles its long-ago list prices. This announcement may help puncture Vertica’s outdated pricing image.
Categories: Pricing, Vertica Systems | 7 Comments |
Commercial software for academic use
As Jacek Becla explained:
- Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.
- What’s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.
Even so, I think that academic researchers, in the natural and social sciences alike, commonly overlook the wealth of commercial software that could help them in their efforts.
I further think that the commercial software industry could do a better job of exposing its work to academics, where by “expose” I mean:
- Give your stuff to academics for free.
- Call their attention to your free offering.
Reasons to do so include:
- Public benefit. Scientific research is important.
- Training future customers. There’s huge academic/commercial crossover, especially as students join the for-profit workforce.
Categories: Business intelligence, Data warehousing, Infobright, Petabyte-scale data management, Predictive modeling and advanced analytics, Scientific research | 7 Comments |
Compression in Sybase ASE 15.7
Sybase recently came up with Adaptive Server Enterprise 15.7, which is essentially the “Make SAP happy” release. Features that were slated for 2012 release, but which SAP wanted, were accelerated into 2011. Features that weren’t slated for 2012, but which SAP wanted, were also brought into 2011. Not coincidentally, SAP Business Suite will soon run on Sybase Adaptive Server Enterprise 15.7.
15.7 turns out to be the first release of Sybase ASE with data compression. Sybase fondly believes that it is matching DB2 and leapfrogging Oracle in compression rate with a single compression scheme, namely page-level tokenization. More precisely, SAP and Sybase seem to believe that about compression rates for actual SAP application databases, based on some degree of testing. Read more
Categories: Database compression, Sybase | 5 Comments |
IBM is buying parallelization expert Platform Computing
IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more
Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research | 5 Comments |
Text data management, Part 3: Analytic and progressively enhanced
This is Part 3 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:
Using text data commonly involves a long series of data enhancement steps.
Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:
- Figure out where the words break.
- Figure out where the clauses and sentences break.
- Figure out where the paragraphs, sections, and chapters break.
- (Where necessary) map the words to similar ones — spelling correction, stemming, etc.
- Figure out which words are grammatically which parts of speech.
- Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
- Figure out what was being said, one clause at a time.
- Figure out the emotion — or “sentiment” — associated with it.
Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.
Categories: Data warehousing, Hadoop, NoSQL, Text | 4 Comments |
Text data management, Part 2: General and short-request
This is Part 2 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from
Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.
to
I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?
Here are some reasons why.
There are three basic kinds of text management use case:
- Text as payload.
- Text as search parameter.
- Text as analytic input.
Categories: MarkLogic, NoSQL, Text | 5 Comments |
Text data management, Part 1: Confusion
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more
Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text | 2 Comments |
Cloudera versus Hortonworks
A few weeks ago I wrote:
The other big part of Hortonworks’ story is the claim that it holds the axe in Apache Hadoop development.
and
… just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I’m not persuaded that the “we know this stuff better” part of the Hortonworks partnering story really holds up.
Now Mike Olson — CEO of my client Cloudera — has posted his analysis of the matter, in response to an earlier Hortonworks post asserting its claims. In essence, Mike argues:
- It’s ridiculous to say any one company, e.g. Hortonworks, has a controlling position in Hadoop development.
- Such diversity is a Very Good Thing.
- Cloudera folks now contribute and always have contributed to Hadoop at a higher rate than Hortonworks folks.
- If you consider just core Hadoop projects — the most favorable way of counting from a Hadoop standpoint — Hortonworks has a lead, but not all that big of one.
Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Open source | 6 Comments |