IBM is buying parallelization expert Platform Computing
IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more
| Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research | 5 Comments |
Text data management, Part 3: Analytic and progressively enhanced
This is Part 3 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:
Using text data commonly involves a long series of data enhancement steps.
Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:
- Figure out where the words break.
- Figure out where the clauses and sentences break.
- Figure out where the paragraphs, sections, and chapters break.
- (Where necessary) map the words to similar ones — spelling correction, stemming, etc.
- Figure out which words are grammatically which parts of speech.
- Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
- Figure out what was being said, one clause at a time.
- Figure out the emotion — or “sentiment” — associated with it.
Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.
| Categories: Data warehousing, Hadoop, NoSQL, Text | 4 Comments |
Text data management, Part 2: General and short-request
This is Part 2 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from
Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.
to
I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?
Here are some reasons why.
There are three basic kinds of text management use case:
- Text as payload.
- Text as search parameter.
- Text as analytic input.
| Categories: MarkLogic, NoSQL, Text | 5 Comments |
Text data management, Part 1: Confusion
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more
| Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text | 2 Comments |
Cloudera versus Hortonworks
A few weeks ago I wrote:
The other big part of Hortonworks’ story is the claim that it holds the axe in Apache Hadoop development.
and
… just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I’m not persuaded that the “we know this stuff better” part of the Hortonworks partnering story really holds up.
Now Mike Olson — CEO of my client Cloudera — has posted his analysis of the matter, in response to an earlier Hortonworks post asserting its claims. In essence, Mike argues:
- It’s ridiculous to say any one company, e.g. Hortonworks, has a controlling position in Hadoop development.
- Such diversity is a Very Good Thing.
- Cloudera folks now contribute and always have contributed to Hadoop at a higher rate than Hortonworks folks.
- If you consider just core Hadoop projects — the most favorable way of counting from a Hadoop standpoint — Hortonworks has a lead, but not all that big of one.
| Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Open source | 6 Comments |
Teradata Unity and the idea of active-active data warehouse replication
Teradata is having its annual conference, Teradata Partners, at the same time as Oracle OpenWorld this week. That made it an easy decision for Teradata to preannounce its big news, Teradata Columnar and the rest of Teradata 14. But of course it held some stuff back, notably Teradata Unity, which is the name chosen for replication technology based on Teradata’s Xkoto acquisition.
The core mission of Teradata Unity is asynchronous, near-real-time replication across Teradata systems. The point of “asynchronous” is performance. The point of “near-real-time” is that it Teradata Unity can be used for high availability and disaster recovery, and further can be used to allow real work on HA and DR database copies. Teradata Unity works request-at-a-time, which limits performance somewhat;* Unity has a lock manager that makes sure updates are applied in the same order on all copies, in cases where locks are needed at all.
| Categories: Data warehousing, Teradata | 2 Comments |
Defining NoSQL
A reporter tweeted: “Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.
NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more
