October 11, 2011

IBM is buying parallelization expert Platform Computing

IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more

Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research

5 Comments

October 10, 2011

Text data management, Part 3: Analytic and progressively enhanced

This is Part 3 of a three post series. The posts cover:

Confusion about text data management.

Choices for text data management (general and short-request).

Choices for text data management (analytic).

I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:

Using text data commonly involves a long series of data enhancement steps.

Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:

Figure out where the words break.
Figure out where the clauses and sentences break.
Figure out where the paragraphs, sections, and chapters break.
(Where necessary) map the words to similar ones — spelling correction, stemming, etc.
Figure out which words are grammatically which parts of speech.
Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
Figure out what was being said, one clause at a time.
Figure out the emotion — or “sentiment” — associated with it.

Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.

Categories: Data warehousing, Hadoop, NoSQL, Text

4 Comments

October 10, 2011

Text data management, Part 2: General and short-request

This is Part 2 of a three post series. The posts cover:

I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from

Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.

I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?

Here are some reasons why.

There are three basic kinds of text management use case:

Text as payload.
Text as search parameter.
Text as analytic input.

Categories: MarkLogic, NoSQL, Text

5 Comments

October 10, 2011

Text data management, Part 1: Confusion

This is Part 1 of a three post series. The posts cover:

There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:

The terminology around text data is inaccurate.
Data volume estimates for text are misleading.
Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
Text search vendors have disappointed, especially technically.
Text analytics vendors have disappointed, especially financially.
Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.

Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.

There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more

Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text

2 Comments

October 4, 2011

Cloudera versus Hortonworks

A few weeks ago I wrote:

The other big part of Hortonworks’ story is the claim that it holds the axe in Apache Hadoop development.

and

… just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I’m not persuaded that the “we know this stuff better” part of the Hortonworks partnering story really holds up.

Now Mike Olson — CEO of my client Cloudera — has posted his analysis of the matter, in response to an earlier Hortonworks post asserting its claims. In essence, Mike argues:

It’s ridiculous to say any one company, e.g. Hortonworks, has a controlling position in Hadoop development.
Such diversity is a Very Good Thing.
Cloudera folks now contribute and always have contributed to Hadoop at a higher rate than Hortonworks folks.
If you consider just core Hadoop projects — the most favorable way of counting from a Hadoop standpoint — Hortonworks has a lead, but not all that big of one.

Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Open source

6 Comments

October 3, 2011

Teradata Unity and the idea of active-active data warehouse replication

Teradata is having its annual conference, Teradata Partners, at the same time as Oracle OpenWorld this week. That made it an easy decision for Teradata to preannounce its big news, Teradata Columnar and the rest of Teradata 14. But of course it held some stuff back, notably Teradata Unity, which is the name chosen for replication technology based on Teradata’s Xkoto acquisition.

The core mission of Teradata Unity is asynchronous, near-real-time replication across Teradata systems. The point of “asynchronous” is performance. The point of “near-real-time” is that it Teradata Unity can be used for high availability and disaster recovery, and further can be used to allow real work on HA and DR database copies. Teradata Unity works request-at-a-time, which limits performance somewhat;* Unity has a lock manager that makes sure updates are applied in the same order on all copies, in cases where locks are needed at all.

Categories: Data warehousing, Teradata

2 Comments

October 2, 2011

Defining NoSQL

A reporter tweeted: “Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.

NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more

Categories: Cassandra, dbShards and CodeFutures, MarkLogic, MySQL, Object, Open source, Petabyte-scale data management, Schooner Information Technology

7 Comments

← Previous Page

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

IBM is buying parallelization expert Platform Computing

Text data management, Part 3: Analytic and progressively enhanced

Text data management, Part 2: General and short-request

Text data management, Part 1: Confusion

Cloudera versus Hortonworks

Teradata Unity and the idea of active-active data warehouse replication

Defining NoSQL

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin