October 10, 2011

Text data management, Part 1: Confusion

This is Part 1 of a three post series. The posts cover:

  1. Confusion about text data management.
  2. Choices for text data management (general and short-request).
  3. Choices for text data management (analytic).

There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:

Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.

There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more

July 27, 2011

MongoDB users and use cases

I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren’t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that’s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I’d guess those really do use the bulk of available disk space.   Read more

July 26, 2011

Remote machine-generated data

I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it’s time to distinguish more carefully among different kinds of machine-generated data. In particular, I think it may be useful to distinguish between:

Here’s what I’m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example:  Read more

June 4, 2011

Dirty data, stored dirt cheap

A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it.

Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.

So the question arises — why would you want to spend serious money to look after your low-value data? The answer, of course, is that maybe your log data isn’t so low-value. Read more

May 15, 2011

What to do about “unstructured data”

We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.

*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.

Second, a more accurate distinction is not whether a database has one structure or none – it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.

One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.

*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.

All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example:  Read more

February 28, 2011

Updating our vendor client disclosures

Edit: This disclosure has been superseded by a March, 2012 version.

From time to time, I disclose our vendor client lists. Another iteration is below. To be clear:

With that said, our vendor client disclosures at this time are:

Read more

December 30, 2009

Clearing up MapReduce confusion, yet again

I’m frustrated by a constant need — or at least urge :) — to correct myths and errors about MapReduce. Let’s try one more time: Read more

October 18, 2009

Technical introduction to Splunk

As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.

Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include: Read more

October 18, 2009

General introduction to Splunk

I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.

Splunk’s technical stories include:

More on those in a separate post.

Less technical Splunk highlights include: Read more

← Previous Page

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.