Hadoop

Discussion of Hadoop. Related subjects include:

MapReduce
Open source database management systems

January 8, 2012

Big data terminology and positioning

Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:

Bigness — Volume, Velocity, size
Structure — Variety, Variability, Complexity

given that

High-velocity “big data” problems are usually high-volume as well.*
Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.

But the conflation should stop there.

*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.

When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:

Relational big data is data of high volume that fits well into a relational DBMS.
Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.

Categories: Cassandra, Data models and architecture, Data warehousing, Exadata, Facebook, Google, Hadoop, HBase, Log analysis, Market share and customer counts, MarkLogic, NewSQL, NoSQL, Oracle, Splunk, Yahoo

10 Comments

November 21, 2011

Some big-vendor execution questions, and why they matter

When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was “execution of various big vendors’ ambitious initiatives”. By “execute” I mean mainly:

“Deliver products that really meet customers’ desires and needs.”
“Successfully convince them that you’re doing so …”
“… at an attractive overall cost.”

Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:

salesforce.com (multiple subjects).
SAS HPA.
The evolution of Hadoop.

Categories: Business intelligence, Cognos, Columnar database management, Data warehouse appliances, Data warehousing, Exadata, Hadoop, HP and Neoview, IBM and DB2, In-memory DBMS, Investment research and trading, Memory-centric data management, Netezza, NoSQL, Oracle, SAP AG, Vertica Systems

2 Comments

November 8, 2011

Hadapt is moving forward

I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:

The Hadapt 1.0 product is going “Early Access” today.
General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
Hadapt raised a nice round of venture capital.
Hadapt added Sharmila Mulligan to the board.
Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
Headcount is in the low teens, with a target of doubling fast.

The Hadapt product story hasn’t changed significantly from what it was before. Specific points I can add include: Read more

Categories: Hadapt, Hadoop, MapReduce, PostgreSQL, SQL/Hadoop integration, Theory and architecture, Workload management

6 Comments

November 3, 2011

MarkLogic’s Hadoop connector

It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.

Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:

Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.

Otherwise, the whole thing is just what you would think:

Hadoop can read from and write to MarkLogic, in parallel at both ends.
If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”

MarkLogic said that it wrote this Hadoop connector itself.

Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management

2 Comments

November 2, 2011

The cool aspects of Odiago WibiData

Christophe Bisciglia and Aaron Kimball have a new company.

It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer. That’s where the secret sauce resides. Also included are:

REST APIs for interactive access.
Import/export tools, including JDBC access.
Management tools.
Analytic libraries — data mining, predictive analytics, machine learning, and so on.

The whole thing is in beta, with about three (paying) beta customers.

*And Avro and so on.

The core ideas of WibiData include:

ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
There are two primary operators in WibiData, Produce and Gather.
- Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
- It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
- Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.

Categories: Data models and architecture, Hadoop, HBase, NoSQL, Predictive modeling and advanced analytics, Web analytics, WibiData

14 Comments

November 1, 2011

MarkLogic 5, and why you might care

MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:

More-of-the-same in line with MarkLogic’s core positioning.
A new bi-directional Hadoop connector.
A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.

Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.

*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.

MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:

MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
… which has been optimized from the getgo for poly-structured data.
MarkLogic can and does scale out to handle large amounts of data.
MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.

Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.

Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text

1 Comment

October 25, 2011

Where Datameer is positioned

I’ve chatted with Datameer a couple of times recently, mainly with CEO Stefan Groschupf, most recently after XLDB last Tuesday. Nothing I learned greatly contradicts what I wrote about Datameer 1 1/2 years ago. In a nutshell, Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).

Stefan reports that these capabilities are appealing to a significant fraction of enterprise or other commercial Hadoop users, especially the EtL and the BI. I don’t doubt him.

Categories: Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop

4 Comments

October 11, 2011

IBM is buying parallelization expert Platform Computing

IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more

Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research

5 Comments

October 10, 2011

Text data management, Part 3: Analytic and progressively enhanced

This is Part 3 of a three post series. The posts cover:

Confusion about text data management.

Choices for text data management (general and short-request).

Choices for text data management (analytic).

I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:

Using text data commonly involves a long series of data enhancement steps.

Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:

Figure out where the words break.
Figure out where the clauses and sentences break.
Figure out where the paragraphs, sections, and chapters break.
(Where necessary) map the words to similar ones — spelling correction, stemming, etc.
Figure out which words are grammatically which parts of speech.
Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
Figure out what was being said, one clause at a time.
Figure out the emotion — or “sentiment” — associated with it.

Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.

Categories: Data warehousing, Hadoop, NoSQL, Text

4 Comments

October 4, 2011

Cloudera versus Hortonworks

A few weeks ago I wrote:

The other big part of Hortonworks’ story is the claim that it holds the axe in Apache Hadoop development.

and

… just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I’m not persuaded that the “we know this stuff better” part of the Hortonworks partnering story really holds up.

Now Mike Olson — CEO of my client Cloudera — has posted his analysis of the matter, in response to an earlier Hortonworks post asserting its claims. In essence, Mike argues:

It’s ridiculous to say any one company, e.g. Hortonworks, has a controlling position in Hadoop development.
Such diversity is a Very Good Thing.
Cloudera folks now contribute and always have contributed to Hadoop at a higher rate than Hortonworks folks.
If you consider just core Hadoop projects — the most favorable way of counting from a Hadoop standpoint — Hortonworks has a lead, but not all that big of one.

Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Open source

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in