May 4, 2012

Notes on graph data management

This post is part of a series on managing and analyzing graph data. Posts to date include:

Interest in graph data models keeps increasing. But it’s tough to discuss them with any generality, because “graph data model” encompasses so many different things. Indeed, just as all data structures can be mapped to relational ones, it is also the case that all data structures can be mapped to graphs.

Formally, a graph is a collection of (node, edge, node) triples. In the simplest case, the edge has no properties other than existence or maybe direction, and the triple can be reduced to a (node, node) pair, unordered or ordered as the case may be. It is common, however, for edges to encapsulate additional properties, the canonical examples of which are:

Weight. Usually, the intuition here is that the weight is a number indicating the strength of the connection. This is generally derived from more basic data.
Kind. The edge can encapsulate one or more descriptors indicating the kind of relationship between the nodes.

Many of the graph examples I can think of fit into four groups: Read more

Categories: Neo Technology and Neo4j, RDF and graphs, Telecommunications, Workday

10 Comments

May 3, 2012

Big Data hype?

A reporter wrote in to ask whether investor interest in “Big Data” was justified or hype. (More precisely, that’s how I reinterpreted his questions. 🙂 ) His examples were Splunk’s IPO, Teradata’s stock price increase, and Birst’s financing. In a nutshell:

My comments, lightly edited, are in plain text below.
Further thoughts are in italics.
Of course I also linked him to my post “Big Data” has jumped the shark.
Overall, my responses boil down to “Of course there’s some hype.”

1. A great example of hype is that anybody is calling Birst a “Big Data” or “Big Data analytics” company. If anything, Birst is a “little data” analytics company that claims, as a differentiating feature, that it can handle ordinary-sized data sets as well. Read more

Categories: Business intelligence, Data warehousing, IBM and DB2, Microsoft and SQL*Server, Oracle, Splunk

14 Comments

May 1, 2012

Thinking about market segments

It is a reasonable (over)simplification to say that my business boils down to:

Advising vendors what/how to sell.
Advising users what/how to buy.

One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.

Last June I wrote:

In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.

I’d further say that it matters whether the buyer:

Is a large central IT organization.
Is the well-staffed IT organization of a particular business department.
Is a small, frazzled IT organization.
Has strong engineering or technical skills, but less in the way of IT specialists.
Is trying to skate by without much technical knowledge of any kind.

Now let’s map those considerations (and others) to some specific market segments. Read more

Categories: Data mart outsourcing, Games and virtual worlds, IBM and DB2, Investment research and trading, Microsoft and SQL*Server, Open source, Software as a Service (SaaS), Telecommunications, Web analytics

9 Comments

April 24, 2012

Notes on the Hadoop and HBase markets

I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:

Cloudera now has 220 employees.
Cloudera now has over 100 subscription customers.
Over the past year, Cloudera has more than doubled in size by every reasonable metric.
Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
Cloudera has trained over 12,000 people.
Hortonworks is training people too.
Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
There should be more HBase information at HBaseCon on May 22.
If MapR still has momentum, nobody I talked with has noticed.

Categories: Amazon and its cloud, ClearStory Data, Cloud computing, Cloudera, Hadoop, HBase, Hortonworks, MapR, MapReduce, Market share and customer counts, Petabyte-scale data management, WibiData

1 Comment

April 24, 2012

Three quick notes about derived data

I had one of “those” trips last week:

20 meetings, a number of them very multi-hour.
A broken laptop.
Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.

So please pardon me if things are a bit disjointed …

I’ve argued for a while that:

All human-generated data should be retained.
The more important kinds of machine-generated data should be retained as well.
Raw data isn’t enough; it’s really important to store derived data as well.

Here are a few notes on the derived data trend. Read more

Categories: Derived data, Hadoop, Hortonworks, KXEN, Predictive modeling and advanced analytics

8 Comments

April 7, 2012

Many kinds of memory-centric data management

I’m frequently asked to generalize in some way about in-memory or memory-centric data management. I can start:

The desire for human real-time interactive response naturally leads to keeping data in RAM.
Many databases will be ever cheaper to put into RAM over time, thanks to Moore’s Law. (Most) traditional databases will eventually wind up in RAM.
However, there will be exceptions, mainly on the machine-generated side. Where data creation and RAM data storage are getting cheaper at similar rates … well, the overall cost of RAM storage may not significantly decline.

Getting more specific than that is hard, however, because:

The possibilities for in-memory data storage are as numerous and varied as those for disk.
The individual technologies and products for in-memory storage are much less mature than those for disk.
Solid-state options such as flash just confuse things further.

Consider, for example, some of the in-memory data management ideas kicking around. Read more

Categories: Business intelligence, Cache, Cognos, Columnar database management, Couchbase, Data models and architecture, Data warehousing, Database diversity, Exasol, IBM and DB2, In-memory DBMS, Kognitio, memcached, MongoDB, MySQL, NoSQL, Oracle, Oracle TimesTen, ParAccel, QlikTech and QlikView, SAP AG, solidDB, Streaming and complex event processing (CEP), VoltDB and H-Store, Workday

15 Comments

April 4, 2012

Clarifying IBM DB2 Express-C crippleware

When Conor O’Mahony briefed me about DB2 10, he kept commenting that cool features he was talking about could be found in all editions of DB2, even the free one. So I asked what the limitations were on free DB2. He researched the matter and got back to me — and they sounded like what appeared to have been the limits when free DB2 was first introduced, over 6 years ago.

I tweeted about this, and was very fortunate that Ian Bjorhovde spoke up and said it wasn’t correct. Some scrambling ensued. It seems that the main sources of error were:

People tend to confuse DB2 Express and DB2 Express-C; only the latter is free.
What IBM said about the limitations DB2 Express-C upon its introduction 6 years ago should not be interpreted in line with what a plain reading might suggest.

In particular, we shouldn’t take IBM’s repeated 2006 statements that

DB2 Express-C may be deployed on … on AMD or Intel x86 systems with up to 2 dual-core chips. 4 GB of memory is the maximum supported.

to mean that you were ever allowed to use DB2 Express-C with 4 cores, nor with 4 GB of RAM.

To clarify things, Conor sent over email with permission to quote, as follows: Read more

Categories: IBM and DB2, Pricing

14 Comments

April 4, 2012

IBM DB2 10

Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.

*I hope any errors in interpretation are minor.

Major aspects of DB2 10 include new or improved capabilities in the areas of:

Compression.
Analytic query performance.
Data ingest.
Multi-temperature data management.
Workload management.
Graph management/relationship analytics.
Time-travel, bitemporal features, and bitemporal time-travel.

Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*

*Also, the data ingest part isn’t in base DB2.

Categories: Data warehousing, Database compression, IBM and DB2, RDF and graphs, Solid-state memory, Workload management

6 Comments

March 31, 2012

Our clients, and where they are located

From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Excluded from this round of disclosure is one vendor I have never written about.
Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.

For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more

Categories: About this blog, Akiban, ClearStory Data, Couchbase, DataStax, dbShards and CodeFutures, Hadapt, Hortonworks, HP and Neoview, IBM and DB2, Infobright, KXEN, MarkLogic, MongoDB, Netezza, PivotLink, SAND Technology, Schooner Information Technology, solidDB, StreamBase, Syncsort, Tableau Software, Teradata, Vertica Systems, WibiData, Yarcdata and Cray

3 Comments

March 27, 2012

DataStax Enterprise and Cassandra revisited

My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.

For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:

Vanilla Cassandra.
Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.

Cassandra, Solr, Lucene, and Hadoop are all Apache projects.

If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:

Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
There’s something called CQL (Cassandra Query Language), said to be SQL-like.
There’s a JDBC driver for CQL.
With Hadoop MapReduce also come Hive, Pig, and Sqoop.
With Solr and Lucene come full-text search.

In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.

The two main ways to get direct SQL* access to data in DataStax Enterprise are:

JDBC/SQL.
Hive/Hadoop.

*or very SQL-like, depending on how you view things

Before going further, let’s recall some Cassandra basics: Read more

Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in