Thinking about market segments
It is a reasonable (over)simplification to say that my business boils down to:
- Advising vendors what/how to sell.
- Advising users what/how to buy.
One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.
Last June I wrote:
In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.
I’d further say that it matters whether the buyer:
- Is a large central IT organization.
- Is the well-staffed IT organization of a particular business department.
- Is a small, frazzled IT organization.
- Has strong engineering or technical skills, but less in the way of IT specialists.
- Is trying to skate by without much technical knowledge of any kind.
Now let’s map those considerations (and others) to some specific market segments. Read more
Notes on the Hadoop and HBase markets
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
Three quick notes about derived data
I had one of “those” trips last week:
- 20 meetings, a number of them very multi-hour.
- A broken laptop.
- Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.
So please pardon me if things are a bit disjointed …
I’ve argued for a while that:
- All human-generated data should be retained.
- The more important kinds of machine-generated data should be retained as well.
- Raw data isn’t enough; it’s really important to store derived data as well.
Here are a few notes on the derived data trend. Read more
Categories: Derived data, Hadoop, Hortonworks, KXEN, Predictive modeling and advanced analytics | 8 Comments |
Many kinds of memory-centric data management
I’m frequently asked to generalize in some way about in-memory or memory-centric data management. I can start:
- The desire for human real-time interactive response naturally leads to keeping data in RAM.
- Many databases will be ever cheaper to put into RAM over time, thanks to Moore’s Law. (Most) traditional databases will eventually wind up in RAM.
- However, there will be exceptions, mainly on the machine-generated side. Where data creation and RAM data storage are getting cheaper at similar rates … well, the overall cost of RAM storage may not significantly decline.
Getting more specific than that is hard, however, because:
- The possibilities for in-memory data storage are as numerous and varied as those for disk.
- The individual technologies and products for in-memory storage are much less mature than those for disk.
- Solid-state options such as flash just confuse things further.
Consider, for example, some of the in-memory data management ideas kicking around. Read more
Human real-time
I first became an analyst in 1981. And so I was around for the early days of the movement from batch to interactive computing, as exemplified by:
- The rise of minicomputers as mainframe alternatives (first VAXen, then the ‘nix systems that did largely supplant mainframes).
- The move from batch to interactive computing even on mainframes, a key theme of 1980s application software industry competition.
Of course, wherever there is interactive computing, there is a desire for interaction so fast that users don’t notice any wait time. Dan Fylstra, when he was pitching me the early windowing system VisiOn, characterized this as response so fast that the user didn’t tap his fingers waiting.* And so, with the move to any kind of interactive computing at all came a desire that the interaction be quick-response/low-latency. Read more
Clarifying IBM DB2 Express-C crippleware
When Conor O’Mahony briefed me about DB2 10, he kept commenting that cool features he was talking about could be found in all editions of DB2, even the free one. So I asked what the limitations were on free DB2. He researched the matter and got back to me — and they sounded like what appeared to have been the limits when free DB2 was first introduced, over 6 years ago.
I tweeted about this, and was very fortunate that Ian Bjorhovde spoke up and said it wasn’t correct. Some scrambling ensued. It seems that the main sources of error were:
- People tend to confuse DB2 Express and DB2 Express-C; only the latter is free.
- What IBM said about the limitations DB2 Express-C upon its introduction 6 years ago should not be interpreted in line with what a plain reading might suggest.
In particular, we shouldn’t take IBM’s repeated 2006 statements that
DB2 Express-C may be deployed on … on AMD or Intel x86 systems with up to 2 dual-core chips. 4 GB of memory is the maximum supported.
to mean that you were ever allowed to use DB2 Express-C with 4 cores, nor with 4 GB of RAM.
To clarify things, Conor sent over email with permission to quote, as follows: Read more
Categories: IBM and DB2, Pricing | 14 Comments |
IBM DB2 10
Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.
*I hope any errors in interpretation are minor.
Major aspects of DB2 10 include new or improved capabilities in the areas of:
- Compression.
- Analytic query performance.
- Data ingest.
- Multi-temperature data management.
- Workload management.
- Graph management/relationship analytics.
- Time-travel, bitemporal features, and bitemporal time-travel.
Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*
*Also, the data ingest part isn’t in base DB2.
Categories: Data warehousing, Database compression, IBM and DB2, RDF and graphs, Solid-state memory, Workload management | 6 Comments |
Our clients, and where they are located
From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
DataStax Enterprise and Cassandra revisited
My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.
For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:
- Vanilla Cassandra.
- Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
- Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.
Cassandra, Solr, Lucene, and Hadoop are all Apache projects.
If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:
- Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
- There’s something called CQL (Cassandra Query Language), said to be SQL-like.
- There’s a JDBC driver for CQL.
- With Hadoop MapReduce also come Hive, Pig, and Sqoop.
- With Solr and Lucene come full-text search.
In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.
The two main ways to get direct SQL* access to data in DataStax Enterprise are:
- JDBC/SQL.
- Hive/Hadoop.
*or very SQL-like, depending on how you view things
Before going further, let’s recall some Cassandra basics: Read more
Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text | 6 Comments |
CodeFutures/dbShards update
I’ve been talking a fair bit with Cory Isaacson, CEO of my client CodeFutures, which makes dbShards. Business notes include:
- 7 production users, plus an 8th imminent.
- 12-14 signed contracts beyond that.
- ~160 servers in production.
- One customer who has almost 15 terabytes of data (in the cloud).
- Still <10 people, pretty much all engineers.
- Profitable, but looking to raise a bit of growth capital.
Apparently, the figure of 6 dbShards customers in July, 2010 is more comparable to today’s 20ish contracts than to today’s 7-8 production users. About 4 of the original 6 are in production now.
NDA stuff aside, the main technical subject we talked about is something Cory calls “relational sharding”. The point is that dbShards’ transparent sharding can be done in such a way as to make many joins be single-server. Specifically:
- When a table is sufficiently small to be replicated in full at every nodes, you can join on it without moving data across the network.
- When two tables are sharded on the same key, you can join on that key without moving data across the network.
dbShards can’t do cross-shard joins, but it can do distributed transactions comprising multiple updates. Cory argues persuasively that in almost all cases this is enough; but I see cross-shard joins as a feature that should someday be added to dbShards even so.
The real issue with dbShards’ transparent sharding is ensuring it’s really transparent. Cory regards as typical a customer with a couple thousand tables, who had to change a dozen or so SQL statements to implement dbShards. But there are near-term plans to automate the number of SQL changes needed down to 0. The essence of that change is this: Read more