Analysis of data management technology based on a structured-document model, or optimized for XML data. Related subjects include:
It feels like time to write about Clustrix, which I last covered in detail in May, 2010, and which is releasing Clustrix 4.0 today. Clustrix and Clustrix 4.0 basics include:
- Clustrix makes a short-request processing appliance.
- As you might guess from the name, Clustrix is clustered — peer-to-peer, with no head node.
- The Clustrix appliance uses flash/solid-state storage.
- Traditionally, Clustrix has run a MySQL-compatible DBMS.
- Clustrix 4.0 introduces JSON support. More on that below.
- Clustrix 4.0 introduces a bunch of administrative features, and parallel backup.
- Also in today’s announcement is a Rackspace partnership to offer Clustrix remotely, at monthly pricing.
- Clustrix has been shipping product for about 4 years.
- Clustrix has 20 customers in production, running >125 Clustrix nodes total.
- Clustrix has 60 people.
- List price for a (smallest size) Clustrix system is $150K for 3 nodes. Highest-end maintenance costs 15%.
- There’s also a $100K version meant for high availability/disaster recovery. Over half of Clustrix’s customers use off-site disaster recovery.
- Clustrix is raising a C round. Part of it has already been raised from insiders, as a kind of bridge.
The biggest Clustrix installation seems to be 20 nodes or so. Others seem to have 10+. I presume those disaster recovery customers have 6 or more nodes each. I’m not quite sure how the arithmetic on that all works; perhaps the 125ish count of nodes is a bit low.
Clustrix technical notes include: Read more
|Categories: Cloud computing, Clustering, Clustrix, Database compression, Market share and customer counts, MySQL, OLTP, Pricing, Structured documents||4 Comments|
From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.
The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:
- I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
- One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
- Even so, one product or technology may be well-suited to address a couple different kinds of challenges.
The Big Seven database challenges that almost any enterprise faces are: Read more
|Categories: Data integration and middleware, Data models and architecture, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, NoSQL, Object, OLTP, RDF and graphs, Structured documents, Talend, Text||4 Comments|
I’ve been talking some with the Neo Technology/Neo4j guys, including Emil Eifrem (CEO/cofounder), Johan Svensson (CTO/cofounder), and Philip Rathle (Senior Director of Products). Basics include:
- Neo Technology came up with Neo4j, open sourced it, and is building a company around the open source core product in the usual way.
- Neo4j is a graph DBMS.
- Neo4j is unlike some other graph DBMS in that:
- Neo4j is designed for OLTP (OnLine Transaction Processing), or at least as a general-purpose DBMS, rather than being focused on investigative analytics.
- To every node or edge managed by Neo4j you can associate an arbitrary collection of (name,value) pairs — i.e., what might be called a document.
Numbers and historical facts include:
- > 50 paying Neo4j customers.
- Estimated 1000s of production Neo4j users of open source version.*
- Estimated 1/3 of paying customers and free users using Neo4j as a “system of record”.
- >30,000 downloads/month, in some sense of “download”.
- 35 people in 6 countries, vs. 25 last December.
- $13 million in VC, most of it last October.
- Started in 2000 as the underpinnings for a content management system.
- A version of the technology in production in 2003.
- Neo4j first open-sourced in 2007.
- Big-name customers including Cisco, Adobe, and Deutsche Telekom.
- Pricing of either $6,000 or $24,000 per JVM per year for two different commercial versions.
|Categories: Market share and customer counts, Neo Technology and Neo4j, Open source, Pricing, RDF and graphs, Structured documents, Telecommunications||10 Comments|
Splunk is announcing the Splunk 4.3 point release. Before discussing it, let’s recall a few things about Splunk, starting with:
- Splunk is first and foremost an analytic DBMS …
- … used to manage logs and similar multistructured data.
- Splunk’s DML (Data Manipulation Language) is based on text search, not on SQL.
- Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).
- Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.
- The paradigmatic use of Splunk is to monitor IT operations in real time. However:
- There also are plenty of non-real-time uses for Splunk.
- Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.
As in any release, a lot of Splunk 4.3 is about “Oh, you didn’t have that before?” features and Bottleneck Whack-A-Mole performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users — especially the non-IT ones — really want to get Splunk information on the tablet devices. While this somewhat contradicts what I wrote a few days ago pooh-poohing mobile BI, let me hasten to point out:
- Splunk is used for a lot of (quasi) real-time monitoring.
- Splunk’s desktop user interfaces are, by BI standards, quite primitive.
That’s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn’t.
|Categories: Business intelligence, Data models and architecture, Data warehousing, Log analysis, Specific users, Splunk, Structured documents, Web analytics||3 Comments|
MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:
- More-of-the-same in line with MarkLogic’s core positioning.
- A new bi-directional Hadoop connector.
- A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.
Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.
*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.
Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.
|Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text||1 Comment|
Alex Williams noticed that there will be a NoSQL session at Oracle OpenWorld next week, and is wondering whether this will be a big deal. I think it won’t be.
There really are three major points to NoSQL.
- Dynamic schemas. This is the only one of the three that truly depends on NoSQL.
- Scale-out short-request processing. If you want to scale out efficiently at high request volumes, you’re best off not using all the flexibility SQL/relational DBMS offer. (In particular, you don’t want to do cross-node joins). Not coincidentally, a number of the best scale-out offerings were built to be NoSQL.
- Open source. Doing a relational DBMS is a big project. It seems easier to build NoSQL ones.
Oracle can address the latter two points as aggressively as it wishes via MySQL. It so happens I would generally recommend MySQL enhanced by dbShards, Schooner, and/or dbShards/Schooner, rather than Oracle-only MySQL … but that’s a detail. In some form or other, Oracle’s MySQL is a huge player in the scale-out, open source, short-request database management market.
So that leaves us with dynamic schemas. Oracle has at least four different sets of technology in that area:
- As Workday noticed years ago, MySQL can be used as a functional, basic key-value store.
- Oracle also has XML-based Berkeley DB/SleepyCat kicking around.*
- The XML extensions to Oracle’s core DBMS could be alleged to have a dynamic schema/NoSQL flavor. (Blech.)
- A dynamic schema argument could also be made for object-oriented DBMS technology. While Oracle doesn’t to my knowledge exactly sell that, it does have the Tangosol Coherence line of technology, with a potentially similar programming model.
If Oracle is now refreshing and rebranding one or more of these as “NoSQL”, there’s no reason to view that as a big deal at all.
*That’s Mike Olson’s former company, if you’re keeping score at home.
|Categories: MySQL, NoSQL, Object, OLTP, Open source, Oracle, Parallelization, Schooner Information Technology, Structured documents||13 Comments|
HP has announced that:
- HP is buying Autonomy.
- HP is pulling back from WebOS.
- HP may spin off its PC business altogether.
On a high level, this means:
- HP is doubling down on enterprise IT.
- HP is taking a more software-centric approach to the enterprise IT business.
- HP is backing away from the consumer electronics business.
- HP in particular is backing away from the generic desktop/laptop PC business, which may with only moderate exaggeration be regarded as:
- The intersection of the enterprise IT and consumer electronics businesses.
- The least attractive sector of each.
My coverage of Autonomy isn’t exactly current, but I don’t know of anything that contradicts long-time competitor* Dave Kellogg’s skeptical view of Autonomy. Autonomy is a collection of businesses involved in the management, search, and retrieval of poly-structured data, in some cases with strong market share, but even so not necessarily with the strongest of reputations for technology or technology momentum. Autonomy started from a text search engine and a Bayesian search algorithm on top of that, which did a decent job for many customers. But if there’s been much in the way of impressive enhancement over the past 8-10 years, I’ve missed the news.
*Dave, of course, was CEO of MarkLogic.
Questions obviously arise about how the Autonomy acquisition relates to other HP businesses. My early thoughts include: Read more
|Categories: HP and Neoview, Market share and customer counts, Structured documents, Text, Vertica Systems||10 Comments|
I decided I needed some Couchbase drilldown, on business and technology alike, so I had solid chats with both CEO Bob Wiederhold and Chief Architect Dustin Sallings. Pretty much everything I wrote at the time Membase and CouchOne merged to form Couchbase (the company) still holds up. But I have more detail now.
Context for any comments on customer traction includes:
- Membase went into limited production release in October, and full release in January. Similar things are true of CouchDB.
- Hence, most sales of Couchbase’s products have been made over the past 6 months.
- Couchbase (the merged product) is at this point only in a pre-production developer’s release.
- Couchbase has both a direct sales force and a classic open-source “funnel”-based online selling model. Naturally, Couchbase’s understanding of what its customers are doing is more solid with respect to the direct sales base.
- Most of Couchbase’s revenue to date seems to have come from a limited number of big-ticket “lighthouse” accounts (as opposed to, say, the larger number of smaller deals that come in through the online funnel).
- Most Membase purchases are for new applications, as opposed to memcached migrations. However, customers are the kinds of companies that probably also are using memcached elsewhere.
- Most other Membase purchases are replacements for the Membase/MySQL combination. Bob says those are easy sales with short sales cycles.
- Pure memcached support is a small but non-zero business for Couchbase, and a fine source of upsell opportunities.
- In the pipeline but not so much yet in the customer base are SaaS vendors and the like who use and may want to replace traditional DBMS such as Oracle. Other than among those, Couchbase doesn’t compete much yet with Oracle et al.
- Pure CouchDB isn’t all that much of a business, at least relative to community size, as CouchDB is a single-server product commonly used by people who are content not to pay for support.
Membase sales are concentrated in five kinds of internet-centric companies, which in declining order are: Read more
E. F. “Ted” Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization). But he may not have been as universally correct as he thought. Cases I’ve noted in which fixed schemas may be problematic include:
- “A bunch of apps in one, similar but not the same” (in my recent post on MongoDB).
- Out-of-control product catalogs (ditto).
- Analytic use cases in which one keeps enhancing the database with derived data.
And if marketing profile analysis is ever done correctly, that will be a huge example for the list.
So what do we call those DBMS — for example NoSQL, object-oriented, or XML-based systems — that bake the schema into the applications or the records themselves? In the MongoDB post I went with “schemaless,” but I wasn’t really comfortable with that, so I took the discussion to Twitter. Comments from Vlad Didenko (in particular), Ryan Prociuk, Merv Adrian, and Roland Bouman favored the idea that schemas in such systems are changeable or late-bound, rather than entirely absent. I quickly agreed.
My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-”?
*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me.
The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.
The definitions I’m proposing are:
- A database is poly-structured to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.
- Data is poly-structured to the extent that it is best represented in a poly-structured database.
- A DBMS is poly-structured to the extent that it is oriented to managing poly-structured databases.