Discussion of relational database management systems that are offered through some version of open source licensing. Related subjects include:
I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.
Trends that I think will continue in 2013 include:
Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.
Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.
Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).
Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.
Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.
MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.
|Categories: Business intelligence, Hadoop, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS), Surveillance and privacy||3 Comments|
My clients at Couchbase checked in.
- After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
- Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
- There also are many users of open source Couchbase, most famously LinkedIn.
- Orbitz is a much-mentioned flagship paying Couchbase customer.
- Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
- Couchbase headcount is just under 100.
The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:
- JSON storage, including secondary indexes.
- Multi-data-center replication.
- A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.
Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.
Technology notes on Couchbase 2.0 include: Read more
|Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB and 10gen, NoSQL, Open source, Structured documents||4 Comments|
What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.
First, some basics:
- Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
- Impala is in public beta, and is targeted for general availability Q1 2013 or so.
- Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
- Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
- … notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.
The general technical idea of Impala is:
- It’s an additional daemon that runs on each of your Hadoop nodes.
- Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
- Impala operates as a distributed parallel analytic DBMS.*
- Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.
|Categories: Cloudera, Data models and architecture, Data warehousing, Hadoop, HBase, MapReduce, Open source, Predictive modeling and advanced analytics, SQL/Hadoop integration||12 Comments|
Time for another catch-all post. First and saddest — one of the earliest great commenters on this blog, and a beloved figure in the Boston-area database community, was Dan Weinreb, whom I had known since some Symbolics briefings in the early 1980s. He passed away recently, much much much too young. Looking back for a couple of examples — even if you’ve never heard of him before, I see that Dan ‘s 2009 comment on Tokutek is still interesting today, and so is a post on his own blog disagreeing with some of my choices in terminology.
Otherwise, in no particular order:
1. Chris Bird is learning MongoDB. As is common for Chris, his comments are both amusing and enlightening.
2. When I relayed Cloudera’s comments on Hadoop adoption, I left out a couple of categories. One Cloudera called “mobile”; when I probed, that was about HBase, with an example being messaging apps.
The other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several years — but I’m increasingly getting the sense that it’s actually arrived.
|Categories: Cloudera, Data integration and middleware, Hadoop, HBase, Informatica, Metamarkets and Druid, MongoDB and 10gen, NoSQL, Open source, Telecommunications||2 Comments|
Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
Beyond that: Read more
|Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration||6 Comments|
I haven’t done a notes/link/comments post for a while. Time for a little catch-up.
1. MySQL now has a memcached integration story. I haven’t checked the details. The MySQL team is pretty hard to talk with, due to the heavy-handedness of Oracle’s analyst relations.
2. The Large Hadron Collider offers some serious numbers, including:
- 1 petabyte/second.
- 6 x 109 collisions/second.
- Only 1 in 1013 collision records kept (which I guess knocks things down to a 100 byte/second average, from the standpoint of persistent storage).
- Real-time filtering by a cluster of several thousand machines, over a 25 nanosecond period.
3. One application area we don’t talk about much for analytic technologies is education. However: Read more
|Categories: Cache, memcached, Memory-centric data management, MySQL, Open source, Petabyte-scale data management, RDF and graphs, Scientific research||Leave a Comment|
Ron Pressler of Parallel Universe/SpaceBase pinged me about a data grid product he was open sourcing, called Galaxy. The idea is that a distributed RAM grid will allocate data, not randomly or via consistent hashing, but rather via a locality-sensitive approach. Notes include:
- The original technology was developed to track moving objects on behalf of the Israeli Air Force.
- The commercial product is focused on MMO (Massively MultiPlayer Online) games (or virtual worlds).
- The underpinnings are being open sourced.
- Ron suggests that, among other use cases, Galaxy might work well for graphs.
- Ron argues that one benefit is that when lots of things cluster together — e.g. characters in a game — there’s a natural way to split them elastically (shrink the radius for proximity).
- The design philosophy seems to be to adapt as many ideas as possible from the way CPUs manage (multiple levels of) RAM cache.
|Categories: Cache, Clustering, Complex event processing (CEP), Games and virtual worlds, GIS and geospatial, Open source, RDF and graphs, Scientific research||2 Comments|
I’ve been talking some with the Neo Technology/Neo4j guys, including Emil Eifrem (CEO/cofounder), Johan Svensson (CTO/cofounder), and Philip Rathle (Senior Director of Products). Basics include:
- Neo Technology came up with Neo4j, open sourced it, and is building a company around the open source core product in the usual way.
- Neo4j is a graph DBMS.
- Neo4j is unlike some other graph DBMS in that:
- Neo4j is designed for OLTP (OnLine Transaction Processing), or at least as a general-purpose DBMS, rather than being focused on investigative analytics.
- To every node or edge managed by Neo4j you can associate an arbitrary collection of (name,value) pairs — i.e., what might be called a document.
Numbers and historical facts include:
- > 50 paying Neo4j customers.
- Estimated 1000s of production Neo4j users of open source version.*
- Estimated 1/3 of paying customers and free users using Neo4j as a “system of record”.
- >30,000 downloads/month, in some sense of “download”.
- 35 people in 6 countries, vs. 25 last December.
- $13 million in VC, most of it last October.
- Started in 2000 as the underpinnings for a content management system.
- A version of the technology in production in 2003.
- Neo4j first open-sourced in 2007.
- Big-name customers including Cisco, Adobe, and Deutsche Telekom.
- Pricing of either $6,000 or $24,000 per JVM per year for two different commercial versions.
|Categories: Market share and customer counts, Neo Technology and Neo4j, Open source, Pricing, RDF and graphs, Structured documents, Telecommunications||10 Comments|
In my recent series of Hadoop posts, there were several cases where I had to choose between recommending that enterprises:
- Go with the most advanced features any vendor was credibly advocating.
- Be more cautious, and only adopt features that have been solidly proven in the field.
I favored the more advanced features each time. Here’s why.
To a first approximation, I divide Hadoop use cases into two major buckets, only one of which I was addressing with my comments:
1. Analytic data management.* Here I favored features over reliability because they are more important, for Hadoop as for analytic RDBMS before it. When somebody complains about an analytic data store not being ready for prime time, never really working, or causing them to tear their hair out, what they usually mean is that:
- It couldn’t do the work that needed doing …
- … with reasonable performance and turnaround time …
- … without undue effort in administration and/or programming.
Those complaints are much, much, more frequent than “It crashed”. So it was for Netezza, DATAllegro, Greenplum, Aster Data, Vertica, Infobright, et al. So it also is for Hadoop. And how does one address those complaints? By performance and feature enhancements, of the kind that the Hadoop community is introducing at high speed. Read more
|Categories: Buying processes, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Hortonworks, Open source||Leave a Comment|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92 (this post)
As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:
- Performance, scalability, and so on.
- “Coprocessors”, which are like triggers or stored procedures.
- Security, as the first major application of co-processors.
HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing. Read more
|Categories: Benchmarks and POCs, Cloudera, Hadoop, HBase, MapReduce, NoSQL, Open source, Storage, Theory and architecture||2 Comments|