Analysis of open source DBMS vendor MySQL (recently acquired by Sun Microsystems), its products, and other products in the MySQL ecosystem. Related subjects include:
“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:
- A query layer with multiple ways to query and analyze data.
- A separate data storage layer in which you have a choice of data storage engines …
- … each of which has the same logical (JSON-based) data structure.
When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.
To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further: Read more
|Categories: Database diversity, Emulation, transparency, portability, MongoDB, MySQL, NoSQL, Open source, RDF and graphs, Structured documents, Text||2 Comments|
I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.
Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.
- 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
- Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
- A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
- By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
- Couchbase 4.0 has been in beta for a few months.
Technical notes on Couchbase 4.0 — and related riffs — start: Read more
|Categories: Cache, Clustering, Couchbase, Data models and architecture, Databricks, Spark and BDAS, Exadata, Hadoop, MarkLogic, MongoDB, MySQL, NoSQL, Open source, Schema on need, Structured documents, Web analytics||1 Comment|
Indexes are central to database management.
- My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
- … and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
- Recently, I’ve had numerous conversations in which indexing strategies played a central role.
Perhaps it’s time for a round-up post on indexing.
1. First, let’s review some basics. Classically:
- An index is a DBMS data structure that you probe to discover where to find the data you really want.
- Indexes make data retrieval much more selective and hence faster.
- While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
- Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)
2. Further: Read more
|Categories: Data warehousing, Database compression, GIS and geospatial, Google, MapReduce, McObject, MemSQL, MySQL, ScaleDB, solidDB, Sybase, Text, Tokutek and TokuDB||18 Comments|
I chatted with the MariaDB folks on Tuesday. Let me start by noting:
- MariaDB, the product, is a MySQL fork.
- MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
- MariaDB, the company, is the former SkySQL …
- … which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
- I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
- It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.
The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:
- ~80 people in ~15 countries.
- 20-25 engineers, which hopefully doesn’t count a few field support people.
- “Tiny” headquarters in Helsinki.
- Business leadership growing in the US and especially the SF area.
MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.
MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries. Read more
|Categories: Database compression, Hadoop, IBM and DB2, Market share and customer counts, Mid-range, MySQL, Open source, Tokutek and TokuDB, Transparent sharding||1 Comment|
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
Relational DBMS used to be fairly straightforward product suites, which boiled down to:
- A big SQL interpreter.
- A bunch of administrative and operational tools.
- Some very optional add-ons, often including an application development tool.
Now, however, most RDBMS are sold as part of something bigger.
- Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
- IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
- Microsoft SQL Server is part of a stack, starting with the Windows operating system.
- Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
- … SAP HANA, which is closely associated with SAP’s applications.
- Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
- MySQL’s glory years were as part of the “LAMP” stack.
- Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.
The general Tokutek strategy has always been:
- Write indexes efficiently, which …
- … makes it reasonable to have more indexes, which …
- … lets more queries run fast.
But the details of “writes indexes efficiently” have been hard to nail down. For example, my post about Tokutek indexing last January, while not really mistaken, is drastically incomplete.
Adding further confusion is that Tokutek now has two product lines:
- TokuDB, a MySQL storage engine.
- TokuMX, in which the parts of MongoDB 2.2 that roughly equate to a storage engine are ripped out and replaced with Tokutek code.
TokuMX further adds language support for transactions and a rewrite of MongoDB’s replication code.
So let’s try again. I had a couple of conversations with Martin Farach-Colton, who:
- Is a Tokutek co-founder.
- Stayed in academia.
- Is a data structures guy, not a database expert per se.
The core ideas of Tokutek’s architecture start: Read more
I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify.
- Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
- Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
- Upstarts expand their offerings, and eventually attack the leaders in their core markets.
In response (this is the Innovator’s Solution part):
- Leaders expand their product lines, increasing the value of their offerings in their core markets.
- In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
- Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.
But not all cleverness is “disruption”.
- Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
- Innovative new technology from small companies is not, in itself, disruption either.
Here are some of the examples that make me think of the whole subject. Read more
|Categories: Business intelligence, Data warehousing, Hadoop, Microsoft and SQL*Server, MongoDB, MySQL, Netezza, NewSQL, NoSQL, Oracle, Predictive modeling and advanced analytics, QlikTech and QlikView, Tableau Software||13 Comments|
The third of my three MySQL-oriented clients I alluded to yesterday is MemSQL. When I wrote about MemSQL last June, the product was an in-memory single-server MySQL workalike. Now scale-out has been added, with general availability today.
MemSQL’s flagship reference is Zynga, across 100s of servers. Beyond that, the company claims (to quote a late draft of the press release):
Enterprises are already using distributed MemSQL in production for operational analytics, network security, real-time recommendations, and risk management.
All four of those use cases fit MemSQL’s positioning in “real-time analytics”. Besides Zynga, MemSQL cites penetration into traditional low-latency markets — financial services (various subsectors) and ad-tech.
Highlights of MemSQL’s new distributed architecture start: Read more
|Categories: Clustering, Database compression, Emulation, transparency, portability, Games and virtual worlds, Investment research and trading, Log analysis, MemSQL, MySQL, NewSQL, Transparent sharding, Zynga||6 Comments|