Analysis of technology that compresses data within a database management system. Related subjects include:
Indexes are central to database management.
- My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
- … and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
- Recently, I’ve had numerous conversations in which indexing strategies played a central role.
Perhaps it’s time for a round-up post on indexing.
1. First, let’s review some basics. Classically:
- An index is a DBMS data structure that you probe to discover where to find the data you really want.
- Indexes make data retrieval much more selective and hence faster.
- While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
- Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)
2. Further: Read more
|Categories: Data warehousing, Database compression, GIS and geospatial, Google, MapReduce, McObject, MemSQL, MySQL, ScaleDB, solidDB, Sybase, Text, Tokutek and TokuDB||17 Comments|
I chatted with the MariaDB folks on Tuesday. Let me start by noting:
- MariaDB, the product, is a MySQL fork.
- MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
- MariaDB, the company, is the former SkySQL …
- … which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
- I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
- It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.
The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:
- ~80 people in ~15 countries.
- 20-25 engineers, which hopefully doesn’t count a few field support people.
- “Tiny” headquarters in Helsinki.
- Business leadership growing in the US and especially the SF area.
MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.
MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries. Read more
|Categories: Database compression, Hadoop, IBM and DB2, Market share and customer counts, Mid-range, MySQL, Open source, Tokutek and TokuDB, Transparent sharding||1 Comment|
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add: Read more
|Categories: Database compression, Hadoop, Humor, In-memory DBMS, MongoDB and 10gen, NoSQL, Open source, Structured documents, Sybase||7 Comments|
I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.
That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call.
In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that
Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,
… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.
Actian … is now “one company, with one voice and one platform” according to its John Santaferraro
The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.
All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include: Read more
|Categories: Actian and Ingres, Clustering, Database compression, Hadoop, ParAccel, Pervasive Software, SQL/Hadoop integration, VectorWise, Workload management||4 Comments|
One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:
- CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
- Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
- CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
- Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.
*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.
Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.
|Categories: Aster Data, Citus Data, Columnar database management, Data warehousing, Database compression, Greenplum, Hadoop, Parallelization, PostgreSQL, SQL/Hadoop integration, Transparent sharding, Workload management||6 Comments|
I stopped by MemSQL last week, and got a range of new or clarified information. For starters:
- Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
- … which was the point of introducing MemSQL’s flash-based columnar option.
- One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
- Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
- At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
- A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
- MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.
On the more technical side: Read more
|Categories: Clustering, Clustrix, Columnar database management, Data warehousing, Database compression, In-memory DBMS, MemSQL, NewSQL, NuoDB, Specific users, Vertica Systems, VoltDB and H-Store, Workload management, Zynga||18 Comments|
I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:
- An actual automation and management user interface, as opposed to the current management style, which is almost entirely via scripts (except for the monitoring UI).
- That’s scheduled for public beta in May, and general availability later this year.
- It will include some kind of integrated provisioning with VMware, OpenStack, et al.
- One goal is to let you apply database changes, software upgrades, etc. without taking the cluster down.
- A reasonable backup strategy.
- A snapshot copy is made of the database.
- A copy of the log is streamed somewhere.
- Periodically — the default seems to be 6 hours — the log is applied to create a new current snapshot.
- For point-in-time recovery, you take the last snapshot prior to the point, and roll forward to the desired point.
- A reasonable locking strategy!
- Document-level locking is all-but-promised for MongoDB 2.8.
- That means what it sounds like. (I mention this because sometimes an XML database winds up being one big document, which leads to confusing conversations about what’s going on.)
- Security. My eyes glaze over at the details, but several major buzzwords have been checked off.
- A general code rewrite to allow for (more) rapid addition of future features.
Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are:
- MemSQL has historically been an in-memory row store, which as of last year scales out.
- It turns out that the MemSQL row store actually has two table types. One is scaled out. The other — called “reference” — is replicated on every node.
- MemSQL has now added a third table type, which is columnar and which resides in flash memory.
- If you want to keep data in, for example, both the scale-out row store and the column store, you’d have to copy/replicate it within MemSQL. And if you wanted to access data from both versions at once (e.g. because different copies cover different time periods), you’d likely have to do a UNION or something like that.
*MemSQL’s first columnar offering sounds pretty basic; for example, there’s no columnar compression yet. (Edit: Oops, that’s not accurate. See comment below.) But at least they actually have one, which puts them ahead of many other row-based RDBMS vendors that come to mind.
And to hammer home the contrast:
- IBM, Oracle and Microsoft, which all sell row-based DBMS meant to run on disk or other persistent storage, have added or will add columnar options that run in RAM.
- MemSQL, which sells a row-based DBMS that runs in RAM, has added a columnar option that runs in persistent solid-state storage.
|Categories: Columnar database management, Database compression, In-memory DBMS, MemSQL, Solid-state memory||12 Comments|
The general Tokutek strategy has always been:
- Write indexes efficiently, which …
- … makes it reasonable to have more indexes, which …
- … lets more queries run fast.
But the details of “writes indexes efficiently” have been hard to nail down. For example, my post about Tokutek indexing last January, while not really mistaken, is drastically incomplete.
Adding further confusion is that Tokutek now has two product lines:
- TokuDB, a MySQL storage engine.
- TokuMX, in which the parts of MongoDB 2.2 that roughly equate to a storage engine are ripped out and replaced with Tokutek code.
TokuMX further adds language support for transactions and a rewrite of MongoDB’s replication code.
So let’s try again. I had a couple of conversations with Martin Farach-Colton, who:
- Is a Tokutek co-founder.
- Stayed in academia.
- Is a data structures guy, not a database expert per se.
The core ideas of Tokutek’s architecture start: Read more
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more