Analysis of cloud computing, especially as applied to database management and analytics. Related subjects include:
I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:
- DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
- DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
- DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
- Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
- DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
- The Deep guys have plans and designs for scale-out — transparent sharding and so on.
*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.
Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:
- Akiban definitely does. (Note: Stay tuned for some next-steps company news about Akiban.)
- Tokutek has planted a small stake there too.
- Key-value-store-backed NuoDB and GenieDB probably leans that way. (And SanDisk evidently shut down Schooner’s RDBMS while keeping its key-value store.)
- VoltDB, Clustrix, ScaleDB and MemSQL seem more strictly tabular, except insofar as text search is a requirement for everybody. (Edit: Oops; I forgot about Clustrix’s approach to JSON support.)
Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.
Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.
- The trend to clustered computing is sustainable.
- The trend to appliances is also sustainable.
- The “single” enterprise cluster is almost as much of a pipe dream as the single enterprise database.
I shall explain.
Arguments for hosting applications on some kind of cluster include:
- If the workload requires more than one server — well, you’re in cluster territory!
- If the workload requires less than one server — throw it into the virtualization pool.
- If the workload is uneven — throw it into the virtualization pool.
Arguments specific to the public cloud include:
- A large fraction of new third-party applications are SaaS (Software as a Service). Those naturally live in the cloud.
- Cloud providers have efficiencies that you don’t.
That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.
Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.
|Categories: Cloud computing, Clustering, Data warehouse appliances, Exadata, NoSQL, Software as a Service (SaaS)||2 Comments|
1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).
Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.
2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.
At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.
3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.
4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more
I must start by apologizing for giving a quote in a press release whose contents I deplore. Unlike occasions on which I’ve posted about inaccurate quotes, in this case the fault is mine. The quote is quite accurate. And NuoDB didn’t mislead me about the release’s contents; I just neglected to ask.
NuoDB evidently subscribes to the marketing fallacy:
- Big DBMS companies hit people repeatedly with marketing cudgels.
- We want to be a big DBMS company.
- Therefore we will hit people repeatedly with marketing cudgels too.
But to my taste, NuoDB’s worst travesty is not the deafening drumroll before launch (I asked off their mailing list months before), nor the claim that NuoDB’s launch would be a “big day” for the database industry (annoying but ordinary hype), nor the emergent flock of birds foofarah, nor even NuoDB’s overwrought benchmark marketing (distressingly many vendors do that).
Rather, I think NuoDB’s greatest marketing offense to date is its Codd-imitating “12 rules” for cloud database management. Read more
NuoDB has an interesting NewSQL story. NuoDB’s core design goals seem to be:
- Very flexible topology, including:
- Local replicas.
- Remote replicas.
- Easy deployment and management.
GenieDB is one of the newer and smaller NewSQL companies. GenieDB’s story is focused on wide-area replication and uptime, coupled to claims about ease and the associated low TCO (Total Cost of Ownership).
GenieDB is in my same family of clients as Cirro.
The GenieDB product is more interesting if we conflate the existing GenieDB Version 1 and a soon-forthcoming (mid-year or so) Version 2. On that basis:
- GenieDB has three tiers.
- GenieDB’s top tier is the usual MySQL front-end.
- GenieDB’s bottom tier is either Berkeley DB or a conventional MySQL storage engine.
- GenieDB’s bottom tier stores your entire database at every node.
- If you replicate locally, GenieDB’s middle tier operates a distributed cache.
- If you replicate wide-area, GenieDB’s middle tier allows active-active/multi-master replication.
The heart of the GenieDB story is probably wide-area replication. Specifics there include: Read more
|Categories: Cache, Cloud computing, Clustering, GenieDB, Market share and customer counts, MySQL, NewSQL||4 Comments|
Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:
- Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
- ParAccel did some engineering to make its DBMS run better in the cloud.
- Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.
“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.
The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.
So who should and will use Amazon Redshift? For starters, I’d say: Read more
|Categories: Amazon and its cloud, Business intelligence, Cloud computing, Data mart outsourcing, Data warehousing, Infobright, ParAccel, Predictive modeling and advanced analytics, Pricing, Vertica Systems||4 Comments|
In connection with Amazon’s Redshift announcement, ParAccel reached out, and so I talked with them for the first time in a long while. At the highest level:
- ParAccel now has 60+ customers, up from 30+ two years ago and 40ish soon thereafter.
- ParAccel is now focusing its development and marketing on analytic platform capabilities more than raw database performance.
- ParAccel is focusing on working alongside other analytic data stores — relational or Hadoop — rather than supplanting them.
There wasn’t time for a lot of technical detail, but I gather that the bit about working alongside other data stores:
- Is relatively new.
- Works via SELECT statements that reach out to the other data stores.
- Is called “on-demand integration”.
- Is built in ParAccel’s extensibility/analytic platform framework.
- Uses HCatalog when reaching into Hadoop.
Also, it seems that ParAccel:
- Is in the early stages of writing its own analytic functions.
- Bundles Fuzzy Logix and actually has some users for that.
|Categories: Amazon and its cloud, Cloud computing, Data warehousing, Hadoop, Market share and customer counts, ParAccel, Predictive modeling and advanced analytics, Specific users||5 Comments|
I chatted with Todd Papaioannou about his new company Continuuity. Todd is as handy at combining buzzwords as he is at concatenating vowels, and so Continuuity — with two “U”s — is making a big data fabric platform as a service with REST APIs that runs over Hadoop and HBase in the private or public clouds. I found the whole thing confusing, in that:
- I recoil against buzzwords. In particular …
- … I pay as little attention to distinctions among PaaS/IaaS/WaaS — Platform/Infrastructure/Whatever as a Service — as I can.
- The Continuuity story sounds Heroku-like, but Todd doesn’t want Continuuity compared to Heroku.
- Todd does want Continuuity discussed in terms of the application server category, but:
- It is hard to discuss app servers without segueing quickly amongst development, deployment, and data connectivity, and Continuuity is no exception to that rule.
- There is doubt as to whether using app servers makes any sense.
But all confusion aside, there are some interesting aspects to Continuuity. Read more
|Categories: Application servers, Cloud computing, Hadoop, HBase, MapReduce, Parallelization, Predictive modeling and advanced analytics, Software as a Service (SaaS)||6 Comments|