Analysis of storage technologies, especially in the context of database management. Related subjects include:
Three months ago, I pointed out that it is hard to generalize about memory-centric database management, because there are so many different kinds. That said, there are some basic points that I’d like to record as background for any future discussion of the subject, focusing on differences between disk and RAM. And while I’m at it, I’ll throw in a few comments about flash memory as well.
This post would probably be better if I had actual numbers for the speeds of various kinds of silicon operations, but I’ll do what I can without them.
For most purposes, database speed is a function of a few kinds of number:
- CPU cycles consumed.
- I/O throughput.
- I/O wait time.
- Network throughput.
- Network wait time.
The amount of storage used is also important, both directly — storage hardware costs money — and because if you save storage via compression, you may get corresponding benefits in I/O. Power consumption and similar costs are usually tied to hardware efficiency; the less gear you use, the less floor space and cooling you may be able to get away with.
When databases move to RAM from spinning disk, major consequences include: Read more
|Categories: Database compression, Memory-centric data management, Solid-state memory, solidDB||6 Comments|
SanDisk has acquired my client Schooner Information Technology. Notes on that include:
- Schooner used to be a flash-based appliance company.
- Then Schooner pivoted to be a database software company with strong flash expertise.
- Then Schooner pivoted further to emphasize general modern OLTP (OnLine Transaction Processing) clustered goodness.
- SanDisk makes flash memory. That’s the fit.
- Specifically, Schooner is being put in the division that grew out of the acquisition of Pliant, which makes solid-state disks for database applications, and gets rave reviews from Teradata.
- Schooner had a few dozen customers, but not a lot of evident traction. Hence, I would imagine, the acquisition.
That’s about all I have at this time.
|Categories: Market share and customer counts, Schooner Information Technology, Solid-state memory||3 Comments|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92 (this post)
As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:
- Performance, scalability, and so on.
- “Coprocessors”, which are like triggers or stored procedures.
- Security, as the first major application of co-processors.
HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing. Read more
|Categories: Benchmarks and POCs, Cloudera, Hadoop, HBase, MapReduce, NoSQL, Open source, Storage, Theory and architecture||2 Comments|
Cloudant is one of the few NoSQL companies with >100 paying subscription customers. For starters:
- Cloudant’s core software is a fork of CouchDB.
- Cloudant only sells you software as a service.
- More precisely, whether Cloudant offers DBaaS (DataBase as a Service) or PaaS (Platform as a Service) or a “data layer” (Cloudant’s preferred terminology) depends on your taste in buzzwords.
- I gather that Cloudant (the company) wants to handle pretty much all your data management needs. But Cloudant (the product) isn’t there yet, especially on the analytic side.
- Before CouchDB and Membase joined together, Cloudant was positioned as the big(ger) data version of CouchDB.
Company demographics include:
- Cloudant is based in Boston.
- Cloudant started out as a Y Combinator company in 2008, and “got serious” in 2009.
- Cloudant now has ~20 employees.
- Management hires include a couple of former Vertica guys.
The Cloudant guys gave me some customer counts in May that weren’t much higher than those they gave me in February, and seem to have forgotten to correct the discrepancy. Oh well. The latter (probably understated) figures included ~160 paying customers, of which:
- ~100 were multitenant.
- ~60 were single tenant.
- 1 was on-premise (but still managed by Cloudant) because of privacy concerns.
The largest Cloudant deployments seem to be in the 10s of terabytes, across a very low double digit number of servers.
|Categories: Cloudant, Clustering, Couchbase, CouchDB, MapReduce, Market share and customer counts, NoSQL, Pricing, Specific users, Storage||2 Comments|
Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.
*I hope any errors in interpretation are minor.
Major aspects of DB2 10 include new or improved capabilities in the areas of:
- Analytic query performance.
- Data ingest.
- Multi-temperature data management.
- Workload management.
- Graph management/relationship analytics.
- Time-travel, bitemporal features, and bitemporal time-travel.
Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*
*Also, the data ingest part isn’t in base DB2.
|Categories: Data warehousing, Database compression, IBM and DB2, RDF and graphs, Solid-state memory, Workload management||6 Comments|
I love talking with Carson Schmidt, chief of Teradata’s hardware engineering (among other things), even if I don’t always understand the details of what he’s talking about. It had been way too long since our last chat, so I requested another one. We were joined by Keith Muller, who I presume is pictured here. Takeaways included:
- Teradata performance growth was slow in the early 2000s, but has accelerated since then; Intel gets a lot of the credit (and blame) for that.
- Carson hopes for a performance “discontinuity” with Intel Ivy Bridge.
- Teradata is not afraid to use niche special-purpose chips.
- Teradata’s views can be taken as well-informed endorsements of InfiniBand and SAS 2.0.
|Categories: Data warehouse appliances, Data warehousing, Database compression, Solid-state memory, Storage, Teradata||13 Comments|
MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:
- More-of-the-same in line with MarkLogic’s core positioning.
- A new bi-directional Hadoop connector.
- A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.
Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.
*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.
Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.
|Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text||1 Comment|
It is widely rumored that there will be a leadership change at HP (Meg Whitman in, Leo Apotheker out). In connection with that, I found myself holding forth on points such as:
- HP needs to make outstanding enterprise systems again.
- They fell away from that target under Mark Hurd, but they surely can hit it again, based on the remnants of DEC (Digital Equipment Corporation), Tandem, the higher-end part of Compaq, and of course the original HP systems group.
- In particular:
- Rumors say that Oracle Exadata 1 boxes, made by HP, were much lower quality than Exadata 2 boxes made by Sun.
- HP Neoview was a waste of good engineering talent.
- I’d like to see a few excellent Vertica appliances.
- I hope the SAP HANA appliances go well, whenever HANA finally becomes a serious product.
- The general move from disk to solid-state memory should offer some opportunities.
Once again, I’m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include:
- 100s of gigabytes of data at first, growing to >1 terabyte over time.
- High peak loads.
- Public cloud portability (but they have private data centers they can use today).
- Simple database design — not a lot of tables, not a lot of columns, not a lot of joins, and everything can be distributed on the same customer_ID key.
- Stream the data to a data warehouse, that will grow to a few terabytes. (Keeping only one year of OLTP data online actually makes sense in this application, but of course everything should go into the DW.)
So I’m leaning to saying: Read more
|Categories: Analytic technologies, Cloud computing, Clustering, Data warehousing, dbShards and CodeFutures, Facebook, Infobright, MySQL, OLTP, Open source, Parallelization, Software as a Service (SaaS), Solid-state memory||13 Comments|
Kaminario, which used to be in the business of solid state storage via DRAM, now is emphasizing hybrid DRAM/flash storage appliances instead. The reason is evidently price. Per terabyte of primary storage (before mirroring onto disk and so on):
- A Kaminario K2 DRAM-only appliance costs $100K.
- A Kaminario K2 flash-only appliance costs $30K (but nobody buys that configuration).
- A typical Kaminario K2 hybrid DRAM/flash appliance might cost $35K (which tells us that there’s a lot more flash than DRAM).
Kaminario positions DRAM as where you focus your most write-intensive/ bottlenecking loads, such as logging or temp space, with the primary benefit being performance and a secondary benefit being slowing the wear on your flash.