CodeFutures/dbShards update
I’ve been talking a fair bit with Cory Isaacson, CEO of my client CodeFutures, which makes dbShards. Business notes include:
- 7 production users, plus an 8th imminent.
- 12-14 signed contracts beyond that.
- ~160 servers in production.
- One customer who has almost 15 terabytes of data (in the cloud).
- Still <10 people, pretty much all engineers.
- Profitable, but looking to raise a bit of growth capital.
Apparently, the figure of 6 dbShards customers in July, 2010 is more comparable to today’s 20ish contracts than to today’s 7-8 production users. About 4 of the original 6 are in production now.
NDA stuff aside, the main technical subject we talked about is something Cory calls “relational sharding”. The point is that dbShards’ transparent sharding can be done in such a way as to make many joins be single-server. Specifically:
- When a table is sufficiently small to be replicated in full at every nodes, you can join on it without moving data across the network.
- When two tables are sharded on the same key, you can join on that key without moving data across the network.
dbShards can’t do cross-shard joins, but it can do distributed transactions comprising multiple updates. Cory argues persuasively that in almost all cases this is enough; but I see cross-shard joins as a feature that should someday be added to dbShards even so.
The real issue with dbShards’ transparent sharding is ensuring it’s really transparent. Cory regards as typical a customer with a couple thousand tables, who had to change a dozen or so SQL statements to implement dbShards. But there are near-term plans to automate the number of SQL changes needed down to 0. The essence of that change is this: Read more
Categories: Clustering, Data integration and middleware, Data models and architecture, dbShards and CodeFutures, Market share and customer counts, Transparent sharding | 1 Comment |
Notes on the ClearStory Data launch, including an inaccurate quote from me
ClearStory Data launched, with nice coverage in the New York Times, Computerworld, and elsewhere. But from my standpoint, there were some serious problems:
- (Bad.) I was planning to cover the launch as well, in a split exclusive, but that plan was changed, costing me considerable wasted work.
- (Worse.) I wasn’t told of the change as soon as it was known. Indeed, I wasn’t told at all; I was left to infer it from the fact that I was now being asked to talk with other reporters.
- (Horrific.) I was quoted in the ClearStory launch press release, but while the sentiments were reasonably in line with my own, the quote was incorrect.*
I’m utterly disgusted with this whole mess, although after talking with her a lot I’m fine with CEO Sharmila Mulligan’s part in it, which is to say with ClearStory’s part in general.
*I avoid the term “platform” as much as possible; indeed, I still don’t really know what the “new platforms” part was supposed to refer to. The Frankenquote wound up with some odd grammar as well.
Actually, in principle I’m a pretty close adviser to ClearStory (for starters, they’re one of my stealth-mode clients). That hasn’t really ramped up yet; in particular, I haven’t had a technical deep dive. So for now I’ll just say:
Categories: Business intelligence, ClearStory Data, Data integration and middleware, Data mart outsourcing | 1 Comment |
DataStax Enterprise 2.0
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
Comments on Oracle’s third quarter 2012 earnings call
Various reporters have asked me about Oracle’s third quarter 2012 earnings conference call. Specific Q&A includes:
What did Oracle do to have its earnings beat Wall Street’s estimates?
Have a bad second quarter and then set Wall Street’s expectations too low for Q3. This isn’t about strong results; it’s about modest expectations.
Can Oracle be a leader in both hardware and software?
- It’s not inconceivable.
- The observation that Oracle, IBM, and Teradata all are pushing hardware-software combinations has been intriguing ever since IBM bought Netezza. (SAP really isn’t, however; ditto Microsoft.)
- I do think Oracle may be somewhat overoptimistic as to how cooperative the Sun user base will be in buying more high-end product and in paying more in maintenance for the gear they already have.
Beyond that, please see below.
What about Oracle in the cloud?
MySQL is an important cloud supplier. But Oracle overall hasn’t demonstrated much understanding of what cloud technology and business are all about. An expensive SaaS acquisition here or there could indeed help somewhat, but it seems as if Oracle still has a very long way to go.
Other comments
Other comments on the call, whose transcript is available, include: Read more
Categories: Cloud computing, Exadata, Humor, In-memory DBMS, Oracle, SAP AG, Software as a Service (SaaS) | 5 Comments |
Akiban update
I have a bunch of backlogged post subjects in or around short-request processing, based on ongoing conversations with my clients at Akiban, Cloudant, Code Futures (dbShards), DataStax (Cassandra) and others. Let’s start with Akiban. When I posted about Akiban two years ago, it was reasonable to say:
- Akiban is in the short-request DBMS business.
- MySQL compatibility is one way to access Akiban, but it’s not the whole story.
- Akiban’s main point of technical differentiation is to arrange data hierarchically on disk so that many joins are “zero-cost”.
- Walking the hierarchy isn’t a great way to get at data for every possible query; Akiban recognizes the need for other access techniques as well.
All of the above are still true. But unsurprisingly, plenty of the supporting details have changed. Read more
Categories: Akiban, Data models and architecture, MySQL, NewSQL, Object | 9 Comments |
Juggling analytic databases
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes. Read more
Kinds of data integration and movement
“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data —
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other. Read more
Categories: Clustering, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, MapReduce | 10 Comments |
Hardware and components — lessons from Teradata
I love talking with Carson Schmidt, chief of Teradata’s hardware engineering (among other things), even if I don’t always understand the details of what he’s talking about. It had been way too long since our last chat, so I requested another one. We were joined by Keith Muller, who I presume is pictured here. Takeaways included:
- Teradata performance growth was slow in the early 2000s, but has accelerated since then; Intel gets a lot of the credit (and blame) for that.
- Carson hopes for a performance “discontinuity” with Intel Ivy Bridge.
- Teradata is not afraid to use niche special-purpose chips.
- Teradata’s views can be taken as well-informed endorsements of InfiniBand and SAS 2.0.
Categories: Data warehouse appliances, Data warehousing, Database compression, Solid-state memory, Storage, Teradata | 13 Comments |
SAP HANA today
SAP HANA has gotten much attention, mainly for its potential. I finally got briefed on HANA a few weeks ago. While we didn’t have time for all that much detail, it still might be interesting to talk about where SAP HANA stands today.
The HANA section of SAP’s website is a confusing and sometimes inaccurate mess. But an IBM whitepaper on SAP HANA gives some helpful background.
SAP HANA is positioned as an “appliance”. So far as I can tell, that really means it’s a software product for which there are a variety of emphatically-recommended hardware configurations — Intel-only, from what right now are eight usual-suspect hardware partners. Anyhow, the core of SAP HANA is an in-memory DBMS. Particulars include:
- Mainly, HANA is an in-memory columnar DBMS, based on SAP’s confusingly-renamed BI Accelerator/BW Accelerator. Analytics and most OLTP (OnLine Transaction Processing) go against the columnar part of HANA.
- The HANA DBMS also has an in-memory row storage option, used to store metadata, small tables, and so on.
- SAP HANA talks both SQL and MDX.
- The HANA DBMS is shared-nothing across blades or rack servers. I imagine that within an individual blade it’s shared everything. The usual-suspect data distribution or partitioning strategies are available — hash, range, round-robin.
- SAP HANA has what sounds like a natural disk-based persistence strategy — logs, snapshots, and so on. SAP says that this is synchronous enough to give ACID compliance. For some hardware partners, those “disks” are actually Fusion I/O cards.
- HANA is fault-tolerant “across servers”.
- Text support is “coming soon”, which makes sense, given that BI Accelerator was based on the TREX search engine in the first place. Inxight is also in the HANA text mix.
- You can put data into SAP HANA in a variety of obvious ways:
- Writing it directly.
- Trigger-based replication (perhaps from the DBMS that runs your SAP apps).
- Log-based replication (based on Sybase Replication Server).
- SAP Business Objects’ ETL tool.
SAP says that the row-store part is based both on P*Time, an acquisition from Korea some time ago, and also on SAP’s own MaxDB. The IBM white paper mentions only the MaxDB aspect. (Edit: Actually, see the comment thread below.) Based on a variety of clues, I conjecture that this was an aspect of SAP HANA development that did not go entirely smoothly.
Other SAP HANA components include: Read more
Third-party analytics
This is one of a series of posts on business intelligence and related analytic technology subjects, keying off the 2011/2012 version of the Gartner Magic Quadrant for Business Intelligence Platforms. The four posts in the series cover:
- Overview comments about the 2011/2012 Gartner Magic Quadrant for Business Intelligence Platforms, as well as a link to the actual document.
- Business intelligence industry trends — some of Gartner’s thoughts but mainly my own.
- Company-by-company comments based on the 2011/2012 Gartner Magic Quadrant for Business Intelligence Platforms.
- (This post) Third-party analytics, pulling together and expanding on some points I made in the first three posts.
I’ve written a lot this weekend about various areas of business intelligence and related analytics. A recurring theme has been what we might call third-party analytics — i.e., anything other than buying analytic technology and deploying it in your own enterprise. Four main areas include:
- Business intelligence software OEMed to packaged operational application vendors.
- Business intelligence software OEMed to SaaS (Software as a Service) application vendors.
- Business intelligence software bundled into information-selling businesses.
- Stakeholder-facing analytics, which usually is just BI allowing customers (or suppliers, investors, citizens, etc.) to look into one of your databases.