Archiving and information preservation – DBMS 2 : DataBase Management System Services

Issues in regulatory compliance

Curt Monash — Sun, 15 Jul 2012 06:27:28 +0000

From time to time, I hear of regulatory requirements to retain, analyze, and/or protect data in various ways. It’s hard to get a comprehensive picture of these, as they vary both by industry and jurisdiction; so I generally let such compliance issues slide. Still, perhaps I should use one post to pull together what is surely a very partial list.

Most such compliance requirements have one of two emphases: Either you need to keep your customers’ data safe against misuse, or else you’re supposed to supply information to government authorities. From a data management and analysis standpoint, the former area mainly boils down to:

Information security. This can include access control, encryption, masking, auditing, and more.
Keeping data in an approved geographical area. (E.g., its country of origin.) This seems to be one of the three big drivers for multi-data-center processing (along with latency and disaster recovery), and hence is an influence upon numerous users’ choices in areas such as clustering and replication.

The latter, however, has numerous aspects.

First, there are many purposes for the data retention and analysis, including but by no means limited to:

Financial reporting (all industries)
Facilitating discovery in case you’re ever sued (all industries)
Anti-discrimination (especially financial services, but also labor law in any industry)
Safety and environmental review (many bricks-and-mortar industries, most notably pharma)
Rate setting (industries with regulated prices, such as insurance and utilities)
Financial risk evaluation, e.g. Basel 3 (financial services)
Ratting out your customers (especially commercial banking and internet)

Second, there are a variety of technical issues supporting the authorities-informing side of compliance, such as:

Keeping a whole lot of data, cost effectively and …
… using those archives, since you have them anyway. While it’s not a big focus area for me, I’ve written a number of posts about archiving and information preservation.
Making your document data sufficiently searchable by somebody who’s suing you. I posted about text e-discovery a couple of times in 2008.
Getting your regulatory reports together quickly. Examples include:
- Closing the books promptly after quarter end. A lot of software technology goes into that.
- Doing risk analysis promptly, even though it’s computationally very demanding. Risk analysis keeps coming up as an application area for scale-out analytic technologies.
Packaging up data nicely for regulators. The classic example of this is pharmaceutical regulatory filings, which is the application that fueled Documentum’s growth 20 years ago.
Ensuring that the data is accurate and hasn’t been tampered with. Some of that, again, is a matter of information security. But also important are the highly overlapping areas of data lineage and data provenance.

Combining all that, and more, I’d say that a considerable fraction of data management and analysis efforts are devoted to meeting legal and regulatory obligations.

Clarifying SAND’s customer metrics, positioning and technical story

Curt Monash — Sun, 13 Nov 2011 02:45:36 +0000

Talking with my clients at SAND can be confusing. That said:

I need to revise my figures for SAND’s customer count way downward.
SAND finally has a reasonably clear positioning.
SAND’s product actually seems to have a lot of features.

A few months ago, I wrote:

SAND Technology reported >600 total customers, including >100 direct.

Upon talking with the company, I need to revise that figure downward, from > 600 to 15.

One embarrassing point: SAND is a client, and I view it as part of my job to save clients from that kind of inadvertent misstatement.

It turns out that SAND has a very impressive customer — Dunnhumby, a data mart outsourcer with 200 terabytes of data in SAND, 30 or so incoming data streams, 400 or so nodes … and 600 or so end customers, all of which SAND was counting as OEM end customers for its DBMS. But I, other industry observers, and other vendors generally don’t count that way.

Besides Dunnhumby, SAND has 14 other customers on maintenance, with < 1 terabyte of data each. Until recently, SAND had a couple dozen more customers than that, but it sold its SAP-oriented archiving/near-line storage product line to Informatica.

I still don’t know where the “> 100 direct” part came from.

After the sale of its other product line, SAND is squarely in the market for analytic DBMS. SAND’s sales efforts seem to be focused on investigative analytics, although some of its existing users seem to be more focused on operational analytics. Most specifically, SAND is trying to focus on “people data” — customer loyalty, health care, etc . — rather than purely machine-generated data, with the paradigmatic target application being personalized marketing.

SAND technical highlights include:

SAND sells a columnar analytic DBMS.
The SAND DBMS operates on bitmaps, with heavy use of run-length encoding on the bitmaps. Bitmaps are used for everything except BLOBs (Binary Large OBjects).
Actual data compression also comes into play, e.g. as result sets are being assembled. This is based on a true global dictionary — multiple columns are tokenized together.
Indeed, SAND can decompose columns and tokenize their parts (e.g. time stamps).
SAND’s workload management sees RAM and CPU, but not explicitly I/O.
SAND lets you pin certain tables or even table segments in RAM if you want to.

SAND’s update story is straightforward — when data comes in, all the columns and bitmaps are updated as needed. Still, since SAND is columnar, you wouldn’t expect true updates in place, and you’d be right. Rather, there’s a story with MVCC (MultiVersion Concurrency Control) and garbage collection, lock-free. The MVCC is also exploited for a kind of time travel, and further for some kind of virtual data mart capability.

SAND’s parallelization story is a bit complicated.

SAND has, or at least has the potential for, node specialization, with database and storage nodes being different.
In principle, disks are specific to storage nodes, and it’s a configuration option as to whether a database node sees one, some, or all storage nodes.
In practice, only Dunnhumby among SAND’s customers operates on other than a shared-disk basis. Dunnhumby’s configuration is mixed/matched among various SAND sharing options.

SAND is proud of its PMML (Predictive Modeling Markup Language) scoring capabilities, but otherwise hasn’t shipped much in the way of analytic platform capabilities. That said, work is underway on a user-defined table function capability that can also query external tables, fire off MapReduce jobs, and so on, under the code name UQL.

Text data management, Part 1: Confusion

Curt Monash — Tue, 11 Oct 2011 01:58:03 +0000

This is Part 1 of a three post series. The posts cover:

There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:

The terminology around text data is inaccurate.
Data volume estimates for text are misleading.
Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
Text search vendors have disappointed, especially technically.
Text analytics vendors have disappointed, especially financially.
Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.

Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.

There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include:

The terms “unstructured” or “semi-structured” data are inherently misleading. That’s why I favor “multi-structured” or “poly-structured” instead. (“Multi-structured” seems to be winning; e.g., it’s been adopted by Teradata and Teradata/Aster.)
The “social media” text data any one enterprise brings in house isn’t all that much. For example, Attensity serves many different enterprises’ social media needs from a single 20-terabyte data store, and reports that no single enterprise has required as much as 1 terabyte of text yet. Text data may consume a lot of storage on spinning disks somewhere, but it’s not that big a factor in future DBMS industry growth. (That 20 terabyte figure does seem low.)
Structured databases are typically worth a lot more per bit than other kinds. The most valuable electronic data, per-bit, is probably records of significant economic transactions — purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of “Nothing going on here; ping you again in a minute.” Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents — such as signed contracts — generally persist in paper as well as electronic forms. Investors commonly overlook this point.
The enterprise text search industry is screwed up.
- FAST was a goofy company before it was acquired for far too much money by Microsoft.
- Autonomy was a goofy company before it was acquired for far too much money by HP.
- Google’s enterprise efforts are quiet.
- The integration of text search and relational DBMS — e.g. at Oracle — has languished, with poor performance and evident lack of management attention.
- Smaller text search vendors don’t seem to be getting a lot of traction — e.g., Coveo has a decent reputation, but when’s the last time you heard much about them? What has Attivio actually accomplished?
Text analytics is a small business. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.
Even so, the text analytics vendors have developed sophisticated technology. In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.

Teradata Columnar and Teradata 14 compression

Curt Monash — Thu, 22 Sep 2011 05:25:42 +0000

Teradata is pre-announcing Teradata 14, for delivery by the end of this year, where by “Teradata 14” I mean the latest version of the DBMS that drives the classic Teradata product line. Teradata 14’s flagship feature is Teradata Columnar, a hybrid-columnar offering that follows in the footsteps of Greenplum (now part of EMC) and Aster Data (now part of Teradata).

The basic idea of Teradata Columnar is:

Each table can be stored in Teradata in row format, column format, or a mix.
You can do almost anything with a Teradata columnar table that you can do with a row-based one.
If you choose column storage, you also get some new compression choices.

The “mix” option is like Vertica’s FlexStore, in that different columns (e.g. different components of a street address) can be grouped into a mini-row, even if you otherwise choose to store that table in a columnar way. Teradata does not at this time offer the Greenplum or Aster way of mixing rows and columns, whereby some of the rows in a table can be stored in a column-store way, while other rows are stored in entire-row row-store solidarity

Thus, Teradata Columnar gives you many of the basic I/O and compression benefits of columnar DBMS, along with all the usual Teradata goodness of concurrency, workload management, system management, concurrency, SQL support, and so on. By way of comparison:

Similar things are true of Greenplum’s offering (except for the parts about concurrency, advanced workload management, and so on).
Aster doesn’t have columnar compression.
Oracle has columnar compression but no true columnar storage.*

Also, as I noted above, Teradata mixes rows and columns in a different way than Aster or EMC Greenplum do.

*However, I won’t be surprised if Oracle soon announces true hybrid-columnar as well. I originally heard about Teradata Columnar and Oracle’s efforts to develop true hybrid-columnar storage the same week, 23 months ago.

Going hybrid-columnar is a big deal. Aster Data, for example, told me that a considerable fraction of all its workloads ran faster with columnar than row-based storage.* And it’s of extra importance to a vendor that, like Teradata, needs to play catch-up in the compression derby.

*Anything in which the queries eliminated more than half or so of the columns (60%, if I recall correctly, but it was definitely an approximate figure). That pretty much means any query except full and near-full table scans.

Teradata’s columnar compression story is pretty complicated. To quote from a forthcoming press release:

Teradata automatically chooses from among six types of compression: run length, dictionary, trim, delta on mean, null and UTF8. based on the column demographics.

The trickiest words in that are “automatic” and “dictionary”. Teradata divides column-store data into “column containers” of, say, 8 KB. (Current thinking is 8 KB default, 65 KB maximum, but that could change by the time of product release.) By default, Teradata software decides separately for each column container which compression algorithm(s) to use. It can even change its mind dynamically over time, as the contents of the container change.

What I find weird about Teradata’s columnar dictionary compression is that the dictionary is container-specific. One benefit versus having a more global dictionary is that, since you compress fewer items, compression tokens can each be shorter. (The length of a typical token is a lot like the log of the cardinality of the dictionary.) Another benefit is that smaller dictionaries are faster to search. The obvious offsetting drawback is that a larger and more global dictionary has the potential to compress various items that wind up being left uncompressed in this smaller-scale scheme.

Other notes about Teradata compression include:

Teradata has for a while had a more manual form of dictionary compression.
Teradata also has block-level compression.
You can do block-level compression even on top of the columnar compression described above.
The Teradata/Rainstor partnership for archiving-level compression that Rainstor made so much fuss about doesn’t seem to actually be happening; Teradata seems content with the other compression choices it offers.

And finally, Teradata 14 extends Teradata Virtual Storage with a feature called Compress on Cold. The idea is that “cold” data can safely get (extra) compression — that block-level stuff — automatically. If the data heats up again (e.g. by becoming relevant for a while to the latest year-over-year comparisons) it can be just as automatically removed from compression. Teradata thinks this is significantly better than the alternative of making manual compression choices based on not-so-granular range partitions.

Unsurprisingly, Teradata lacks some features and benefits found in certain columnar-first analytic DBMS. One biggie is that, absent clever workarounds such as Vertica’s in-memory write-optimized store, columnar DBMS have a single-row-update performance problem, because you are putting the information in many places on disk rather than just one. I generally take it for granted that a columnar-first vendor has such a workaround. Row-based vendors gone columnar, however, are a different story. Teradata et al. are also likely to decompress data and reassemble it into full rows as soon as it hits RAM, which obviates the potential benefit that you have less data per row clogging up cache.* (Edit: As per Todd Walter’s comments below, this is not accurate — and that’s a potentially important feature.)

*Late decompression actually depends on columnar compression, not columnar storage, and hence can also be enjoyed by row-based DBMS such as DB2.

To use Teradata Columnar, you need to be using round-robin data distribution rather than, say, hash. Teradata jargon for this is NoPI, where the “PI” stands for Primary Index.* Drawbacks to that include:

You don’t get the hash distribution benefit of saving a data redistribution step on joins whose join key happens to be the same as the hash key.
In Teradata-land, NoPI implies append-only, so you get the garbage collection/compactification that implies.

However, that’s a physical append-only; you can still do logical updates.

*PI is not to be confused with PPI, which stands for Primary Partition Index, and is Teradata’s name for range (or case-statement-based) partitioning. PPI works just fine with Teradata Columnar. As of Teradata 14, you can do PPI up to 62 levels deep.

The Teradata folks also sent along a slide deck laying out parts of the Teradata Columnar story. But it’s not one of the better Teradata decks I’ve ever posted.

Eight kinds of analytic database (Part 2)

Curt Monash — Tue, 05 Jul 2011 08:18:18 +0000

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear.

Bit bucket

Kinds of data likely to be included: Logs, other technical/external
Likely use styles: Staging/ETL, investigative
Canonical example: Log files in a Hadoop cluster
Stresses: TCO, scale-out, transform/big-query performance, ETL functionality

With the explosion of machine-generated data has come the need for a place to put it all, sometimes called the big bit bucket. This is like the investigative data mart for big databases, but more poly-structured. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.

The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.

Archival data store

Kinds of data likely to be included: Operational, CDR (call detail record), security log
Likely use styles: Archival, reporting (for compliance), possibly also investigative
Examples: Any long-term detailed historical store
Stresses: TCO, compression, scale-out, performance (if multi-use)

Analytic DBMS vendors have been insulting each other with the claim “that’s just an archival data store,” dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only Rainstor truly embraces the archival positioning, and I’ve become pretty dubious about their technical claims and their company alike.

Still, there’s a legitimate need for data stores — especially relational analytic DBMS that:

Store data cheaply, with high rates of compression.
Have decent performance if you do want to query the data.
May have archiving/compliance-specific features as well.

Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.

Outsourced data mart

Kinds of data likely to be included: All
Likely use styles: Traditional BI, investigative analytics, staging/ETL
Examples: Advertising tracking, SaaS CRM
Stresses: Performance, TCO, reliability, concurrency

Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I’ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that’s an analytic data management challenge. The possibilities expand from there.

Data outsourcers are in the IT business, and so their IT development is — hopefully! — more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.* Multitenancy is commonly an issue, as is running in the cloud.

*Even so, there’s often That Guy who doesn’t want to migrate away from Oracle, no matter what.

Vertica gets the nod in a number of these cases; it’s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you’re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.

Operational analytic(s) server

Kinds of data likely to be included: Customer-centric, log, financial trade
Likely use styles: Advanced operational analytics
Examples:
- Lower latency: Web or call-center personalization, anti-fraud
- Higher latency: Customer profiling, Basel 3 risk analysis
Stresses: Performance, reliability, analytic functionality, perhaps concurrency

Even with eight different choices, I need a “catch-all” category; this is it.

Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in integrating short-request and analytic processing. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you’re set. Otherwise, you may want to pipe derived data into a more “industrial-strength” DBMS, ideally the one that runs your operational apps anyway

Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you’re starting out with the data in a convenient bit bucket.

Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.

So did I get them all? Or are there yet other analytic data management use cases that I don’t fit into my eight categories?

Rainstor update

Curt Monash — Fri, 11 Jun 2010 10:54:09 +0000

I was tired and cranky when I talked with my former clients at Rainstor (formerly Clearpace) yesterday, so our call was shorter than it otherwise might have been. Anyhow, there’s a new version called Rainstor 4, the two main themes of which are:

Compliance-specific features.
Bottleneck Whack-A-Mole.

The point is that Rainstor is focusing its efforts on enterprises that:

Have a compliance mandate to keep detailed information, either now or coming down the pike.
Would like to query the information, either as part of the compliance mandate or for the usual business reasons one does analysis (or for that matter pinpoint lookup of historical information).
Might want to delete the information as soon as the compliance mandate runs out. (That’s a new feature. Frankly, I think the clients demanding it are being foolish. Information is valuable and should never be thrown away if one can afford to keep it.)
Might want to annotate the information, even though it is being preserved immutably. (Also a new feature. I think that one is smart.)

“Application retirement” was mentioned only in the context of Rainstor’s flagship Informatica partnership, and even then mainly for clients who had a compliance reason to keep old application data around. “Cloud” and “private cloud” get mentioned, but they don’t seem to be as central as Rainstor was previously hoping they would be. (This is one area we could and probably should have touched on more had I been more awake.)

One thing that hasn’t changed: “Information preservation,” which I coined for Rainstor at our first meeting, is still the company catchphrase.

So far as I could tell, the big point on Rainstor 4 Bottleneck Whack-A-Mole is this: When you load data into Rainstor (bulk or otherwise), it likes to do some metadata analysis first. (I imagine this is related to the sophisticated Rainstor compression scheme.) Well, that isn’t much of a performance hit for schemas with small numbers of tables, but is a bigger deal for more complex schemas. The Rainstor 4 fix is to remember/persist some of that analysis from one time the database is updated until the next time. Sounds obvious, but so do a lot of bottleneck fixes once they are made.

I’ll be speaking in Washington, DC on May 6

Curt Monash — Sun, 18 Apr 2010 21:48:15 +0000

My clients at Aster Data are putting on a sequence of conferences called “Big Data Summit(s)”, and wanted me to keynote one. I agreed to the one in Washington, DC, on May 6, on the condition that I would be allowed to start with the same liberty and privacy themes I started my New England Database Summit keynote with. Since I already knew Aster to be one of the multiple companies in this industry that is responsibly concerned about the liberty and privacy threats we’re all helping cause, I expected them to agree to that condition immediately, and indeed they did.

On a rough-draft basis, my talk concept is:

Implications of New Analytic Technology in four areas:

Liberty & privacy
Data acquisition & retention
Data exploration
Operationalized analytics

I haven’t done any work yet on the talk besides coming up with that snippet, and probably won’t until the week before I give it. Suggestions are welcome.

If anybody actually has a link to a clear discussion of legislative and regulatory data retention requirements, that would be cool. I know they’ve exploded, but I don’t have the details.

The retention of everything

Curt Monash — Sun, 04 Apr 2010 07:25:37 +0000

I’d like to reemphasize a point I’ve been making for a while about data retention:

As costs go down, the wisdom of keeping detailed data goes up. I’d go so far as to say that every piece of data generated by a human being should be preserved and kept online, legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset. Don’t discard it lightly.

*Unless there’s an explicit law mandating data destruction, legal considerations should permit. The idea “Let’s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.

That applies to the structured/tabular kinds of data I tend to focus on in this blog. It applies even more to anything that’s like a document (or email, instant message, whatever) somebody has taken the trouble to place into words. A top document-oriented archiving analyst (and my good friend), David Ferris, quite agrees. As David puts it:

I think we’ll end up archiving everything, except egregious garbage like spam:

It’s too hard to get users to conform to policy.

Automated methods of capturing a human-understandable policy, for example “tax records,” are too hard to implement through automatic filters. The filters are too inaccurate.

It’s impractical to get users to classify everything, and automatic classification is too crude.

You never know what you might want later. Stuff you think you won’t want now may end up being very useful.

The cost of storage is trivial when looked at on a per-user basis.

In particular, I think information destruction is a crude instrument for the protection of privacy, wasteful at best, and likely to be vigorously resisted by governments and large businesses. For example:

Businesses are increasingly subject to retention-oriented compliance regulation. Your lawyers may want you to destroy information that could be used to sue you, but governments won’t let you.
Information about individuals’ web surfing is being retained, under law, so that they may be fingered later for pornography consumption or illegal file sharing. I deplore some of the ways web-surfing data can be and is being used, and want laws passed to rein them in. But the retention will happen.
Marketers want all that data. Duh.
Electronic health records are coming — slowly, but they’ll get here some day.

Besides, archiving technologies are getting ever more cost-effective.

Notes on RainStor, the company formerly known as Clearpace

Curt Monash — Sat, 12 Dec 2009 00:15:02 +0000

I nformation preservation* DBMS vendor Clearpace officially changed its name to RainStor this week. RainStor is also relocating its CEO John Bantleman and more generally its headquarters to San Francisco. This all led to a visit with John and his colleague Ramon Chen, highlights of which included:

RainStor expects to finish the year with > 50 users (overwhelmingly via partners)
A big market for RainStor (at least in terms of signed partnerships and large deal activity) is retention of telecom records, for compliance purposes, typically for a 1-3 year period. This includes:
- CDRs (Call Detail Records)
- Mobile phone records including CDRs and missed calls
- SMS (Short Message Service), including the complete text of same
RainStor thinks a number of larger telcos have the need to store a billion records per day each. (I’m not sure how many subscribers such a telco would have to have).
John further thinks that, for the same query performance, RainStor can handle such a database on 4 blades. More precisely, he says that’s what happened at a test conducted by a major technology firm. In the same test case, SenSage required 40 blades, and Oracle required 80 or more cores on a pair of big SMP machines. John further says that the Oracle solution required a new table and new tablespace every day, while RainStor’s took 3 days for initial installation and required no DBA afterwards. However, I’m in no position to verify this report independently.
In a different kind of proof point, so extreme it gives even the RainStor folks pause, a user has retired 300 different applications and put their databases onto a single 2-core box. (Presumably, this is via RainStor’s OEM relationship with Informatica.)
Coming Very Soon are some services tying RainStor’s DBMS to obvious-suspect SaaS offerings. The core positioning is “SaaS data escrow”.i.e., RainStor will help you ensure that, in a worst-case scenario, there’s a nice safe copy of your data you can get at. RainStor also encourages you to do basic reporting and BI against the RainStor copy of the data, if you choose.
The idea I’ve been pushing lately of taking a heterogeneous replication offering like Continuent’s and having it feed an archiving store like RainStor’s has hit a rather basic snag. RainStor doesn’t actually consume change data capture kinds of information directly, at least as of yet, because of difficulties fitting such a stream into its guaranteed-data-immutability model.

*I coined that category description for John in the tea room of the Park Lane Hotel. He’s subsequently embraced it enthusiastically, and I kind of like it myself.

Related links

RainStor’s approach to compression, as described by me and by RainStor itself

Boston Big Data Summit keynote outline

Curt Monash — Mon, 23 Nov 2009 06:25:50 +0000

Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I’m posting them below.

The top two points from Q&A probably were:

Big Data and the cloud actually have relatively little to do with each other, a few exceptions notwithstanding, especially if the data is in a shared-nothing DBMS (as opposed to, say, a MapReduce-oriented file cluster). Two principal reasons are:
- Redistributing data from node to node is a little slow, undermining some of the elasticity benefits of the cloud.
- Getting data into the cloud in the first place is a lot slow.
The NoSQL movement is a lot like the Ron Paul campaign — it consists of people who are dissatisfied with the status quo, whose dissatisfaction has a lot to do with insufficient liberty and/or excessive expenditure, and who otherwise don’t have a whole lot in common with each other.

Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.

Quick introduction

Big Data vs. cloud
How big is Big Data?
At the low end of that range, there’s little you can’t do with conventional technology if you have:
- An unlimited budget for hardware
- An unlimited budget for software
- An unlimited budget for people, especially Oracle DBAs

Big Data in OLTP

Hard-core OLTP
- Focus of DBMS technology for a long-time
- Big budgets because each transaction has significant value
- Tough to get users to change technologies
Lighter-weight OLTP
- Classic example = web companies
  - Big ones — retail-oriented ones (eBay, Amazon) partially excepted — rolled their own technology stacks
  - Reluctant to give money to anybody
    - Open source, etc.
- Difficulty finding market
  - Product vs. feature
    - Clustering/HA/DR/whatever
    - Ditto cloud enablement
  - True products haven’t found much traction yet

Analytic Big Data use cases

Kinds of data for analytics
- More of same != big
- More detail and/or new kinds
  - Complete data sets
  - Transactions
  - Call details
  - Tick/trade history
  - Web clickstreams
  - Network event logs
  - Other machine-generated data
  - CAM bottom line
    - Anything human-generated should and will be retained in its entirety
    - Quantities of machine-generated data retained should and will grow roughly in line w/ computing cost reductions (Moore’s Law, etc.)
Analytic uses of Big Data
- Analytics is mainly about three things
  - Problem detection
  - Customer relationship improvement
    - (Those overlap when the customer relationship is bad)
  - Financial statements on steroids
- Main kinds of analytics
  - What BI vendors traditionally sell
    - General reporting and dashboards
    - Ad-hoc query (now driven from those reports and dashboards)
    - Planning (allegedly integrated with BI)
  - Research
    - Ad hoc relational query (worth mentioning twice because it drives so much of the market)
    - Data mining
    - Most web search and web mining
  - Operational/near-real-time
  - Archiving/compliance
- What gets Big?
  - Mainly research and archiving
  - But when reporting or operational get Big, you have really interesting computing problems

Technology issues and trends

Moore’s Law
- CPUs — All about cores, hence parallelism is key
- RAM
- SSDs – hence replace disks
- Sensors – hence generate lots more data
Kryder’s Law
- But rotational speeds up only 12.5X since Eisenhower Administration
- Hence solid-state memory (or RAM) will soon take over
In the mean time, I/O bottlenecks have had to be beaten
- Hence sequential scans
- Hence index-light architectures
- Hence columnar
DBMS “overhead”
- Raw license and maintenance fees – software increasing fraction of total
- OLTP vestiges – locking and all that
- DBAs
  - People costs = huge fraction of total
  - Index-lightness addresses
  - So does appliance
- Many people don’t really know how to write SQL
Configuration
- Appliance/tightly-balanced
  - Netezza
  - Teradata earlier
  - Greenplum/Sun
  - Oracle
  - IBM
  - Microsoft/Madison
- Commodity/do what you want
  - Vertica
  - Greenplum now
  - Infobright, Aster and others
  - MapReduce-oriented file systems
- Extreme rigidity is silly
  - Teradata, Oracle have both signaled moving to more modularity
  - Big driver of that = heterogeneous storage
    - Cheap disk
    - Expensive disk
    - Solid-state
    - RAM
  - CPU/storage ratio is even more of a driver

Theoretically defensible ways to segment the market

Latency requirements
- High availability and low latency go together
Query types
- Simultaneous users for same
Database size
Budget

Actual segments right now

Utter ADW/EDW
Data mart
- Size
- Naturally columnar vs. naturally row-based
Operational/frontline
Less dramatic/smaller EDW