SenSage – DBMS 2 : DataBase Management System Services

An idealized log management and analysis system — from whom?

Curt Monash — Sun, 07 Sep 2014 12:38:52 +0000

I’ve talked with many companies recently that believe they are:

Focused on building a great data management and analytic stack for log management …
… unlike all the other companies that might be saying the same thing …
… and certainly unlike expensive, poorly-scalable Splunk …
… and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.

At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.

Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.

A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with:

Agents for any kind of machine that admits streams of data.
Parsers that:
- Immediately identify explicit name-value pairs in popular formats such as JSON or XML.
- Also immediately extract a significant fraction of all implicit fields in text strings — timestamps for sure, but also a lot else. (Splunk is the current gold standard for such capabilities.)
- Allow you to easily write rules for more such extractions.
Immediate indexing in line with everything the parsers do.
Easy import of log files, relational tables, and other relevant data structures.
Queries that can exploit all the indexes, at least up to the functionality level of SQL 2003 analytics (including windowing) and StreamSQL, of course with …
… blazing scalable performance.
Strong workload management and concurrent performance support. (Teradata is the gold standard for such capabilities in the analytic sphere.)
Various other mature-DBMS features, e.g. in backup, manageability, and uptime.

Further, there would be numerous styles of business intelligence interface, at least including:

Generic BI like we generally see for tabular data.
Constantly-changing displays of streaming data.
BI with an event-series orientation.
Strong alerting.
Mobile versions of everything.

And there would be good support for quick-turnaround, easily-operationalized predictive analytics, of the sort that’s fairly central to the visions for Kiji and Spark.

The data management part of that is particularly hard, in that:

Different architectures seem naturally well-suited for different parts of the problem.
Maturing a new data management product is always difficult, costly and slow.

My thoughts on strengths and weaknesses of some obvious log data management contenders start:

Oracle, IBM, and Microsoft have a lot of heft in all things database. But while each of those vendors has great resources and occasionally impressive pieces of new database engineering, none shows much evidence of framing, let alone solving, the problem in the right way(s).
SAP owns Sybase, HANA, several old CEP companies, and Business Objects. Add them to the Oracle/IBM/Microsoft list.
Teradata has a lot going for them. Their core analytic data management strengths are obvious. They’ve owned Aster for a while, and Aster innovated nPath quite some time ago. They recently added Hadapt, a leader in schema-on-need, as well as Revelytix, which has some good ideas in dataset management. Like most other DBMS vendors, however, Teradata doesn’t yet have much of a story for streaming data, and anyhow the most optimistic case for Teradata involves the difficult task of stitching together disparate data management technologies.
HP Vertica has a decent position as well. Probably more proven in general concurrent, scalable performance than others in their peer group (Netezza, Greenplum, et al.), Vertica also was relatively early in innovations relevant to log analysis, including a range of time series/event series features and its own schema-on-need effort. Vertica was also founded by people who were also streaming pioneers (there were heavily overlapping groups of academics behind StreamBase, Vertica and VoltDB), but it’s not clear how that background is reflected in present Vertica product.
Splunk, of course, has a complete stack. At the data acquisition and parsing layers, it’s second to none, and it has a considerable set of log-appropriate BI capabilities as well. And for data management it in effect is stitching together two different inverted-list data stores, plus Hadoop.
Hadoop distribution vendors such as Cloudera, MapR or Hortonworks offer typically bundle a range of relevant capabilities. HDFS (Hadoop Distributed File System) is the default place to dump entire logs. In most distros, Spark offers a new approach to streaming. Impala, Drill and so on offer query. Flume gathers the log data in the first place. But a lot of the cooler capabilities are immature or unproven, and in some cases that’s putting it mildly.

In the interest of length, I’ll omit discussion of smaller vendors, except to say that Platfora’s integrated-stack event series analytics story deserves attention, and I’m disappointed that I never hear about Sumo Logic. And I don’t know a lot about companies positioned as SIEM (Security Information and Event Management), especially now that SenSage has left the scene.

Eight kinds of analytic database (Part 2)

Curt Monash — Tue, 05 Jul 2011 08:18:18 +0000

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear.

Bit bucket

Kinds of data likely to be included: Logs, other technical/external
Likely use styles: Staging/ETL, investigative
Canonical example: Log files in a Hadoop cluster
Stresses: TCO, scale-out, transform/big-query performance, ETL functionality

With the explosion of machine-generated data has come the need for a place to put it all, sometimes called the big bit bucket. This is like the investigative data mart for big databases, but more poly-structured. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.

The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.

Archival data store

Kinds of data likely to be included: Operational, CDR (call detail record), security log
Likely use styles: Archival, reporting (for compliance), possibly also investigative
Examples: Any long-term detailed historical store
Stresses: TCO, compression, scale-out, performance (if multi-use)

Analytic DBMS vendors have been insulting each other with the claim “that’s just an archival data store,” dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only Rainstor truly embraces the archival positioning, and I’ve become pretty dubious about their technical claims and their company alike.

Still, there’s a legitimate need for data stores — especially relational analytic DBMS that:

Store data cheaply, with high rates of compression.
Have decent performance if you do want to query the data.
May have archiving/compliance-specific features as well.

Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.

Outsourced data mart

Kinds of data likely to be included: All
Likely use styles: Traditional BI, investigative analytics, staging/ETL
Examples: Advertising tracking, SaaS CRM
Stresses: Performance, TCO, reliability, concurrency

Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I’ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that’s an analytic data management challenge. The possibilities expand from there.

Data outsourcers are in the IT business, and so their IT development is — hopefully! — more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.* Multitenancy is commonly an issue, as is running in the cloud.

*Even so, there’s often That Guy who doesn’t want to migrate away from Oracle, no matter what.

Vertica gets the nod in a number of these cases; it’s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you’re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.

Operational analytic(s) server

Kinds of data likely to be included: Customer-centric, log, financial trade
Likely use styles: Advanced operational analytics
Examples:
- Lower latency: Web or call-center personalization, anti-fraud
- Higher latency: Customer profiling, Basel 3 risk analysis
Stresses: Performance, reliability, analytic functionality, perhaps concurrency

Even with eight different choices, I need a “catch-all” category; this is it.

Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in integrating short-request and analytic processing. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you’re set. Otherwise, you may want to pipe derived data into a more “industrial-strength” DBMS, ideally the one that runs your operational apps anyway

Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you’re starting out with the data in a convenient bit bucket.

Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.

So did I get them all? Or are there yet other analytic data management use cases that I don’t fit into my eight categories?

Advice for some non-clients

Curt Monash — Fri, 30 Jul 2010 14:35:52 +0000

Edit: Any further anonymous comments to this post will be deleted. Signed comments are permitted as always.

Most of what I get paid for is in some form or other consulting. (The same would be true for many other analysts.) And so I can be a bit stingy with my advice toward non-clients. But my non-clients are a distinguished and powerful group, including in their number Oracle, IBM, Microsoft, and most of the BI vendors. So here’s a bit of advice for them too.

Oracle. On the plus side, you guys have been making progress against your reputation for untruthfulness. Oh, I’ve dinged you for some past slip-ups, but on the whole they’ve been no worse than other vendors.’ But recently you pulled a doozy. The analyst reports section of your website fails to distinguish between unsponsored and sponsored work.* That is a horrible ethical stumble. Fix it fast. Then put processes in place to ensure nothing that dishonest happens again for a good long time.

*Merv Adrian’s “report” listed high on that page is actually a sponsored white paper. That Merv himself screwed up by not labeling it clearly as such in no way exonerates Oracle. Besides, I’m sure Merv won’t soon repeat the error — but for Oracle, this represents a whole pattern of behavior.

Oracle. And while I’m at it, outright dishonesty isn’t your only unnecessary credibility problem. You’re also playing too many games in analyst relations.

HP. Neoview will never succeed. Admit it to yourselves. Go buy something that can.

Smaller BI vendors. Analytic DBMS evaluations commonly include BI strategy and tool selection as well. If an analytic DBMS expert tells you he needs to learn more about your product line, don’t blow him off. In fact, you should be particularly embracing anybody who’s shown a fondness for small DBMS vendors; maybe he or his clients will like small BI vendors as well. That means (among others) Jaspersoft, Endeca, and Tableau.

Information Builders. Is there anything about your BI products that is in any way technologically differentiated? If so, you might want to mention some examples to somebody some time.

Kalido. I’ve said this to you before, but it bears repeating — your positioning translates to “I-CASE for analytics,” and that’s not a good thing. If your product is not as cumbersome and entrapping as that sounds, you need to do a much better job of explaining why not.

SenSage. You are what you are. Sell out while the selling is good. You don’t have the corporate personality to make it into the analytic DBMS mainstream on your own.

Ingres. You need to be more engaged with analysts than you are. Ingres navel-gazed too much 25 years ago, and evidently you haven’t outgrown it yet.

TIBCO. You probably have a lot of cool analytic technology, but I don’t know of an influencer who has much relationship with or trust in you. Rethink how you’re approaching influencer relations top to bottom.

Tableau. You had a lot of mindshare, but it’s fading. Do something.

MarkLogic, graph DBMS vendors, etc. You’re clinging too hard to the NoSQL label. Nobody is out there deciding among Cassandra, neo4j, and MarkLogic. They might be deciding between MongoDB and MarkLogic, I guess, but if you admit to yourself that’s all it is you’ll probably change your messaging somewhat.

Objectivity. Get real about marketing. Infinite Graph is a cool opportunity. But I didn’t even ping you for a meeting when I’m in your area next week, because I wouldn’t have known who to reach out to.

Everybody (especially Objectivity). “First X deployed in the cloud” is almost surely an inaccurate claim. Don’t make it. And by the way, even if it were true, it probably wouldn’t be interesting.

Clearing up MapReduce confusion, yet again

Curt Monash — Wed, 30 Dec 2009 10:50:53 +0000

I’m frustrated by a constant need — or at least urge — to correct myths and errors about MapReduce. Let’s try one more time:

MapReduce was named and popularized — but not invented — by Google.
“MapReduce” variously refers to:
- A programming paradigm
- Execution engines that implement the programming paradigm
- Distributed file systems that work with the execution engines
In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).
MapReduce and analytic DBMS can interact in a number of different ways, including:
- Tight integration between a DBMS and exposed MapReduce functionality, e.g. Aster Data’s SQL/MapReduce or Greenplum.
- Integrated MapReduce “under the covers”, e.g. SenSage or Oracle. This may or may not follow all the rules Google laid out for MapReduce, but it’s at least similar in spirit.
- Looser coupling between DBMS and a MapReduce system, e.g. Vertica/Hadoop, in which MapReduce may or may not run on a different cluster than the DBMS.
- Not at all, except perhaps insofar as a quasi-DBMS such as Hive is implemented over a MapReduce system such as Hadoop/HDFS.
As predicted by Monash’s First Law of Commercial Semantics, different vendors have individual variants on those themes. For example, as per a registration-required white paper, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.
MapReduce implementations such as Hadoop are sometimes regarded as part of the NoSQL “movement”. When they are, many generalities about NoSQL — such as that it doesn’t deal with analytics — are falsified.
So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven’t done much to adopt it yet. Probably that’s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.
Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.
Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.
MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.

Notes on RainStor, the company formerly known as Clearpace

Curt Monash — Sat, 12 Dec 2009 00:15:02 +0000

I nformation preservation* DBMS vendor Clearpace officially changed its name to RainStor this week. RainStor is also relocating its CEO John Bantleman and more generally its headquarters to San Francisco. This all led to a visit with John and his colleague Ramon Chen, highlights of which included:

RainStor expects to finish the year with > 50 users (overwhelmingly via partners)
A big market for RainStor (at least in terms of signed partnerships and large deal activity) is retention of telecom records, for compliance purposes, typically for a 1-3 year period. This includes:
- CDRs (Call Detail Records)
- Mobile phone records including CDRs and missed calls
- SMS (Short Message Service), including the complete text of same
RainStor thinks a number of larger telcos have the need to store a billion records per day each. (I’m not sure how many subscribers such a telco would have to have).
John further thinks that, for the same query performance, RainStor can handle such a database on 4 blades. More precisely, he says that’s what happened at a test conducted by a major technology firm. In the same test case, SenSage required 40 blades, and Oracle required 80 or more cores on a pair of big SMP machines. John further says that the Oracle solution required a new table and new tablespace every day, while RainStor’s took 3 days for initial installation and required no DBA afterwards. However, I’m in no position to verify this report independently.
In a different kind of proof point, so extreme it gives even the RainStor folks pause, a user has retired 300 different applications and put their databases onto a single 2-core box. (Presumably, this is via RainStor’s OEM relationship with Informatica.)
Coming Very Soon are some services tying RainStor’s DBMS to obvious-suspect SaaS offerings. The core positioning is “SaaS data escrow”.i.e., RainStor will help you ensure that, in a worst-case scenario, there’s a nice safe copy of your data you can get at. RainStor also encourages you to do basic reporting and BI against the RainStor copy of the data, if you choose.
The idea I’ve been pushing lately of taking a heterogeneous replication offering like Continuent’s and having it feed an archiving store like RainStor’s has hit a rather basic snag. RainStor doesn’t actually consume change data capture kinds of information directly, at least as of yet, because of difficulties fitting such a stream into its guaranteed-data-immutability model.

*I coined that category description for John in the tea room of the Park Lane Hotel. He’s subsequently embraced it enthusiastically, and I kind of like it myself.

Related links

RainStor’s approach to compression, as described by me and by RainStor itself

Introduction to SenSage

Curt Monash — Sun, 18 Oct 2009 16:02:42 +0000

I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage’s, hasty. Still, I think I have enough of a handle on SenSage basics to be worth writing up.

General SenSage highlights include:

SenSage used to be known as Addamark.
SenSage used to characterize itself as being in the Security Information Management (SIM) market.
Now SenSage characterizes itself (approximately) as selling technology built around a columnar DBMS that happens to be pretty good at log analysis, compliance, and/or archiving.
More concisely, SenSage says it is in the event data warehouse category. (The same could arguably be said of Splunk.)
SenSage says it has >400 paying customers, of which ~200 are direct.
SenSage has >120 employees and, like Splunk, is profitable.
SenSage has enjoyed >50% annual revenue growth the past four years.
Some SenSage deals are in the multiple-million dollar range.
A major SenSage channel partner – dozens of installations — is SAP, which resells SenSage software on HP hardware is a “Compliance Log Warehouse.”
A hot market for SenSage is CDRs (Call Detail Records).
SenSage says that, among analytic DBMS vendors, it competes with Oracle, IBM, Teradata, Netezza and, to some extent, Vertica and Greenplum.

Technical SenSage highlights include:

SenSage’s core technology is an append-only columnar DBMS, with no master node.
SenSage’s DBMS uses no indexes and requires “no” database administration.
SenSage’s database is range-partitioned, with the range-partition key always being time.
SenSage has something it calls SQO (Sparse Query Optimization), which sounds a lot like Netezza zone maps. SQO never yields a false negative on whether data is in a block, never yields a false positive on equality predicates, and only rarely yields a false positive on range predicates.
SenSage’s database uses large block sizes – typically 250,000 records/block, at 200-250 bytes per record. (That’s in the range of 64 megabytes/block.)
SenSage says its software can load 10-50,000 records/second/node. If I’m doing the arithmetic correctly, that’s roughly 7-40 gigabytes/node/hour.
SenSage collects log data into its event data warehouse in what it characterizes as an agentless manner. Even so, it seems that for a majority of kinds of data sources one does have to write custom agents. The two other ways to get data into SenSage – and presumably most of the data volume comes through these – are:
- File transfer in the usual way
- syslog
SenSage says its software can read 100s of data sources, and that this is a huge competitive advantage. I’m not totally sure how that jibes with the prior point.
SenSage says it gets 5X compression on CDR data, 10-20X on other kinds of logs. That’s not too far off from Vertica’s compression figures.
SenSage says that it has datatype-aware compression as well as more standard stuff, with VARCHAR compressing particularly well.
In particular, SenSage uses both dictionary/token and delta compression.
SenSage’s software is pretty agnostic with respect to storage kind – DAS (Direct Attached Storage), SAN (Storage-Area Network), or content-addressable. In particular, there’s only about a 4% performance hit for using content-addressable storage.
When using WORM (Write Once Read Many) storage like EMC’s Centera, SenSage leaves record locator information behind on ordinary storage and otherwise queries the WORM storage just like it queries anything else.
SenSage says it has been using MapReduce since “Day 1”.
Probably not coincidentally, you can use Perl and other aggregates in SenSage SQL statements.
Perhaps also not coincidentally, SenSage says it has a number of advanced built-in analytic functions, including some focused on sessionization.

In addition to all that, SenSage offers a built-in event processing engine, consisting of:

A finite-state machine correlation engine.
A proprietary event processing language.
A GUI to “abstract” (i.e., generate?) the event processing language.

The SenSage event processing engine is used to generate alerts. Data that comes into SenSage actually is passed to two places at once, namely to both the event processing engine and the database itself.