After my call with Truviso and blog post referencing same, I had the chance to discuss stream processing with Mike Stonebraker, who among his many other distinctions is also StreamBase’s Founder/CTO. We focused almost exclusively on the financial trading market. Here are some of the highlights.
- The current need is for trades to be completed in 10 milliseconds, give or take; my 30 milliseconds figure was old news. Database lookups should be no more than 1/10 of that. I wasn’t too clear, actually, on how many lookups had to occur in that 1 millisecond or so, but no matter for now. The main point is that disk lookup will not get the job done.
- Because of that, Mike challenges Truviso’s assertion that close integration with disk-based DBMS matters. (Truviso credibly claims that its integration into PostgreSQL is tighter than StreamBase has with Sleepycat.) If you need to use data from disk, you’d better suck it all into memory in advance. For this reason and others, apps commonly need to preserve “state” totaling up to a few gigabytes. StreamBase also has ODBC/JDBC access to external databases, but that takes 30 milliseconds or so, which is too slow for financial apps. (In retrospect, I should have pushed back more about integration w/ disk-based DBMS in non-financial apps that aren’t so demanding about latency.)
- Interestingly, standard stock ticker feeds such as Reuters have latency up to 100 milliseconds or so. Hence, large investors take raw feeds from stock exchanges and so on.
- Stream processor architectures are radically different from those of conventional DBMS, even beyond the obvious ways. E.g., there’s no time to put tasks on a queue. Each StreamBase step, if at all possible, calls its successor directly. And filters (i.e., queries or query parts) are compiled straight down into machine code, whereas conventional DBMS usually stop at p-code and then interpret from there.
- The performance wars, in Mike’s opinion, are fought largely – but not entirely — around which features have great performance. Sophistication of windowing is one example, presumably since any way to slice data into “windows” is apt to be an uncomfortable approximation to what you really want to know.
- Another is stream disorder (possibly an infelicitous phrase if your audience is into toilet humor). For starters, one needs to be able to deal with data that arrives with its timestamps out of order. For some apps (sensor networks perhaps more than financial ones) one even needs to handle cases where some data doesn’t arrive at all, at least not in an acceptable timeframe.
- Incredibly important is intrastream pattern recognition – e.g., self-joins of streams.
- While Mike concedes that StreamSQL is incomplete as SQL goes, he claims nobody cares about the missing features (e.g., nested queries). Take that, Truviso! He also thinks there’s going to be an agreed-upon de facto standard for stream extensions to SQL. That said, precedent suggests to me that this standard will be in the form of extensions, ala SQL/MM and so on. Thus – unless and until I’m proven wrong on that — the idea of an emerging standard is not by itself an argument for lesser vs. richer base SQL.
- StreamBase and Truviso alike make a big deal about having super-high performance “replaying” sequential streams for backtesting. Presumably, this means that you suck historical data in in a batch, then check query results as if it had streamed in. I don’t know how the performance on this compares to what happens when the data comes in live.
Next up, I’ve been assured, is a talk with Progress Software’s Apama division.