I’ve been confused about temporal data management for a while, because there are several different things going on.
- Date arithmetic. This of course has been around for a very long — er, for a very long time.
- Time-series-aware compression. This has been around for quite a while too.
- “Time travel”/snapshotting — preserving the state of the database at previous points in time. This is a matter of exposing (and not throwing away) the information you capture via MVCC (Multi-Version Concurrency Control) and/or append-only updates (as opposed to update-in-place). Those update strategies are increasingly popular for pretty much anything except update-intensive OLTP (OnLine Transaction Processing) DBMS, so time-travel/snapshotting is an achievable feature for most vendors.
- Bitemporal data access. This occurs when a fact has both a transaction timestamp and a separate validity duration. A Wikipedia article seems to cover the subject pretty well, and I touched on Teradata’s bitemporal plans back in 2009.
- Time series SQL extensions. Vertica explained its version of these to me a few days ago. I imagine Sybase IQ and other serious financial-trading market players have similar features.
In essence, the point of time series/event series SQL functionality is to do SQL against incomplete, imprecise, or derived data.* For example, suppose in one time series events happen at times 3.00, 3.01, 3.03, and 3.05; in another time series events happen at times 3.00, 3.02, 3.03, 3.04, and 3.05; and you want to join the time series together. Then you can do an event series join — i.e., you can join on each of the times 3.00, 3.01, 3.02, 3.03, 3.04, and 3.05, using interpolated values to check WHERE conditions. Vertica says that the only interpolation methods anybody ever wants are “first value in the interval,” “last value in the interval,” and “linear average of the endpoint values” (I forget whether that’s weighted by time-distance from the endpoints, or is a simple arithmetic mean).
*This is a limited counterexample to my dictum that you should explicitly store derived data because it’s too much trouble to keep re-deriving it on the fly.
Also cool is the “CONDITION_TRUE_EVENT” syntax Vertica has had since Release 4.0, which generalized SQL 99 windowing; you now can look at all the rows that meet a specific criterion — via an arbitrary expression — rather than just being restricted to a row count or strict time duration. Vertica says it’s gone further in the direction of event series pattern matching in Vertica 5.0; I didn’t grasp the details, but it sounded philosophically akin to Aster Data’s nPath, albeit without the arbitrary-language procedural extensibility.
Finally, Vertica also gave me an imprecise-SQL example that has little to do with time series. Vertica has a concept of “range join,” implemented so that telecom firms can save space by storing partial IP addresses. I’ve noted before that while we should retain all human-generated data, it will never be practical to retain all machine-generated data (because its volume will keep going up based on the same technological factors that keep storage cost per unit volume going down). This sounds like one interesting (if specialized) approach to storing machine-generated data in summarized form.