Infobright is announcing its 4.0 release, with imminent availability. In marketing and product alike, Infobright is betting the farm on machine-generated data. This hasn’t been Infobright’s strategy from the getgo, but it is these days, with pretty good focus and commitment. While some fraction of Infobright’s customer base is in the Sybase-IQ-like data mart market — and indeed Infobright put out a customer-win press release in that market a few days ago — Infobright’s current customer targets seem to be mainly:
- Web companies, many of which are already MySQL users.
- Telecommunication and similar log data, especially in OEM relationships.
- Trading/financial services, especially at mid-tier companies.
Key aspects of Infobright 4.0 include:
- “Rough Query,” which lets you get approximate query results >10X faster than you could get precise ones, which is a good thing for iterative investigative analytics.
- The start of a plan — “DomainExpert” — to compress and otherwise optimize data in specific, commonly machine-generated patterns, such as URLs or CDRs (call detail records).
- “Distributed Load Manager” — i.e., load nodes that are separate from (and more parallelized than) query nodes.
- A Hadoop connector.
- Lots of cleanup and Bottleneck Whack-A-Mole, although I haven’t paid close attention as to which parts of that are truly new, and which were already handled in recent Infobright point releases.
Items on that list focused on the machine-generated data market include:
- DomainExpert — obviously.
- The Hadoop connector — also obviously.
- The Distributed Load Manager — why would you need such load speeds unless the data is flowing in from machines?
To understand Infobright Rough Query, recall the essence of Infobright’s architecture:
Infobright’s core technical idea is to chop columns of data into 64K chunks, called data packs, and then store concise information about what’s in the packs. The more basic information is stored in data pack nodes,* one per data pack. If you’re familiar with Netezza zone maps, data pack nodes sound like zone maps on steroids. They store maximum values, minimum values, and (where meaningful) aggregates, and also encode information as to which intervals between the min and max values do or don’t contain actual data values.
I.e., a concise, imprecise representation of the database is always kept in RAM, in something Infobright calls the “Knowledge Grid.” Rough Query estimates query results based solely on the information in the Knowledge Grid — i.e., Rough Query always executes against information that’s already in RAM.
To me, Rough Query is the most impressive part of the Infobright 4.0 announcement. DomainExpert sounds like it will be somewhat better than straightforward prefix/suffix compression, but Infobright hasn’t yet convinced me that the difference is substantial. Distributed Load Manager is indeed important, but only because Infobright doesn’t have a shared-nothing MPP (Massively Parallel Processing) option at this time. And the rest is mainly catch-up toward Infobright’s larger and more expensive peers.