Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
More grist for the column vs. row mill
Daniel Abadi and Sam Madden are at it again, following up on their blog posts of six months arguing for the general superiority of column stores over row stores (for analytic query processing). The gist is to recite a number of bases for superiority, beyond the two standard ones of less I/O and better compression, and seems to be based largely on Section 5 of a SIGMOD paper they wrote with Neil Hachem.
A big part of their argument is that if you carry the processing of columnar and/or compressed data all the way through in memory, you get lots of advantages, especially because everything’s smaller and hence fits better into Level 2 cache. There also is some kind of join algorithm enhancement, which seems to be based on noticing when the result wound up falling into a range according to some dimension, and perhaps using dictionary encoding in a way that will help induce such an outcome.
The main enemy here is row-store vendors who say, in effect, “Oh, it’s easy to shoehorn almost all the benefits of a column-store into a row-based system.” They also take a swipe — for being insufficiently purely columnar — at unnamed columnar Vertica competitors, described in terms that seemingly apply directly to ParAccel.
| Categories: Columnar database management, Data warehousing, Database compression, ParAccel, Vertica Systems | 2 Comments |
Introduction to SAND Technology
SAND Technology has a confused history. For example:
- SAND has been around in some form or other since 1982, starting out as a Hitachi reseller in Canada.
- In 1992 SAND acquired a columnar DBMS product called Nucleus, which originally was integrated with hardware (in the form of a card). Notwithstanding what development chief Richard Grodin views as various advantages vs. Sybase IQ, SAND has only had limited success in that market.
- Thus, SAND introduced a second, similarly-named product, which could also be viewed as a columnar DBMS. (As best I can tell, both are called SAND/DNA.) But it’s actually focused on archiving, aka the clunkily named “near-line storage.” And it’s evidently not the same code line; e.g., the newer product isn’t bit-mapped, while the older one is.
- The near-line product was originally focused on the SAP market. Now it’s moving beyond.
- Canada-based SAND had offices in Germany and the UK before it did in the US. This leads to an oddity – SAND is less focused on the SAP aftermarket in Germany than it still is in the US.
SAND is publicly traded, so its numbers are on display. It turns out to be doing $7 million in annual revenue, and losing money.
OK. I just wanted to get all that out of the way. My main thoughts about the DBMS archiving market are in a separate post.
| Categories: Archiving and information preservation, Columnar database management, Data warehousing, SAND Technology | 6 Comments |
How to buy an analytic DBMS (overview)
I went to London for a couple of days last week, at the behest of Kognitio. Since I was in the neighborhood anyway, I visited their offices for a briefing. But the main driver for the trip was a seminar Thursday at which I was the featured speaker. As promised, the slides have been uploaded here.
The material covered on the first 13 slides should be very familiar to readers of this blog. I touched on database diversity and the disk-speed barrier, after which I zoomed through a quick survey of the data warehouse DBMS market. But then I turned to material I’ve been working on more recently – practical advice directly on the subject of how to buy an analytic DBMS.
I started by proposing a seven-part segmentation self-assessment: Read more
| Categories: Buying processes, Data warehousing, Presentations | 10 Comments |
The “baseball bat” test for analytic DBMS and data warehouse appliances
More and more, I’m hearing about reliability, resilience, and uptime as criteria for choosing among data warehouse appliances and analytic DBMS. Possible reasons include:
- More data warehouses are mission-critical now, with strong requirements for uptime.
- Maybe reliability is a bit of a luxury, but the products are otherwise good enough now that users can afford to be a bit pickier.
- Vendor marketing departments are blowing the whole subject out of proportion.
The truth probably lies in a combination of all these factors.
Making the most fuss on the subject is probably Aster Data, who like to talk at length both about mission-critical data warehouse applications and Aster’s approach to making them robust. But I’m also hearing from multiple vendors that proofs-of-concept now regularly include stress tests against failure, in what can be – and indeed has been – called the “baseball bat” test. Prospects are encouraged to go on a rampage, pulling out boards, disk drives, switches, power cables, and almost anything else their devious minds can come up with to cause computer carnage. Read more
| Categories: Benchmarks and POCs, Buying processes, Data warehouse appliances, Data warehousing | 6 Comments |
Kognitio and WX-2 update
I went to Bracknell Wednesday to spend time with the Kognitio team. I think I came away with a better understanding of what the technology is all about, and why certain choices have been made.
Like almost every other contender in the market,* Kognitio WX-2 queries disk-based data in the usual way. Even so, WX-2’s design is very RAM-centric. Data gets on and off disk in mind-numbingly simple ways – table scans only, round-robin partitioning only (as opposed to the more common hash), and no compression. However, once the data is in RAM, WX-2 gets to work, happily redistributing as seems optimal, with little concern about which node retrieved the data in the first place. (I must confess that I don’t yet understand why this strategy doesn’t create ridiculous network bottlenecks.) How serious is Kognitio about RAM? Well, they believe they’re in the process of selling a system that will include 40 terabytes of the stuff. Apparently, the total hardware cost will be in the $4 million range.
*Exasol is the big exception. They basically use disk as a source from which to instantiate in-memory databases.
Other technical highlights of the Kognitio WX-2 story include: Read more
| Categories: Application areas, Data warehousing, Kognitio, Scientific research | 2 Comments |
Data warehouse load speeds in the spotlight
Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:
- Syncsort isn’t just a mainframe sort utility company, but also does data integration. Who knew?
- Vertica’s design to overcome the traditional slow load speed of columnar DBMS works.
The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now. Read more
The Teradata Accelerate program
An article in Intelligent Enterprise clued me in that Teradata has announced the Teradata Accelerate program. A little poking around revealed a press release in which — lo and behold — I am quoted,* to wit:
“The Teradata Accelerate program is a great idea. There’s no safer choice than Teradata technology plus Teradata consulting, bundled in a fixed-cost offering,” said Curt Monash, president of Monash Research. “The Teradata Purpose Built Platform Family members are optimized for a broad range of business intelligence and analytic uses.”
| Categories: Data warehousing, Pricing, Teradata | Leave a Comment |
Interpreting the results of data warehouse proofs-of-concept (POCs)
When enterprises buy new brands of analytic DBMS, they almost always run proofs-of-concept (POCs) in the form of private benchmarks. The results are generally confidential, but that doesn’t keep a few stats from occasionally leaking out. As I noted recently, those leaks are problematic on multiple levels. For one thing, even if the results are to be taken as accurate and basically not-misleading, the way vendors describe them leaves a lot to be desired.
Here’s a concrete example to illustrate the point. One of my vendor clients sent over the stats from a recent POC, in which its data warehousing product was compared against a name-brand incumbent. 16 reports were run. The new product beat the old 16 out of 16 times. The lowest margin was a 1.8X speed-up, while the best was a whopping 335.5X.
My client helpfully took the “simple average” — i.e. the mean – of the 16 factors, and described this as an average 62X drubbing. But is that really fair? Read more
| Categories: Benchmarks and POCs, Buying processes, Data warehousing | 7 Comments |
Graphjam: I can haz BI
Charts and graphs, from the folks who brought you a whole lot of cute kitten photos:
| Categories: Business intelligence, Fun stuff, Humor | Leave a Comment |
When people don’t want accurate predictions made about them
In a recent article on governmental anti-terrorism data mining efforts — and the privacy risks associated with same — The Economist wrote (emphasis mine):
Abdul Bakier, a former official in Jordan’s General Intelligence Department, says that tips to foil data-mining systems are discussed at length on some extremist online forums. Tricks such as calling phone-sex hotlines can help make a profile less suspicious. “The new generation of al-Qaeda is practising all that,” he says.
Well, duh. Terrorists and fraudsters don’t want to be detected. Algorithms that rely on positive evidence of bad intent may work anyway. But if you rely on evidence that shows people are not bad actors, that’s likely to work about as well as Bayesian spam detectors.* Read more
