Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
More on Fox Interactive Media’s use of Greenplum
Greenplum’s most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across a “review” of Greenplum by FIM’s Brian Dolan, neatly summarizing his views about Greenplum’s strengths, weaknesses, and uses inside Fox. Highlights include: Read more
| Categories: Data warehousing, Fox and MySpace, Greenplum, Web analytics | 2 Comments |
Merv Adrian on SAND Technology
Merv Adrian blogged about SAND Technology, casting significant doubt on SAND’s business prospects. At this point, I can’t say I disagree. On the other hand, SAND does have public, audited financial statements showing it generating more revenue than a lot of other analytic DBMS or archiving vendors probably make. Columnar DBMS vendors doing better than SAND are Sybase, Vertica, maybe Infobright — and who else?
| Categories: Archiving and information preservation, Columnar database management, Data warehousing, SAND Technology | 1 Comment |
Daniel Abadi on Kickfire and related subjects
Daniel Abadi has a new blog, whose first post centers around Kickfire. The money quote is (emphasis mine):
In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker’s voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore’s law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore’s law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past
Good point.
More generally, Abadi speculates about the market for MySQL-compatible data warehousing. My responses include:
- OF COURSE there are many MySQL users who need to move to a serious analytic DBMS.
- What’s less clear is whether there’s any big advantage to those users in remaining MySQL-compatible when they do move. I’m not sure what MySQL-specific syntax or optimizations they’d have that would be difficult to port to a non-MySQL system.
- It’s nice to see Abadi speaking well of Infobright and its technology.
- To say that Infobright went open source because it was “desperate” is overstated. That said, I don’t think Infobright was on track to prosper without going open source.
- While open source and MySQL go together, an appliance like Kickfire loses many (not all) of the benefits of open source.
- Calpont has indeed never disclosed a customer win. Any year now … (Just kidding, Vogel!)
- In general, seeing Abadi be so favorable toward Vertica competitors adds credibiity to the recent Hadoop vs. DBMS paper.
Anyhow, as previously noted, I’m a big Daniel Abadi fan. I look forward to seeing what else he posts in his blog, and am optimistic he’ll live up to or exceed its stated goals.
| Categories: Calpont, Columnar database management, Data warehouse appliances, Data warehousing, DBMS product categories, Infobright, Kickfire, MySQL, Open source, Theory and architecture | 2 Comments |
Greenplum update — Release 3.3 and so on
I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including: Read more
| Categories: Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Market share and customer counts, Parallelization, PostgreSQL, Pricing | 11 Comments |
Greenplum will be announcing some stuff
Greenplum is having a webinar Monday to announce “The Next Big Leap in Data Warehousing” (capitalization theirs). The idea they’ll be talking about is a genuinely good one. And off the top of my head I can only think of a few vendors who implemented it before Greenplum, and even fewer who emphasize it explicitly. So if you like webinars, you might want to listen in. I plan to blog about the general concept soon after the 12:01 am Monday embargo lifts. (Uh, guys, it is Monday rather than Tuesday, right?) Read more
| Categories: Data warehousing, Greenplum, Specific users | 1 Comment |
What statistics texts and other analytics books should we recommend to people?
On a message board I frequent, two different guys have asked for recommendations for statistics textbooks, in a kind of general knowledge vein. One phrases it as:
I’m looking for a general purpose statistics textbook for reference purposes.
giving his background as
I took Calculus-level Statistics in college. (i.e. 2 semesters of Calc was a prerequisite; this was the stats class that stat majors took.)
He was a computer science major and is now a professional programmer. (And if somebody can use a tournament-chess-smart programmer with outstandingly clear communication skills in the Buffalo area, I’m pretty sure he’d be glad to know about the opportunity. But I digress …)
The other is a law student with a more general need, which he phrases as
I want to use them for work to help identify trends; do multiple regressions; put values on things that aren’t easy to quantify, etc.
Economics I already know most of the basics from my undergrad studies, but I need more advanced economic theory and such.
He’s interested in what I’d call “pop” analytics books as well as hardcore stuff; e.g., the one book he’s identified already is “Competing with Analytics.” I’m thinking some good vendor white papers might be just as useful for him as that class of books. But he obviously also wants to learn the hardcore stuff as well.
I haven’t attended or taught a college course since 1981, and I tend to find the business books on analytics too simple for my tastes, so I’m not the right guy to answer from his own experience.
Does anybody have any helpful thoughts? Thanks!
| Categories: Analytic technologies | 11 Comments |
Reinventing business intelligence
I’ve felt for quite a while that business intelligence tools are due for a revolution. But I’ve found the subject daunting to write about because — well, because it’s so multifaceted and big. So to break that logjam, here are some thoughts on the reinvention of business intelligence technology, with no pretense of being in any way comprehensive.
Natural language and classic science fiction
Actually, there’s a pretty well-known example of BI near-perfection — the Star Trek computers, usually voiced by the late Majel Barrett Roddenberry. They didn’t have a big role in the recent movie, which was so fast-paced nobody had time to analyze very much, but were a big part of the Star Trek universe overall. Star Trek’s computers integrated analytics, operations, and authentication, all with a great natural language/voice interface and visual displays. That example is at the heart of a 1998 article on natural language recognition I just re-posted.
As for reality: For decades, dating back at least to Artificial Intelligence Corporation’s Intellect, there have been offerings that provided “natural language” command, control, and query against otherwise fairly ordinary analytic tools. Such efforts have generally fizzled, for reasons outlined at the link above. Wolfram Alpha is the latest try; fortunately for its prospects, natural language is really only a small part of the Wolfram Alpha story.
A second theme has more recently emerged — using text indexing to get at data more flexibly than a relational schema would normally allow, either by searching on data values themselves (stressed by Attivio) or more by searching on the definitions of pre-built reports (the Google OneBox story). SAP’s Explorer is the latest such view, but I find Doug Henschen’s skepticism about SAP Explorer more persuasive than Cindi Howson’s cautiously favorable view. Partly that’s because I know SAP (and Business Objects); partly it’s because of difficulties such as those I already noted.
Flexibility and data exploration
It’s a truism that each generation of dashboard-like technology fails because it’s too inflexible. Users are shown the information that will provide them with the most insight. They appreciate it at first. But eventually it’s old hat, and when they want to do something new, the baked-in data model doesn’t support it.
The latest attempts to overcome this problem lie in two overlapping trends — cool data exploration/visualization tools, and in-memory analytics. Read more
| Categories: Analytic technologies, Business intelligence, Google, Memory-centric data management, Microsoft and SQL*Server, SAP AG | 19 Comments |
How big are the intelligence agencies’ data warehouses?
Edit: The relevant part of the article cited has now been substantially changed, in line with Jeff Jonas’ remarks in the comment thread below.
Joe Harris linked me to an article that made a rather extraordinary claim:
At another federal agency Jonas worked at (he wouldn’t say which), they had a very large data warehouse in the basement. The size of the data warehouse was a secret, but Jonas estimated it at 4 exabytes (EB), and increasing at the rate of 5 TB per day.
Now, if one does the division, the quote claims it takes 800,000 days for the database to double in size, which is absurd. Perhaps this (Jeff) Jonas guy was just talking about a 4 petabyte system and got confused. (Of course, that would still be pretty big.) But before I got my arithmetic straight, I ran the 4 exabyte figure past a couple of folks, as a target for the size of the US government’s largest classified database. Best guess turns out to be that it’s 1-2 orders of magnitude too high for the government’s largest database, not 3. But that’s only a guess …
| Categories: Data warehousing, Specific users | 5 Comments |
Notes on CEP application development
While performance may not be all that great a source of CEP competitive differentiation, event processing vendors find plenty of other bases for technological competition, including application development, analytics, packaged applications, and data integration. In particular:
- Most independent CEP vendors have some kind of application story in the capital markets vertical, such as packaged applications, ISV partners with packaged applications, application frameworks, and so on.
- CEP vendors offer lots of connectors to specific financial industry price/quote/trade feeds, as well as the usual other kinds of database connectivity (SQL, XML, etc.)
- Aleri/Coral8 (separately and now together) like to call attention to their business intelligence/analytics offerings. Analytics is front-and-center on Truviso’s web site too, not that Truviso does much to call attention to itself, period. (Roman Bukary once said he’d outline Truviso’s new strategy to me in 6-8 weeks or so … it’s now 14 months and counting.)
So far as I can tell, the areas of applications and analytics are fairly uncontroversial. Different CEP vendors have implemented different kinds of things, no doubt focusing on those they thought they would find easiest to build and then sell. But these seem to be choices in business execution, not in core technical philosophy.
In CEP application development, however, real philosophical differences do seem to arise. There are at least three different CEP application development paradigms: Read more
| Categories: Aleri and Coral8, Business intelligence, Microsoft and SQL*Server, Progress, Apama, and DataDirect, StreamBase, Streaming and complex event processing (CEP) | 5 Comments |
Followup on IBM System S/InfoSphere Streams
After posting about IBM’s System S/InfoSphere Streams CEP offering, I sent three followup questions over to Jeff Jones. It seems simplest to just post the Q&A verbatim.
1. Just how many processors or cores does it take to get those 5 million messages/sec through? A little birdie says 4,000 cores. Read more
