Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Introduction to Gooddata
Around the end of the Cold War, Esther Dyson took it upon herself to go repeatedly to Eastern Europe and do a lot of rah-rah and catalysis, hoping to spark software and other computer entrepreneurs. I don’t know how many people’s lives she significantly affected – I’d guess it’s actually quite a few – but in any case the number is not zero. Roman Stanek, who has built and sold a couple of software business, cites her as a key influence setting him on his path.
Roman’s latest venture is business intelligence firm Gooddata. Gooddata was founded in 2007 and has been soliciting and getting attention for a while, so I was surprised to learn that Gooddata officially launched just a few weeks ago. Anyhow, some less technical highlights of the Gooddata story include: Read more
Ray Wang on SAP
Ray Wang made a terrific post based on SAP’s annual influencer love-in, an event which I no longer attend. Ray believes SAP has been in a “crisis”, and sums up his views as
The Bottom Line – SAP’s Turning The Corner
Credit must be given to SAP for charting a new course. A shift in the management philosophy and product direction will take years to realize, however, its not too late for change. SAP must remember its roots and become more German and less American. The renewed focus must put customer requests and priorities ahead of SAP’s bureaucracy. The emphasis must focus on the relationship. When that reemerges in how SAP works with customers, partners, influencers, and its own employees, SAP will be back in good graces. In the meantime, its time to get to work and deliver. Oracle’s Fusions Apps are coming soon and competitors such as IBM, Microsoft, Epicor, IFS, and SalesForce.com will not relent.
I recall the 1980s, when SAP’s main differentiator, at least in the English-speaking US, was a total commitment to customer success, and when it could be taken for granted that SAP would do business ethically. Things change, and not always for the better.
Anyhow, the reason I’m highlighting Ray’s post is that he makes reference to a number of interesting SAP-cetric technology trends or initiatives. Read more
| Categories: Analytic technologies, Business intelligence, Memory-centric data management, MOLAP, SAP AG, Solid-state memory | 1 Comment |
A framework for thinking about data warehouse growth
There are only three ways that the amount of data stored in data warehouses can grow:
- The same kinds of data are stored as before, with more being added over time.
- The same kinds of data are stored as before, but in more detail.
- New kinds of data are stored.
| Categories: Analytic technologies, Application areas, Data warehousing, Investment research and trading, Log analysis, Solid-state memory, Storage, Telecommunications, Text, Web analytics | 9 Comments |
Webinar on MapReduce for complex analytics (Thursday, December 3, 10 am and 2 pm Eastern)
The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was a Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:
- Registration for tomorrow’s webinars
- Replay of the first webinar
- My slides from the first webinar
The main subjects of the webinar will be:
- Some review of material from the first webinar (all three presenters)
- Discussion of how MapReduce can help with three kinds of analytics:
- Pattern matching (Jonathan will give detail)
- Number-crunching (I’ll cover that, and it will be short)
- Graph analytics (I haven’t written the slides yet, but my starting point will be some of the relationship analytics ideas we discussed in August)
Arguably, aspects of data transformation fit into each of those three categories, which may help explain why data transformation has been so prominent among the early applications of MapReduce.
As you can see from Aster’s title for the webinar (which they picked while I was on vacation), at least their portion will be focused on customer analytics, e.g. web analytics.
| Categories: Analytic technologies, Aster Data, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, MapReduce, RDF and graphs, Web analytics | 4 Comments |
New England Database Summit (January 28, 2010)
New England Database Day has now, in its third year, become a “Summit.” It’s a nice event, providing an opportunity for academics and business folks to mingle. The organizers are basically the local branch of the Mike Stonebraker research tree, with this year’s programming head being Daniel Abadi. It will be on Thursday, January 28, 2010, once again in the Stata Center at MIT. It would be reasonable to park in the venerable 4/5 Cambridge Center parking lot, especially if you’d like to eat at Legal Seafood afterwards.
So far there are two confirmed speakers — Raghu Ramakrishnan of Yahoo and me. My talk title will be something like “Database and analytic technology: The state of the union”, with all wordplay intended.
There’s more information at the official New England Database Summit website. There’s also a post with similar information on Daniel Abadi’s DBMS Musings blog.
Edit after the event:
Posts based on my January, 2010 New England Database Summit keynote address
- Data-based snooping — a huge threat to liberty that we’re all helping make worse
- Flash, other solid-state memory, and disk
- Interesting trends in database and analytic technology
- Open issues in database and analytic technology
| Categories: Analytic technologies, Data warehousing, Michael Stonebraker, Presentations, Theory and architecture | 4 Comments |
Comments on a fabricated press release quote
My clients at Kickfire put out a press release last week quoting me as saying things I neither said nor believe. The press release is about a “Queen For A Day” kind of contest announced way back in April, in which users were invited to submit stories of their data warehouse problems, with the biggest sob stories winning free Kickfire appliances. The fabricated “quote” reads: Read more
| Categories: About this blog, Data warehouse appliances, Data warehousing, Kickfire, Market share and customer counts, Sybase | 3 Comments |
Boston Big Data Summit keynote outline
Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I’m posting them below.
Calpont’s InfiniDB
Since its inception, Calpont has gone through multiple management teams, strategies, and investor groups. What it hadn’t done, ever, is actually shipped a product. Last week, however, Calpont introduced a free/open source DBMS, InfiniDB, with technical details somewhat reminiscent of what Calpont was promising last April. Highlights include:
- Like Infobright, Calpont’s InfiniDB is a columnar DBMS consisting of a MySQL front end and a columnar storage engine.
- Community edition InfiniDB runs on a single server.
- One of commercial/enterprise edition InfiniDB’s main claims to fame will be MPP support.
- There’s no announced time frame for commercial edition InfiniDB.
- InfiniDB’s current compression story is dictionary/token only, with decompression occurring before joins are executed. Improvement is a roadmap item.
- Indeed, InfiniDB has many roadmap items, a few of which can be found here. Also, a great overview of InfiniDB’s current state and roadmap can be found in this MySQL Performance Blog thread. (And follow the links there to find performance discussions of other free analytic DBMS.)
- One thing InfiniDB already has that is still a roadmap item for Infobright is the ability to run a query across multiple cores at once.
- One thing free InfiniDB has that Infobright only offers in its Enterprise Edition is ACID-compliant Insert/Update/Delete. (Note: I wish people would stop saying that Infobright Enterprise Edition isn’t ACID-compliant, since that point was cleared up a while ago.)
- InfiniDB has no indexes or materialized views.
- However, InfiniDB’s retrieval is expedited by something called “Extents,” which sounds a lot like Netezza’s zone maps.
Being on vacation, I’ll stop there for now. (If it weren’t for Tropical Storm/ depression Ida, I might not even be posting this much until I get back.)
| Categories: Analytic technologies, Calpont, Columnar database management, Data warehousing, Database compression, Infobright, MySQL, Open source | 3 Comments |
Aster Data 4.0 and the evolution of “advanced analytic(s) servers”
Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”
Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:
- Aster is offering a slick vision for integrating big-database management and general analytic processing on the same MPP cluster, under the not-so-slick name “Data-Application Server.”
- Aster is also offering a sophisticated vision for workload management.
In addition, Aster has matured nCluster in various ways, for example cleaning up a performance problem with single-row updates.
Highlights of the Aster “Data-Application Server” story include: Read more
| Categories: Aster Data, Cloud computing, Data warehousing, EAI, EII, ETL, ELT, ETLT, MapReduce, Market share and customer counts, Teradata, Theory and architecture, Workload management | 9 Comments |
A question on MDX performance
An enterprise user wrote in with a question that boils down to:
What are reasonable MDX performance expectations?
MDX doesn’t come up in my life very much, and I don’t have much intuition about it. E.g., I don’t know whether one can slap an MDX-to-SQL converter on top of a fast analytic RDBMS and go to town. What’s more, I’m heading off on vacation and don’t feel like researching the matter myself in the immediate future. 🙂
So here’s the long form of the question. Any thoughts?
I have a general question on assessing the performance of an OLAP technology using a set of MDX queries. I would be interested to know if there are any benchmark MDX performance tests/results comparing different OLAP technologies (which may be based on different underlying DBMS’s if appropriate) on similar hardware setup, or even comparisons of complete appliance solutions. More generally, I want to determine what performance limits I could reasonably expect on what I think are fairly standard servers.
In my own work, I have set up a star schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000. In ad hoc analytics, is it expected that any query against such a dataset should return a result within a minute or two (i.e. before a user gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table or caching mechanism)? Or is that level of performance only expected with a high end massively parallel software/hardware solution? The server specs I’m testing with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3 Gb/s SAS drive, running Windows Server 2003 (x64).
I realise that caching of query results and pre-aggregation mechanisms can significantly improve performance, but I’m coming from the viewpoint that in purely exploratory analytics, it is not possible to have all combinations of dimensions calculated in advance, in addition to being maintained.
| Categories: Analytic technologies, Benchmarks and POCs, Data warehousing, MOLAP | 16 Comments |
