Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Notes from the Fusion-io S-1 filing
Fusion-io has filed for an initial public offering. With public offerings go S-1 filings which, along with 10-Ks, are the kinds of SEC filing that typically contain a few nuggets of business information. Notes from Fusion-io’s S-1 include:
Fusion-io is growing very, very fast, doubling or better in revenue every 6 months.
Fusion-io’s marketing message revolves around “data centralization”. Fusion-io is competing against storage-area networks and storage arrays.
Fusion-io’s list of application types includes
… systems dedicated to decision support, high performance financial analysis, web search, content delivery and enterprise resource planning.
Fusion-io says it has shipped over 20 petabytes of storage.
Fusion-io has a shifting array of big customers, including OEMs: Read more
| Categories: Analytic technologies, Data warehousing, Facebook, Solid-state memory, Storage | Leave a Comment |
Traditional databases will eventually wind up in RAM
In January, 2010, I posited that it might be helpful to view data as being divided into three categories:
- Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays.
- Human/Nontabular data — i.e., all other data generated by humans.
- Machine-Generated data.
I won’t now stand by every nuance in that post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to dispute:
Traditional database data — records of human transactional activity, referred to as “Human/Tabular data above” — will not grow as fast as Moore’s Law makes computer chips cheaper.
And that point has a straightforward corollary, namely:
It will become ever more affordable to put traditional database data entirely into RAM. Read more
IBM InfoSphere Warehouse pricing, packaging, compression and more
IBM InfoSphere Warehouse 9.7.3 has been announced, and is planned for general availability late this month. IBM InfoSphere Warehouse is, in essence, DB2-plus, where the “plus” comprises:
- DPF (Data Partitioning Feature) — i.e., the ability to do shared-nothing scale-out.
- Unimportant add-ons — e.g., a mere 5 seats of the Cognos BI tool.
The main news in this release of InfoSphere Warehouse is probably pricing. While IBM has long had a funky server-power-based pricing scheme, it is now adding per-terabyte pricing, with a twist: IBM InfoSphere Warehouse now can be bought per terabyte of compressed user data. Specifically:
- IBM InfoSphere Warehouse 9.7.3 Enterprise Edition can be bought for production for $70K or so per terabyte of compressed user data.
- IBM InfoSphere Warehouse 9.7.3 Departmental Edition can be bought for production for $35K or so per terabyte of compressed user data.
- Development/test seats of IBM InfoSphere Warehouse cost about $2K per user.
- High availability/disaster recovery instances are priced as if they were managing 1 TB each — unless, of course, you have an active-active configuration, in which case they’re priced according to their full amount of data.
Per-terabyte pricing is generally a good way to think about analytic DBMS costs, for at least two reasons: Read more
| Categories: Data warehousing, Database compression, IBM and DB2, Pricing | 1 Comment |
Oracle and IBM workload management
When last night’s Oracle/Exadata post got too long — and before I knew Oracle would request a different section be cut — I set aside my comments on Oracle’s workload management story to post separately. Elements of Oracle’s workload management story include:
- Oracle’s workload management product is called Oracle Database Resource Manager.
- Oracle Database Resource Manager has long managed CPU. For Exadata, Oracle added in management of I/O. Management of RAM is coming.
- Another aspect of Oracle workload management is “instance caging.” If you’re running multiple instances of Oracle on the same box – e.g. one with 128 cores and thus 256 threads – instance caging can keep an instance confined to a specific number of threads.
- Policies can let some classes of user get access to more threads in Oracle Parallel Query than others do.*
- Oracle offers a QoS (Quality of Service) layer, at least on Exadata, that tries to use Oracle’s workload management capabilities to enforce SLAs (Service Level Agreements). For example, if you want a certain query to always be answered in no more than 0.3 seconds, it tries to make that happen. However, this technology is new in the current Oracle release, and will be enhanced going forward.
*Recall that “degrees of parallelism” in Oracle Parallel Query can now be set automagically.
One reason I split out this discussion of workload management is that I also talked with IBM’s Tim Vincent yesterday, who added some insight to what I already wrote last August about DB2/InfoSphere Warehouse workload management. Specifically:
- DB2/InfoSphere Warehouse workload management has multiple ways to manage use of CPU resources.
- DB2/InfoSphere Warehouse workload management doesn’t directly manage consumption of I/O or RAM resources. However, it can influence usage of I/O or RAM by:
- Limiting the number or rows read or returned.
- Adjusting priorities as to which queries get to prefetch the most records.
- DB2/InfoSphere Warehouse workload management doesn’t allow you to directly set an SLA mandating query response time. However, if query response times exceed a target SLA, DB2/InfoSphere Warehouse workload management can cause a statistics dump that might help you tune your way out of the problem.
| Categories: Data warehousing, IBM and DB2, Oracle, Workload management | Leave a Comment |
Oracle and Exadata: Business and technical notes
Last Friday I stopped by Oracle for my first conversation since January, 2010, in this case for a chat with Andy Mendelsohn, Mark Townsend, Tim Shetler, and George Lumpkin, covering Exadata and the Oracle DBMS. Key points included: Read more
Application areas for SAS HPA
When I talked with SAS about its forthcoming in-memory parallel SAS HPA offering, we talked briefly about application areas. The three SAS cited were:
- Consumer financial services. The idea here is to combine information about customers’ use of all kinds of services — banking, credit cards, loans, etc. SAS believes this is both for marketing and risk analysis purposes.
- Insurance. We didn’t go into detail.
- Mobile communications. SAS’ customers aren’t giving it details, but they’re excited about geocoding/geospatial data.
Meanwhile, in another interview I heard about, SAS emphasized retailers. Indeed, that’s what spawned my recent post about logistic regression.
The mobile communications one is a bit scary. Your cell phone — and hence your cellular company — know where you are, pretty much from moment to moment. Even without advanced analytic technology applied to it, that’s a pretty direct privacy threat. Throw in some analytics, and your cell company might know, for example, who you hang out with (in person), where you shop, and how those things predict your future behavior. And so the government — or just your employer — might know those things too.
| Categories: Application areas, Predictive modeling and advanced analytics, SAS Institute, Surveillance and privacy, Telecommunications | 2 Comments |
In-memory, parallel, not-in-database SAS HPA does make sense after all
I talked with SAS about its new approach to parallel modeling. The two key points are:
- SAS no longer plans to go as far with in-database modeling as it previously intended.
- Rather, SAS plans to run in RAM on MPP DBMS appliances, exploiting MPI (Message Passing Interface).
The whole thing is called SAS HPA (High-Performance Analytics), in an obvious reference to HPC (High-Performance Computing). It will run initially on RAM-heavy appliances from Teradata and EMC Greenplum.
A lot of what’s going on here is that SAS found it annoyingly difficult to parallelize modeling within the framework of a massively parallel DBMS such as Teradata. Notes on that aspect include:
- SAS wasn’t exploiting the capabilities of individual DBMS to their fullest; rather, it was looking for an approach that would work across multiple brands of DBMS. Thus, for example, the fact that Aster’s analytic platform architecture is more flexible or powerful than Teradata’s didn’t help much with making SAS run within the Aster nCluster database.
- Notwithstanding everything else, SAS did make a certain set of modeling procedures run in-database.
- SAS’ previous plans to run in-database modeling in Aster and/or Netezza DBMS may never come to fruition.
Endeca topics
I visited my then-clients at Endeca in January. We focused on underpinnings (and strategic counsel) more than on coolness in what the product actually does. But going over my notes I think there’s enough to write up now.
Before saying much else about Endeca, there’s one confusion to dispose of: What’s the relationship between Endeca’s efforts in e-commerce (helping shoppers navigate websites) and business intelligence (helping people navigate their own data)? As Endeca tells it:
- Endeca’s e-commerce and business intelligence efforts are reflections of the same technical approach. Indeed, I’m pretty sure Endeca’s product lines still share much/most of the same technology.
- Endeca went after e-commerce first because that’s where the provable ROI was. As I pointed out a couple of times in 2007, Endeca became a market leader in that area.
- Endeca increased its BI efforts later.
- Circa 2009-10, Endeca differentiated its e-commerce and BI product lines from each other.
- An e-commerce line extension called Page Builder is what really got Endeca through the recent recession.
- The BI product line Latitude was launched in the fall of 2010.
Endeca’s positioning in the business intelligence market boils down to “investigative analytics for people who aren’t hardcore analysts.” Endeca’s technological support for that stresses: Read more
| Categories: Business intelligence, Columnar database management, Database compression, Endeca | 11 Comments |
Netezza TwinFin i-Class overview
I have long complained about difficulties in discussing Netezza’s TwinFin i-Class analytic platform. But I’m ready now, and in the grand sweep of the product’s history I’m not even all that late. The Netezza i-Class timing story goes something like this:
- Netezza i-Class was first foreshadowed in February, 2010.
- Netezza i-Class customer testing started in October, 2010 or so. Netezza i-Class evidently has been shipped to 4-5 partners and a single-digit number of end-user organizations, spread across some usual-suspect industries (financial services, telecom, and so on).
- Netezza i-Class 1.0 general availability is still in the (near) future.
My advice to Netezza as to how it should describe TwinFin i-Class boils down to: Read more
| Categories: Cloudera, Data warehouse appliances, Data warehousing, GIS and geospatial, Hadoop, IBM and DB2, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics | 5 Comments |
Attensity update
I talked with Michelle de Haaff and Ian Hersey of Attensity back in February. We covered a lot of ground, so let’s start with a very high-level view.
- Two years ago, Attensity merged with two other companies in somewhat related businesses, thus expanding 4X or so in size.
- Due to the merger, Attensity now has two core lines of business:
- Text analytics.
- Driving actions, such as call center or social media response, based on text analytics.
- The combined Attensity is part American, part German.
- Attensity’s German part compels it to do some public financial reporting. Attensity will do $50-60 million in 2011 revenue.
- Attensity crunches text in 17 languages. English is preeminent. #2 is — you guessed it! — German.
- A big part of Attensity’s business (or at least of its value proposition) is analyzing the text in social media. Attensity boasts coverage of 75 million social media sources, such as blogs, forums, or review sites.
The four most interesting technical points were probably:
- Attensity has changed how it does exhaustive extraction. I’m having some trouble writing that part up, so for now I’ll just refer you to Attensity’s own description of the new way of doing things.
- Attensity has development work underway meant to address some of the problems in text analytics/other analytics integration. I don’t feel I got enough detail to want to talk about that yet.
- Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr (open source text search). More on that below.
- Attensity now OEMs Vertica. More on that below too.
Some more specific notes include: Read more
