Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
ParAccel PADB technical notes
I posted last October about PADB (ParAccel Analytic DataBase), but held back on various topics since PADB 3.0 was still under NDA. By the time PADB 3.0 was released, I was on blogging hiatus. Let’s do a bit of ParAccel catch-up now.
One big part of PADB 3.0 was an analytics extensibility framework. If we match PADB against my recent analytic computing system checklist, Read more
| Categories: Analytic technologies, Data warehousing, EMC, MapReduce, ParAccel, Parallelization, Storage | 2 Comments |
Do we still need EDWs?
Colin White reopened the question of whether enterprise data warehouses (EDW) are still needed, lining up and knocking down a number of traditional pro-EDW arguments, in more detail than I ever have. So this feels like a good time to revisit my answer to the question of the EDW’s role, whose money quote was:
At conventional enterprises … Manage some of your data to enterprise data warehouse standards, but not all of it. Specifically, your highest-value data should be in something that looks like a classic enterprise data warehouse, and your lower-value data shouldn’t.
For sufficiently small enterprises, the “something that looks like a classic enterprise data warehouse” might just be your One Central Database, combining OLTP (OnLine Transaction Processing) and analytics. Otherwise, the chances are high that you’re going to want to copy your data crown jewels to an EDW, even if they’re also being used as analytic inputs directly from the OLTP systems that first capture them.
As I’ve recently reviewed, there are huge amounts of specialized technology for SQL queries and other analytics. Classical EDW vendors may not be the best or lowest-cost providers of such technology. And even when the EDW is technically competitive, the bureaucratic processes around it can impede rapid adoption of important analytic tools. So Colin is directionally right, in that most large enterprises should be taking the EDW concept less seriously than they currently do. But core EDW technology and business attitudes shouldn’t be entirely discarded either.
| Categories: Analytic technologies, Data warehousing | 3 Comments |
Choices in analytic computing system design
When I posted a long list of architectural options for analytic DBMS, I left a couple of IOUs in for missing parts. One was in the area of what is sometimes called advanced-analytics functionality, which roughly speaking means aspects of analytic database management systems that are not directly related to conventional* SQL queries.
*Main examples of “conventional” = filtering, simple aggregrations.
The point of such functionality is generally twofold. First, it helps you execute analytic algorithms with high performance, due to reducing data movement and/or executing the analytics in parallel. Second, it helps you create and execute sophisticated analytic processes with (relatively) little effort.
For now, I’m going to refer to an analytic RDBMS that has been extended by advanced-analytics functionality as an analytic computing system, rather than as some kind of “platform,” although I suspect the latter term is more likely to wind up winning. So far, there have been five major categories of subsystem or add-on module that contribute to making an analytic DBMS a more fully-fledged analytic computing system:
- SQL extensions. Examples include SQL-2003 analytics (notably windowing), or vendor-specific temporal functionality.
- A framework for UDFs (User-Defined Functions) to further extend SQL. At its core, a relational DBMS is a big SQL interpreter. SQL, while powerful, only does a limited number of things. User-Defined Functions are new predicates in the SQL language that do additional things.
- An execution engine for analytic processes that is less coupled to the SQL engine than a pure UDF framework might be. The two main approaches are MapReduce (e.g. Aster Data) and general C++ libraries (Netezza, ParAccel).
- Libraries of pre-built analytic processes. Commonly included are statistics, (other machine learning), general linear algebra, and Monte Carlo analysis. Some of these functions are fully parallelized (perhaps tens per vendor). Others just play nicely with the vendor’s execution framework, in that a separate copy can be run on each node (up to thousands per vendor, for those who bring in open source statistics libraries).
- Development tools such as integrated development environments (IDEs). Aster keeps trying to convince me that having built a nice Eclipse IDE is a major competitive differentiation.
| Categories: Aster Data, MapReduce, Netezza, ParAccel, Parallelization, Predictive modeling and advanced analytics, Workload management | 8 Comments |
Mega-trends driving data warehousing and business intelligence
Philip Russom opines (emphasis mine):
What’s driving change in data warehousing (DW) and business intelligence (BI)? There are obvious scalability issues, due to burgeoning data, reports, and user communities. Plus, end-users need more real-time and on-demand BI. For many organizations, integrating existing systems into DW/BI is a higher priority than putting in new ones. And the “do more with less” economy demands more BI at lower costs. Hence, most drivers of change in BI and DW concern four Mega-Trends: size, speed, interoperability, and economics.
Depending on which universe of enterprises and vendors you’re looking at, Philip’s claim of “most” may be technically true. But from where I sit, Philip omitted two other crucial trends: new kinds of data and increased analytic sophistication.
A year ago, I divided data into three kinds:
- Human/tabular, which is what Philip’s comments seem to be focused on.
- Human/nontabular, e. g. what is best handled via text analytics.
- Machine-generated, such as web log or sensor data.
Most organizations on the planet could benefit from better understanding or exploiting their human-generated tabular data. But even so, many of the best opportunities to add analytic value come from capturing and analyzing fundamentally newer kinds of information.
I further would suggest that analytic sophistication is going up, for at least two reasons:
- New kinds of data call for or at least allow new kinds of analytics.
- Better price-performance (on bigger data sets) allows for more sophisticated analytic techniques.
Some of the best examples of these trends, especially the second one, may be found in what I recently called analytic profiling.
| Categories: Analytic technologies, Business intelligence | 4 Comments |
Notes, links, and comments January 20, 2011
I haven’t done a pure notes/links/comments post for a while. Let’s fix that now. (A bunch of saved-up links, however, did find their way into my recent privacy threats overview.)
First and foremost, the fourth annual New England Database Summit (nee “Day”) is next week, specifically Friday, January 28. As per my posts in previous years, I think well of the event, which has a friendly, gathering-of-the-clan flavor. Registration is free, but the organizers would prefer that you register online by the end of this week, if you would be so kind.
The two things potentially wrong with the New England Database Summit are parking and the rush hour drive home afterwards. I would listen with interest to any suggestions about dinner plans.
One thing I hope to figure out at the Summit or before is what the hell is going on on Vertica’s blog or, for that matter, at Vertica. The recent Mike Stonebraker post that spawned a lot of discussion and commentary has disappeared. Meanwhile, Vertica has had three consecutive heads of marketing leave the company since June, and I don’t know who to talk to there any more. Read more
| Categories: About this blog, Analytic technologies, Data warehousing, GIS and geospatial, Investment research and trading, MongoDB, OLTP, Open source, PostgreSQL, Vertica Systems | 4 Comments |
Sound bites on HP/Microsoft and Neoview
HP and Microsoft put out a press release. Three new appliances are being announced, and we’re being reminded of at least one past announcement. I wasn’t briefed, and wouldn’t want to comment on, say, price/performance or feature particulars. That said:
- HP Neoview seems pretty dead.
- I haven’t heard a single favorable reference to HP Neoview since I remarked in March, 2010 that “HP Neoview is reeling.”
- A reporter asked me “What went wrong?” Well, almost any new analytic DBMS/appliance product will compete mainly on two things in its early days — price/performance (or absolute performance), and just how (im)mature it initially is. (Aster Data may be the only prominent exception to that rule.) Presumably, HP Neoview did badly by those metrics.
- HP Neoview was widely conjectured to be a pet project of ousted former HP CEO Mark Hurd.
- Nobody tells me of competing with Microsoft SQL Server 2008 Parallel Data Warehouse either (i.e. Madison/DATallegro). Thus, in particular, I haven’t heard any reason to believe there’s anything good about the technology, especially now that the ever-upbeat Stuart Frost has left Microsoft. I’m conjecturing that Parallel Data Warehouse is focused heavily on the existing Microsoft installed base.
- Speaking of Aster — even under NDA, they won’t tell me or give me any useful hints as to who their undisclosed strategic investor is. Well, HP has a long history of investing in sometimes-competing DBMS vendors (back to Oracle and Informix), and a good reason to keep quiet (reluctance to admit the end of Neoview). Hmm …
- The consolidation appliance in the HP/Microsoft announcement is a clear response to Oracle’s Exadata strategy, or (which is probably more accurate) to the same market opportunity Oracle identified.
- I couldn’t quite figure out whether the cheap data warehouse appliance included Microsoft PowerPivot support, but that would make sense if it did.
| Categories: Aster Data, Data warehouse appliances, Data warehousing, HP and Neoview, Microsoft and SQL*Server | 3 Comments |
Architectural options for analytic database management systems
Mike Stonebraker recently kicked off some discussion about desirable architectural features of a columnar analytic DBMS. Let’s expand the conversation to cover desirable architectural characteristics of analytic DBMS in general. Read more
The technology of privacy threats
This post is the second of a series. The first one was an overview of privacy dangers, replete with specific examples of kinds of data that are stored for good reasons, but can also be repurposed for more questionable uses. More on this subject may be found in my August, 2010 post Big Data is Watching You!
There are two technology trends driving electronic privacy threats. Taken together, these trends raise scenarios such as the following:
- Your web surfing behavior indicates you’re a sports car buff, and you further like to look at pictures of scantily-clad young women. A number of your Facebook friends are single women. As a result, you’re deemed a risk to have a mid-life crisis and divorce your wife, thus increasing the interest rate you have to pay when refinancing your house.
- Your cell phone GPS indicates that you drive everywhere, instead of walking. There is no evidence of you pursuing fitness activities, but forum posting activity suggests you’re highly interested in several TV series. Your credit card bills show that your taste in restaurant food tends to the fatty. Your online photos make you look fairly obese, and a couple have ashtrays in them. As a result, you’re judged a high risk of heart attack, and your medical insurance rates are jacked up accordingly.
- You did actually have that mid-life crisis and get divorced. At the child-custody hearing, your ex-spouse’s lawyer quotes a study showing that football-loving upper income Republicans are 27% more likely to beat their children than yoga-class-attending moderate Democrats, and the probability goes up another 8% if they ever bought a jersey featuring a defensive lineman. What’s more, several of the more influential people in your network of friends also fit angry-male patterns, taking the probability of abuse up another 13%. Because of the sound statistics behind such analyses, the judge listens.
Not all these stories are quite possible today, but they aren’t far off either.
| Categories: Facebook, Predictive modeling and advanced analytics, Surveillance and privacy, Telecommunications, Web analytics | 4 Comments |
Privacy dangers — an overview
This post is the first of a series. The second one delves into the technology behind the most serious electronic privacy threats.
The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the United States, with my first public breakthrough coming at the end of January. But much has changed since then.
On the commercial side, Facebook modified its privacy policies, garnering great press attention and an intense user backlash, leading to a quick partial retreat. The Wall Street Journal then launched a long series of articles — 13 so far — recounting multiple kinds of privacy threats. Other media joined in, from Forbes to CNet. Various forms of US government rule-making to inhibit advertising-related tracking have been proposed as an apparent result.
In the US, the government had a lively year as well. The Transportation Security Administration (TSA) rolled out what have been dubbed “porn scanners,” and backed them up with “enhanced patdowns.” For somebody who is, for example, female, young, a sex abuse survivor, and/or a follower of certain religions, those can be highly unpleasant, if not traumatic. Meanwhile, the Wikileaks/Cablegate events have spawned a government reaction whose scope is only beginning to be seen. A couple of “highlights” so far are some very nasty laptop seizures, and the recent demand for information on over 600,000 Twitter accounts. (Christopher Soghoian provided a detailed, nuanced legal analysis of same.)
At this point, it’s fair to say there are at least six different kinds of legitimate privacy fear. Read more
| Categories: Analytic technologies, Facebook, GIS and geospatial, Health care, Surveillance and privacy, Telecommunications, Web analytics | 6 Comments |
The six useful things you can do with analytic technology
I seem to be in the mode of sharing some of my frameworks for thinking about analytic technology. Here’s another one.
Ultimately, there are six useful things you can do with analytic technology:
- You can make an immediate decision.
- You can plan in support of future decisions.
- You can research, investigate, and analyze in support of future decisions.
- You can monitor what’s going on, to see when it necessary to decide, plan, or investigate.
- You can communicate, to help other people and organizations do these same things.
- You can provide support, in technology or data gathering, for one of the other functions.
Technology vendors often cite similar taxonomies, claiming to have all the categories (as they conceive them) nicely represented, in slickly integrated fashion. They exaggerate. Most of these categories are in rapid flux, and the rest should be. Analytic technology still has a long way to go.
In more detail: Read more
