Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
Pervasive DataRush today
In my first post-fire briefing, I had a long-scheduled dinner with the Pervasive DataRush folks. Much of DataRush’s positioning, feature evolution, and so on remain To Be Determined. Most existing customers and applications remain To Be Disclosed. What’s more, DataRush is a technology to accelerate applications that
- Need to be parallelized
- Should run on SMP rather than shared-nothing hardware
and Pervasive hasn’t done a great job of explaining where #2 applies.
That said, there’s at least one use case for which DataRush should clearly be considered today. Suppose you have a messy ETL/data transformation task that requires custom code. Then I see three main choices:
- Write the code within the confines of an off-the-shelf ETL tool.
- Write the code to run on an analytic DBMS platform, ideally an MPP/shared-nothing one.
- Use something like DataRush (and I’m not familiar with any good alternatives to DataRush).
In some cases, DataRush may be best possibility.
Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Parallelization, Pervasive Software | 1 Comment |
Three Greenplum customers’ applications of MapReduce
Greenplum (and Truviso) advisor Joseph Hellerstein offers a few examples of MapReduce applications (specifically Greenplum MapReduce), namely:
The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).
Roger talked about using MapReduce to extract structured entities from text for doing tech trend analyses from billions of rows of online job postings. Brian (who is a mathematician by training) was talking about implementing conjugate gradiant and Support Vector Machines in parallel SQL to support “hypertargeting” for advertisers. I mentioned how Jonathan Goldman at LinkedIn was using SQL and MapReduce to do graph algorithms for social network analysis.
Incidentally: While it’s been some months since I asked, my sense is that the O’Reilly text extraction is home-grown, and primitive compared to what one could do via commercial products. That said, if the specific application is examining job postings, I’m not sure how much value more sophisticated products would add. After all, tech job listings are generally written in a style explicitly designed to ensure that most or all of their meaning is conveyed simply by a bag of keywords. And by the way, this effort has been underway for quite some time.
Related link
- Greenplum has a page on the O’Reilly relationship. However, the part that isn’t behind a registration barrier is trivial — and I wouldn’t know one way or the other about the registration-required part.
Categories: Analytic technologies, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Specific users, Web analytics | 3 Comments |
MapReduce user eHarmony chose Netezza over Aster or Greenplum
Depending on which IDG reporter you believe, eHarmony has either 4 TB of data or more than 12 TB, stored in Oracle but now analyzed on Netezza. Interestingly, eHarmony is a Hadoop/MapReduce shop, but chose Netezza over Aster Data or Greenplum even so. Price was apparently an important aspect of the purchase decision. Netezza also seems to have had a very smooth POC. Read more
Categories: Application areas, Aster Data, Benchmarks and POCs, Data warehousing, Greenplum, MapReduce, Netezza, Oracle, Predictive modeling and advanced analytics, Pricing | 5 Comments |
An example of Aster Data’s nPath/MapReduce syntax
Perhaps in response to my prior post on Aster Data’s introduction of MapReduce-based nPath, Steve Wooledge of Aster offers a more detailed example. The particular case he works through is:
… the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site?
Categories: Analytic technologies, Aster Data, Data warehousing, MapReduce, Web analytics | Leave a Comment |
Aster Data nPath
Edit: Unfortunately, this post and its sequel rely on Aster Data posts that Aster’s buyer Teradata no longer makes easily available.
At the same time as it rolled out its cloud story, Aster Data told of nPath, a MapReduce-based feature in nCluster. As best I understand it, the core idea of nPath is that it preprocesses sequential data via MapReduce so that you can then do ordinary SQL on it. (Steve Wooledge’s blog post about nPath outlines why that might be needed. Point 1 in Mayank Bawa’s August, 2008 post is much more concise. 😉 ) Now, that might seem to contradict the syntax, which is all about MapReduce being invoked via SQL — still, it’s what’s really going on.
That leads to two obvious questions: What is nPath used (or useful) for? and How is the preprocessing done anyway? Read more
Categories: Aster Data, Data warehousing, MapReduce, Predictive modeling and advanced analytics, Web analytics | 2 Comments |
Aster Data in the cloud
Aster Data is in the news, bragging about a cloud version of nCluster, and providing both a press release and a blog post on the subject. It seems there are three actual customers, two of which have been publicly named. One of them, ShareThis, is in production. (2 terabytes of data on 9 nodes, planning to scale to 10-18 TB on 24 or so nodes by year-end.) All seem to be doing something in the area of internet marketing, web analytics or otherwise — which makes sense, as the same could be said of almost all Aster customers overall. That said, it seems that these customers are doing their primary analytic processing remotely, which makes Aster’s experience in that regard more akin to Kognitio’s than to Vertica’s. Read more
Categories: Analytic technologies, Application areas, Aster Data, Cloud computing, Data warehousing, MapReduce, Software as a Service (SaaS), Web analytics | 1 Comment |
Pervasive DataRush
I’ve made a few references to Pervasive DataRush in the past — like this one — but I’ve never gotten around to seriously writing it up. I’ll now try to make partial amends. The key points about Pervasive Datarush are:
- DataRush grew out of Pervasive Software’s ETL business, as the underpinnings for a new data transformation tool they were building.
- DataRush is a Java framework for doing parallel programming automagically.
- Unlike most modern parallelization technologies, DataRush is focused on single SMP (Symmetric MultiProcessing) boxes rather than loosely-coupled grids.
- DataRush is based on dataflow programming.
- Pervasive says that DataRush is really fast.
More details may be found at the rather rich Pervasive DataRush website, or in the following excerpt from an email by Pervasive’s Steve Hochschild: Read more
Categories: Parallelization, Pervasive Software | 1 Comment |
Another dubious “end of computer history” argument
In a typically snarky Register article, Chris Mellor raises a caution about the use of future many-cored chips in IT. In essence, he says that today’s apps run in a relatively small number of threads each, and modifying them to run in many threads is too difficult. Hence, most of the IT use for many-cored chips will be via hypervisors that assign apps to cores as makes sense.
Mellor has a point, but he’s overstating it. Read more
Categories: Parallelization, Theory and architecture | 3 Comments |
High-end MySQL use
To a large extent, MySQL lives in two different alternate universes from most other DBMS. One is for low-end, simple database applications. For example, of all the DBMS I write about, MySQL is the one I actually use in my own business — because MySQL sits underneath WordPress, and WordPress is what runs my blogs. My largest database (the one for DBMS2) contains 12 megabytes of data in 11 tables, none of which has yet reached 5000 rows in size. Read more
Categories: Google, MySQL, OLTP, Open source, Parallelization | 1 Comment |
High-performance analytics
For the past few months, I’ve collected a lot of data points to the effect that high-performance analytics – i.e., beyond straightforward query — is becoming increasingly important. And I’ve written about some of them at length. For example:
- MapReduce – controversial or in some cases even disappointing though it may be – has a lot of use cases.
- It’s early days, but Netezza and Teradata (and others) are beefing up their geospatial analytic capabilities.
- Memory-centric analytics is in the spotlight.
Ack. I can’t decide whether “analytics” should be a singular or plural noun. Thoughts?
Another area that’s come up which I haven‘t blogged about so much is data mining in the database. Data mining accounts for a large part of data warehouse use. The traditional way to do data mining is to extract data from the database and dump it into SAS. But there are problems with this scenario, including: Read more