Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Greenplum et alia’s BigDataNews.com site
Greenplum recently started a website BigDataNews.com, and quickly signed up Aster Data as a co-sponsor. (Edit: As per a comment below, the decision to sign up additional sponsors was made by the site’s independent publisher.) It’s actually being run by Brett Sheppard, a former Gartner/DataQuest analyst who now gets involved in this kind of thing. (Brett and I may be working on another project soon, with Greenplum funding.)
The heart of the site is feeds* from a variety of high-profile blogs (DBMS2, Daniel Abadi’s, Joe Hellerstein’s, James Kobelius’, et al.), plus some additional posts written by Brett (primarily) or Greenplum folks. Highlights of Brett’s posts include:
- What I am told was an unauthorized revelation that Greenplum Chorus is built on CouchDB and Erlang.
- An impassioned defense of the integrity of Gartner’s analysis.
*At least in my case, that’s just a post title or snippet, plus a link back to the main post. The same goes for mapreduce.org, actually.
| Categories: Analytic technologies, Data warehousing, Greenplum, NoSQL | 2 Comments |
Aster Data’s mapreduce.org site
Aster Data has started a site mapreduce.org, which purports to compile “the best information about MapReduce.” At the moment, mapreduce.org highlights include:
- A feed of MapReduce-related posts from several blogs, including this one.
- A calendar of MapReduce-related events, not necessarily Aster-specific, integrated with a feed combining …
- … Aster MapReduce-related press releases and also …
- … not necessarily Aster-specific MapReduce-related press articles.
- Links to a lot of Aster Data MapReduce-related collateral. Some of that stuff is quite good.*
- A sycophantic introduction from Colin White praising the value of the mapreduce.org “independent forum.”
*I did a couple of MapReduce-related webinars for Aster late last year. 🙂 But seriously — Aster does a good job of writing clear and informative collateral.
| Categories: Analytic technologies, Aster Data, MapReduce | 3 Comments |
Introduction to Datameer
Elder care issues have flared up with a vengeance, so I’m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:
- Datameer offers a business intelligence and analytics stack that runs on any distribution of Hadoop.
- Datameer is still building a lot of features that it talks about, for target release in (I think) the fall.
- Datameer’s pride and joy is its user interface. Very laudably for a software start-up, Datameer claims to have spent considerable time with professional user interface designers.
- Datameer’s core user interface metaphor is formula definition via a spreadsheet.
- Datameer includes 124 functions one can use in these formulae, ranging from math stuff to text tokenization.
- Datameer does some straight BI, with 4 kinds of “visualization” headed for 20 kinds later. But if you want to do hard-core BI, use Datameer to dump data into an RDBMS and then use the BI tool of your choice. (Datameer’s messaging does tend to obscure or even contradict that point.)
- Rather, Datameer seems to be designed for the classic MapReduce use cases of ETL and heavy data crunching.
- Datameer’s messaging includes a bit about “Datameer is real-time, even though Hadoop is generally thought of as batch.” So far as I can tell, what that boils down to is …
- … Datameer will let you examine sample and/or partial query results before a full Hadoop run is over. Apparently, there are three different ways Datameer lets you do this:
- You can truly query against a sample of the data set.
- You can query against intermediate results, when only some stages of the Hadoop process have already been run.
- You can drill down into a “distributed index,” whatever the heck that means when Datameer says it.
- Datameer will let you import data from 15 or so different kinds of sources, SQL, NoSQL, and file system alike.
| Categories: Analytic technologies, Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce | 3 Comments |
Story of an analytic DBMS evaluation
One of our readers was kind enough to walk me through his analytic DBMS evaluation process. The story is:
- The X Company (XCo) has a <1 TB database.
- 100s of XCo’s customers log in at once to run reports. 50-200 concurrent queries is a good target number.
- XCo had been “suffering” with Oracle and wanted to upgrade.
- XCo didn’t have a lot of money to spend. Netezza pulled out of the sales cycle early due to budget (and this was recently enough that Netezza Skimmer could have been bid).
- Greenplum didn’t offer any references that approached the desired number of concurrent users.
- Ultimately the evaluation came down to Vertica and ParAccel.
- Vertica won.
Notes on the Vertica vs. ParAccel selection include: Read more
| Categories: Analytic technologies, Benchmarks and POCs, Buying processes, Data warehousing, Greenplum, Netezza, Oracle, ParAccel, Vertica Systems | 7 Comments |
Greenplum Chorus and Greenplum 4.0
Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year’s EDC (Enterprise Data Cloud) vision statement and marketing campaign.
Greenplum 4.0 highlights and related observations include: Read more
Is the enterprise data warehouse a myth?
An enterprise data warehouse should:
- Manage data to high standards of accuracy, consistency, cleanliness, clarity, and security.
- Manage all the data in your organization.
Pick ONE. Read more
| Categories: Data models and architecture, Data warehousing, Database diversity, Teradata, Theory and architecture | 8 Comments |
Examples of machine-generated data
Not long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which include: Read more
| Categories: Analytic technologies, Data warehousing, Games and virtual worlds, Investment research and trading, Log analysis, Oracle, Telecommunications, Web analytics | 29 Comments |
Information found in public-facing social networks
Here are some examples illustrating two recent themes of mine, namely:
- Easily-available information reveals all sorts of things about us.
- Graph-based analysis is on the rise.
Pete Warden scraped all of Facebook’s social graph (at least for the United States), and put up a really interesting-looking visualization of same. Facebook’s lawyer’s came down on him, and he quickly agreed to destroy the data he’d scraped, but also published ideas on how other people could duplicate his work.
Warden has since given an interview in which he outlines some of the things researchers hoped to do with this data: Read more
| Categories: Analytic technologies, Facebook, RDF and graphs, Surveillance and privacy | 1 Comment |
Notes on the evolution of OLTP database management systems
The past few years have seen a spate of startups in the analytic DBMS business. Netezza, Vertica, Greenplum, Aster Data and others are all reasonably prosperous, alongside older specialty product vendors Teradata and Sybase (the Sybase IQ part). OLTP (OnLine Transaction Processing) and general purpose DBMS startups, however, have not yet done as well, with such success as there has been (MySQL, Intersystems Cache’, solidDB’s exit, etc.) generally accruing to products that originated in the 20th Century.
Nonetheless, OLTP/general-purpose data management startup activity has recently picked up, targeting what I see as some very real opportunities and needs. So as a jumping-off point for further writing, I thought it might be interesting to collect a few observations about the market in one place. These include:
- Big-brand OLTP/general-purpose DBMS have more “stickiness” than analytic DBMS.
- By number, most of an enterprise’s OLTP/general-purpose databases are low-volume and low-value.
- Most interesting new OLTP/general-purpose data management products are either MySQL-based or NoSQL.
- It’s not yet clear whether MySQL will prevail over MySQL forks, or vice-versa, or whether they will co-exist.
- The era of silicon-centric relational DBMS is coming.
- The emphasis on scale-out and reducing the cost of joins spans the NoSQL and SQL-based worlds.
- Users’ instance on “free” could be a major problem for OLTP DBMS innovation.
I shall explain. Read more
Liberty and privacy, once again
I’ve long argued three points:
- It is inevitable* that governments and other constituencies will obtain huge amounts of information, which can be used to drastically restrict everybody’s privacy and freedom.
- To protect against this grave threat, multiple layers of defense are needed, technical and legal/regulatory/social/political alike.
- One particular layer is getting insufficient attention, namely restrictions upon the use (as opposed to the acquisition or retention) of data.
*And indeed in many ways even desirable
I surprised people by leading with the liberty/privacy subject at my New England Database Summit keynote; considerable discussion ensued, largely supportive. I hope for a similar outcome when I keynote the Aster Big Data Summit in Washington, DC in May. And I expect to do even more to advance the liberty/privacy discussion as 2010 unfolds.
Fortunately, I’m not the only only thinking or talking about these liberty/privacy issues. Read more
