October 18, 2009

Three big myths about MapReduce

Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics

11 Comments

October 18, 2009

Introduction to SenSage

I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage’s, hasty. Still, I think I have enough of a handle on SenSage basics to be worth writing up.

General SenSage highlights include:

Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, Log analysis, MapReduce, SenSage, Streaming and complex event processing (CEP), Telecommunications

3 Comments

October 18, 2009

Technical introduction to Splunk

As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.

Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include: Read more

Categories: Analytic technologies, Log analysis, MapReduce, Splunk, Structured documents, Text, Web analytics

12 Comments

October 18, 2009

General introduction to Splunk

I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.

Splunk’s technical stories include:

Text search over log files.
Business intelligence over text search. (That part sounds a lot like Attivio.)
MapReduce with schema flexibility and smart multi-stage execution plans. (That part sounds a lot like Aster Data.)

Kickfire capacity and pricing

Kickfire’s marketing communication efforts are still a work in progress. Kickfire did finally relax its secrecy about FPGA-vs.-custom-silicon – not coincidentally during Netezza’s recent publicity cycle. That wise choice helped Kickfire get some favorable attention recently for its technical and market strategy, e.g. from Daniel Abadi, Merv Adrian and, kicking things off — as it were — me. Weeks after a recent Kickfire product release, there’s finally a fairly accurate data sheet up, although there’s still one self-defeatingly misleading line I’ll comment on below. Pricing is a whole other area of confusion, although it seems that current list prices have been inadvertently* leaked in Merv’s post linked above, with only one inaccuracy that I can detect.**

*I gather from the company that they forgot to tell Merv pricing was NDA.

** Merv cited a price as “starting” that I believe to be top-of-the-line. No criticism of Merv is implied in that; Kickfire has not been very clear in communicating hard numbers.

All that said, if one takes Kickfire’s marketing statements literally, Kickfire list pricing is around $20-50K per terabyte for a few small, fixed, high-performance configurations. That’s all-in, for plug-and-play appliances. What’s more, that range is based on the actual published user data capacity numbers for various Kickfire models, which I think are low for several reasons:

Kickfire doesn’t officially admit that its model with 14.4 terabytes of disk can manage more than 6 terabytes of data, even though it clearly can.
Actually, those 14.4 terabytes of disk can be increased or lowered as you choose.
The basic compression figures implied in those calculations seem conservative.
Compression figures are a lot more conservative yet, in that Kickfire assumes you’ll have a lot of actual indexes on your data. I’m not sure that’s necessary for most workloads.

Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Kickfire, Pricing

3 Comments

October 15, 2009

MapReduce webinars and annotated slides

As previously noted, I’m giving a webinar twice today — i.e., Thursday, October 15 — at 10:00 am and 1:00 pm Eastern time.

The subject is MapReduce.
The sponsor is Aster Data.
Part of the webinar will be an explanation of MapReduce basics, especially the conflict between theory/propaganda and reality.
As you might guess from the identity of the sponsor, there will be an emphasis on how MapReduce and SQL play nicely with each other.
You can register for the webinar on Aster’s site.
(Edit) The webinar replay can be found here.
I’ve already uploaded the slides from which I will present. (But not the ones from which Aster folks will be talking. I’ve seen those, and there’s some good technical crunch in some of them.) The “Notes” under the slides have a number of relevant URLs for follow-up, as well as a small number of explanatory comments (e.g., as to why one slide simply has a quote from and corresponding picture of Shakespeare).

Categories: Aster Data, MapReduce, Presentations

6 Comments

October 14, 2009

Infobright notes

I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn’t primarily a briefing, but a few takeaways are:

Infobright now has >100 paying customers.
Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.
Agile development is at or approaching two-week release cycles.
Like Kickfire, Infobright has a multi-year deal with MySQL that insulates it against many potential Oracle/MySQL shenanigans.
From an industry perspective, Infobright’s customer base sounds a lot like other vendors’:
- Data mart outsourcing/online analytics
- Log files for websites
- Telecommunications
- Financial services
- OEM, especially in the markets cited above
- “Hey, we’re beginning to see the occasional energy deal”
- A few random others
Infobright is seeing some household-name customers, who surely have big-name analytic DBMS products, but who also have a policy that open source is the default choice, and if open source can get the job done then the favorite closed-source choices aren’t used.
Infobright has the usual open-source community story — lots of involvement and engagement in the forums, but contributions are limited mainly to connectivity, utility scripts, etc. (Maybe some national language translation too; I’m not sure.)

Categories: Analytic technologies, Data mart outsourcing, Data warehousing, Infobright, Investment research and trading, Kickfire, Log analysis, Market share and customer counts, MySQL, Open source, Telecommunications, Web analytics

7 Comments

October 14, 2009

Greenplum is going hybrid columnar as well

Over the past summer, Vertica, VectorWise, and Oracle all announced flavors of hybrid row/columnar storage. Now it’s Greenplum’s turn. Greenplum is actually offering true columnar storage, as opposed to Oracle’s PAX-like scheme — and also as opposed to the kind of Frankencolumn storage Daniel Abadi decries. For example, you don’t have to do a join to retrieve multiple columns; you just ask for them and there they are. Similarly, Greenplum doesn’t maintain explicit row IDs – whether in row-oriented or column-oriented append-only storage – relying instead on block-level header information. Read more

Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, Greenplum, Theory and architecture

12 Comments

October 10, 2009

How 30+ enterprises are using Hadoop

MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:

Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance

Categories: Application areas, Aster Data, Cloudera, Data types, Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Log analysis, MapReduce, Open source, Parallelization, Predictive modeling and advanced analytics, Scientific research, Structured documents, Telecommunications, Text, Vertica Systems, Web analytics

9 Comments

October 9, 2009

I have some presentations coming up (all on October Thursdays)

On Thursday, October 15, and two different times (10:00 am and 1:00 pm Eastern time), I’ll be giving a webinar for Aster Data on MapReduce. The content is very much work in progress, but it definitely will:

Be overviewy in nature
Emphasize SQL/MapReduce integration

Then, on the evening of Thursday, October 22, there’s something called the Boston Big Data Summit, in Waltham, where “Big Data” evidently is to be construed as anything from a few terabytes on up. (Things are smaller in the Northeast than in California …) It’s being put together by Amrith Kumar (who I don’t really know) and Bob Zurek (who everybody knows). This is the inaguaral meeting. It seems I’m both giving the keynote and running the subsequent panel, one of whose participants will be Ellen Rubin. Read more

Categories: Analytic technologies, Aster Data, Cloud computing, MapReduce, Presentations

4 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Three big myths about MapReduce

Introduction to SenSage

Technical introduction to Splunk

General introduction to Splunk

Kickfire capacity and pricing

MapReduce webinars and annotated slides

Infobright notes

Greenplum is going hybrid columnar as well

How 30+ enterprises are using Hadoop

I have some presentations coming up (all on October Thursdays)

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin