July 14, 2011

An odd claim attributed to Mike Stonebraker

This post has a sequel.

Last week, Mike Stonebraker insulted MySQL and Facebook’s use of it, by implication advocating VoltDB instead. Kerfuffle ensued. To the extent Mike was saying that non-transparently sharded MySQL isn’t an ideal way to do things, he’s surely right. That still leaves a lot of options for massive short-request databases, however, including transparently sharded RDBMS, scale-out in-memory DBMS (whether or not VoltDB*), and various NoSQL options. If nothing else, Couchbase would seem superior to memcached/non-transparent MySQL if you were starting a project today.

*The big problem with VoltDB, last I checked, was its reliance on Java stored procedures to get work done.

Pleasantries continued in The Register, which got an amazing-sounding quote from Mike. If The Reg is to be believed — something I wouldn’t necessarily take for granted — Mike claimed that he (i.e. VoltDB) knows how to solve the distributed join performance problem. Read more

Categories: Cache, Clustering, Couchbase, Games and virtual worlds, In-memory DBMS, memcached, Michael Stonebraker, MySQL, Parallelization, Theory and architecture, VoltDB and H-Store

20 Comments

July 10, 2011

Hadoop futures and enhancements

Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:

Cloudera’s proprietary code is focused on management, set-up, etc.
The “Phase 1” plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2”.*
MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
So does Hadapt, but mainly vs. Hive.
Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.

(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)

Categories: Cloudera, Greenplum, Hadapt, Hadoop, HBase, MapR, MapReduce, Parallelization, Zettaset

20 Comments

July 10, 2011

Cloudera and Hortonworks

My clients at Cloudera have been around for a while, in effect positioned as “the Hadoop company.” Their business, in a nutshell, consists of:

Packaging up a Cloudera distribution of Apache Hadoop. This distribution doesn’t have proprietary code; it’s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to have been contributed by Cloudera engineers).
Paid subscription support for Apache Hadoop and, in connection with that …
… proprietary software that all support customers automatically get. There are two points to this proprietary software:
- It adds value for the customer.
- It makes Cloudera’s support job easier.
Professional services around Hadoop.
Training and conferences around Hadoop, which probably don’t generate all that much money, but are great marketing in terms of visibility, thought leadership, and lead generation.

Hortonworks spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point — but evidently hopes that day will never actually come.

Categories: Cloudera, Hadoop, Hortonworks, IBM and DB2, MapReduce, Open source, Yahoo

9 Comments

July 7, 2011

Sybase IQ soundbites

Sybase made a total hash of the timing of this week’s press release. I got annoyed after they promised to inform me of the new embargo time, then broke the promise. Other people got annoyed earlier than that.

So be it. Below is the draft of a post I was holding, with brackets added around one word that is no longer accurate.

I don’t write enough about Sybase IQ. That said, I offered a couple of quotes to a reporter [yesterday] in connection with the general availability of Sybase IQ 15.3. Lightly edited, they go:

“Shared-everything MPP” isn’t a total contradiction in terms. It’s great for adding in concurrent users. And there’s little doubt that Sybase IQ can support robust access to databases 10s of terabytes in size.
As I first noted a couple of years ago, virtual data marts are a good idea. Too few vendors are making it easy to spin them out. They let departments start doing analytics very quickly, yet allow IT to keep partial control.

Beyond that, I should note:

Sybase IQ is the classic choice for what I call traditional data marts.
Sybase IQ is a leader in temporal functionality, which is not coincidental to its presence in the financial services market.

Categories: Columnar database management, Data warehousing, Parallelization, Sybase, Theory and architecture

Hadapt update

I met with the Hadapt guys today. I think I can be a bit crisper than before in positioning Hadapt and its use cases, namely:

Hadapt is additional software on a cluster that also runs fully functional Hadoop/HDFS. (Cloudera Hadoop more than straight-from-Apache Hadoop to date, but that’s not a requirement.)
The cluster also runs a DBMS on every node, such as PostgreSQL or one of Infobright/Vectorwise.
Hadapt’s software manages parallel SQL queries by distributing them to the DBMS living on each node. Hadapt says that the resulting query performance far outshines Hive’s.
Hadapt further says that, by exploiting the partner DBMS, its SQL functionality outpaces Hive’s as well.
Target Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the relational store.
In particular, Hadapt seems like an interesting choice when you want to use that relational data as you work on other data that’s still in HDFS, or if you want to keep using the relational data in other kinds of MapReduce jobs.
That all fits well with my thoughts about the importance of derived data.

Other evolution from what I wrote about Hadapt a few months ago includes:

Hadapt is in beta now.
Hadapt has added adult supervision in the form of Philip Wickline, late of Endeca.

In other news, Hadapt is our newest client.

Categories: Analytic technologies, Cloudera, Data models and architecture, Data warehousing, Hadapt, Hadoop, Infobright, MapReduce, Open source, PostgreSQL, SQL/Hadoop integration, VectorWise

Petabyte-scale Hadoop clusters (dozens of them)

I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.

Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:

42,000 Hadoop nodes …
… holding 180-200 petabytes of data.

Categories: Cloudera, Facebook, Hadoop, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, Petabyte-scale data management, Scientific research, Web analytics, Yahoo

13 Comments

July 6, 2011

Hadoop hardware and compression

A month ago, I posted about typical Hadoop hardware. After talking today with Eric Baldeschwieler of Hortonworks, I have an update. I also learned some things from Eric and from Brian Christian of Zettaset about Hadoop compression.

First the compression part. Eric thinks 6-10X compression is common for “curated” Hadoop data — i.e., the data that actually gets used a lot. Brian used an overall figure of 6-8X, and told of a specific customer who had 6X or a little more. By way of comparison, it sounds as if the kinds of data involved are like what Vertica claimed 10-60X compression for almost three years ago.

Eric also made an excellent point about low-value machine-generated data. I was suggesting that as Moore’s Law made sensor networks ever more affordable: Read more

Categories: Cloudera, Database compression, Hadoop, Hortonworks, Storage, Vertica Systems, Zettaset

10 Comments

July 5, 2011

Eight kinds of analytic database (Part 2)

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more

Categories: Analytic technologies, Archiving and information preservation, Business intelligence, Buying processes, Cloud computing, Columnar database management, Data mart outsourcing, Data types, Data warehouse appliances, Data warehousing, Database compression, Database diversity, EAI, EII, ETL, ELT, ETLT, Greenplum, Hadoop, Investment research and trading, Log analysis, MapReduce, MOLAP, MySQL, Netezza, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Rainstor, SAND Technology, Scientific research, SenSage, Software as a Service (SaaS), Streaming and complex event processing (CEP), Telecommunications, Vertica Systems, Web analytics

6 Comments

July 5, 2011

Eight kinds of analytic database (Part 1)

Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.

Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more

Categories: Analytic technologies, Aster Data, Benchmarks and POCs, Business intelligence, Buying processes, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Database diversity, Exadata, Greenplum, IBM and DB2, Infobright, Investment research and trading, Log analysis, Microsoft and SQL*Server, MOLAP, Netezza, OLTP, Oracle, ParAccel, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Pricing, QlikTech and QlikView, SAND Technology, Scientific research, Sybase, Teradata, Vertica Systems, Web analytics, Workload management

7 Comments

June 27, 2011

What colleges should teach in analytics

Based on a Teradata press release calling attention to the small amount of explicit university instruction in business intelligence, I was asked:

Does BI really need a dedicated undergrad track? What sort of BI and analytics-related skills should students look to obtain now in order to be viable in the job marketplace five years out?

My answers were (slightly edited):

Most important is a basic, intuitive understanding of statistical significance. If you’re looking at an apparent trend, is it real or just random variation?
Also crucial are general analytic and quantitative problem-solving skills.
One also should have a comfort level learning how to use new software tools.
Everybody in business should have those skillsets. So should people in science, medicine, teaching, journalism, government, and most other vocations.
The more analytically oriented should add basic programming skills, and basic knowledge of SQL. While SQL’s utter dominance is ebbing a bit, it still will be with us for a very long time.

Of course, there are more specialized skills also worth teaching, in a number of areas, starting with statistics and other predictive modeling technologies. But it’s OK to go through life not knowing those.

Categories: Analytic technologies, Business intelligence, Data warehousing, NoSQL, Predictive modeling and advanced analytics, Teradata

1 Comment

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in