Fox and MySpace
Discussion of MySpace’s and MySpace parent Fox Interactive Media’s use of database and analytic technology. Related subjects include:
- The use of analytic technologies to study web and network event data
There’s a point I keep making in speeches, and used to keep making in white papers, yet have almost never spelled out in this blog. Let me now (somewhat) correct the oversight.
Analytic technology isn’t only for you. It’s also for your customers, citizens, and other stakeholders.
I am not referring here to what is well understood to be an important, fast-growing activity — providing data and its analysis to customers as your primary or only business — nor to the related business of taking people’s data, crunching it for them, and giving them results. That combined sector — which I am pretty alone in aggregating into one and calling data mart outsourcing — is one of the top several vertical markets for a lot of the analytic DBMS vendors I write about. Rather, I’m talking about enterprises that gather data for some primary purpose, and have discovered that a good secondary use of the data is to reflect it back to stakeholders, often the same ones who provided or created it in the first place.
For now I’ll call this category stakeholder-facing analytics, as the shorter phrase “stakeholder analytics” would be ambiguous.* I first picked up the idea early this decade from Information Builders, for whom it had become something of a specialty. I’ve been asking analytics vendors for examples of stakeholder-facing analytics ever since, and a number have been able to comply. But the whole thing is in its early days even so; almost any sufficiently large enterprise should be more active in stakeholder-facing analytics than it currently is.
|Categories: Analytic technologies, Business intelligence, Data mart outsourcing, Fox and MySpace, PostgreSQL||4 Comments|
Some notes based on what I’ve been reading recently: Read more
|Categories: Akiban, Analytic technologies, Data warehousing, EMC, Exadata, Fox and MySpace, Games and virtual worlds, Groovy Corporation, IBM and DB2, Open source, Oracle, SAP AG, Theory and architecture||Leave a Comment|
I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.
Splunk’s technical stories include:
- Text search over log files.
- Business intelligence over text search. (That part sounds a lot like Attivio.)
- MapReduce with schema flexibility and smart multi-stage execution plans. (That part sounds a lot like Aster Data.)
More on those in a separate post.
Less technical Splunk highlights include: Read more
|Categories: Analytic technologies, Fox and MySpace, Investment research and trading, Log analysis, Splunk, Telecommunications, Text, Web analytics||1 Comment|
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
Aster Data continues to think that MapReduce, integrated with SQL, is an important technology. For example:
- Aster announced today that it’s providing .NET support for SQL/MapReduce. Perhaps not coincidentally, Aster’s biggest customer is MySpace, which is apparently a big Microsoft shop. (And MySpace parent Fox Interactive Media is a SQL/MapReduce fan, albeit running on Greenplum.)
- Aster generally puts more emphasis on MapReduce than SQL/MapReduce rival Greenplum. That’s a non-trivial comparison, because Greenplum is making progress in SQL/MapReduce itself.
- When talking with Aster folks, I can’t get them to shut up hear a lot about SQL/MapReduce.
I was a big fan of SQL/MapReduce when it was first announced last August. Notwithstanding persuasive examples favoring pure DBMS or pure MapReduce over DBMS/MapReduce integration, I continue to think the SQL/MapReduce idea has great potential. But I do wish more successful production examples would become visible …
|Categories: Analytic technologies, Aster Data, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Parallelization||4 Comments|
Greenplum’s most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across a “review” of Greenplum by FIM’s Brian Dolan, neatly summarizing his views about Greenplum’s strengths, weaknesses, and uses inside Fox. Highlights include: Read more
Eric Lai offers more facts, figures, explanation, and competitive insight than I did on Greenplum’s loading of the Fox/MySpace database, including that Greenplum is being loaded with data at the 4 TB/hour rate only for half an hour at a time.
Also, Eric cites the Greenplum Fox Interactive Media database as being only 200 TB in size. Surely there is some confusion somewhere, since Greenplum described it as being 400 TB back in August.
Data warehouse load speeds are a contentious issue. Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate. Oracle has gotten dinged for very low load speeds, which then are hotly debated. I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.
Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that. Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.
One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.
|Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Fox and MySpace, Greenplum, Theory and architecture, Web analytics||3 Comments|
Greenplum (and Truviso) advisor Joseph Hellerstein offers a few examples of MapReduce applications (specifically Greenplum MapReduce), namely:
The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).
Roger talked about using MapReduce to extract structured entities from text for doing tech trend analyses from billions of rows of online job postings. Brian (who is a mathematician by training) was talking about implementing conjugate gradiant and Support Vector Machines in parallel SQL to support “hypertargeting” for advertisers. I mentioned how Jonathan Goldman at LinkedIn was using SQL and MapReduce to do graph algorithms for social network analysis.
Incidentally: While it’s been some months since I asked, my sense is that the O’Reilly text extraction is home-grown, and primitive compared to what one could do via commercial products. That said, if the specific application is examining job postings, I’m not sure how much value more sophisticated products would add. After all, tech job listings are generally written in a style explicitly designed to ensure that most or all of their meaning is conveyed simply by a bag of keywords. And by the way, this effort has been underway for quite some time.
- Greenplum has a page on the O’Reilly relationship. However, the part that isn’t behind a registration barrier is trivial — and I wouldn’t know one way or the other about the registration-required part.
|Categories: Analytic technologies, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Specific users, Web analytics||3 Comments|
Greenplum’s largest named account is Fox Interactive Media — the parent organization of MySpace — which has a multi-hundred terabyte database that it uses for hardcore data mining/analytics. Greenplum has been engaging in regrettable business practices, claiming that it is in the process of supplanting Aster Data at Fox/MySpace. In fact, MySpace’s use of Aster is more mission-critical than Fox’s use of Greenplum, and is increasing significantly.
|Categories: Analytic technologies, Aster Data, Data warehousing, Fox and MySpace, Greenplum, Specific users, Theory and architecture, Web analytics||3 Comments|