Discussion of Google’s data management technologies MapReduce and BigTable. Related subjects include:
Some notes, follow-up, and links before I head out to California: Read more
|Categories: GIS and geospatial, Google, HP and Neoview, Humor, Kickfire, Netezza, Solid-state memory, Teradata, Web analytics||3 Comments|
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
|Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture||7 Comments|
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
|Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization||4 Comments|
As you might imagine, there are a lot of blog posts I’d like to write I never seem to get around to, or things I’d like to comment on that I don’t want to bother ever writing a full post about. In some cases I just tweet a comment or link and leave it at that.
And it’s not going to get any better. Next week = the oft-postponed elder care trip. Then I’m back for a short week. Then I’m off on my quarterly visit to the SF area. Soon thereafter I’ve have a lot to do in connection with Enzee Universe. And at that point another month will have gone by.
Anyhow: Read more
|Categories: Analytic technologies, Business intelligence, Data warehousing, Exadata, GIS and geospatial, Google, IBM and DB2, Netezza, Oracle, Parallelization, SAP AG, SAS Institute||3 Comments|
Rumors are flying that Google may acquire ITA Software. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include:
- ITA Software builds huge OLTP systems that it runs itself on behalf of airlines.
- Very, very unusually, ITA Software builds these huge OLTP systems in LISP.
- ITA Software is an Oracle shop (see Dan Weinreb’s comment).
- ITA Software is run by a techie (again, see Dan Weinreb’s comment).
- ITA Software has an interesting screen-scraping/web ETL project called Needlebase
ITA’s software does both price/reservation lookup/checking and reservation-making. I’ve had trouble keeping it straight, but I think the lookup is ITA’s actual business, and the reservation-making is ITA’s Next Big Thing. This is one of the ultimate federated-transaction-processing applications, because it involves coordinating huge OLTP systems run, in some cases, by companies that are bitter competitors with each other. Network latencies have to allow for intercontinental travel of the data itself.
Indeed, airline reservation systems are pretty much the OLTP ultimate in themselves. As the story goes, transaction monitors were pretty much invented for airline reservation systems in the 1960s.
A really small project for ITA Software is Needlebase. I stopped by ITA to look at Needlebase in January, and what it is is a very smart and hence interesting screen-scraping system. The idea is people publish database information to the web, and you may want to look at their web pages and recover the database records it is based on. Applications of this to the airline industry, which has 100s of 1000s of price changes per day — and I may be too low by one or two orders of magnitude when I say that — should be fairly obvious. ITA Software has aspirations of applying Needlebase to other sectors as well, or more precisely having users who do so. Last I looked, ITA hadn’t put significant resources behind stimulating Needlebase adoption — but Google might well change that.
Edit: I just re-found an old characterization of (some of) what ITA Software does by — who else? — Dan Weinreb:
I am working on our new product, an airline reservation system. It’s an online transaction-processing system that must be up 99.99% of the time, maintaining maximum response time (e.g. on www.aircanada.com). It’s a very, very complicated system. The presentation layer is written in Java using conventional techniques. The business rule layer is written in Common Lisp; about 500,000 lines of code (plus another 100,000 or so of open source libraries). The database layer is Oracle RAC. We operate our own data centers, some here in Massachusetts and a disaster-recovery site in Canada (separate power grid).
|Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Google, OLTP, Oracle||4 Comments|
I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more
|Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek and TokuDB||5 Comments|
|Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Hadoop, MapReduce, SenSage, Splunk||8 Comments|
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
|Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics||11 Comments|
Google has announced an experimental cloud-based data management system called Fusion Tables. A press article and Slashdot thread ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.
What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it’s a place to dump data in a grid of cells, comment on it, version it, and do elementary data manipulation. This could, I guess, be useful as an alternative to traditional RDBMS — assuming, of course, that you want to have a row-by-row debate about 100 megs of data.
Seriously, while Google Fusion Tables bears some vague resemblance to what I’m thinking about for the future of both business intelligence and data marts, it sounds as if it has a long way to go before it’s something most enterprises should spend time looking at.