Sneakernet to the cloud
Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet. In some circumstances, he is surely right. Possible implications include:
- When sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, cloud, cloud — but Clearpace thinks you really should have a bit of its software onsite too, to compress the data before sending it across the wire.
- Getting data from one cloud to another cloud could be problematic. I’m fond of saying that weblog data naturally lives in the cloud at your hosting company’s location, so you should analyze it there too. But this makes the most sense if you analyze it or at least filter/reduce it in place. (That said, the really, really big web companies have lots of different data centers, and presumably do move huge amounts of log data from place to place.)
But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.
Categories: Amazon and its cloud, Cloud computing, Database compression, EAI, EII, ETL, ELT, ETLT, Web analytics | 2 Comments |
Song of the contract programming firm, and other filks
I heard a different version of the same idea at Boskone once, but here is a pretty good send-up of what might occur at a customer review session. (Warning, however: Low production values.) Also, in case you missed them, considerably funnier are a couple of classic Star Trek filksongs, especially the first.
While I’m on the subject, a couple of more serious filksongs I really like are:
- Jordin Kare’s Fire in the Sky
- Heather Alexander singing Demonbane.
Other great serious filksongs are “Queen of Air and Darkness” (Poul Anderson lyrics) and Jordin Kare’s “When the Ship Lifts, All Debts Are Paid”, but I can’t find recordings of those now.
And finally, back to the humor: I just found a video to a song I posted previously.
Categories: Fun stuff, Humor | 2 Comments |
Teradata Developer Exchange (DevX) begins to emerge
Every vendor needs developer-facing web resources, and Teradata turns out to have been working on a new umbrella site for its. It’s called Teradata Developer Exchange — DevX for short. Teradata DevX seems to be in a low-volume beta now, with a press release/bigger roll-out coming next week or so. Major elements are about what one would expect:
- Articles
- Blogs
- Downloads
- Surprisingly, so far as I can tell, no forums
If you’re a Teradata user, you absolutely should check out Teradata DevX. If you just research Teradata — my situation 🙂 — there are some aspects that might be of interest anyway. In particular, I found Teradata’s downloads instructive, most particularly those in the area of extensibility. Mainly, these are UDFs (User-Defined Functions), in areas such as:
- Compression
- Geospatial data
- Imitating Oracle or DB2 UDFs (as migration aids)
Also of potential interest is a custom-portlet framework for Teradata’s management tool Viewpoint. A straightforward use would be to plunk some Viewpoint data into a more general system management dashboard. A yet cooler use — and I couldn’t get a clear sense of whether anybody’s ever done this yet — would be to offer end users some insight as to how long their queries are apt to run.
Categories: Database compression, Emulation, transparency, portability, GIS and geospatial, Teradata | 2 Comments |
Yet more on MySQL forks and storage engines
The issue of MySQL forks and their possible effect on closed-source storage engine vendors continues to get attention. The underlying question is:
Suppose Oracle wants to make life difficult for third-party storage engine vendors via its incipient control of MySQL? Can the storage engine vendors insulate themselves from this risk by working with a MySQL fork?
Categories: MySQL, Open source, PostgreSQL | 11 Comments |
How big are the intelligence agencies’ data warehouses?
Edit: The relevant part of the article cited has now been substantially changed, in line with Jeff Jonas’ remarks in the comment thread below.
Joe Harris linked me to an article that made a rather extraordinary claim:
At another federal agency Jonas worked at (he wouldn’t say which), they had a very large data warehouse in the basement. The size of the data warehouse was a secret, but Jonas estimated it at 4 exabytes (EB), and increasing at the rate of 5 TB per day.
Now, if one does the division, the quote claims it takes 800,000 days for the database to double in size, which is absurd. Perhaps this (Jeff) Jonas guy was just talking about a 4 petabyte system and got confused. (Of course, that would still be pretty big.) But before I got my arithmetic straight, I ran the 4 exabyte figure past a couple of folks, as a target for the size of the US government’s largest classified database. Best guess turns out to be that it’s 1-2 orders of magnitude too high for the government’s largest database, not 3. But that’s only a guess …
Categories: Data warehousing, Specific users | 5 Comments |
Notes on CEP application development
While performance may not be all that great a source of CEP competitive differentiation, event processing vendors find plenty of other bases for technological competition, including application development, analytics, packaged applications, and data integration. In particular:
- Most independent CEP vendors have some kind of application story in the capital markets vertical, such as packaged applications, ISV partners with packaged applications, application frameworks, and so on.
- CEP vendors offer lots of connectors to specific financial industry price/quote/trade feeds, as well as the usual other kinds of database connectivity (SQL, XML, etc.)
- Aleri/Coral8 (separately and now together) like to call attention to their business intelligence/analytics offerings. Analytics is front-and-center on Truviso’s web site too, not that Truviso does much to call attention to itself, period. (Roman Bukary once said he’d outline Truviso’s new strategy to me in 6-8 weeks or so … it’s now 14 months and counting.)
So far as I can tell, the areas of applications and analytics are fairly uncontroversial. Different CEP vendors have implemented different kinds of things, no doubt focusing on those they thought they would find easiest to build and then sell. But these seem to be choices in business execution, not in core technical philosophy.
In CEP application development, however, real philosophical differences do seem to arise. There are at least three different CEP application development paradigms: Read more
Categories: Aleri and Coral8, Business intelligence, Microsoft and SQL*Server, Progress, Apama, and DataDirect, StreamBase, Streaming and complex event processing (CEP) | 5 Comments |
Notes on CEP performance
I’ve been talking to CEP vendors on and off for a few years. So what I hear about performance is fairly patchwork. On the other hand, maybe 1-2+ year-old figures of per-core performance are still meaningful today. After all, Moore’s Law is being reflected more in core count than per-core performance, and it seems CEP vendors’ development efforts haven’t necessarily been concentrated on raw engine speed.
So anyway, what do you guys have to add to the following observations?
- Super-low-latency financial services industry tasks are often “embarrassingly parallel.” Thus, near-linear scale-out is common.
- That said, good parallelism seems fairly new in CEP engines (of course, CEP engines are fairly new themselves — for all I know, some have been parallel since inception).
- I’ve heard claims of up to 400,000 messages/second/core for simple queries or patterns.
- I’ve heard claims of 70,000 messages/core for not-so-simple queries or patterns, and probably higher than that depending on what the meaning of “simple” is.
- IBM just disclosed >15,000 messages/core on a pretty low-powered processor.
- I’ve heard that Coral8, Apama, and StreamBase rarely lost deals due to performance or throughput problems. I’ve heard that the same is not as true of Aleri.
- StreamBase proudly says it’s been fully multithreaded since academic research-project days. For Apama multithreading is evidently a more recent feature. But does it matter much?
Categories: Aleri and Coral8, IBM and DB2, Memory-centric data management, Progress, Apama, and DataDirect, StreamBase, Streaming and complex event processing (CEP) | 13 Comments |
Followup on IBM System S/InfoSphere Streams
After posting about IBM’s System S/InfoSphere Streams CEP offering, I sent three followup questions over to Jeff Jones. It seems simplest to just post the Q&A verbatim.
1. Just how many processors or cores does it take to get those 5 million messages/sec through? A little birdie says 4,000 cores. Read more
Categories: Analytic technologies, IBM and DB2, Investment research and trading, Streaming and complex event processing (CEP) | 7 Comments |
MySQL forking heats up, but not yet to the benefit of non-GPLed storage engine vendors
Last month, I wrote “This is a REALLY good time to actively strengthen the MySQL forkers,” largely on behalf of closed-source/dual-source MySQL storage engine vendors such as Infobright, Kickfire, Calpont, Tokutek, or ScaleDB. Yesterday, two of my three candidates to lead the effort — namely Monty Widenius/MariaDB/Monty Program AB and Percona — came together to form something called the Open Database Alliance. Details may be found:
- On the Open Database Alliance website
- In a press release
- On Monty Widenius’ blog
- In a Stephen O’Grady blog post based on a discussion with Monty Widenius
- In an ars technica blog post based on a discussion with Monty Program AB’s Kurt von Finck
But there’s no joy for the non-GPLed MySQL storage engine vendors in the early news. Read more
Categories: MySQL, Open source, Theory and architecture | 16 Comments |
Facebook’s experiences with compression
One little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to be:
- Facebook uses gzip, and gets a little bit more than 6X compression.
- Experiments suggest bzip2 would reduce data by another 20% or so, increasing compression to the 7.5X range.
- The downside of bzip2 is 15-25% processing overhead, depending on the kind of data.
Categories: Data warehousing, Database compression, Facebook, Hadoop | 2 Comments |