Some subjects just keep coming up. And so I keep saying things like:
Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.
Most generalizations about Hadoop are false. Reasons include:
- Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
- The transition from Hadoop 1 to Hadoop 2 will be drastic.
- For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.
Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.
Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.
Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)
I’m not having a productive week, part of the reason being a hard drive crash that took out early drafts of what were to be last weekend’s blog posts. Now I’m operating from a laptop, rather than my preferred dual-monitor set-up. So please pardon me if I’m concise even by comparison to my usual standards.
- My recent posts based on surveillance news have been partly superseded by – well, by more news. Some of that news, along with some good discussion, may be found in the comment threads.
- The same goes for my recent Hadoop posts.
- The replay for my recent webinar on real-time analytics is now available. My part ran <25 minutes.
- One of my numerous clients using or considering a “real-time analytics” positioning is Sqrrl, the company behind the NoSQL DBMS Accumulo. Last month, Derrick Harris reported on a remarkable Accumulo success story – multiple US intelligence instances managing 10s of petabytes each, and supporting a variety of analytic (I think mainly query/visualization) approaches.
- Several sources have told me that MemSQL’s Zynga sale is (in part) for Membase replacement. This is noteworthy because Zynga was the original pay-for-some-of-the-development Membase customer.
- More generally, the buzz out of Couchbase is distressing. Ex-employees berate the place; job-seekers check around and then decide not to go there; rivals tell me of resumes coming out in droves. Yes, there’s always some of that, even at obviously prospering companies, but this feels like more than the inevitable low-level buzz one hears anywhere.
- I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- And if you still want to do something that looks like a regression – linear or otherwise – then you might want to use a tool that lets you shovel training data in WITHOUT a whole lot of preparation* and receive a model back out. Even if you don’t accept that as your final model, it can at least be a great guide to feature selection (in the statistical sense of the phrase) and the like.
- Champion/challenger model testing is also a good idea, at least if you’re in some kind of personalization/recommendation space, and have enough traffic to test like that.**
- Most companies have significant turnover after being acquired, perhaps after a “golden handcuff” period. Vertica is no longer an exception.
- Speaking of my clients at HP Vertica – they’ve done a questionable job of communicating that they’re willing to price their product quite reasonably. (But at least they allowed me to write about $2K/terabyte for hardware/software combined.)
- I’m hearing a little more Amazon Redshift buzz than I expected to. Just a little.
- StreamBase was bought by TIBCO. The rumor says $40 million.
*Basic and unavoidable ETL (Extract/Transform/Load) of course excepted.
**I could call that ABC (Always Be Comparing) or ABT (Always Be Testing), but they each sound like – well, like The Glove and the Lions.
Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.
US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?
My views about domestic data collection start:
- I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
- Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
- Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
- I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
- Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.
*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.
As for foreign data: Read more
|Categories: Hadoop, HP and Neoview, Petabyte-scale data management, Pricing, Surveillance and privacy, Telecommunications, Text, Vertica Systems, Web analytics||10 Comments|
Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:
- Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
- This year the figure is around 40%.
- The 6700 is based on Intel’s Sandy Bridge.
- Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
- Teradata has now largely switched over to InfiniBand.
Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:
- Teradata Unified Data Architecture.
- Fabric-based computing, even though this isn’t really about storage.
- Teradata SQL-H.
|Categories: Data integration and middleware, Data warehouse appliances, Data warehousing, Pricing, SAS Institute, Teradata||3 Comments|
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more
I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.
Trends that I think will continue in 2013 include:
Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.
Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.
Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).
Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.
Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.
MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.
|Categories: Business intelligence, Hadoop, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS), Surveillance and privacy||3 Comments|
Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:
- Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
- ParAccel did some engineering to make its DBMS run better in the cloud.
- Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.
“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.
The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.
So who should and will use Amazon Redshift? For starters, I’d say: Read more
|Categories: Amazon and its cloud, Business intelligence, Cloud computing, Data mart outsourcing, Data warehousing, Infobright, ParAccel, Predictive modeling and advanced analytics, Pricing, Vertica Systems||4 Comments|
Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:
- You can dump any kind of data you want into Hadoop’s file system.
- You can have data in a scale-out RDBMS to get good performance on analytic SQL.
- You can access all the data (not just the relationally stored part) via SQL.
- You can do MapReduce on all the data (not just the Hadoop-stored part).
- To varying degrees, Hadapt and Aster each offer three kinds of advantage over Hadoop-with-Hive:
- SQL performance is (much) better.
- SQL functionality is better.
- At least some of your employees — the “business analysts” — can invoke MapReduce processes through SQL, if somebody else (e.g. your techies or the vendor’s) coded them up in the first place.
Of course, there are plenty of differences. Those start: Read more
My clients at Teradata are introducing a mix-em/match-em Aster/Hadoop box, officially called the Teradata Aster Big Analytics Appliance. Basics include:
- You can fill a rack with nodes either for the Aster DBMS or for Hadoop (Hortonworks flavor), or you can combine them in the same box.
- If you combine them, they share management software (adapted from mainstream Teradata’s) and Infiniband.
- An Aster node has 16 2.6-gigahertz cores and 24 900GB disk drives.
- A Hadoop node has 12 2.0-gigahertz cores and 12 3TB drives.
- A central part of Teradata’s strategy is that Aster and Hadoop nodes can work together via SQL-H.
- The Teradata Aster Big Analytics Appliance is based on a family of Dell servers that fit more compactly into racks than do Teradata’s traditional products.
- The Teradata Aster Big Analytics Appliance replaces a previous interim Teradata Aster appliance that used similar hardware to that in other Teradata systems.
My views on the Teradata Aster Big Analytics Appliance start: Read more
My new clients at Aerospike have a range of minor news to announce:
- A company and product name change (they used to be Citrusleaf).
- Some new people and funding.
- In association with an acqui-hire — of AlchemyDB guy Russ Sullivan — some unspecified future technical plans.
- A community edition (Aerospike, nee’ Citrusleaf, is closed-source).
Mainly, however, they want to call your attention to the fact that they’ve been selling a fast, reliable key-value store, with a number of production references, and want to suggest that other organizations should perhaps buy it as well.
- Aerospike has a key-value data model.
- Secondary indexes and so on are still futures.
- Aerospike is clustered, of course.
- Two hardware/storage choices are encouraged:
- Spinning disk, but you keep all your data in RAM.
- Solid-state disk.
AeroSpike’s three core marketing claims are performance, consistent performance, and uninterrupted operations.
- Aerospike’s performance claims are supported by a variety of blazing internal benchmarks.
- Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
- Uninterrupted operation is a core AeroSpike design goal, and the company says that to date, no AeroSpike production cluster has ever gone down.
Aerospike technical details start with the expected: Read more
|Categories: Aerospike, Market share and customer counts, Memory-centric data management, NoSQL, Pricing||2 Comments|