Hadoop news and rumors, June 23, 2013
Cloudera
- Cloudera changed CEOs last week. Tom Reilly, late of ArcSight, is the new guy (I don’t know him), while Mike Olson’s titles become Chairman and Chief Strategy Officer. Mike told me Friday that Reilly had secretly been working with him for months.
- Mike shared good-sounding numbers with me. But little is for public disclosure except the stat >400 employees.
- There are always rumors of infighting at Cloudera, perhaps because from earliest days Cloudera was a place where tempers are worn on sleeves. That said, Mike denied stories of problems between him and COO Kirk Dunn, and greatly praised Kirk’s successes at large-account sales.
- Cloudera now self-identifies pretty clearly as an analytic data management company. The vision is multiple execution engines – MapReduce, Impala, something more memory-centric, etc. – talking to any of a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less anything.*
- Mike told me that Cloudera didn’t have any YARN users in production, but thought there would be some by year-end. Even so, he thinks it’s fair to say that Cloudera users have substantial portions of Hadoop 2 in production, for example NameNode failover and HDFS (Hadoop Distributed File System) performance enhancements. Ditto HCatalog.
*Of course, there will always be exceptions. E.g., some formats can be updated on a short-request basis, while others can only be written to via batch conversions.
Everybody else
- There’s a widespread belief that Hortonworks is being shopped. Numerous folks – including me — believe the rumor of an Intel offer for $700 million. Higher figures and alternate buyers aren’t as widely believed.
- Views of MapR market traction, never high, are again on the downswing.
- IBM Big Insights seems to have some traction.
- In case there was any remaining doubt — DBMS vendors are pretty unanimous in agreeing that it makes sense to have Hadoop too. To my knowledge SAP hasn’t been as clear about showing a markitecture incorporating Hadoop as most of the others have … but then, SAP’s markitecture is generally less clear than other vendors’.
- Folks I talk with are generally wondering where and why Datameer lost its way. That still leaves Datameer ahead of other first-generation Hadoop add-on vendors (Karmasphere, Zettaset, et al.), in that I rarely hear them mentioned at all.
- I visited with my client Platfora. Things seem to be going very well.
- My former client Revelytix seems to have racked up some nice partnerships. (I had something to do with that. :))
Categories: Cloudera, Data warehousing, Datameer, Hadoop, Hortonworks, IBM and DB2, Intel, MapR, Market share and customer counts, Platfora, SAP AG, Zettaset | 11 Comments |
Impala and Parquet
I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:
- Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
- At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
- While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
- Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
- Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
- The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.
Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.
Webinar Wednesday, June 26, 1 pm EST — Real-Time Analytics
I’m doing a webinar Wednesday, June 26, at 1 pm EST/10 am PST called:
Real-Time Analytics in the Real World
The sponsor is MemSQL, one of my numerous clients to have recently adopted some version of a “real-time analytics” positioning. The webinar sign-up form has an abstract that I reviewed and approved … albeit before I started actually outlining the talk. 😉
Our plan is:
- I’ll review the multiple technologies and use cases that various companies call “real-time analytics”. I’m not planning for this part to be at all MemSQL-focused.*
- MemSQL will review some specific use cases they feel their product — memory-centric scale-out RDBMS — has proven it supports.
*MemSQL is debuting pretty high in my rankings of content sponsors who are cool with vendor neutrality. I sent them a draft of my slides mentioning other tech vendors and not them, and they didn’t blink.
In other news, I’ll be in California over the next week. Mainly I’ll be visiting clients — and 2 non-clients and some family — 10:00 am through dinner, but I did set aside time to stop by GigaOm Structure on Wednesday. I have sniffles/cough/other stuff even before I go. So please don’t expect a lot of posts until I’ve returned, rested up a bit, and also prepared my webinar deck.
Categories: Analytic technologies, In-memory DBMS, MemSQL, NewSQL, Parallelization | 1 Comment |
How is the surveillance data used?
Over the past week, discussion has exploded about US government surveillance. After summarizing, as best I could, what data the government appears to collect, now I ‘d like to consider what they actually do with it. More precisely, I’d like to focus on the data’s use(s) in combating US-soil terrorism. In a nutshell:
- Reporting is persuasive that electronic surveillance data is helpful in following up on leads and tips obtained by other means.
- Reporting is not persuasive that electronic surveillance data on its own uncovers or averts many terrorist plots.
- With limited exceptions, neither evidence nor logic suggests that data mining or predictive modeling does much to prevent domestic terrorist attacks.
Consider the example of Tamerlan Tsarnaev:
In response to this 2011 request, the FBI checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history.
While that response was unsuccessful in preventing a dramatic act of terrorism, at least they tried.
As for actual success stories — well, that’s a bit tough. In general, there are few known examples of terrorist plots being disrupted by law enforcement in the United States, except for fake plots engineered to draw terrorist-leaning individuals into committing actual crimes. One of those examples, that of Najibullah Zazi, was indeed based on an intercepted email — but the email address itself was uncovered through more ordinary anti-terrorism efforts.
As for machine learning/data mining/predictive modeling, I’ve never seen much of a hint of it being used in anti-terrorism efforts, whether in the news or in my own discussions inside the tech industry. And I think there’s a great reason for that — what would they use for a training set? Here’s what I mean. Read more
Categories: Application areas, Predictive modeling and advanced analytics, RDF and graphs, Surveillance and privacy, Text | 9 Comments |
Where things stand in US government surveillance
Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.
US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?
My views about domestic data collection start:
- I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
- Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
- Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
- I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
- Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.
*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.
As for foreign data: Read more
Categories: Hadoop, HP and Neoview, Petabyte-scale data management, Pricing, Surveillance and privacy, Telecommunications, Text, Vertica Systems, Web analytics | 10 Comments |
Dave DeWitt responds to Daniel Abadi
A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.
Read more
Categories: Benchmarks and POCs, Cloudera, Clustering, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, PostgreSQL, SQL/Hadoop integration | 6 Comments |
SQL-Hadoop architectures compared
The genesis of this post is:
- Dave DeWitt sent me a paper about Microsoft Polybase.
- I argued with Dave about the differences between Polybase and Hadapt.
- I asked Daniel Abadi for his opinion.
- Dan agreed with Dave, in a long email …
- … that he graciously permitted me to lightly-edit and post.
I love my life.
Per Daniel (emphasis mine): Read more
Categories: Aster Data, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, SQL/Hadoop integration, Theory and architecture | 13 Comments |
WibiData and its Kiji technology
My clients at WibiData:
- Think they’re an application software company …
- … but actually are talking about what I call analytic application subsystems.
- Haven’t announced or shipped any of those either …
- … but will shortly.
- Have meanwhile shipped some cool enabling technology.
- Name their products after sushi restaurants.
Yeah, I like these guys. 🙂
If you’re building an application that “obviously” calls for a NoSQL database, and which has a strong predictive modeling aspect, then WibiData has thought more cleverly about what you need than most vendors I can think of. More precisely, WibiData has thought cleverly about your data management, movement, crunching, serving, and integration. For pure modeling sophistication, you should look elsewhere — but WibiData will gladly integrate with or execute those models for you.
WibiData’s enabling technology, now called Kiji, is a collection of modules, libraries, and so on — think Spring — running over Hadoop/HBase. Except for some newfound modularity, it is much like what I described at the time of WibiData’s launch or what WibiData further disclosed a few months later. Key aspects include:
- A way to define schemas in HBase, including ones that change as rapidly as consumer-interaction applications require.
- An analytic framework called “Produce/Gather”, which can execute at human real-time speeds (via its own execution engine) or with higher throughput in batch mode (by invoking Hadoop MapReduce).
- Enough load capabilities, Hive interaction, and so on to get data into the proper structure in Kiji in the first place.
Categories: Hadoop, HBase, NoSQL, Open source, Predictive modeling and advanced analytics, WibiData | 5 Comments |
Syncsort extends Hadoop MapReduce
My client Syncsort:
- Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
- Has a strong history in and fondness for sort.
- Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
- Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
- Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.
*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already. 🙂
The essence of the Syncsort DMX-h ETL Edition story is:
- DMX-h inherits the various ETL-suite trappings of DMX.
- Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
- With a copy of DMX on every node, DMX-h can do parallel load/export.
More details can be found in a slide deck Syncsort graciously allowed me to post. Read more
Categories: Cloudera, Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Syncsort | 8 Comments |
IBM BLU
I had a good chat with IBM about IBM BLU, aka BLU Accelerator or Acceleration. BLU basics start:
- BLU is a part of DB2.
- BLU works like a columnar analytic DBMS.
- If you want to do a join combining BLU and non-BLU tables, all the BLU tables are joined first, and the result set is joined to the other tables by the rest of DB2.
And yes — that means Oracle is now the only major relational DBMS vendor left without a true columnar story.
BLU’s maturity and scalability basics start:
- BLU is coming out in IBM DB2 10.5, this quarter.
- BLU will initially be single-server, but …
- … IBM claims “near-linear” scalability up to 64 cores, and further says that …
- … scale-out for BLU is coming “soon”.
- IBM already thinks all your analytically-oriented DB2 tables should be in BLU.
- IBM describes the first version of BLU as being optimized for 10 TB databases, but capable of handling 20 TB.
BLU technical highlights include: Read more