Predictive modeling and advanced analytics
Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.
Everybody is confused about privacy and surveillance. So I’m renewing my efforts to consciousness-raise within the tech community. For if we don’t figure out and explain the issues clearly enough, there isn’t a snowball’s chance in Hades our lawmakers will get it right without us.
How bad is the confusion? Well, even Edward Snowden is getting it wrong. A Wired interview with Snowden says:
“If somebody’s really watching me, they’ve got a team of guys whose job is just to hack me,” he says. “I don’t think they’ve geolocated me, but they almost certainly monitor who I’m talking to online. Even if they don’t know what you’re saying, because it’s encrypted, they can still get a lot from who you’re talking to and when you’re talking to them.”
That is surely correct. But the same article also says:
“We have the means and we have the technology to end mass surveillance without any legislative action at all, without any policy changes.” The answer, he says, is robust encryption. “By basically adopting changes like making encryption a universal standard—where all communications are encrypted by default—we can end mass surveillance not just in the United States but around the world.”
That is false, for a myriad of reasons, and indeed is contradicted by the first excerpt I cited.
What privacy/surveillance commentators evidently keep forgetting is:
- There are many kinds of privacy-destroying information. I think people frequently overlook just how many kinds there are.
- Many kinds of organization capture that information, can share it with each other, and gain benefits from eroding or destroying privacy. Similarly, I think people overlook just how pervasive the incentive is to snoop.
- Privacy is invaded through a variety of analytic techniques applied to that information.
So closing down a few vectors of privacy attack doesn’t solve the underlying problem at all.
Worst of all, commentators forget that the correct metric for danger is not just harmful information use, but chilling effects on the exercise of ordinary liberties. But in the interest of space, I won’t reiterate that argument in this post.
Perhaps I can refresh your memory why each of those bulleted claims is correct. Major categories of privacy-destroying information (raw or derived) include:
- The actual content of your communications – phone calls, email, social media posts and more.
- The metadata of your communications — who you communicate with, when, how long, etc.
- What you read, watch, surf to or otherwise pay attention to.
- Your purchases, sales and other transactions.
- Video images, via stationary cameras, license plate readers in police cars, drones or just ordinary consumer photography.
- Monitoring via the devices you carry, such as phones or medical monitors.
- Your health and physical state, via those devices, but also inferred from, for example, your transactions or search engine entries.
- Your state of mind, which can be inferred to various extents from almost any of the other information areas.
- Your location and movements, ditto. Insurance companies also want to put monitors in cars to track your driving behavior in detail.
|Categories: Health care, Predictive modeling and advanced analytics, Surveillance and privacy, Telecommunications||Leave a Comment|
I’ve talked with many companies recently that believe they are:
- Focused on building a great data management and analytic stack for log management …
- … unlike all the other companies that might be saying the same thing …
- … and certainly unlike expensive, poorly-scalable Splunk …
- … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.
At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.
Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.
A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: Read more
Many of the companies I talk with boast of freeing business analysts from reliance on IT. This, to put it mildly, is not a unique value proposition. As I wrote in 2012, when I went on a history of analytics posting kick,
- Most interesting analytic software has been adopted first and foremost at the departmental level.
- People seem to be forgetting that fact.
In particular, I would argue that the following analytic technologies started and prospered largely through departmental adoption:
- Fourth-generation languages (the analytically-focused ones, which in fact started out being consumed on a remote/time-sharing basis)
- Electronic spreadsheets
- 1990s-era business intelligence
- Fancy-visualization business intelligence
- Predictive analytics
- Text analytics
- Rules engines
What brings me back to the topic is conversations I had this week with Paxata and Metanautix. The Paxata story starts:
- Paxata is offering easy — and hopefully in the future comprehensive — “data preparation” tools …
- … that are meant to be used by business analysts rather than ETL (Extract/Transform/Load) specialists or other IT professionals …
- … where what Paxata means by “data preparation” is not specifically what a statistician would mean by the term, but rather generally refers to getting data ready for business intelligence or other analytics.
Metanautix seems to aspire to a more complete full-analytic-stack-without-IT kind of story, but clearly sees the data preparation part as a big part of its value.
If there’s anything new about such stories, it has to be on the transformation side; BI tools have been helping with data extraction since — well, since the dawn of BI. Read more
|Categories: Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics, Progress, Apama, and DataDirect||8 Comments|
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
Spark is on the rise, to an even greater degree than I thought last month.
- Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
- A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
- MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
- Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
- The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
- A rich set of programming primitives in connection with that model.
- Support also for highly-iterative processing, of the kind found in machine learning.
- Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
- A clever approach to fault-tolerance.
- Spark is a major contender in streaming.
- There’s some cool machine-learning innovation using Spark.
- Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*
*Yes, my fingerprints are showing again.
The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):
… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.
With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.
|Categories: Cloudera, Complex event processing (CEP), Databricks, Spark and BDAS, Hadoop, Hortonworks, MapR, MapReduce, Predictive modeling and advanced analytics, SQL/Hadoop integration, Yahoo||14 Comments|
The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.
Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing.
The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.
Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.
*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.
WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.
Disclosure: My fingerprints are all over that deal.
In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.
I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.
I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.
I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.
*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.
From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
It took me a bit of time, and an extra call with Vertica’s long-time R&D chief Shilpa Lawande, but I think I have a decent handle now on Vertica 7, code-named Crane. The two aspects of Vertica 7 I find most interesting are:
- Flex Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica performance).
- What sounds like an alternate query execution capability for short-request queries, the big point of which is that it saves them from being broadcast across the whole cluster, hence improving scalability. (Adding nodes of course doesn’t buy you much for the portion of a workload that’s broadcast.)
Other Vertica 7 enhancements include:
- A lot of Bottleneck Whack-A-Mole.
- “Significant” improvements to the Vertica management console.
- Security enhancements (Kerberos), Hadoop integration enhancements (HCatalog), and enhanced integration with Hadoop security (Kerberos again).
- Some availability hardening. (“Fault groups”, which for example let you ensure that data is replicated not just to 2+ nodes, but also that the nodes aren’t all on the same rack.)
- Java as an option to do in-database analytics. (Who knew that feature was still missing?)
- Some analytic functionality. (Approximate COUNT DISTINCT, but not yet Approximate MEDIAN.)
Overall, two recurring themes in our discussion were:
- Load and ETL (Extract/Transform/Load) performance, and/or obviating ETL.
- Short-request performance, in the form of more scalable short-request concurrency.
I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.
There are four primary ways that people try to parallelize predictive modeling:
- They can run the same algorithm on different parts of a dataset on different nodes, then return all the results, and claim they’ve parallelized. This is trivial and not really a solution. It is also the last-ditch fallback position for those who parallelize more seriously.
- They can generate intermediate results from different parts of a dataset on different nodes, then generate and return a single final result. This is what Revolution does.
- They can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in statistical computing “If you’re using matrices, you’re doing it wrong”; he thinks shortcuts and workarounds are almost always the better way to go.
- They can jack up the speed of inter-node communication, perhaps via MPI (Messaging Passing Interface), so that full parallelization isn’t needed. That’s SAS’ main approach.
One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:
- External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
- What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
- … algorithms that can be parallelized by:
- Operating on data in parts.
- Getting intermediate results.
- Combining them in some way for a final result.
- Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.
To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.
|Categories: Greenplum, Hadoop, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics, Revolution Analytics, Teradata||Leave a Comment|