Predictive modeling and advanced analytics
Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.
In which I observe that Tim Cook and the EFF, while thankfully on the right track, haven’t gone nearly far enough.
Traditionally, the term “chilling effect” referred specifically to inhibitions on what in the US are regarded as First Amendment rights — the freedoms of speech, the press, and in some cases public assembly. Similarly, when the term “chilling effect” is used in a surveillance/privacy context, it usually refers to the fear that what you write or post online can later be held against you. This concern has been expressed by, among others, Tim Cook of Apple, Laura Poitras, and the Electronic Frontier Foundation, and several research studies have supported the point.
But that’s only part of the story. As I wrote in July, 2013,
… with the new data collection and analytic technologies, pretty much ANY action could have legal or financial consequences. And so, unless something is done, “big data” privacy-invading technologies can have a chilling effect on almost anything you want to do in life.
The reason, in simplest terms, is that your interests could be held against you. For example, models can estimate your future health, your propensity for risky hobbies, or your likelihood of changing your residence, career, or spouse. Any of these insights could be useful to employers or financial services firms, and not in a way that redounds to your benefit. And if you think enterprises (or governments) would never go that far, please consider an argument from the sequel to my first “chilling effects” post: Read more
It’s difficult to project the rate of IT change in health care, because:
- Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
- … health care is heavily bureaucratic, political and regulated.
Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:
- The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
- Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
- Most tests exploit electronic technology. Progress in electronics is intense.
- Biomedical research is itself intense.
- In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
These vastly greater amounts of data cited above will allow for greatly changed analytics.
1. There are multiple ways in which analytics is inherently modular. For example:
- Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
- The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
- Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.
Also, analytics is inherently iterative.
- Everything I just called “modular” can reasonably be called “iterative” as well.
- So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”
If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.
2. In 2011, I wrote, in the context of agile predictive analytics, that
… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.
I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more
|Categories: Business intelligence, Data warehousing, Datameer, Hadoop, Log analysis, Oracle, Platfora, Predictive modeling and advanced analytics, SAS Institute, Software as a Service (SaaS), Tableau Software, Web analytics||2 Comments|
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks: Read more
|Categories: Business intelligence, Data warehousing, Databricks, Spark and BDAS, Hadoop, Netezza, NoSQL, Predictive modeling and advanced analytics, Tableau Software||1 Comment|
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as: Read more
|Categories: Amazon and its cloud, Cloud computing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Software as a Service (SaaS)||6 Comments|
There are numerous ways that technology, now or in the future, can significantly improve personal safety. Three of the biggest areas of application are or will be:
- Crime prevention.
- Vehicle accident prevention.
- Medical emergency prevention and response.
Implications will be dramatic for numerous industries and government activities, including but not limited to law enforcement, automotive manufacturing, infrastructure/construction, health care and insurance. Further, these technologies create a near-certainty that individuals’ movements and status will be electronically monitored in fine detail. Hence their development and eventual deployment constitutes a ticking clock toward a deadline for society deciding what to do about personal privacy.
Theoretically, humans aren’t the only potential kind of tyrants. Science fiction author Jack Williamson postulated a depressing nanny-technology in With Folded Hands, the idea for which was later borrowed by the humorous Star Trek episode I, Mudd.
Of these three areas, crime prevention is the furthest along; in particular, sidewalk cameras, license plate cameras and internet snooping are widely deployed around the world. So let’s consider the other two.
Vehicle accident prevention
|Categories: Health care, Predictive modeling and advanced analytics, Public policy, Surveillance and privacy||3 Comments|
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Graph analytics and data management are still confused.
Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly: Read more
A conversation I have too often with vendors goes something like:
- “That confidential thing you told me is interesting, and wouldn’t harm you if revealed; probably quite the contrary.”
- “Well, I guess we could let you mention a small subset of it.”
- “I’m sorry, that’s not enough to make for an interesting post.”
That was the genesis of some tidbits I recently dropped about WibiData and predictive modeling, especially but not only in the area of experimentation. However, Wibi just reversed course and said it would be OK for me to tell more or less the full story, as long as I note that we’re talking about something that’s still in beta test, with all the limitations (to the product and my information alike) that beta implies.
As you may recall:
- WibiData started out with a rich technology stack …
- … but decided to cast itself as an application company …
- … whose first vertical market is retailing,
With that as background, WibiData’s approach to predictive modeling as of its next release will go something like this: Read more
1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.
- The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
- If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.
I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:
- Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
- That can inform a BI-plus interface for a human to figure out what is happening.
Makes sense to me. (Edit: ScalingData subsequently launched, under the name Rocana.)
* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.
Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.
2. Discussion of graph DBMS can get confusing. For example: Read more
|Categories: Business intelligence, Greenplum, Hadoop, Hortonworks, Log analysis, Neo Technology and Neo4j, Nutonian, Predictive modeling and advanced analytics, RDF and graphs, WibiData||5 Comments|