From time to time, I hear of regulatory requirements to retain, analyze, and/or protect data in various ways. It’s hard to get a comprehensive picture of these, as they vary both by industry and jurisdiction; so I generally let such compliance issues slide. Still, perhaps I should use one post to pull together what is surely a very partial list.
Most such compliance requirements have one of two emphases: Either you need to keep your customers’ data safe against misuse, or else you’re supposed to supply information to government authorities. From a data management and analysis standpoint, the former area mainly boils down to:
- Information security. This can include access control, encryption, masking, auditing, and more.
- Keeping data in an approved geographical area. (E.g., its country of origin.) This seems to be one of the three big drivers for multi-data-center processing (along with latency and disaster recovery), and hence is an influence upon numerous users’ choices in areas such as clustering and replication.
The latter, however, has numerous aspects.
First, there are many purposes for the data retention and analysis, including but by no means limited to: Read more
|Categories: Archiving and information preservation, Clustering, Data warehousing, Health care, Investment research and trading, Text||2 Comments|
Cray’s strategy these days seems to be:
- Move forward with the classic supercomputer business.
- Diversify into related areas.
At the moment, the main diversifications are:
- Boxes that are like supercomputers, but at a lower price point.
- “(Big) data”.
The last of the three is what Cray subsidiary Yarcdata is all about. Read more
|Categories: Data models and architecture, Health care, In-memory DBMS, Investment research and trading, Market share and customer counts, Parallelization, Petabyte-scale data management, RDF and graphs, Yarcdata and Cray||1 Comment|
There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …
… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.
Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.
Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more
|Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics||6 Comments|
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more
|Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray||20 Comments|
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
|Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics||5 Comments|
I wasn’t too impressed when I spoke with Revolution Analytics at the time of its relaunch last year. But a conversation Thursday evening was much clearer. And I even learned some cool stuff about general predictive modeling trends (see the bottom of this post).
Revolution Analytics business and business model highlights include:
- Revolution Analytics is an open-core vendor built around the R language. That is, Revolution Analytics offers proprietary code and support, with subscription pricing, that help in the use of open source software.
- Unlike most open-core vendors I can think of, Revolution Analytics takes little responsibility for the actual open source part. Some “grants” for developing certain open source R pieces seem to be the main exception. While this has caused some hard feelings, I don’t have an accurate sense for their scope or severity.
- Revolution Analytics also sells a single-user/workstation version of its product, freely admitting that this is mainly a lead generation strategy or, in my lingo, a “break-even leader.”
- Revolution Analytics boasts around 100 customers, split about 70-30 between the workstation seeding stuff and the real server product.
- Revolution Analytics has “about” 37 employees. Headquarters are at 101 University Avenue (do I have to say in what city? ). There are also a development office in Seattle and a sales office in New York.
- Revolution Analytics’ pricing is by size of server. “Small” servers — i.e. up to 12 cores — start at $25K/year.
- Unsurprisingly, adoption is more alongside SAS et al. than rip-and-replace.
|Categories: Health care, Investment research and trading, Open source, Parallelization, Predictive modeling and advanced analytics, Pricing, Revolution Analytics, SAS Institute||2 Comments|
This post is the first of a series. The second one delves into the technology behind the most serious electronic privacy threats.
The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the United States, with my first public breakthrough coming at the end of January. But much has changed since then.
On the commercial side, Facebook modified its privacy policies, garnering great press attention and an intense user backlash, leading to a quick partial retreat. The Wall Street Journal then launched a long series of articles — 13 so far — recounting multiple kinds of privacy threats. Other media joined in, from Forbes to CNet. Various forms of US government rule-making to inhibit advertising-related tracking have been proposed as an apparent result.
In the US, the government had a lively year as well. The Transportation Security Administration (TSA) rolled out what have been dubbed “porn scanners,” and backed them up with “enhanced patdowns.” For somebody who is, for example, female, young, a sex abuse survivor, and/or a follower of certain religions, those can be highly unpleasant, if not traumatic. Meanwhile, the Wikileaks/Cablegate events have spawned a government reaction whose scope is only beginning to be seen. A couple of “highlights” so far are some very nasty laptop seizures, and the recent demand for information on over 600,000 Twitter accounts. (Christopher Soghoian provided a detailed, nuanced legal analysis of same.)
At this point, it’s fair to say there are at least six different kinds of legitimate privacy fear. Read more
|Categories: Analytic technologies, Facebook, GIS and geospatial, Health care, Surveillance and privacy, Telecommunications, Web analytics||6 Comments|
Internet privacy issues are getting more and more attention. Frankly, I think we’re getting past the point where the only big risk is loss of liberty. More and more, the risk of an excessive backlash is upon us as well. (In the medical area, I’d say it’s already more than a risk — it’s a life-wrecking reality. But now the problem is poised to become wider-spread.) Read more
I’ve previously ranted about the medical information problems in connection with my father’s care at Friendship Village of Dublin and Riverside Methodist Hospital (among others). Well, they’re getting worse. Read more
|Categories: Analytic technologies, Health care, Michael Stonebraker, MySQL, Open source, Parallelization, Petabyte-scale data management, Scientific research, Surveillance and privacy||2 Comments|