Discussion of how database and related technologies are used to support scientific research. Related subjects include:
As planned, I’m getting more active in predictive modeling. Anyhow …
1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.
2. The most controversial part of that post was probably the claim:
I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- It is always possible to instead go with a single model formally.
- A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
- Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.
3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start: Read more
|Categories: Ayasdi, Databricks, Spark and BDAS, Log analysis, Nutonian, Predictive modeling and advanced analytics, Revolution Analytics, Scientific research, Web analytics||6 Comments|
We all tend to assume that data is a great and glorious asset. How solid is this assumption?
- Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
- In many cases, however, data’s value diminishes quickly.
- Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
- Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
- Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.
*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.
This all raises the idea – if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include: Read more
|Categories: Data mart outsourcing, eBay, Health care, Investment research and trading, Log analysis, Scientific research, Text, Web analytics||7 Comments|
IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.
I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:
- MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
- Cyc is connected to the computer science community’s standard unit of bogosity.
I haven’t done a notes/link/comments post for a while. Time for a little catch-up.
1. MySQL now has a memcached integration story. I haven’t checked the details. The MySQL team is pretty hard to talk with, due to the heavy-handedness of Oracle’s analyst relations.
2. The Large Hadron Collider offers some serious numbers, including:
- 1 petabyte/second.
- 6 x 109 collisions/second.
- Only 1 in 1013 collision records kept (which I guess knocks things down to a 100 byte/second average, from the standpoint of persistent storage).
- Real-time filtering by a cluster of several thousand machines, over a 25 nanosecond period.
3. One application area we don’t talk about much for analytic technologies is education. However: Read more
|Categories: Cache, memcached, Memory-centric data management, MySQL, Open source, Petabyte-scale data management, RDF and graphs, Scientific research||Leave a Comment|
My recent post on broadening the usefulness of statistics presupposed two things about the statistical sophistication of business intelligence tool users:
- It varies a lot.
- In many cases, it isn’t be very high.
Let me now say a little more on the subject. My basic message is — people’s facility with statistics is extremely difficult to predict.
If you DO have to make a point estimate, however, you could do worse than just putting quotation marks around the last four words of that sentence …
Suppose we measure people’s statistical understanding on a 5-point scale:
- People who haven’t clue what a p-value is.
- People who think a p-value of .05 signifies a 95% chance of truth.
- People who know better than that, but who still think that “statistically significant” is pretty close to the same as “true”.
- People who know better yet, but aren’t fluent in using statistical techniques correctly.
- People who are fluent in statistics.
Just knowing somebody’s job description, can you confidently predict their ranking to within, say, +/- 1 point? I suggest you can’t. People differ wildly in general numeracy and in specific statistical knowledge.
Even our guesses about average knowledge may be off, not least because education is changing things. Read more
Ron Pressler of Parallel Universe/SpaceBase pinged me about a data grid product he was open sourcing, called Galaxy. The idea is that a distributed RAM grid will allocate data, not randomly or via consistent hashing, but rather via a locality-sensitive approach. Notes include:
- The original technology was developed to track moving objects on behalf of the Israeli Air Force.
- The commercial product is focused on MMO (Massively MultiPlayer Online) games (or virtual worlds).
- The underpinnings are being open sourced.
- Ron suggests that, among other use cases, Galaxy might work well for graphs.
- Ron argues that one benefit is that when lots of things cluster together — e.g. characters in a game — there’s a natural way to split them elastically (shrink the radius for proximity).
- The design philosophy seems to be to adapt as many ideas as possible from the way CPUs manage (multiple levels of) RAM cache.
|Categories: Cache, Clustering, Games and virtual worlds, GIS and geospatial, Open source, RDF and graphs, Scientific research, Streaming and complex event processing (CEP)||2 Comments|
There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …
… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.
Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.
Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more
|Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics||6 Comments|
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more
|Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray||20 Comments|
MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:
- More-of-the-same in line with MarkLogic’s core positioning.
- A new bi-directional Hadoop connector.
- A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.
Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.
*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.
Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.
|Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text||1 Comment|
As Jacek Becla explained:
- Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.
- What’s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.
Even so, I think that academic researchers, in the natural and social sciences alike, commonly overlook the wealth of commercial software that could help them in their efforts.
I further think that the commercial software industry could do a better job of exposing its work to academics, where by “expose” I mean:
- Give your stuff to academics for free.
- Call their attention to your free offering.
Reasons to do so include:
- Public benefit. Scientific research is important.
- Training future customers. There’s huge academic/commercial crossover, especially as students join the for-profit workforce.
|Categories: Business intelligence, Data warehousing, Infobright, Petabyte-scale data management, Predictive modeling and advanced analytics, Scientific research||7 Comments|