Notes on the analysis of large graphs
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more
| Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray | 12 Comments |
We’re back
Our blogs have been moved to a new hosting company, and everything should be working. Ditto our business site.
If you notice any counterexamples, please be so kind as to ping me.
| Categories: About this blog | Leave a Comment |
Comments are briefly being turned off
I need to move web hosts, and am initiating the process now. This involves a large file copy, a recopy of same, and a variety of manual steps. So until the process is complete, updating site databases is a bad idea.
A comment is, of course, an update. So we’re closing off comments across DBMS 2, Strategic Messaging, Text Technologies, Software Memories, and the Monash Report. I hope to turn them back on shortly.
The sites should remain readable all the way through — unless, of course, there are more hosting company outages.
| Categories: About this blog | Leave a Comment |
Site reliability has been ghastly
Unfortunately, we’ve had serious site outages over the past few days, as well as an increased frequency of shorter-term problems. My ordinarily excellent hosting company is going through a bad stretch, and I’ll have to move away from them. (As usual, I’ll rely on http://www.webhostingtalk.com for recommendations.)
When I pull the trigger on the move, there will be a short period when I turn off comments across all my blogs. I’ll post again here to announce when that is happening.
I apologize for the inconvenience.
| Categories: About this blog | 2 Comments |
Relationship analytics application notes
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications (this post)
- Analysis of large graphs
In my recent post on graph data models, I cited various application categories for relationship analytics. For most applications, it’s hard to get a lot of details. Reasons include:
- In adversarial domains such as national security, anti-fraud, or search engine ranking, it’s natural to keep algorithms secret.
- The big exception – influencer analytics, aka social network analysis — is obscured by a major hype/reality gap (so, come to think of it, is a lot of other predictive modeling).
Even so, it’s fairly safe to say:
- Much of relationship analytics is about subgraph pattern matching.
- Much of relationship analytics is about identifying subgraph patterns that are predictive of certain characteristics or outcomes.
- An important kind of relationship analytics challenge is to identify influential individuals.
| Categories: Predictive modeling and advanced analytics, RDF and graphs, Telecommunications | 5 Comments |
Terminology: Relationship analytics
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition (this post)
- Relationship analytics applications
- Analysis of large graphs
In late 2005, I encountered a company called Cogito that was using a graphical data manager to analyze relationships. They called this “relational analytics”, which I thought was a terrible name for something that they were trying to claim should NOT be done in a relational DBMS. On the spot, I coined relationship analytics as an alternative. A business relationship ensued, which included a short white paper. Cogito didn’t do so well, however, and for a while the term “relationship analytics” faltered too. But recently it’s made a bit of a comeback, having been adopted by Objectivity, Qlik Tech, Yarcdata and others.
“Relationship analytics” is not a perfect name, both because it’s longish and because it might over-connote a social-network focus. But then, no other term would be perfect either. So we might as well stick with it.
In that case, “relationship analytics” could use an actual definition, preferably one a little heftier than just:
Analytics on graphs.
| Categories: Cogito and 7 Degrees, Objectivity and Infinite Graph, QlikTech and QlikView, RDF and graphs, Yarcdata and Cray | 6 Comments |
Notes on graph data management
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics (this post)
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs
Interest in graph data models keeps increasing. But it’s tough to discuss them with any generality, because “graph data model” encompasses so many different things. Indeed, just as all data structures can be mapped to relational ones, it is also the case that all data structures can be mapped to graphs.
Formally, a graph is a collection of (node, edge, node) triples. In the simplest case, the edge has no properties other than existence or maybe direction, and the triple can be reduced to a (node, node) pair, unordered or ordered as the case may be. It is common, however, for edges to encapsulate additional properties, the canonical examples of which are:
- Weight. Usually, the intuition here is that the weight is a number indicating the strength of the connection. This is generally derived from more basic data.
- Kind. The edge can encapsulate one or more descriptors indicating the kind of relationship between the nodes.
Many of the graph examples I can think of fit into four groups: Read more
| Categories: RDF and graphs, Telecommunications, Workday | 6 Comments |
Big Data hype?
A reporter wrote in to ask whether investor interest in “Big Data” was justified or hype. (More precisely, that’s how I reinterpreted his questions.
) His examples were Splunk’s IPO, Teradata’s stock price increase, and Birst’s financing. In a nutshell:
- My comments, lightly edited, are in plain text below.
- Further thoughts are in italics.
- Of course I also linked him to my post “Big Data” has jumped the shark.
- Overall, my responses boil down to “Of course there’s some hype.”
1. A great example of hype is that anybody is calling Birst a “Big Data” or “Big Data analytics” company. If anything, Birst is a “little data” analytics company that claims, as a differentiating feature, that it can handle ordinary-sized data sets as well. Read more
| Categories: Business intelligence, Data warehousing, IBM and DB2, Microsoft and SQL*Server, Oracle, Splunk | 13 Comments |
Thinking about market segments
It is a reasonable (over)simplification to say that my business boils down to:
- Advising vendors what/how to sell.
- Advising users what/how to buy.
One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.
Last June I wrote:
In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.
I’d further say that it matters whether the buyer:
- Is a large central IT organization.
- Is the well-staffed IT organization of a particular business department.
- Is a small, frazzled IT organization.
- Has strong engineering or technical skills, but less in the way of IT specialists.
- Is trying to skate by without much technical knowledge of any kind.
Now let’s map those considerations (and others) to some specific market segments. Read more
Notes on the Hadoop and HBase markets
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
