This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be:
- Store a table with one row per edge.
- Do an (n-1)-way join, where n is the number of edges in the path or subgraph.
In many cases the cardinality of intermediate result sets would be high, and you’d basically be doing a series of full table scans. Those could take a while.
There are various approaches to dealing with this challenge. For example:
- Graph analysis has been around long enough that much of it has surely been done relationally.
- I wrote about some specific relational strategies for graph analysis five years ago.
- A lot of graph analysis these days is being done in Hadoop (or other MapReduce, notably Aster Data’s).
- Objectivity Infinite Graph and Google Pregel emphasize pre-fetching (or pre-shipping) edges that might soon be needed.
- Yarcdata, with its Cray genes, tries to optimize hardware (single RAM image across a cluster, with a whole lot of multithreading) for in-memory Apache Jena performance. Unfortunately, I’m not clear as to which data structure(s) Jena uses.
When trying to figure out which of these techniques is likely to win in the most demanding cases, I run into the key controversy around analytic graph data management — how successfully can graphs be partitioned? Opinions vary widely, with the correct answers in each case surely depending on:
- The topology of the graph.
- The size of the graph.
- The length of the paths that need to be examined.
But in the interest of getting this posted tonight, I’ll leave further discussion of graph partitioning to another time.