I chatted Wednesday night with Darren Wood, the Australia-based lead developer of Objectivity’s Infinite Graph database product. Background includes:
- Objectivity is a profitable, decades-old object-oriented DBMS vendor with about 50 employees.
- Like some other object-oriented DBMS of its generation, Objectivity is as much a toolkit for building DBMS as it is a real finished DBMS product. Objectivity sales are typically for custom deals, where Objectivity helps with the programming.
- The way Objectivity works is basically:
- You manage objects in memory, in the format of your choice.
- Objectivity bangs them to disk, across a network.
- Objectivity manages the (distributed) pointers to the objects.
- You can, if you choose, hard code exactly which objects are banged to which node.
- Objectivity’s DML for reading data is very different from Objectivity’s DML for writing data. (I think the latter is more like the program code itself, while the former is more like regular DML.)
- The point of Objectivity is not so much to have fast I/O. Rather, it is to minimize the CPU cost of getting the data that comes across the wire into useful form.
- Darren got the idea of putting a generic graph DBMS front-end on Objectivity while doing a relationship analytics project for an Australian intelligence agency.
- Darren redoubled his efforts to sell the project internally at Objectivity after reading what I wrote about relationship analytics back in 2006 or so.
- There is now a 5 or so person team developing Infinite Graph.
- Infinite Graph is just now going out to beta test.
Infinite Graph is an API or language binding on top of Objectivity that:
- Hides a lot of Objectivity’s complexity.
- Is suitable for graph/relationship analytics.
The main point of the Infinite Graph beta test is to see whether Objectivity got the API right. By way of contrast, Objectivity is still just researching the DBMS optimization side of things. According to Darren, what makes that so hard is that if you partition the graph in some smart way, probably through some kind of costly algorithm to determine “least connectedness,” a bit more additional data can thoroughly invalidate your results. Thus, Darren is focused more on ensuring that performance is good even if data is distributed around the network in annoying ways.
One performance win that Infinite Graph seems to get (almost?) for free from being built on top of Objectivity is lots of prefetching. Specifically, graph nodes and their edges are stored together, just like objects and their pointers are in traditional Objectivity — and if a node is retrieved, the nodes it’s connected to might also get retrieved as a background operation, before they’re even needed. More generally, Objectivity has always tried to be fast about traversing pointers, and that is a whole lot like traversing graph edges.
As a future, Infinite Graph is looking at ideas from Google’s Pregel. As Darren characterizes it, in Pregel you wrap up information about a graph node and ship it off to another computing node if the next graph node you need is over there. Darren suspects that the extreme form of this strategy would not be ideal. (I gather from Darren that Google has realized the same thing from the getgo.) Instead, he’s pinning his hopes more on smarts about when to do that (costly) shipping, and when to just fetch the information back to the compute node currently being used.
The most interesting part of our discussion, in my opinion, was about applications and application functionality. In a nutshell, Darren seems to think that it’s all about the edges, rather than the nodes themselves. (My words, not his.) In particular:
- Edges are first-class citizens in Infinite Graph, just as nodes are.
- Graphs typically are polluted with lots of insignificant edges. Examples include:
- If you’re tracking people’s telephone traffic, lots of folks call the local pizza parlor. Indeed, it’s common to look for “star” nodes like that that have very high connectivity, and excise from the graph to reduce noise.
- Many measures of relationship include minor relationships. Facebook friends? LinkedIn connections? Occasional phone calls? Next door neighbors? All of those can indicate very minor relationships.
- Therefore, in Infinite Graph, edges (can) have weights. Darren says this is a widely-used capability in graph applications. The core reason is to let you distinguish between significant and insignificant edges. Note that these weights can be calculated based on the raw data and stored back into the database.
- In Infinite Graph, edges can also have effectiveness date intervals. E.g., if you live at an address for a certain period, that’s when the edge connecting you to it is valid.
- In general in Infinite Graph, edges can carry arbitrary or at least flexible “qualifier”/attribute information.
- For many applications, the number of possible nodes is fundamentally limited. There are only so many people in the world, so many street addresses, so many telephone numbers, and so on. (There was a time this wasn’t believed to be the case, because timestamping was done at the node rather than edge level. But I find persuasive Darren’s argument that it works better on edges.) Edit: Even so, DARPA is thinking in the billions-of-nodes range.
- Darren is in general agreement with my observation that the “social graph” shouldn’t primarily be regarded as a graph.
- Yes, the paradigmatic examples of intelligence agency graph analytics are telephone or even IP traffic analysis. Nodes can wind up with lots of edges connecting them. Full analysis of the graphs exceeds even the computing capacity available to governments.
- On a happy civil liberties note, Darren observed that Australian intelligence has a lot of red tape restricting them from getting this kind information. Basically, they can only get chunks of information “on demand”. An awkward side effect of this is that when they do get it, it could be in any number of formats.