July 5, 2012

Introduction to Neo Technology and Neo4j

I’ve been talking some with the Neo Technology/Neo4j guys, including Emil Eifrem (CEO/cofounder), Johan Svensson (CTO/cofounder), and Philip Rathle (Senior Director of Products). Basics include:

Numbers and historical facts include:

*I forgot to ask why the paying/production ratio was so low, but my guesses start with:

I also forgot to ask if there were any OEM users involved.

If we look at my basic list of graph data model application areas, Neo4j seems to be involved in most of what you would think. Exceptions include:

To scope what kind of databases Neo4j can or can’t handle, it may be helpful to note that:

Neo4j is built around pointers and linked lists. The record for an edge, aka relationship, consists of:

(As you might imagine, any of those pointers could conceivably be null.) The property list is, you guessed it, another double linked list, in this case of (name,value) pairs. Similarly, I believe that a node record contains a node ID, a pointer to one edge leading from the node, a pointer to one edge leading to the node, and a pointer to a property list.

The physical data retrieval story in Neo4j starts:

An exception to the “pointers are supposed to be in memory” rule occurs when property lists are kept on disk — but when you fetch the first property in the chain, all the rest are retrieved too, placing their pointers in memory from that time on.

Neo4j records are (almost) always fixed length, so they can be found just from offset calculations. Notes on that include:

Indexes do play a limited role, in determining at which node(s) to start the pointer chase. Notes on Neo4j indexing include:

You can get at Neo4j via either its “Cypher” declarative query language or an older Java API.

Finally, there’s Neo4j’s durability story — it’s just full durability/ACID, with no tunable durability or anything like that. Neo4j doesn’t seem to feel any great rush about sending a write all the way to the database in persistent storage; but there’s also an update log, and the write doesn’t get acknowledged until that log has been flushed to disk.

Comments

10 Responses to “Introduction to Neo Technology and Neo4j”

  1. Notes on graph data management | DBMS 2 : DataBase Management System Services on July 5th, 2012 5:32 am

    [...] Neo Technology (the Neo4j guys) started out doing a content management system, and eventually decided that what they really wanted underneath it was a graph-oriented DBMS. [...]

  2. M-A-O-L » Introduction to Neo Technology and Neo4j on July 5th, 2012 1:01 pm

    [...] Introduction to Neo Technology and Neo4j. Good stuff as usual from Curt Monash, going into a lot of detail about product and company (but not how to use or develop for Neo4j – use Google for that). [...]

  3. Philip Rathle on July 5th, 2012 1:14 pm

    > *I forgot to ask why the paying/production ratio was so low, but my guesses start with:
    Hi Curt, I’m happy to answer this question directly:

    Part of Neo Technology’s mission as an open source company is to promote broad-based adoption of our graph database. It delights us to see the free Community Edition being used across all contexts. This has resulted in an amazingly vibrant community, and in the long-run is also healthy for the product.

    Our commercial-to-free ratio, while it may appear low at first glance, is actually much higher than most open source projects: including MySQL.

  4. Philip Rathle on July 5th, 2012 1:22 pm

    The question of batch analytics raises an interesting aspect of graph databases:

    Because they can so efficiently traverse data in real time, graph databases can take on certain analytic activities that—in a relational context—would need to take place in batch mode. The ability to migrate certain (not all) types of analytic activities, such as recommendations and fraud analysis, into the OLTP system with ACID properties, has been an important driver behind graph database adoption.

  5. Philip Rathle on July 5th, 2012 1:46 pm

    Some additional clarification on the technical implementation details, for those who are thirsty for more:

    – Pointers in Neo4j are explicit relationships between data records created at insert time. This differs *significantly* from relationships in RDBMSs, which are physically calculated on the fly at query time via join operations and index lookups, and are consequently *much* more expensive. Graph database traversals don’t require index lookups, which for complex connected data operations, yields order-of-magnitude performance improvements over relational and other NOSQL options.
    – The pointer scheme is entirely hidden from the user. Users need only walk the relationships, or tell Neo4j is what they need to bring back, and let the database traverse the graph and return the results.
    – Both the data in the graph (including indexes), and the the graph structure, are always consistent at any point in time.
    – While Lucene is our default choice of indexing (because of it predicability/maturity/performance characteristics), it is possible to swap out Lucene for other types of indexes. These must be JTA compliant in order to reap the consistency benefits that we have built into our Lucene solution.
    – Buffers can be caused to flush more or less frequently by tuning the size of the logical logs.
    – Regarding scale out vs. scale up: few users bump up against the current limits. A who have needed to scale across multiple instances have done by scaling out via the application. This leads to hybrid performance characteristics: extremely fast traversals for queries local to an instance; and slower indexed lookups across nodes, equivalent to non-graph databases.

    Nice write up!

  6. Curt Monash on July 6th, 2012 4:41 am

    Our commercial-to-free ratio, while it may appear low at first glance, is actually much higher than most open source projects: including MySQL.

    The MySQL example is something of a special case, because of all the WordPress and other embeds. Indeed, I’m a MySQL “user” numerous times over. :)

  7. Introduction to Neo Technology and Neo4j [Questions of Scale?] « Another Word For It on July 7th, 2012 4:21 pm

    [...] Introduction to Neo Technology and Neo4j by Curt Monash. [...]

  8. Database diversity revisited | DBMS 2 : DataBase Management System Services on July 8th, 2012 8:55 pm

    [...] an object-oriented DBMS does the job (or a graph DBMS or [...]

  9. Disk, flash, and RAM | DBMS 2 : DataBase Management System Services on October 9th, 2012 12:53 am

    [...] and Neo4j both rely on direct [...]

  10. Sky Hester on August 13th, 2013 12:42 pm

    In response to the statement: “The cost [C] of the pointer chase, to a first approximation, is [C = k*E^(L-1)], where… the constant [k] is sub-microsecond.”

    Using data from an experiment in Partner and Vukotic’s “Neo4j in Action” (cited in “Graph Databases”, Robinson and Webber p. 20), where paths of length up to 5 are queried from a social network of about 50 friends per person, this would give

    C(L) = k*50^(L-1)

    implying that the constant is about 8 microseconds using a linear regression when the equation is in the form

    log C = L + log k – 1.

    Of course, it could be machine-dependent, but if the claim is that the constant is sub-microsecond, this experiment does not support that claim.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.