I've seen a fair amount of evidence online (videos, blogs etc) describing how Neo4j performs very well when scaling up data volumes on a single machine, partly because relationships (edges) between data items can be followed directly instead of indirectly via FKs as in relational systems, and Neo4j's ACID transaction support is a real bonus in the NoSQL world. But to what extent is it amenable to scaling out across multiple servers in the way that Hadoop or MongoDB allow? How far is it possible to achieve high performance across multiple servers, given the fact that your data will inevitably have dependencies that cross machine boundaries, affecting how your transactions are managed?
I can see potential for Neo4j in some applications in my workplace, where we currently have a lot of relationships between tables on a strictly 3NF (unfortunately!) relational database that require lots of joins, because those relationships are expensive to traverse and Neo4j would make that process much more efficient. But I'm not clear on how far it would be possible to scale that solution out if the need arose, especially as the general trend is away from scaling up on big expensive kit and towards scaling out on cheaper servers.
Scaling graphs horizontally in terms of storage (data sharding) is a hard problem. Traversing graph with the cost of network hop for a relationship can quickly become too expensive.
So in those terms, Neo4j will not compete with likes of Hadoop in the near future.
What Neo4j does provide is a very intelligent caching engine (both JVM heap caching and off-memory caching), which should speed things significantly if entire data set can fit into memory.
For cases where data is larger then the amount of memory available, you can run Neo4j in a clustered master-slave setup. Each node (master and slave) will have the entire copy of the data, but by smart routing within your application you keep different data sets cached on different nodes, making traversals scalable on large data sets.
In addition, code Neo4j data structures are very small - 9 bytes for a node and 33 bytes for a relationships - which means that you can fit billions of nodes and relationships in a standard server RAM memory for example. It is properties that add size to the data - and if you are after core traversal performance, you can store only minimal amount of information in Neo4j and use other storage system for persist other data - this is what polyglot persistence i all about.
Thanks again, Aleksa. It sounds like server-based Neo4j would probably be fine for some of our applications, and I like the idea of polyglot persistence using the right DB for the right kind of data. Just need to persuade my bosses...