One of the more interesting upcoming growth areas in Data Science is the use of Graph Databases and graph-based analytics on large, unstructured datasets.
This is a natural next step along the progression we’ve already been on: first raw MapReduce and Hadoop for large-scale data processing, then tools and frameworks (such as Streaming and Cascading), and then the addition of SQL-like layers on top such as Pig and Hive.
Why do we need Graph Databases?
Today applications and devices generate a flood of data. This high volume of data is typically incredibly dense and highly related; it does not fall neatly into pre-defined schemas. Facebook’s Open Graph and twitter’s interest graph are two obvious examples, but there are many other domains where this applies, such as Healthcare, E-commerce and sensor data.
This type of data of highly-connected entities is not easily modeled using traditional relational schemas; instead, using graph data structures makes it easy to represent connected data and to perform rapid analysis on these large datasets.
In a graph database, data is stored as nodes and relationships; both nodes and relationships have properties. Instead of capturing relationships between entities in a join table as in a Relational Database, a Graph Database captures the relationships themselves and their properties directly within the stored data.
To quote Derrick Harris from his GigaOm article:
Graph analysis is among the hottest techniques around for making sense of large datasets, primarily by determining how tightly different data points are related or how similar they are.
Although graphs have recently become more popular because of their applicability in modeling social networks, graph analysis can be widely applied to analyze any kind of relationships.
Modeling the data as graphs allows data scientists to discover localized patterns; i.e. how are specific items related to other items. Even with large datasets, most analytics queries end up acting locally within a graph. All graphs share common patterns - simple examples include the diamond, butterfly and star patterns, however these simple patterns can be composed into arbitrarily complex patterns.
Graph Database: Neo4J
One of the exciting entries in this area is Neo4J, a Java-based open source graph database from Swedish company Neo Technology. Neo4J stores graph data directly and offers large-scale horizontal scalability using replication; in addition it offers ACID transactions and indexes similar to a traditional database. Neo4J also has a REST API and its own graph query language called Cypher.
A good starting point to learn about Neo4J is Robert Scoble’s video interview with Neo Technology’s CEO, Emil Eifrem. To me, the most interesting section is where Emil nails the value of graph databases in general (at the 6:30 mark in the video):
Sophisticated intelligence and reasoning is all about how are things related to one another. … Whenever the value is in the connection between things, that's when a graph database excels.
You can watch the video here: http://vimeo.com/56040747
Interestingly, he says that the next step for Neo4J is to work on transparent partitioning, to enable the database to automatically co-locate highly-connected nodes. There is lots of good information about graph databases on the Neo4J site, here: http://www.neo4j.org/learn/neo4j .
As a student of data visualization, I also found the following infographic an interesting way to represent Graph databases:
[If the graphic does not display above, click here to view it on visual.ly .]
Clearly, this is a hot area right now, with lots of different technologies and frameworks coming up. One offering comes from Intel: GraphBuilder is an open source Java library for constructing graphs out of large datasets for data analytics. YarcData, a Cray company, has a graph analytics hardware platform for real-time analysis: an in-memory appliance called Urika . And of course, there is an Apache project: Apache Giraph that addresses graph processing of big data.
This is still very much an emerging space, so it’s not clear which of these projects will evolve into popular successes. But given the explosion of data all around us and the prevalence of relationships among entities within that data, it’s certain that this area of computer science will receive a lot of attention in the near future.