0% found this document useful (0 votes)
53 views

Graph Databases: Nosql Database Solution For Managing Linked Data

This document discusses using graph databases to manage connected data and address challenges in the contemporary data landscape. It describes how graph databases can help extract insights from complex, real-world data that is large in size and variably structured, containing many connections and relationships that are difficult to represent in a traditional relational database. The document focuses on graph databases and their use cases, providing articles on data modeling in graph databases, using Neo4j for web development, and graph databases for fraud detection.

Uploaded by

zenzo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Graph Databases: Nosql Database Solution For Managing Linked Data

This document discusses using graph databases to manage connected data and address challenges in the contemporary data landscape. It describes how graph databases can help extract insights from complex, real-world data that is large in size and variably structured, containing many connections and relationships that are difficult to represent in a traditional relational database. The document focuses on graph databases and their use cases, providing articles on data modeling in graph databases, using Neo4j for web development, and graph databases for fraud detection.

Uploaded by

zenzo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT

Graph Databases
NoSQL Database Solution for Managing Linked Data
eMag Issue 34 - October 2015

ARTICLE INTERVIEW ARTICLE

Full Stack Web Data Modeling in Graph Databases in


Development Using Graph Databases the Enterprise:
Neo4j Fraud Detection 1
Graph Databases // eMag Issue 34 - Oct 2015
Let Me Graph That For You
In this article on Graph Databases, author Ian Robinson discusses the problems Graph DBs aim to
solve. He also talks about the data, storage, and query models for managing graph data.

Data Modeling in Graph


Databases: Interview with Jim
Webber and Ian Robinson
Data modeling with Graph databases requires a different paradigm
than modeling in Relational or other NoSQL databases like Document
databases, Key Value data stores, or Column Family databases. InfoQ
spoke with Jim Webber and Ian Robinson about data modeling efforts
when using Graph databases.

High Tech, High Sec.: Security


Concerns in Graph Databases
Graph NoSQL databases support data models with connect-
ed data and relationships. In this article, the author discusses
the security implications of graph database technology. He
talks about the privacy and security concerns in use cases
like graph discovery, knowledge management, and predic-
tion.

Graph Databases in the


Enterprise: Fraud Detection
Financial services organizations lose billions of dollars every
year to fraud transactions. Graph NoSQL databases can be
used to uncover fraud rings and other complex scams using
contextual link analysis and potentially stop advanced fraud
scenarios in real time. Checkout this article for a detailed dis-
cussion of one of the most impactful and interesting use cases
of graph database technologies, fraud detection.

Full Stack Web Development Using Neo4j


When building a web application there are a lot of choices for the database. In this article, author
discusses why Neo4j Graph database is a good choice as a data store for your web application if your
data model contains lot of connected data and relationships.

FOLLOW US CONTACT US
GENERAL FEEDBACK [email protected]

ADVERTISING [email protected]

EDITORIAL [email protected]
facebook.com @InfoQ google.com linkedin.com
/InfoQ /+InfoQ company/infoq
SRINI currently works as
Senior Software
PENCHIKALA Architect at a financial
services organization in Austin, Texas. He is also
the Lead Editor for Big Data and NoSQL Database
communities at InfoQ. Srini has over 20 years of
experience in software architecture, design and
development. He is currently authoring a book
on Big Data Processing with Apache Spark. He is
also the co-author of "Spring Roo in Action" book
from Manning Publications. Srini has presented at
conferences like Big Data Conference, Enterprise
Data World, JavaOne, SEI Architecture Technology
Conference (SATURN), IT Architect Conference
(ITARC), No Fluff Just Stuff, NoSQL Now and Project
World Conference. He also published several
articles on software architecture, security and risk
management, and NoSQL databases on websites
like InfoQ, The ServerSide, OReilly Network
(ONJava), DevX Java, java.net and JavaWorld.

A LETTER FROM
THE EDITOR

Graph databases are getting lot of attention late- Another emerging trend in Graph Database
ly. They are used to manage connected data and space is the Multi-Model Databases. NoSQL da-
are better solutions for several real-world use tabases like OrientDB support storing Document
cases which require the mapping of relationships and Graph data sets.
between data entities for data driven business This eMag focuses on the graph database
decision making. landscape and the real world use cases of graph
Real-world use cases of graph databases databases. It includes articles and interviews cov-
include fraud detection, Anti–Money Launder- ering topics like data modeling in graph databas-
ing (AML) and Master Data Management (MDM), es and how companies like Pinterest use graph
Trading Analytics, and Online Gaming. databases in their application. It also includes
Graph data management space includes an article on full stack web development using a
the three different areas: graph database so the readers can see the pow-
1 Specialized Graph databases to store the con- er of Graph databases to manage the connected
nected data. These databases include Neo4J, data.
TitanDB, InfiniteGraph, and AllegroGraph.
2 Graph data processing and real-time graph
analytics frameworks like Spark GraphX and
GraphLab.
3 Graph data visualization tools which give
non-technical users insights into the con-
nected data, to use for data driven business
decision making.

4 Graph Databases // eMag Issue 34 - Oct 2015


Read online on InfoQ

Let Me Graph That for You

Ian Robinson works on research and development for future versions of the Neo4j graph
database. Harbouring a long-held interest in connected data, he was for many years one of the
foremost proponents of REST architectures before turning his focus from the Web›s global graph
to the realm of graph databases. Follow him on Twitter: @iansrobinson

Neo4j is designed to address challenges in the contemporary data


landscape.

Why we would consider using a graph database to uniformity. The more data we gather about a group
tackle complexity, generate insight, and bring val- of entities, the more likely that data is to be variably
ue to end users. More specifically, to wrest insight structured.
from the kind of complexity that arises wherever Variably structured data is the kind of messy,
three contemporary forces meet: an increase in the real-world data that doesn’t fit comfortably into a
amount of generated and stored data, a need to ac- uniform, one-size-fits-all, rigid relational schema;
commodate a high degree of structural variation, the kind that gives rise to lots of sparse tables and
and a need to understand the multiple facets of con- null-checking logic. It’s the increasing prevalence of
nectedness inherent in the domain to which the data variably structured data in today’s applications that
belongs. has led many organisations to adopt schema-free
Increased data size — big data — is perhaps alternatives to the relational model and document
the best understood of these three forces. The vol- stores.
ume of new data is growing exponentially each year, But the challenges that face us today aren’t just
a trend that looks set to continue for the foresee- the management of increasingly large volumes of
able future. But as the volume of data increases and data nor do they simply extend to us having to ac-
we learn more about the instances in our domain, commodate ever-increasing degrees of structural
each instance begins to look subtly unique. In oth- variation in that data. The real challenge to gener-
er words, as data volumes grow, we trade insight for ating significant insight is understanding connect-

Graph Databases // eMag Issue 34 - Oct 2015 5


edness: to answer the most important questions yet outperforming relational databases by several
we want to ask of our domains, we must first know orders of magnitude.
which things are connected and then, having iden- To understand how graphs and graph databas-
tified these connected entities, understand in what es help tackle complexity, we need first to under-
ways and with what strength, weight, or quality they stand Neo4j’s graph data model.
are connected.
Consider these questions: The Labelled Property Graph Model
• Which friends and colleagues do we have in com- Neo4j uses a particular graph data model, called the
mon? labelled property graph model, to represent net-
• Which applications and services in my network work structures. A labelled property graph consists
will be affected if a particular network element of nodes, relationships, properties, and labels. Here’s
— a router or switch, for example — fails? Do we an example of a property graph:
have redundancy throughout the network for our
most important customers?
• What’s the quickest route between two stations
on the underground?
• What do you recommend this customer should
buy, view, or listen to next?
• Which products, services, and subscriptions does
a user have permission to access and modify?
• What’s the cheapest or fastest means of deliver-
ing this parcel from A to B?
• Which parties are likely working together to de-
fraud their bank or insurer?
• Which institutions are most at risk of poisoning
the financial markets?
Have you had to answer questions such as
these? If so, you’ve already encountered the need Diagram 1
to manage and make sense of large volumes of vari-
ably structured, densely connected data. These are Nodes
the kinds of problems for which graph databases Nodes represent entity instances. To capture an enti-
are ideally suited. Understanding what depends on ty’s attributes, we attach key-value pairs (properties)
what, and how things flow; identifying and assessing to a node, thereby creating a record-like structure for
risk, and analysing the impact of events on deep de- each individual thing in our domain. Because Neo4j
pendency chains: these are all connected data prob- is a schema-free database, no two nodes need share
lems. Today, Neo4j is being used in business-critical the same set of properties: no two nodes represent-
applications in domains as diverse as social network- ing persons, for example, need have the exact same
ing, recommendations, datacentre management, attributes.
logistics, entitlements and authorization, route find-
ing, telecommunications network monitoring, fraud Relationships
analysis, and many others. Its widespread adoption Relationships represent the connections between
challenges the notion that the relational databases entities. By connecting pairs of nodes with relation-
are the best tools for working with connected data. ships, we introduce structure into the model. Ev-
At the same time, it proposes an alternative to the ery relationship must have a start node and an end
simplified, aggregate-oriented data models adopted node. Just as importantly, every relationship must
by NoSQL. have a name and a direction. A relationship’s name
The rise of NoSQL was largely driven by a need and direction lend semantic clarity and context to
to remedy the perceived performance and oper- the nodes attached to the relationship. This allows
ational limitations of relational technology. But in us — in, for example, a Twitter-like graph — to say
addressing performance and scalability, NoSQL has that “Bill” (a node) “follows” (a named and directed
tended to surrender the expressive and flexible mod- relationship) “Sally” (another node). Like nodes, re-
elling capabilities of its relational predecessor, partic- lationships can also contain properties. We typically
ularly with regard to connected data. Graph databas- use relationship properties to represent some distin-
es, in contrast, revitalise the world of connected data, guishing feature of each connection. This is particu-
shunning the simplifications of the NoSQL models larly important when, in answering the questions we
want to ask of our domain, we must not only trace

6 Graph Databases // eMag Issue 34 - Oct 2015


the connections between things, but also take ac- in what degree, and with what weight, strength, or
count of the strength, weight, or quality of each of quality those entities connect, provides one of the
those connections. most powerful means for managing complexity to-
day.
Node labels
Nodes, relationships, and properties provide for tre- And doing it fast
mendous flexibility. In effect, no two parts of the Join-intensive queries in a relational database are
graph need have anything in common. Labels, in notoriously expensive, in large part because joins
contrast, allow us to introduce an element of com- must be resolved at query time by way of an indirect
monality with which to group nodes together and index lookup. As an application’s dataset size grows,
indicate the roles they play within our domain. We these join-inspired lookups slow down, causing per-
do this by attaching one or more labels to each of the formance to deteriorate. In Neo4j, in contrast, every
nodes we want to group: we can, for example, label relationship acts as a pre-computed join and every
a node to make it represent both a user and, more node acts as an index of its associated nodes. By hav-
specifically, an administrator. (Labels are optional, ing each element maintain direct references to its
and therefore a node can have zero labels.) Node la- adjacent entities in this way, a graph database avoids
bels are similar to relationship names insofar as they the performance penalty imposed by index lookups
lend additional semantic context to elements in the — a feature sometimes know as index-free adjacen-
graph but while a relationship instance must per- cy. As a result, Neo4j can be many thousands of times
form exactly one role because it connects precisely faster than a join-intensive operation in a relational
two nodes, a node, by virtue of the fact it can be con- database for complexly connected queries.
nected to any number of other nodes (or to none), Index-free adjacency provides for queries
may fulfil several different roles. whose performance characteristics are a function
On top of this simple grouping capability, labels of the amount of the graph they choose to explore,
also allow us to associate indexes and constraints rather than the overall size of the dataset. In other
with nodes bearing specific labels. We can, for exam- words, query performance tends to remain reason-
ple, require that all nodes labelled “Book” are indexed ably constant even as the dataset grows. Consider,
by their ISBN property, and then further require that for example, a social network in which every person
each ISBN property value is unique within the con- has, on average, fifty friends. Given this invariant,
text of the graph. friend-of-a-friend queries will remain reasonably
constant, irrespective of whether the network has a
Representing complexity thousand, a million, or a billion nodes.
This graph model is probably the best abstraction
we have for modelling both variable structure and Graph data modelling
connectedness. Variable structure grows by virtue This section looks at how to design and implement
of connections being specified at the instance level an application’s graph data model and associated
rather than the class level. Relationships join individ- queries.
ual nodes, not classes of nodes: in consequence, no From user story to domain questions
two nodes need connect in exactly the same way Imagine we’re building a cross-organizational
to their neighbours and no two subgraphs need skills finder: an application that allows us to find peo-
be structured exactly alike. Each relationship in the ple with particular skills in a network of professional
graph represents a specific connection between relationships.
two particular things. It’s this instance-level focus To see how we might design a data model and
on things and the connections between things that associated queries for this application, we’ll follow
make graphs ideal for representing and navigating the progress of one of our agile user stories, from
a variably structured domain. Relationships not only analysis through to implementation in the database.
specify that two things are connected, they also de- The story follows.
scribe the nature and quality of that connection. To As an employee, I want to know which of my col-
the extent that complexity is a function of the ways leagues have similar skills to mine so that I can
in which the semantic, structural, and qualitative as- exchange knowledge with them or ask them for
pects of the connections in a domain can vary, our help.
data models require a means to express and exploit Given this end-user goal, our first task is to iden-
this connectedness. Neo4j’s labelled property graph tify the questions we would have to ask of our do-
model, wherein every relationship can not only be main in order to satisfy it. Here’s the story rephrased
specified independently of every other but also may as a question.
be annotated with properties that describe how,

Graph Databases // eMag Issue 34 - Oct 2015 7


Which people, who work for the same company
skill because HAS is too general a term. Rightsizing
as me, have similar skills to mine?
a graph’s relationship names is key to developing a
Whereas the user story describes what it is good application graph model. If the same relation-
we’re trying to achieve, the questions we pose to our ship name is used with different semantics in several
domain hint as to how we might satisfy our users’ different contexts, queries that traverse those rela-
goals. A good application graph data model makes tionships will tend to explore far more of the graph
it easy to ask and answer such questions. Fortunate- than is strictly necessary — something we are mind-
ly, the questions themselves contain the germ of the ful to avoid.
structure we’re looking for. The expressions we’ve derived from the ques-
Language itself is a structuring of logical rela- tions we want to ask of our domain form a prototyp-
tionships. At its simplest, a sentence describes a per- ical path for our data model. In fact, we can refactor
son or thing, some action performed by this person the expressions to form a single path expression:
or thing, and the target or recipient of that action,
001 (:Company)<-[:WORKS_FOR]-(:Person)-
together with circumstantial detail such as when, [:HAS_SKILL]->(:Skill)
where, or how this action takes place. By attending
closely to the language we use to describe our do- While there are likely many other requirements for
main and the questions we want to ask of our do- our application and many other data elements to
main, we can readily identify a graph structure that discover as a result of analysing those requirements,
represents this logical structuring in terms of nodes, for the story at hand, this path structure captures
relationships, properties, and labels. all that is needed to meet our end users’ immediate
goals. There is still some work to do to design an ap-
From domain questions to Cypher path plication that can create instances of this path struc-
expressions ture at run time as users add and amend their details,
The particular question we outlined earlier names but insofar as this article is focussed on the design
some of the significant entities in our domain: peo- and implementation of the data model and associat-
ple, companies, and skills. Moreover, the question ed queries, our next task is to implement the queries
tells us something about how these entities are con- that target this structure.
nected to one another:
• A person works for a company. A sample graph
• A person has several skills. To illustrate the query examples, we’ll use Cy-
These simple natural-language representations of pher’s  CREATE  statement to build a small sample
our domain can now be transformed into Neo4j’s graph comprising two companies, their employees,
query language, Cypher. Cypher is a declarative, SQL- and the skills and levels of proficiency possessed by
like graph-pattern-matching language built around each employee. (Code 1)
the concept of path expressions: declarative struc- This statement uses Cypher path expressions
tures that allow us to describe to the database the to declare or describe the kind of graph structure we
kinds of graph patterns we wish either to find or to wish to introduce into the graph. In the first half, we
create inside our graph. create all the nodes we’re interested in — in this in-
When translating our ordinary-language de- stance, nodes representing companies, people and
scriptions of the domain into Cypher path expres- skills — and in the second half, we connect these
sions, the nouns become candidate node labels, the nodes using appropriately named and directed re-
verbs relationship names: lationships. The entire statement, however, executes
as a single transaction.
001 (:Person)-[:WORKS_FOR]->(:Company), Let’s take a look at the first node definition.
(:Person)-[:HAS_SKILL]->(:Skill)
001 (p1:Person{username:’ben’})
Cypher uses parentheses to represent nodes, and
dashes and less-than and greater-than signs (<- and This expression describes a node labelled Person. The
->) to represent relationships and their directions. node has a username property whose value is “ben”.
Node labels and relationship names are prefixed The node definition is contained within parentheses.
with a colon; relationship names are placed inside Inside the parentheses, we specify a colon-prefixed
square brackets in the middle of the relationship. list of the labels attached to the node (there’s just
In creating our Cypher expressions, we’ve one here,  Person), together with the node’s prop-
tweaked some of the language. The labels we’ve erties. Cypher uses a JSON-like syntax to define the
chosen refer to entities in the singular. More impor- properties that belong to a node.
tantly, we’ve used HAS_SKILL rather than HAS to Having created the node, we then assign it to an
denote the relationship that connects a person to a identifier, p1. This identifier allows us to refer to the

8 Graph Databases // eMag Issue 34 - Oct 2015


001 // Create skills-finder network
002
003 CREATE (p1:Person{username:’ben’}),
004 (p2:Person{username:’charlie’}),
005 (p3:Person{username:’lucy’}),
006 (p4:Person{username:’ian’}),
007 (p5:Person{username:’sarah’}),
008 (p6:Person{username:’emily’}),
009 (p7:Person{username:’gordon’}),
010 (p8:Person{username:’kate’}),
011 (c1:Company{name:’Acme’}),
012 (c2:Company{name:’Startup’}),
013 (s1:Skill{name:’Neo4j’}),
014 (s2:Skill{name:’REST’}),
015 (s3:Skill{name:’DotNet’}),
016 (s4:Skill{name:’Ruby’}),
017 (s5:Skill{name:’SQL’}),
018 (s6:Skill{name:’Architecture’}),
019 (s7:Skill{name:’Java’}),
020 (s8:Skill{name:’Python’}),
021 (s9:Skill{name:’Javascript’}),
022 (s10:Skill{name:’Clojure’}),
023 (p1)-[:WORKS_FOR]->(c1),
024 (p2)-[:WORKS_FOR]->(c1),
025 (p3)-[:WORKS_FOR]->(c1),
026 (p4)-[:WORKS_FOR]->(c1),
027 (p5)-[:WORKS_FOR]->(c2),
028 (p6)-[:WORKS_FOR]->(c2),
029 (p7)-[:WORKS_FOR]->(c2),
030 (p8)-[:WORKS_FOR]->(c2),
031 (p1)-[:HAS_SKILL{level:1}]->(s1),
032 (p1)-[:HAS_SKILL{level:3}]->(s2),
033 (p2)-[:HAS_SKILL{level:2}]->(s1),
034 (p2)-[:HAS_SKILL{level:1}]->(s9),
035 (p2)-[:HAS_SKILL{level:2}]->(s5),
036 (p3)-[:HAS_SKILL{level:3}]->(s3),
037 (p3)-[:HAS_SKILL{level:2}]->(s6),
038 (p3)-[:HAS_SKILL{level:1}]->(s8),
039 (p4)-[:HAS_SKILL{level:2}]->(s7),
040 (p4)-[:HAS_SKILL{level:3}]->(s1),
041 (p4)-[:HAS_SKILL{level:2}]->(s2),
042 (p5)-[:HAS_SKILL{level:1}]->(s1),
043 (p5)-[:HAS_SKILL{level:3}]->(s7),
044 (p5)-[:HAS_SKILL{level:2}]->(s2),
045 (p5)-[:HAS_SKILL{level:1}]->(s10),
046 (p6)-[:HAS_SKILL{level:2}]->(s1),
047 (p6)-[:HAS_SKILL{level:1}]->(s3),
048 (p6)-[:HAS_SKILL{level:2}]->(s8),
049 (p7)-[:HAS_SKILL{level:3}]->(s3),
050 (p7)-[:HAS_SKILL{level:1}]->(s4),
051 (p8)-[:HAS_SKILL{level:2}]->(s6),
052 (p8)-[:HAS_SKILL{level:3}]->(s8)

Code 1

Graph Databases // eMag Issue 34 - Oct 2015 9


Diagram 2

Find colleagues with similar skills


newly created node elsewhere in the query. Identi-
fiers are arbitrarily named, ephemeral, in-memory Now that we’ve a sample dataset that exemplifies
phenomena; they exist only within the scope of the the path expressions we derived from our user sto-
query (or subquery) in which they are declared. They ry, we can return to the question we want to ask of
are not considered part of the graph and therefore our domain and express it more formally as a Cypher
are discarded when the data is persisted to disk. query. Here’s the question again.
Having created all the nodes representing Which people, who work for the same company
people, companies, and skills, we then connect as me, have similar skills to me?
them as per our prototypical path expression: each
person WORKS_FOR a company; each person HAS_
SKILL one or more skills. Here’s the first of the HAS_
SKILL relationships.
001 (p1)-[:HAS_SKILL{level:1}]->(s1)
This relationship connects the node identified
by p1 to the node identified by s1. Besides specify-
ing the relationship name, we’ve also attached a lev-
el property to this relationship using the same JSON-
like syntax we used for node properties.
(We’ve used a single  CREATE  statement here
to create an entire sample graph. This is not how we
would populate a graph in a running application,
where individual end-user activities trigger the cre-
ation or modification of data. For such applications,
we’d use a mixture of CREATE, SET, MERGE, and DE-
LETE to create and modify portions of the graph. You
can read more about these operations in the online
Cypher documentation.)
Diagram 2 shows a portion of the sample data.
Within this structure you can clearly see multiple in- Diagram 3
stances of our prototypical path.

10 Graph Databases // eMag Issue 34 - Oct 2015


To answer this question, we’re going to have to colleagues), and the names and the directions of the
find a particular graph pattern in our sample data. relationships that must be present between nodes
Let’s assume that somewhere in the existing data for them to match (a Person must be connected to
is a node labelled Person that represents me (I have a Company with an outgoing WORKS_FOR relation-
the  username  “ian”). That node will be connected ship, and to a Skill with an outgoing HAS_SKILL rela-
to a node labelled  Company  by way of an outgo- tionship). Where we want to refer to a matched node
ing  WORKS_FOR  relationship. It will also be con- later in the query, we assign it to an identifier (we’ve
nected to one or more nodes labelled Skill by way chosen  colleague,  company,  and  skill). By being as
of several outgoing HAS_SKILL relationships. To find explicit as we can about the pattern, we help ensure
colleagues who share my skill set, we’re going to that Cypher explores no more of the graph than is
have to find all the other nodes labelled Person that strictly necessary to answer the query.
are connected to the same company node as me The  RETURN  clause generates a tabular pro-
and that are also connected to at least one of the jection of the results. As I mentioned earlier, we’re
skill nodes to which I’m connected. In diagrammatic matching multiple instances of the pattern. Col-
form, the following is the pattern we’re looking for. leagues with more than one skill in common with
(Diagram 3) me will match multiple times. In the results, howev-
Our query will look for multiple instances of this er, we only want to see one line per colleague. Us-
pattern inside the existing graph data. For each col- ing the  count  and  collect  functions, we aggregate
league who shares one skill with me, we’ll match the the results on a per colleague basis. The count func-
pattern once. If a person has two skills in common tion counts the number of skills we’ve matched per
with me, we’ll match the pattern twice, and so on. colleague, and aliases this as their score. The  col-
Each match will be anchored on the node that rep- lect  function creates a comma-separated list of the
resents me. Using Cypher path expressions, we can skills that each colleague has in common with me,
describe this pattern to Neo4j. Here’s the full query. and aliases this as skills. Finally, we order the results,
highest score first.
001 // Find colleagues with similar
skills Executing this query against the sample data-
002 MATCH (me:Person{username:’ian’}) set generates the following results:
003    -[:WORKS_FOR]->(company:Company), username score skills
004     (me)-[:HAS_SKILL]->(skill:Skill), ben 2 [‘Neo4j’, ‘REST’
005     (colleague:Person)-[:WORKS_FOR]-
>(company), charlie 1 [‘Neo4j’]
006      (colleague)-[:HAS_SKILL]- The important point about this query and the pro-
>(skill) cess that led to its formulation is that the paths we
007 RETURN colleague.username AS use to search the data are very similar to the paths
username, we use to create the data in the first place. The dia-
008       count(skill) AS score, mond-shaped pattern at the heart of our query has
009       collect(skill.name) AS skills
two legs, each comprising a path that joins a person
010       ORDER BY score DESC
to a company and a skill:
This query comprises two clauses: a  MATCH  clause
and a  RETURN  clause. The  MATCH  clause describes 001 (:Company)<-[:WORKS_FOR]-(:Person)
[:HAS_SKILL]->(:Skill)
the graph pattern we want to find in the existing
data; the  RETURN  clause generates a projection of This is the very same path structure we came up with
the results on behalf of the client. for our data model. The similarity shouldn’t surprise
The first line of the  MATCH  clause,  (me:Per- us; after all, both the underlying model and the que-
son{username:’ian’}), locates the node in the exist- ry we execute against that model are derived from
ing data that represents me — a node labelled Per- the question we wanted to ask of our domain.
son  with a  username  property whose value is “ian”
— and assigns it to the identifier me. If there are mul- Filter by skill level
tiple nodes matching these criteria (unlikely because In our sample graph, we qualified each HAS_SKILL re-
username ought to be unique), me will be bound to lationship with a level property that indicates an in-
a list of nodes. dividual’s proficiency with regard to the skill to which
The rest of the  MATCH  clause then describes the relationship points: 1 for beginner, 2 for interme-
the diamond-shaped pattern we want to find in the diate, 3 for expert. We can use this property in our
graph. In describing this pattern, we specify the la- query to restrict matches to only those people who
bels that must be attached to a node for it to match are, for example, level 2 or above in the skills they
(Company  for companies,  Skill  for skills,  Person  for share with us:

Graph Databases // eMag Issue 34 - Oct 2015 11


001 // Find colleagues with shared
skills,
002 level 2 or above
003 MATCH (me:Person{username:’ian’})
004        -[:WORKS_FOR]->(company),
005      (me)-[:HAS_SKILL]->(skill),
006      (colleague)-[:WORKS_FOR]-
>(company),
007      (colleague)-[r:HAS_SKILL]-
>(skill)
008 WHERE  r.level >= 2
009 RETURN colleague.username AS
username,
010       count(skill) AS score,
011       collect(skill.name) AS skills
012       ORDER BY score DESC
I’ve highlighted the changes to the original query. In
the MATCH clause, we now assign a colleague’s HAS_
SKILL  relationships to an identifier  r  (meaning
that r will be bound to a list of such relationships). We
then introduce a WHERE clause that limits the match
to cases where the value of the level property on the
relationships bound to r is 2 or greater. Diagram 4
Running this query against the sample data re-
turns the following results: Running this query against the sample data returns
Username score skills the following results:
Charlie 1 [‘Neo4j’]
username company score skills
Ben 1 [‘Neo4j’]
sarah Startup, 2 [‘Java’,
Search across companies Ltd ‘REST’]
As a final illustration of the flexibility of our sim- ben Acme, Inc 1 [‘REST’]
ple data model, we’ll tweak the query again so that
emily Startup, 1 [‘Neo4j’]
we no longer limit it to the company where I work,
Ltd
but instead search across all companies for people
with skills in common with me: charlie Acme, Inc 1 [‘Neo4j’]

001 // Find people with shared skills, Modelling


level 2
002 or above We’ve looked at how we derive an application’s graph
003 MATCH (me:Person{username:’ian’}) data model and associated queries from end-user re-
004        -[:HAS_SKILL]->(skill), quirements. In summary, we:
005      (other)-[:WORKS_FOR]->(company), • describe the client or end-user goals that moti-
006      (other)-[r:HAS_SKILL]->(skill) vate our model;
007 WHERE  r.level >= 2 • rewrite those goals as questions we would have
008 RETURN other.username AS username, to ask of our domain;
009 • identify the entities and the relationships be-
010       company.name AS company, tween them that appear in these questions;
011       count(skill) AS score, • translate these entities and relationships into Cy-
012       collect(skill.name) AS skills
pher path expressions; and
013       ORDER BY score DESC
• express the questions we want to ask of our do-
To facilitate this search, we’ve removed the require- main as graph patterns using path expressions
ment that the other person must be connected to similar to the ones we used to model the domain.
the same company node as me. We do, however, still
identify the company for whom this other person Use Cypher to describe your model
works. This then allows us to add the company name Use Cypher path expressions rather than an interme-
to the results. The pattern described by the MATCH diate modelling language such as UML to describe
clause now looks as follows. (Diagram 4) your domain and its model. As we’ve seen, many of
the noun and verb phrases in the questions we want

12 Graph Databases // eMag Issue 34 - Oct 2015


to ask of our domain can be straightforwardly trans- that is, queries we haven’t yet thought of. All is well
formed into Cypher path expressions, which then and good — until we go into production. At that
become the basis of both the model itself and the point, for the sake of performance, we denormalize
queries we want to execute against that model. In the data, effectively specializing it on behalf of an
such circumstances, the use of an intermediate mod- application’s specific access patterns. This denormal-
elling language adds very little. This is not to say that ization helps in the near term but poses a risk for the
Cypher path expressions comprehensively address future, for in specializing for one access pattern, we
all of our modelling needs. Besides capturing the effectively close the door on many others. Relation-
structure of the graph, we also need to describe how al modellers frequently face these kinds of either/
both the graph structure and the values of individual or dilemmas: either stick with the normal forms and
node and relationship properties ought to be con- degrade performance or denormalize and limit the
strained. Cypher does provide for some constraints scope for evolving the application further down the
today, and the number of constraints it supports will line.
rise with each release, but there are occasions where Not so with graph modelling. Because the
domain invariants must be expressed as annotations graph allows us to introduce new relationships at
to the expressions we use to capture the core of the the level of individual node instances, we can spe-
model. cialize it over and over again, use case by use case, in
an additive fashion — that is, by adding new routes
Name relationships based on use to an existing structure. We don’t need to destroy
cases the old to accommodate the new; rather, we simply
Derive your relationship names from your use cases. introduce the new configuration by connecting old
Doing so creates paths in your model that easily align nodes with new relationships. These new relation-
with the patterns you want to find in your data. This ships effectively materialize previously unthought-of
ensures that queries that take advantage of these graph structures to new queries. Their introduction
paths will ignore all other nodes and relationships. into the graph, however, need not upset the view en-
Relationships both compose and partition the joyed by existing queries.
graph. In connecting nodes, they structure the whole,
creating a complex composite from what would oth- Pay attention to language
erwise be simple islands of data. Because they can be In our modelling example, we derived a couple of
differentiated based on their names, directions, and path expressions from the noun and verb phrases we
property values, relationships also serve to partition used to describe common relationships. There are a
the graph, allowing us to identity specific subgraphs few rules of thumb to analysing a natural-language
within a larger, more generally connected structure. representation of a domain. Common nouns be-
By focussing our queries on certain relationship come candidates for labels: “person”, “company” and
names and directions and the paths they form, we “skill” become Person, Company, and Skill. Verbs that
exclude other relationships and
other paths, effectively materi-
alizing a particular view of the
graph dedicated to addressing a
particular need.
You might think this
smacks somewhat of an over-
ly specialized approach, and in
many ways it is one. But it’s rare-
ly an issue. Graphs don’t exhibit
the same degree of specializa-
tion tax as relational models. The
relational world has an uneasy
relationship with specialization,
both abhorring it yet requiring
it, and then suffering when it’s
used.
Consider that we apply the
normal forms in order to derive
a logical structure capable of
supporting ad hoc queries —

Graph Databases // eMag Issue 34 - Oct 2015 13


take an object — “owns”, “wrote”, and “bought”, for
example — are candidates for relationship names. 001 (:Person{name:’Alice’}) -[:EMAILED]-
Proper nouns — a person or company name, for ex- >(:Person{name:’Lucy’})
ample— refer to an instance of a thing, which we
then typically model as a node. This looks straightforward enough. In fact, it’s a lit-
Things aren’t always so straightforward. Sub- tle too straightforward, for with this construct it be-
ject-verb-object constructs easily transform into comes extremely difficult to indicate that Alice also
graph structures, but a lot of the sentences we use carbon-copied Alex. But if we unpack the noun or-
to describe our domain are not so simple. Adverbial igins of “email”, we discover both an important do-
phrases, for example — those additional parts of a main concept — the electronic communication itself
sentence that describe how, when, or where an ac- — and an intermediate node that connects senders
tion takes place — result in what entity-relational and receivers:
modelling calls “n-ary” relationships: complex, mul-
tidimensional relationships that bind several things 001 (:Person{name:’Alice’})
and concepts. 002 -[:SENT]->(e:Email{subject:’Annual
The representation of n-ary relationships would 003 report’})
appear to require something more sophisticated 004 -[:TO]->(:Person{name:’Lucy’}),
005 (e)-[:CC]->(:Person{name:’Alex’})
than a property graph, like a model that allows re-
lationships to connect more than two nodes or that If you’re struggling to come up with a graph struc-
permits one relationship to connect to and thereby ture that captures the complex interdependencies
qualify another. Such data-model constructs are, among several things in your domain, look for the
however, almost always unnecessary. To express a nouns, and hence the domain concepts, hidden on
complex interrelation of several things, we need only the far side of some of the verb phrases you use to
introduce an intermediate node: a hub-like node describe the structure of your domain.
that connects all the parties to an n-ary relationship.
Intermediate nodes are a common occurrence Conclusion
in many application graph data models. Does their Once a niche academic topic, graphs are now a com-
widespread use imply that there is a deficiency in the modity technology. Neo4j makes it easy to model,
property graph model? I think not. More often than store, and query large amounts of variably struc-
not, an intermediate node makes visible one more tured, densely connected data, and we can design
element of the domain — a hidden or implicit con- and implement an application graph data model
cept with informational content and a meaningful by transforming user stories into graph structures
domain semantic all its own. and declarative graph-pattern-matching queries.
Intermediate nodes are usually self-evident If you’re beginning to think in graphs, head over to
wherever an adverbial phrase qualifies a clause. “Bill http://neo4j.com and grab a copy of Neo4j.
worked at Acme from 2005-2007 as a software en-
gineer” leads us to introduce an intermediate node
that connects Bill, Acme, and the role of software en-
gineer. It quickly becomes apparent that this node
represents a job or an instance of employment, to
which we can attach the date properties “from” and
“to”.
It’s not always so straightforward. Some inter-
mediate nodes lie hidden in far more obscure locales.
Verbing — the language habit whereby a noun is
used as a verb — can often occlude the presence of
an intermediate node. Technical and business jargon
is particularly rife with such neologisms: we “email”
one another, rather than send an email and “google”
for results, rather than search Google.
The verb “email” provides a ready example of
the kinds of difficulties we can encounter if we ig-
nore the noun origins of some verbs. The following
path shows the result of us treating “email” as a rela-
tionship name.

14 Graph Databases // eMag Issue 34 - Oct 2015


Read online on InfoQ

Data Modelling in Graph Databases:


Interview with Jim Webber and Ian Robinson

by Srini Penchikala 

Jim Webber is chief scientist at Neo Technology, a distributed-systems specialist working on


very-large-scale graph data technology.

Ian Robinson works on research and development for future versions of the Neo4j graph
database. Harbouring a long-held interest in connected data, he was for many years one of the
foremost proponents of REST architectures before turning his focus from the Web›s global graph
to the realm of graph databases. Follow him on Twitter: @iansrobinson

Graph databases are NoSQL database systems that use graph data


model for storage and processing of data.

Matt Aslett from the 451 Group you model the data stored in InfoQ spoke with Ian Rob-
notes that graphs are now relational or other NoSQL data- inson and Jim Webber of Neo
emerging from the general bases like document databases, Technologies, who co-authored
NoSQL umbrella as a category key-value data stores, or col- O’Reilly’s Graph Databases,
in their own right. In 2014-2015, umn-family databases. You can about data modelling and best
there has been growth in the cat- use graph data models to create practices when using graph da-
egory of all things graph. rich and highly connected data tabases for data management
Data modeling with a graph to represent real-world use cases and analytics.
database is different from how and applications.

Graph Databases // eMag Issue 34 - Oct 2015 15


better, simply because of famil-
InfoQ: What type of data is not iarity. Why use a graph database over
suitable for storing in a rela- However, there are often other NoSQL databases?
tional database but is a good two drivers underlying a move
candidate for a graph data- from a relational database to Fowler and Sadalage answer
base? Neo4j. The first is the observa- this, we think, very clearly in
tion that your domain is a con- their book NoSQL Distilled. They
Ian Robinson and Jim Web- nected data structure (e.g. social point out that of the four types of
ber:  That’s pretty straightfor- network, healthcare, rights man- NoSQL store — graph, key-value,
ward to answer: anything that’s agement, real-time logistics, rec- column, and document — the
interconnected either immedi- ommendations...) and then un- latter three are what they term
ately, because the coding and derstanding that such domains “aggregate stores”. Aggregate
schema design is complicated, or are easy and pleasant to store stores work well when the pat-
eventually, because of the join- and query in Neo4j, but difficult tern of storage and retrieval is
bomb problem inherent in any and unpleasant in a relational symmetric. Store a shopping
practical application of the rela- database. Typically these cases basket by key, retrieve it by key;
tional model. are driven by technologists who store a customer document and
Relational databases are understand, at least to some de- retrieve it, and so on.
fine things, even for large data gree, that they are dealing with a But when you want to anal-
sets, up to the point where you graph problem and are prepared yse the data across aggregates,
have to join. And in every rela- to use Neo4j to solve that graph things get trickier. How do you
tional-database use case that problem elegantly and quickly find the popular products for dif-
we’ve seen, there’s always a join — they don’t want to be stuck ferent customer demographics?
— and in extreme cases, when in a miasma of sparse tables and How do you do this in real time,
an ORM has written and hidden the über-join table. as you’re serving a customer on
particularly poor SQL, many in- The second driver is perfor- your system right now?
discriminate joins. mance. Join pain in relational da- These activities, though ba-
The problem with a join is tabases is debilitating for systems sic in domain terms, are tricky to
that you never know what inter- that use them. Perhaps your first solve with aggregate stores. As
mediate set will be produced, join performs well; if you’re lucky, a result, developers using these
meaning you never quite know maybe even your second does, stores are forced to compute
the memory use or latency of a too. But as the size of a data- rather than query to get an an-
query with a join. Multiplying set grows, confidence in those swer. This is why aggregate stores
that out with several joins means joins diminishes as query times are so keen on map-reduce-style
you have enormous potential for get longer and longer. Join-in- interactions.
queries to run slowly while con- tensive models usually come Neo4j’s data model is far
suming lots of (scarce) resources. about because they’re trying to more expressive than aggregate
solve some kind of connection stores or relational databases.
or path problem, but the maths Importantly, the model stress-
underlying relational databases es connectivity as a first-class
What are the advantages of simply aren’t well suited to em- concept. It is connectivity in the
a graph database over a rela- ulating path operations. Neo4j graph between customers, prod-
tional database? has no such penalties for path ucts, demographics, and trends
operations: query times scale that yields the answers to these
linearly with the amount of data kinds of real-time analytics prob-
As an industry, we’ve become you choose to explore as part of lems. Neo4j provides answers by
rather creative at forcing all kinds your query, not with the overall traversing (querying) the graph
of data into a relational database size of the dataset (which can rather than resorting to latent
(and we’ve become philosoph- be enormous). Having join pain map-reduce computations.
ical about the consequences!). is another indicator that a Neo4j In a graph, you bring to-
Relational databases are truly graph will be a superior solution gether arbitrary dimensions (dif-
the golden hammer of comput- to a complex data model in a re- ferent relationship types) at que-
er-systems development, to the lational database. ry time to answer sophisticated
point where many of us are re- questions with ease and excel-
luctant to drop RDBMS from our lent performance. In non-native
tool chain in favour of something graph databases (which includes

16 Graph Databases // eMag Issue 34 - Oct 2015


the other kinds of NoSQL stores), easy to imagine how repeating of the solution at this stage, but
traversals are faked: they happen this many times gives a large and the domain-specific questions
at the application level in code interesting graph of people and we describe now provide rich in-
you have to write and maintain. their food preferences (or aller- put for the next step of the pro-
They’re also over the network gies, etc.). cess.
and are orders of magnitude Data modelling consists of
slower than the graph-native, using the property-graph prim- Step 3: Identify the entities
highly optimised graph query itives — nodes, relationships, and the relationships between
engine that underpins Neo4j. properties, and labels — to build them that appear in these
an application-specific graph questions.
data model that allows us to Language itself is a structuring
easily express the questions we of logical relationships. By at-
How do you typically model want to ask of that application’s tending closely to the language
data with a graph database? domain. we use to describe our domain
When building an applica- and the questions we want to
tion with Neo4j, we typically em- ask of it, we can readily identify
Neo4j uses a graph model called ploy a multistep process, which a graph structure that represents
the “labelled property graph”. starts with a description of the this logical structuring in terms
This is a pragmatic model that problem we’re trying to solve and of nodes, relationships, proper-
eschews some of the more eso- ends with the queries we want ties, and labels. Common nouns
teric mathematical bits of graph to execute against an applica- — “person” or “company”, for ex-
theory in favour of ease of under- tion-specific graph data model. ample — tend to refer to groups
standing and design. This process can be applied in an or classes of things, or the roles
The labelled property graph iterative and incremental man- that such things perform: these
consists of nodes (which we typ- ner to build a data model that common nouns become candi-
ically use to represent entities) evolves in step with the iterative date label names. Verbs that take
connected by relationships. and incremental development of an object indicate how things are
Nodes can be tagged with one the rest of the application. connected: these then become
or more labels to indicate the candidate relationship names.
role each node plays within our Step 1: Describe the client or Proper nouns — a person’s or
dataset. Every relationship has a end-user goals that motivate company’s name, for example —
name and a direction, which to- our model. tend to refer to an instance of a
gether provide semantic context What’s the problem we’re trying thing, which we model as a node
for the nodes connected by that to solve? We’ve found that agile and its properties.
relationship. Both nodes and re- user stories are great for provid-
lationships can contain one or ing concise, natural-language Step 4: Translate these entities
more properties. We typically use descriptions of the problems we and relationships into Cypher
node properties to represent the intend our model to address, but path expressions.
attributes of an entity, and rela- pretty much any description of a In this step, we formalize the de-
tionship properties to specify the requirement can serve as the ba- scription of our candidate nodes,
weight, strength, or quality of sis of our modelling efforts. relationships, properties, and la-
that particular relationship. bels using Cypher path expres-
These graph primitives pro- Step 2: Rewrite those goals as sions. These path expressions
vide us with a simple, compact, questions we would have to form the basis of our application
and easy-to-reason-about mod- ask of our domain. graph data model in that they
elling kit. For example, using An agile user story de- describe the paths, and hence
Neo4j’s Cypher query language, scribes  what  it is we’re trying to the structure, that we would ex-
we can easily describe that Alice achieve with our application. By pect to find in a graph dedicated
loves cookies: rewriting each application goal to addressing our application’s
(:Person {name: ‘Al- in terms of the questions the needs.
ice’})-[:LOVES]->(:Food domain would have to answer
{type: ‘Cookie’}) to achieve that goal, we take a Step 5: Express the questions
This path expression says step towards identifying how we we want to ask of our domain
that there’s a node representing might go about implementing as graph patterns using path
a Person named Alice who loves that application. We’re still deal- expressions similar to the ones
a particular food type, cookie. It’s ing with an informal description we used to model the domain.

Graph Databases // eMag Issue 34 - Oct 2015 17


Having derived the basic controls access to and manages adding another node (an inter-
graph structure from the ques- the lifecycle of the whole. mediate node) to the graph. For
tions we would want to ask of That the graph has no no- example, we could cast the origi-
our domain, we’re now in a po- tion of an aggregate boundary nal like, which was (alice)-[:-
sition to express the questions (beyond the boundary imposed LIKES]-(post),  as  (al-
themselves as graph queries that by a node’s record-like structure) ice)-[:CREATED]->(-
target this same structure. At the is not necessarily a drawback. like)-[:FOR]->(post) — and
heart of most Cypher queries is The graph model, after all, em- now that we have the
a MATCH clause containing one phasizes interconnectedness. (like) node, it’s easy to like
or more path expressions that Some of the most valuable in- it as  (bob)-[:LIKES]->(-
describe the kind of graph struc- sights we can generate from our like), giving hyperedge equiva-
ture we want either to find or to data require us to take account lent functionality when you need
create inside our dataset. If we’ve of connections between things it and avoiding those complex-
been diligent in allowing our that in any other context would ities when you don’t (which is
natural-language descriptions be considered discrete entities. most of the time).
of the domain to guide the basic Many techniques of predictive
graph structure, we will find that analysis and forensic analysis de-
many of the queries we execute pend on our being able to infer
against our data will use similar or identity new composites, new What are the best practices for
path expressions to the ones we boundary-breaking connected modelling the graph data?
used to structure the graph. The structures that don’t appear in
key point here is that the result- our initial conception of the do-
ing structure is an expression of main. Cypher’s rich path-expres- Derive your relationship names
the questions we want to ask of sion syntax, with its support for from your use cases. Doing so
the domain: the model is isomor- variable-length paths and op- creates paths in your model that
phic to the queries we want to tional subgraph structures, ef- align easily with the patterns you
execute against the model. fectively allows us to identify and want to find in your data. This
materialize these new composite ensures that the queries you de-
structures at query time. rive from your use cases only see
these paths, thereby eliminating
Should the modelling happen irrelevant parts of the surround-
in the database or application ing graph from consideration.
layer? What are hyperedges and how As new use cases emerge, you
should they be modelled? can reuse existing relationships
or introduce new ones, as needs
As will be evident from our de- dictate.
scription of the modelling pro- Hyperedges come from a dif- Use intermediate nodes to
cess, much of the modelling ferent graph model known as connect multiple dimensions.
takes place in the database. The a hypergraph. A hyperedge is a Intermediate nodes represent
labelled-property-graph primi- special kind of relationship that domain-meaningful hubs that
tives allow us to create extreme- connects more than two nodes. connect an arbitrary number of
ly expressive, semantically rich You can imagine a somewhat entities. A job node, for exam-
graph models, with little or no contrived example where you ple, can connect a person, a role,
accidental complexity — there like something on Facebook and and a company to represent a
are no foreign keys or join tables, your friend likes that like. While time-bounded instance of em-
for example, to obscure our mod- they’re beloved of theoreticians ployment. As we learn more
elling intent. and some other graph databases, about the context of that job
That said, there are sever- hyperedges are not a first-class (where the individual worked,
al characteristics of the domain citizen in Neo4j. Our experience for example), we can enrich our
that are best implemented in is that they’re only useful in a rel- initial structure with additional
the application. Neo4j doesn’t atively small number of use cas- nodes.
store behaviour. Nor does it have es. Their narrow utility is offset by Connect nodes in linked
any strict concept of what do- their additional complexity and lists to represent logical or tem-
main-driven design calls an “ag- intellectual cost. poral ordering. Using different
gregate”: a composite structure You can, of course, mod- relationships, we can even inter-
bounded by a root entity that el hyperedges in Neo4j just by leave linked lists. The episodes

18 Graph Databases // eMag Issue 34 - Oct 2015


of a TV programme, for exam- over, thereby impacting perfor- ple, if we had the structure (Al-
ple, can be connected in one mance. In contrast, because we ice)-[:EMAILED]->(Bob) then
linked list to represent the order know some domain invariants, we might think we’ve built a
in which they were broadcast we can instead cast this query as: sound model since it reads well
(using, say,  NEXT_BROADCAST re- (me)-[:FRIEND]->()-[:- left to right (Alice e-mailed Bob)
lationships), while at the same FRIEND]->(other) and makes sense the reverse way
time being connected in anoth- Here we’ve constrained (Bob was e-mailed by Alice).
er linked list to represent the both the relationship type and Initially, we went with this
order in which they were made the depth to find only friends of model but it soon became ap-
(using  NEXT_IN_PRODUCTION re- friends (because dating friends is parent that it was lossy. When
lationships). Same nodes, same yucky). This is better, but we can it came time to query an e-mail
entities, but two different struc- go even further, adding addition- that could violate communica-
tures. al constraints such as gender, tions policy, we found the e-mail
sexual preference, and interests: didn’t actually exist — quite a
( m e ) - [ : F R I E N D ] - problem! Furthermore, where we
> ( ) - [ : F R I E N D ] - > expected to see several e-mails
Can you discuss the design (other:Heterosexu- confirming the corrupt activity,
considerations for graph tra- al:Female)-[:LIKES]-> all we saw was that Alice and Bob
versal? (:TV_SHOW {title: ‘Doctor had e-mailed each other several
Who’}). times. Because of our imprecise
Now we’ll only match use of English, we’d accidentally
Unlike RDBMS, query latency in against heterosexual females encoded a core domain entity
a graph is proportional to how (the two colon-separated labels —  the e-mail itself  — into a re-
much of the graph you choose to on the other node) who also lationship when it should have
search. That means you as a que- like the TV show titled Doctor been a node.
ry designer should try to mini- Who. Neo4j can now aggressive- Fortunately, once we’d un-
mize the amount of graph you ly prune any parts of the graph derstood that we’d created a lossy
search. You should search just that don’t match, significantly re- model, it was straightforward to
enough to answer your query. ducing the amount of work that correct it using an intermediate
In Neo4j, that means add- needs to be done, and thereby node:  (Alice)-[:SENT]->(e-
ing as many constraints into your keeping latency very low (typi- mail)-[:TO]->(Bob). Now we
query as you can. Constraints cally small milliseconds even for have that intermediate node rep-
will prevent the database from large data sets). resenting the e-mail, we know
searching paths that you al- who sent it (Alice) and to whom
ready know will be fruitless for it was addressed (Bob). It’s easy
your result. For example, if you to extend the model so that we
know that you’re only interested Are there any anti-patterns can capture who was CC’d or
in other people in a large social when working with graph BCC’d like so:  (Charlie)<-[:
graph who are single and share data? CC]-(email)-[:BCC]->(Dai-
compatible sexual orientation/ sy). From here, it’s easy to see
hobbies/interests, you should how we’d construct a large-scale
constrain your search according- There certainly are anti-patterns graph of all e-mail exchanged
ly. That way, you’ll avoid visiting that can catch out the unwary and map out patterns that would
parts of that social network with or people who are new to graph catch anyone violating the rules,
incompatible sexual orientation/ modelling. For example, in Graph but we’d have missed them if we
boring hobbies/bizarre interests, Databases, we discuss a case of hadn’t thought carefully about
and you’ll get an answer much e-mail forensics (in Chapter 3). nodes and relationships.
more quickly. In that example, we’re looking If there was just one piece
In Cypher, you can express a for potentially illegal activities of advice for people coming fresh
poor match very loosely, like this: in an organization by individuals to graph modelling, it would be
(me)-[*]->(other) swapping information for pur- don’t (accidentally) encode enti-
This query matches all re- poses like insider trading (think: ties as relationships.
lationship types, at any depth Enron).
(that’s the asterisk), to any kind of When we design graph
thing. As a result, it would likely models, we sanity-check the
visit the whole graph many times graph by reading it. For exam-

Graph Databases // eMag Issue 34 - Oct 2015 19


dard monitoring and alerting in- model is always going to take a
Can you talk about any gotchas
frastructure. Indeed, that’s now little more learning than a docu-
or limitations of graph databas-
es? happening. ment store (it is, after all, a richer
In terms of standards spe- model), there are certain mechan-
cific to the graph space, it’s still ical things we can do just to make
The biggest gotcha we both found premature. For example, Neo4j that learning curve easier.
when moving from RDBMS to recently introduced labels for You’ve seen this with the re-
graphs some years back (when we nodes, something that didn’t exist cent release of Neo4j 2.0, where
were users of early Neo4j versions, in the previous 10-year history of we introduced labels to the graph
way before we were developers the database. Now that graph da- model, provided declarative in-
on the product) was the years of tabases are taking off, this kind of dexing based on those labels,
entrenched relational modelling. rapid innovation in the model and and built a fabulous new UI with
We found it difficult to leave be- in the implementation of graph excellent visualization and a new
hind the design process of creat- databases means it’s too early to version of Cypher with a produc-
ing a normalized model and then try to standardize: our industry is tive REPL. But our efforts don’t
denormalizing it for performance still too immature and the lack of end there: soon we’ll be releasing
— we didn’t feel like we’d worked standards is definitely not inhib- a point version of Neo4j that takes
hard enough with the graph to be iting take-up of graph databases, all the pain out of bulk imports to
successful. which are now the fastest growing the database (something all users
It was agonizing in a way. segment of the whole data market do at least once) and then we’ll
You’d finish up some piece of work (RDBMS and NoSQL). keep refining our ease of use.
quickly and the model would In the meantime, since Neo4j The other research and de-
work efficiently with high fidelity, makes up in excess of 90% of the velopment thread that we’re fol-
then you’d spend ages worrying entire graph database market, it lowing is performance and scale.
that you’d missed something. acts as a de facto standard itself, For example, we’ve got some
Today, there are materials to with a plethora of third-party con- great ideas to vertically scale ACID
help get people off the ground nectors and frameworks allowing transactional writes in Neo4j by
far more quickly. There are good you to plug it into your applica- orders of magnitude using a write
books, blog posts, Google groups, tion or monitoring stack. window to batch IO. From the us-
and a healthy Meetup commu- er’s perspective, Neo4j remains
nity all focussed on graphs and ACID compliant, but under the
Neo4j. That sense of not working covers, we amortize the cost of
hard enough is quickly dispelled What is the future of graph da- IO across many writes. This is only
once you’ve gotten involved with tabases in general and Neo4j in possible in native graph databas-
graphs and you get on the real particular? es like Neo4j because we own the
work of modelling and querying code all the way down to the disk
your domain — the data model and can optimize the whole stack
gets out of the way, which is as it Graphs are a very-general-pur- accordingly. Non-native graph
should be. pose data model. Much as we databases simply don’t have this
have seen RDBMS become a tech- option.
nology that has been widely de- For performance in the large,
ployed in many domains in the we’re excited by the work by Bai-
What is the current status of past, we expect graph databases lis et al (2013) on highly available
standards in graph data man- to be even more widely deployed transactions (HATs), which pro-
agement with respect to data across many domains in the fu- vide non-blocking transactional
queries, analytics, etc.? ture. That is, we expect graph da- agreement across a cluster, and
tabases to be the default choice by RAMP transactions which
As with most NoSQL stores, it’s a and the first model that you think maintain ACID constraints in a
little early to think about stan- of when you hear the word “data- database cluster while allowing
dards (though the folks at Oracle base.” non-contending transactions to
would seemingly disagree). At last To do that, there are a few execute in parallel. These kind of
year’s  NoSQL Now!  conference, things that we need to solve. First- ideas for the mechanical bedrock
the general opinion was that op- ly, graph databases need to be for our work on highly distributed
erational interfaces should be even easier to use — the out-of- graph databases, on which you’ll
standardized so that users can box experience has to be painless hear more from us in the coming
plug their databases into a stan- or even pleasurable. Although the months.

20 Graph Databases // eMag Issue 34 - Oct 2015


Read online on InfoQ

High Tech, High Security:


Security Concerns in Graph Databases

George Hurlburt is chief scientist at STEMCorp, a nonprofit that works to further economic development via the
adoption of network science and to advance autonomous technologies as useful tools for human use. Contact him at
[email protected].

Cybersecurity measures are best accommodated in system design


because retrofits can be costly. New technologies and applications,
however, bring new security and privacy challenges, and the
consequences of new technology are often difficult to anticipate. Such
is the case with graph databases, a relatively new database technology
that’s gaining popularity.

The emergence of in the realm of indexed spatial to the decades-long reign of the
NoSQL data, but it fares poorly in highly RDBMS. Various forms of NoSQL
The relational-database man- dynamic environments, such as database opened doors to a vast-
agement system (RDBMS), initial- in a management information ly improved dynamic data por-
ly designed to maximize highly system that depends on volatile trayal with far less overhead. For
expensive storage, has indeed data or a systems architecture example, schemas need not be
proven to be highly effective in with a high churn of many-to-ma- as rigorous in the NoSQL world.
transaction-rich and process-sta- ny relationships. In such environ- NoSQL database designs include
ble environments. For example, ments, RDBMS design imposes wide-column stores, document
the RDBMS excels in large-scale far too much mathematical and stores, key-value (tuple) stores,
credit-card transaction process- managerial overhead. multimodal databases, object
ing and cyclic billing operations. The rise of the NoSQL da- databases, grid/cloud databases,
It offers superior performance tabase represents an alternative and graph databases. The graph

Graph Databases // eMag Issue 34 - Oct 2015 21


Graph discovery
Because they deal with properties
and connections, graph databas-
es represent rich pools of infor-
mation, often hidden until dis-
covered. Discovery is a means by
which a large collection of related
data is mined for new insights,
without a strong precognition of
what these insights might be.
The graph database wasn’t
Figure 1. This simple node-arc-node triad, often called a triple, is the fundamental building block initially considered a useful tool
for describing all manner of complex networks in great detail. for discovery. It took a specially
designed family of supercom-
puters to realize the full power
of graph discovery. Although
it’s straightforward to represent
graphs, as the volume of triples
increases into the billions, the
ability to rapidly traverse multiple
paths becomes compute-bound
in all but the most powerful ma-
chines.
This is particularly true in
the case of dense graphs, such
as tightly woven protein-protein
networks. Here, detailed graph
Figure 2. A graph database harnessed for discovery. Such discovery could support a detailed build- queries can overwhelm less capa-
out of the complex relationships between ocean and atmosphere that compose climatic condi- ble computational systems. The
tions, or could hasten the discovery of how Ebola might spread in Western Africa. graph supercomputer, built from
the ground up to traverse graphs,
overcomes time and capacity lim-
data base, crossing many lines in ple relationships across arcs. Net- itations. Such devices, some com-
the NoSQL world,  stands poised works of all kinds lend themselves plete with Hadoop analysis tools,
to become a successful technol- well to graph representation. The recently became available in the
ogy. graph database harnesses this high-end graph-database market-
powerful capability to represent place via Cray.
The graph database network composition and con- The high-end graph super-
The graph database relies on the nectivity. Graph databases have computer, built for discovery,
familiar node-arc-node or, per- matured to support discovery, brings great promise. For ex-
haps more simplistically, noun- knowledge management, and ample, it can support a detailed
verb-noun relationship of a net- even prediction. build-out of the complex relation-
work (see Figure 1). A node can be In an Internet-connected ships between the ocean and at-
any object. An arc represents the world, where networks of all types mosphere that compose climatic
relationship between nodes. Both become increasingly preeminent, conditions. In a time of great cli-
nodes and arcs can contain prop- such a network capability is be- mate change, further discovery
erties. This simple node-arc-node coming essential to modern sense of indirect, nonlinear causes and
triad, often called a triple, is the making. However, like the RDBMS, effects becomes increasingly cru-
fundamental building block for the graph database is just another cial. Likewise, a graph supercom-
describing all manner of complex tool in the box, and it can be har- puter could hasten a discovery
networks in great detail. nessed for good or ill. Thus, it’s not concerning the spread of Ebola in
Networks such as an electri- premature to consider the large- Western Africa, which could serve
cal grid, a corporate supply chain, scale security implications of this to stem the spread of the disease.
or an entire ecosystem are often new and rather exciting technol- Figure 2 illustrates the notion of
composed of numerous nodes ogy, at least at the highest levels. discovery using a graph database.
that share huge numbers of multi-

22 Graph Databases // eMag Issue 34 - Oct 2015


Discovery: Privacy and security
Graph discovery, which has great promise for re-
solving complex interrelated problems, presents
privacy and security concerns, however. For ex-
ample, one’s identity can be further laid bare if
the graph supercomputer becomes the device
of choice to further mine our social and financial
transactions for purposes of surveillance, target-
ed advertising, and other overt exploits that tend
to rob individuals of their privacy.
While perhaps it’s an alien thought in a
thriving free enterprise system, placing an ethi-
cal bar on the acceptable extent of intrusion into
one’s personal life might well prove necessary for
financial, if not constitutional, reasons. It’s quite
acceptable to expect law enforcement to use all
necessary means to remove real threats from our
midst, but at what expense to the rest of society?
Likewise, those anxious to move their products
will take advantage of every opportunity to do
so by whatever means possible, but at what per-
sonal price for those targeted? Such high-end
exploitation amounts to nothing more than a
projection of currently established trends.
In the design of such socioeconomic stud-
ies, especially involving a wide range of social
and business transaction relationships, the se-
curity bar must be set exceedingly high. Any in-
tentionally perpetrated breach could be far more
devastating than recent massive hacks against
corporations such as credit-card issuers or mo-
tion-picture companies. This is further acerbat-
ed by the notion that the Internet of Anything
(IoA) consists of myriads of sensors, actuators,
and mobile devices, all of which seem to be opti-
mized for privacy leakage.

Graph knowledge management


The concept of the node-arc-node triple strong-
ly resembles the subject-predicate-object rela-
tionships expressed in the Resource Description
Framework (RDF) descriptive language. RDF
creates a level of formal expression that lets us
describe and reason about the data housed in
a graph database. Moreover, RDF nicely feeds
a formal ontology, thus permitting a rigorous
semantic definition of terms. The “how much is
enough” question, however, might take years to
resolve with regard to a tolerable degree of prac-
tical formalization.
Together, RDF and a formal ontology speak
to the World Wide Web Consortium (W3C) view
of linked data: an endeavor to make reusable
structured knowledge generally available in a
common referential format via the Web. There’s a
downside though. Whereas it’s relatively straight-

23
forward to convert highly struc- ising candidate graph databases change systemic behaviors over
tured data, such as well-organized for their vulnerability to attack. time. Thus, the accreditation is
spreadsheets and databases, into good only for the moment of time
RDF, only high-end tools can re- Graph prediction at which the snapshot was taken.
liably convert unstructured data In dynamic circumstances involv- Given their growing sophis-
into RDF, and that carries some re- ing an unfolding process such as tication, graph databases offer
strictive caveats. weather or economic trends, the the potential to let us monitor
Not all graph databases, ability to predict future behavior dynamic change in near real time.
however, require RDF-style tri- becomes highly desirable. By monitoring data streams with
ple representation. A number of Graph representations fa- quantitative methods, looking
thriving commercial graph data- cilitate predictions because they for anomalous node or changing
bases employ triples in their own let us both qualify and quantify a relationship patterns, we could
unique ways without engaging system represented as a network. detect and investigate intrusions
RDF. Many offer a number of at- The ability to assign properties to and other security breaches early
tractive features, such as graph nodes and arcs — such as location, on, quickly prosecuting any iden-
visualization, backup, and recov- time, weights, or quantities — lets tified perpetrators.
ery. Emil Eifrem, founder and CEO us qualitatively evaluate the graph From the predictive per-
of Neo Technology, expects that on the basis of similar properties. spective, data integrity must take
these tools will attract the corpo- More importantly, quantitative a front seat. Data provenance be-
rate nod and the consumer base techniques let us evaluate metrics comes a crucial issue because the
will continue to grow, pushing the inherent in almost all graphs. stakes of prediction are high. The
graph-database industry from 2 The ability to apply prov- results of a prediction are as ac-
percent to an estimated 25 per- en metrics to graphs means that curate as the data underlying the
cent of the database market by their characteristics might be predictive tools. False data could
2017. Of course, many companies quantified to allow an objective gravely affect outcomes and liter-
employ their own languages and evaluation of the graph. In cas- ally endanger security. Consider
techniques for data management. es where graph data is dynamic, the consequence of a faulty pre-
A real need exists for standards such as in an ongoing process, dictive model for disaster relief,
that, at a minimum, support data a powerful predictive capability which calls for distributing re-
transportability. becomes possible, assuming the sources in an unaffected region as
data stream is accessible. This ap- opposed to the affected region. In
Knowledge management: proach presumes combinations this regard, good security practice
Privacy and security of graph theory and combinato- results in the highest ethical stan-
Security, particularly for propri- rial mathematics can be applied dards of applied science.
etary architectural designs, must against a real-time data stream. Although graph databas-
be taken into consideration. If Moreover, various graph configu- es hold great promise in a world
Web sharing is envisioned as a rations could be classified based wrapped in networks of all kinds,
reasonable means to generate a on their metrics. Such classifica- they also contain some inherent
lot of system-representative tri- tion templates, each with a graph security risks that have yet to be
ples from resident experts, a se- signature based on its metrics, fully understood, much less ap-
cure portal to the RDF data store could then permit identification preciated. Rather than piling on
becomes exceedingly important. of and a predictive baseline for the bandwagon, the prudent IT
User authentication and verifica- similar graphs as they arise. professional must carefully evalu-
tion also become important. ate potential risks in the context
Although knowledge man- Prediction: Security and of the intended operating envi-
agement is perhaps less extensive privacy ronment and perform the neces-
than discovery, related databases Current cybersecurity best prac- sary tradeoffs to achieve accept-
still might possess specific iden- tices suggest the importance of able levels of security and data
tity attributes that must be well taking a snapshot of a system un- protection. If security and privacy
protected. Front-end provisions der study to determine its security issues surrounding relatively new
must assure the existence of both and privacy vulnerabilities, lead- technologies, such as increasingly
security against intrusion and the ing to the accreditation of systems popular graph databases, aren’t
privacy of any personal data con- proven to be “secure”. The fallacy of considered up-front, they become
tained in the graph database. Fail- such practice is that most systems far more costly to implement
ure to offer adequate protection are influenced by ever-changing downstream.
could disqualify otherwise prom- environments, which serve to

24 Graph Databases // eMag Issue 34 - Oct 2015


Read online on neo4j.com

Graph Databases in the Enterprise:


Fraud Detection

By Jim Webber and Ian Robinson

Banks and insurance companies lose billions of dollars every year to


fraud.
Traditional methods of fraud detection fail to minimize these
losses since they perform discrete analyses that are susceptible
to false positives (and false negatives). Knowing this, increasingly
sophisticated fraudsters have developed a variety of ways to exploit the
weaknesses of discrete analysis.

Graph databases, on the other with challenges. Here are some when a fraudulent transac-
hand, offer  new methods of un- of their biggest: tion occurs.
covering fraud rings  and other • Complex link analysis to dis- • Evolving and dynamic fraud
complex scams with a high level cover fraud patterns — Un- rings — Fraud rings are con-
of accuracy through advanced covering fraud rings requires tinuously growing in shape
contextual link analysis, and you to traverse data relation- and size, and your applica-
they are capable of stopping ships with high computa- tion needs to detect these
advanced fraud scenarios in real tional complexity, a problem fraud patterns in this highly
time. that’s exacerbated as a fraud dynamic and emerging envi-
ring grows. ronment.
The key challenges in • Detect and prevent fraud  as
fraud detection it happens — To prevent a Overcoming fraud
Between the enormous amounts fraud ring, you need real-time detection challenges
of data available for analysis and link analysis on an intercon- with graph databases
today’s experienced fraud rings nected dataset, from the time While no fraud-prevention mea-
(and solo fraudsters), fraud-de- a false account is created to sures are perfect, significant
tection professionals are beset improvements occur when you

Graph Databases // eMag Issue 34 - Oct 2015 25


any of which can cause the trans-
action to be evaluated against
the fraud graph. Fan-out might
be skipped, but complex graphs
can be flagged as a possible in-
stance of fraud.

Conclusion
When it comes to  graph-based
fraud detection, you need to
augment your fraud-detection
capability with link analysis. That
being said, two points are clear:
• As business processes be-
come faster and more auto-
mated, the time margins for
detecting fraud are narrow-
ing, increasing the need for a
Figure 1. A graph of a series of transactions from different IP addresses with a likely fraud event real-time solution.
occurring from IP1, which has carried out multiple transactions with five different credit cards. • Traditional technologies are
not designed to detect elab-
look beyond individual data fraudster can create a large num- orate fraud rings. Graph da-
points to  the connections that ber of synthetic identities to car- tabases add value through
link them. ry out sizeable schemes. analysis of connected data
Understanding the connec- Consider an online transac- points.
tions between data, and deriving tion with the following identifi- Graph databases are the ideal en-
meaning from these links, doesn’t ers: user ID, IP address, location, abler for efficient and manage-
necessarily mean gathering new tracking cookie, and credit-card able fraud-detection solutions.
data. You can draw significant number. Typically, the relation- Graph databases uncover a vari-
insights from your existing data ships between these identifiers ety of important fraud patterns
simply by reframing the problem should be (almost) one-to-one. from fraud rings and collusive
in a new way: as a graph. Some variations naturally ac- groups to educated criminals op-
Unlike most other ways of count for shared machines, fam- erating on their own — all in real
looking at data, graphs are de- ilies sharing a single credit-card time. 
signed to express relatedness. number, individuals using multi-
Graph databases uncover pat- ple computers, and the like.
terns that are difficult to detect However, as soon as the re-
with traditional representations lationships between these vari-
such as tables. An increasing ables exceed a reasonable num-
number of companies use graph ber, fraud should be considered
databases to solve a variety of as a strong possibility. The more
connected-data problems,  in- interconnections exist amongst
cluding fraud detection. identifiers, the greater the cause
for concern. Large and tightly
Example: E-commerce knit graphs are very strong indi-
fraud cators that fraud is taking place.
As our lives become increasingly See Figure 1 for an example.
digital, a growing number of fi- By putting checks into
nancial transactions are conduct- place and associating them with
ed online. Fraudsters have adapt- the appropriate event triggers,
ed quickly to this trend and have such schemes can be discovered
devised clever ways to defraud before they are able to inflict sig-
online payment systems. nificant damage. 
While this type of activity Triggers can include events
can and does involve criminal such as logging in, placing an or-
rings, even a single well-informed der, or registering a credit card —

26 Graph Databases // eMag Issue 34 - Oct 2015


Read online on InfoQ

Full-Stack Web Development Using Neo4j

Brian Underwood is a software engineer and lover of all things data. As a developer advocate
for Neo4j and co-maintainer of the Neo4j Ruby gem, Brian regularly lectures and writes on his
blog about the power and simplicity of graph databases. Brian is currently traveling the world
with his wife and son. Follow Brian on Twitter or join him on LinkedIn.

When building a full-stack Web application, you have many choices for
the database that you will put on the bottom of the stack. You want a
database that is dependable, certainly, but which also allows you to
model your data well. Neo4j is a good choice as the foundation of your
Web-application stack if your data model contains a lot of connected
data and relationships.
What is Neo4j? Why Neo4j?
Neo4j is a graph database, which simply means To choose a database for a Web application, you
that instead of storing data in tables or collections, should consider what it is that you want from it. Top
it stores data as nodes and relationships between criteria include:
nodes. In Neo4j, both nodes and relationships can • Is it easy to use?
contain properties with values. In addition: • Will it let you easily respond to changes in re-
• Nodes can have zero or more labels (like “Author” quirements?
or “Book”). • Is it capable of high-performance queries?
• Relationships have exactly one type (like “WROTE” • Does it allow for easy data modeling?
or “FRIEND_OF”). • Is it transactional?
• Relationships are always directed from one node • Does it scale?
to another (but can be queried regardless of di- • Is it fun (sadly, an often overlooked quality in a
rection). database)?

Graph Databases // eMag Issue 34 - Oct 2015 27


Figure 1. The Neo4j Web console.

In this respect, Neo4j fits the bill nicely: 001 MATCH


• It has its own, easy-to-learn query language (Cy- 002 (city:City)<-[:LIVES_IN]-
pher). (:Author)-[:WROTE]->
• It’s schema-less, which allows it to be whatever 003 (book:Book)-[:HAS_CATEGORY]-
you want it to be. >(category:Category)
• It can perform queries on highly related data 004 WHERE city.name = “Chicago”
(graph data) much faster than traditional data- 005 RETURN *
bases. Note the ASCII-art syntax that has parentheses sur-
• It has an entity and relationship structure that rounding nodes and arrows representing the rela-
naturally fits human intuition. tionships that point from one node to another. This
• It supports ACID-compliant transactions. is Cypher’s way to match a given subgraph pattern.
• It has a high-availability mode for query through- Of course, Neo4j isn’t just about pretty graphs.
put, scaling, backups, data locality, and redun- If you wanted to count the categories of books by
dancy. the location (City) of the author, you can use the
• It’s hard to grow tired of its visual query console. same MATCH pattern to return a different a set of col-
umns, like so:
When to not use Neo4j? 001 MATCH
While Neo4j, as a graph NoSQL database, has a 002 (city:City)<-[:LIVES_IN]-
lot to offer, no solution can be perfect. Some use cas- (:Author)-[:WROTE]->
es where Neo4j isn’t as good of a fit are for: 003 (book:Book)-[:HAS_CATEGORY]-
• recording large amounts of event-based data >(category:Category)
such as log entries or sensor data, 004 RETURN city.name, category.name,
• large-scale distributed data processing like with COUNT(book)
Hadoop, That would return the following:
• binary data storage, and city.name category.name COUNT(category)
• structured data that’s a good candidate for stor-
Chicago Fantasy 1
age in a relational database.
In the sample graph at the beginning of this section, Chicago Non-Fiction 2
you can see a graph of Author, City, Book, and Cate- While Neo4j can handle big data, it isn’t Hadoop,
gory and the relationships that tie these together. To HBase, or Cassandra and you won’t typically be
use Cypher to find all authors in Chicago and show crunching massive (petabyte) analytics directly in
that result in the Neo4j console, you could execute your Neo4j database. But when you’re interested in
the following search. serving up information about an entity and its data
neighborhood (like you would when generating
a webpage or an API result), it is a great choice for

28 Graph Databases // eMag Issue 34 - Oct 2015


Figure 2

anything from simple CRUD access to a complicated, you can see at  graphgist.neo4j.com) is a portal for
deeply nested view of a resource. GraphGists. A GraphGist is a simple AsciiDoc text
file (with images if need be) that describes the data
Which stack should you use with model, setup, and use-case queries to be executed.
Neo4j? A reader’s browser interactively renders the file and
All major programming languages have support for visualizes it live. A GraphGist is much like an IPython
Neo4j via the HTTP API, either via a basic HTTP library notebook or an interactive white paper. It also allows
or via a number of native libraries that offer higher- readers to write their own queries to explore the
level abstractions. Since Neo4j is written in Java, all dataset from the browser.
languages that have a JVM interface can take advan- Neo Technology, the creators of Neo4j, want-
tage of the high-performance APIs in Neo4j. ed to provide a showcase for GraphGists created by
Neo4j also has its own “stack” to allow to you the community, and this was my project. Neo4j was
choose different access methods ranging from easy used as the back end, of course, but for the rest of
access to raw performance. It offers: the stack I used Node.js with Express.js and the Neo4j
• a HTTP API for making Cypher queries and re- package, Angular.js, and Swagger UI.
trieving results in JSON,
• an “unmanaged extension”
facility in which you can write
your own endpoints for your
Neo4j database,
• A Java API for specifying tra-
versals of nodes and relation-
ships at a higher level,
• A low-level batch-loading
API for massive initial data in-
gestion, and
• A core Java API for direct ac-
cess to nodes and relation-
ships for maximum perfor-
mance.

An application example
I recently took on a project to
expand a Neo4j-based appli-
cation. The application (which

Graph Databases // eMag Issue 34 - Oct 2015 29


All of the code is open-sourced and available on Parameters are features of Neo4j that separate the
GitHub. query from the data that the query uses. Parameters
This GraphGist portal is conceptually a simple let Neo4j cache queries and query plans, and it also
app, providing a list of GraphGists and allowing users means that you don’t need to worry about query-in-
to view details about each as well as the GraphGist it- jection attacks. Secondly, we’re using an OPTIONAL
self. The data domain consists of Gist, gist categories MATCH clause here, which simply means that we still
of Keyword/Domain/UseCase, and Person (for the want to return the Gist that we’re originally matching
authors). (Figure 2) with even if there are no related gists.
Now that you’re familiar with the model, I’d Now, let’s take that part of the query and ex-
like to give you a quick intro to the Cypher query pand on it by replacing the  RETURN  clause with a
language before we dig deeper. For example, if we WITH clause. (Code 1)
wanted to return all gists and their keywords, we The COLLECT() in the RETURN transforms a re-
could do the following. sult with pairs of Gist and related_gist nodes so that
001 MATCH (gist:Gist)-[:HAS_KEYWORD]- each row has the Gist only once, along with an array
>(keyword:Keyword) of related_gist nodes. Inside COLLECT(), we specify
002 RETURN gist.title, keyword.name only the data we need from the related gists in order
This would give a table with one row for every com- to reduce the size of our response.
bination of Gist and Keyword, just like an SQL join. If Lastly, we’ll take the query so far and
we want to find all Domains for which a given Person use WITH one last time. (Code 2)
has written Gists, we could perform the following In this last part, we optionally match all associated
query. Domain, UseCase, Keyword, and Person nodes and
collect them together just like we did with related
001 MATCH (person:Person)-[:WRITER_
OF]->(gist:Gist)-[:HAS_DOMAIN]- Gists. Rather than having a flat, denormalized result,
>(domain:Domain) we now return a list of Gist nodes with arrays of asso-
002 WHERE person.name = “John Doe” ciated “has many” relationships with no duplication.
003 RETURN domain.name, COUNT(gist) Pretty cool!
This would return another table of results. Each row If tables of data are too old school for you, Cy-
of the table would have the name of the Domain pher can return objects as well.
accompanied by the number of Gists that the Per-
001 RETURN
son has written for that Domain. There’s no need
002 {
for a GROUP BY clause because when we use an ag-
003 gist: gist,
gregate function like COUNT(), Neo4j automatically 004 domains: collect(DISTINCT
groups by the other columns in the RETURN clause. domain.name) AS domains,
Let’s look at a real-world query from our app. 005 usecases: collect(DISTINCT
When building the portal, it was useful to be able to usecase.name) AS usecases,
provide a way to make just one request to the data- 006 writers: collect(DISTINCT
base and retrieve all the data that we need with al- writer.name) AS writers,
most exactly the format in which we want it. 007 keywords: collect(DISTINCT
Let’s build the query that the portal’s API uses keyword.name) AS keywords,
(you can view it on GitHub). First, we need to match 008 related_gists: related
the Gist in question by its title property to any relat- 009 }
010 ORDER BY gist.title
ed Gist nodes.
Traditionally in a decently sized Web application,
001 // Match Gists based on title
you need a number of database calls to populate the
002 MATCH (gist:Gist) WHERE gist.title
HTTP response. Even if you can execute queries in
=~ {search_query}
003 // Optionally match Gists with the parallel, it is often necessary to get the results of one
same keyword query before you can make a second to get related
004 // and pass on these related Gists data. In SQL, you can execute complicated and ex-
with the pensive joins on tables to get results from many ta-
005 // most common keywords first bles in one query, but anybody who has done more
006 OPTIONAL MATCH (gist)-[:HAS_ than a couple of SQL joins in the same query knows
KEYWORD]->(keyword)<-[:HAS_ how quickly that can get complicated — not to men-
KEYWORD]-(related_gist) tion that the database still needs to scan tables or in-
There are a couple of things to note here. Firstly, dexes to get the associated data. In Neo4j, retrieving
the WHERE clause is matching the title using a regular entities via relationships uses pointers directly to the
expression (that’s the =~ operator) and a parameter.

30 Graph Databases // eMag Issue 34 - Oct 2015


001 MATCH (gist:Gist) WHERE gist.title =~ {search_query}
002 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(related_gist)
003 WITH gist, related_gist, COUNT(DISTINCT keyword.name) AS keyword_count
004 ORDER BY keyword_count DESC
005
006 RETURN
007 gist,
008 COLLECT(DISTINCT {related: { id: related_gist.id, title: related_gist.title,
poster_image: related_gist.poster_image, url: related_gist.url }, weight: keyword_
count }) AS related

Code 1

001 MATCH (gist:Gist) WHERE gist.title =~ {search_query}


002 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(related_gist)
003 WITH gist, related_gist, COUNT(DISTINCT keyword.name) AS keyword_count
004 ORDER BY keyword_count DESC
005
006 WITH
007 gist,
008 COLLECT(DISTINCT {related: { id: related_gist.id, title: related_gist.title,
poster_image: related_gist.poster_image, url: related_gist.url }, weight: keyword_
count }) AS related
009
010 // Optionally match domains, use cases, writers, and keywords for each Gist
011 OPTIONAL MATCH (gist)-[:HAS_DOMAIN]->(domain:Domain)
012 OPTIONAL MATCH (gist)-[:HAS_USECASE]->(usecase:UseCase)
013 OPTIONAL MATCH (gist)<-[:WRITER_OF]-(writer:Person)
014 OPTIONAL MATCH (gist)-[:HAS_KEYWORD]->(keyword:Keyword)
015
016 // Return one Gist per row with arrays of domains, use cases, writers, and
keywords
017 RETURN
018 gist,
019 related,
020 COLLECT(DISTINCT domain.name) AS domains,
021 COLLECT(DISTINCT usecase.name) AS usecases,
022 COLLECT(DISTINCT keyword.name) AS keywords
023 COLLECT(DISTINCT writer.name) AS writers,
024 ORDER BY gist.title

Code 2

related nodes so that the server can traverse right We are fortunate to have many excellent data-
where it needs to go. base choices. While relational databases are still the
That said, there are a couple of downsides to best choice for storing structured data, NoSQL data-
this approach. While it’s possible to retrieve all of bases are a good choice for managing semi-struc-
the data required in one query, the query is quite tured, unstructured, and graph data. If you have a
long. I haven’t yet found a way to modularize it for data model with lot of connected data and want a
reuse. Along the same lines, we might want to use database that is intuitive, fun, and fast, you should
this same endpoint in another place but show more get to know Neo4j.
information about the related gists. We could modify This article was authored by Brian Underwood
the query to return that data but then it would be with contributions from Michael Hunger.
returning data unnecessary for the original use case.

Graph Databases // eMag Issue 34 - Oct 2015 31


PREVIOUS ISSUES
Business, Design and
Technology: Joining Forces

32
for a Truly Competitive
Advantage

This eMag offers readers tactical approaches to build-


ing software experiences that your users will love. Break
down existing silos and create an environment for
cross-collaborative teams: placing technology, business
and user experience design at the core.

31
Architectures You Always
Wondered About

33 Cloud
Migration

In this eMag, you’ll find practical advice from leading practi-


tioners in cloud. Discover new ideas and considerations for
planning out workload migrations.
In this eMag we take a look at the state of the art for the
microservice architectural style in both theory and prac-
tice. Amongst others Martin Fowler talks about Micros-
ervice trade-offs, Eric Evans explores the interplay of Do-
main-Driven Design, microservices, event-sourcing, and
CQRS, Randy Shoup talks about Lessons from Google
and eBay, and Yoni Goldberg describes Gilt’s experience.

30
Description, Discovery,
and Profiles

This eMag focuses on three key areas of “meta-lan-


guage” for Web APIs: API Description, API Discovery, and
API Profiles. You’ll see articles covering all three of these
important trends as well as interviews with some of the
key personalities in this fast-moving space.

You might also like