Graph Databases: Nosql Database Solution For Managing Linked Data
Graph Databases: Nosql Database Solution For Managing Linked Data
Graph Databases
NoSQL Database Solution for Managing Linked Data
eMag Issue 34 - October 2015
FOLLOW US CONTACT US
GENERAL FEEDBACK [email protected]
ADVERTISING [email protected]
EDITORIAL [email protected]
facebook.com @InfoQ google.com linkedin.com
/InfoQ /+InfoQ company/infoq
SRINI currently works as
Senior Software
PENCHIKALA Architect at a financial
services organization in Austin, Texas. He is also
the Lead Editor for Big Data and NoSQL Database
communities at InfoQ. Srini has over 20 years of
experience in software architecture, design and
development. He is currently authoring a book
on Big Data Processing with Apache Spark. He is
also the co-author of "Spring Roo in Action" book
from Manning Publications. Srini has presented at
conferences like Big Data Conference, Enterprise
Data World, JavaOne, SEI Architecture Technology
Conference (SATURN), IT Architect Conference
(ITARC), No Fluff Just Stuff, NoSQL Now and Project
World Conference. He also published several
articles on software architecture, security and risk
management, and NoSQL databases on websites
like InfoQ, The ServerSide, OReilly Network
(ONJava), DevX Java, java.net and JavaWorld.
A LETTER FROM
THE EDITOR
Graph databases are getting lot of attention late- Another emerging trend in Graph Database
ly. They are used to manage connected data and space is the Multi-Model Databases. NoSQL da-
are better solutions for several real-world use tabases like OrientDB support storing Document
cases which require the mapping of relationships and Graph data sets.
between data entities for data driven business This eMag focuses on the graph database
decision making. landscape and the real world use cases of graph
Real-world use cases of graph databases databases. It includes articles and interviews cov-
include fraud detection, Anti–Money Launder- ering topics like data modeling in graph databas-
ing (AML) and Master Data Management (MDM), es and how companies like Pinterest use graph
Trading Analytics, and Online Gaming. databases in their application. It also includes
Graph data management space includes an article on full stack web development using a
the three different areas: graph database so the readers can see the pow-
1 Specialized Graph databases to store the con- er of Graph databases to manage the connected
nected data. These databases include Neo4J, data.
TitanDB, InfiniteGraph, and AllegroGraph.
2 Graph data processing and real-time graph
analytics frameworks like Spark GraphX and
GraphLab.
3 Graph data visualization tools which give
non-technical users insights into the con-
nected data, to use for data driven business
decision making.
Ian Robinson works on research and development for future versions of the Neo4j graph
database. Harbouring a long-held interest in connected data, he was for many years one of the
foremost proponents of REST architectures before turning his focus from the Web›s global graph
to the realm of graph databases. Follow him on Twitter: @iansrobinson
Why we would consider using a graph database to uniformity. The more data we gather about a group
tackle complexity, generate insight, and bring val- of entities, the more likely that data is to be variably
ue to end users. More specifically, to wrest insight structured.
from the kind of complexity that arises wherever Variably structured data is the kind of messy,
three contemporary forces meet: an increase in the real-world data that doesn’t fit comfortably into a
amount of generated and stored data, a need to ac- uniform, one-size-fits-all, rigid relational schema;
commodate a high degree of structural variation, the kind that gives rise to lots of sparse tables and
and a need to understand the multiple facets of con- null-checking logic. It’s the increasing prevalence of
nectedness inherent in the domain to which the data variably structured data in today’s applications that
belongs. has led many organisations to adopt schema-free
Increased data size — big data — is perhaps alternatives to the relational model and document
the best understood of these three forces. The vol- stores.
ume of new data is growing exponentially each year, But the challenges that face us today aren’t just
a trend that looks set to continue for the foresee- the management of increasingly large volumes of
able future. But as the volume of data increases and data nor do they simply extend to us having to ac-
we learn more about the instances in our domain, commodate ever-increasing degrees of structural
each instance begins to look subtly unique. In oth- variation in that data. The real challenge to gener-
er words, as data volumes grow, we trade insight for ating significant insight is understanding connect-
Code 1
by Srini Penchikala
Ian Robinson works on research and development for future versions of the Neo4j graph
database. Harbouring a long-held interest in connected data, he was for many years one of the
foremost proponents of REST architectures before turning his focus from the Web›s global graph
to the realm of graph databases. Follow him on Twitter: @iansrobinson
Matt Aslett from the 451 Group you model the data stored in InfoQ spoke with Ian Rob-
notes that graphs are now relational or other NoSQL data- inson and Jim Webber of Neo
emerging from the general bases like document databases, Technologies, who co-authored
NoSQL umbrella as a category key-value data stores, or col- O’Reilly’s Graph Databases,
in their own right. In 2014-2015, umn-family databases. You can about data modelling and best
there has been growth in the cat- use graph data models to create practices when using graph da-
egory of all things graph. rich and highly connected data tabases for data management
Data modeling with a graph to represent real-world use cases and analytics.
database is different from how and applications.
George Hurlburt is chief scientist at STEMCorp, a nonprofit that works to further economic development via the
adoption of network science and to advance autonomous technologies as useful tools for human use. Contact him at
[email protected].
The emergence of in the realm of indexed spatial to the decades-long reign of the
NoSQL data, but it fares poorly in highly RDBMS. Various forms of NoSQL
The relational-database man- dynamic environments, such as database opened doors to a vast-
agement system (RDBMS), initial- in a management information ly improved dynamic data por-
ly designed to maximize highly system that depends on volatile trayal with far less overhead. For
expensive storage, has indeed data or a systems architecture example, schemas need not be
proven to be highly effective in with a high churn of many-to-ma- as rigorous in the NoSQL world.
transaction-rich and process-sta- ny relationships. In such environ- NoSQL database designs include
ble environments. For example, ments, RDBMS design imposes wide-column stores, document
the RDBMS excels in large-scale far too much mathematical and stores, key-value (tuple) stores,
credit-card transaction process- managerial overhead. multimodal databases, object
ing and cyclic billing operations. The rise of the NoSQL da- databases, grid/cloud databases,
It offers superior performance tabase represents an alternative and graph databases. The graph
23
forward to convert highly struc- ising candidate graph databases change systemic behaviors over
tured data, such as well-organized for their vulnerability to attack. time. Thus, the accreditation is
spreadsheets and databases, into good only for the moment of time
RDF, only high-end tools can re- Graph prediction at which the snapshot was taken.
liably convert unstructured data In dynamic circumstances involv- Given their growing sophis-
into RDF, and that carries some re- ing an unfolding process such as tication, graph databases offer
strictive caveats. weather or economic trends, the the potential to let us monitor
Not all graph databases, ability to predict future behavior dynamic change in near real time.
however, require RDF-style tri- becomes highly desirable. By monitoring data streams with
ple representation. A number of Graph representations fa- quantitative methods, looking
thriving commercial graph data- cilitate predictions because they for anomalous node or changing
bases employ triples in their own let us both qualify and quantify a relationship patterns, we could
unique ways without engaging system represented as a network. detect and investigate intrusions
RDF. Many offer a number of at- The ability to assign properties to and other security breaches early
tractive features, such as graph nodes and arcs — such as location, on, quickly prosecuting any iden-
visualization, backup, and recov- time, weights, or quantities — lets tified perpetrators.
ery. Emil Eifrem, founder and CEO us qualitatively evaluate the graph From the predictive per-
of Neo Technology, expects that on the basis of similar properties. spective, data integrity must take
these tools will attract the corpo- More importantly, quantitative a front seat. Data provenance be-
rate nod and the consumer base techniques let us evaluate metrics comes a crucial issue because the
will continue to grow, pushing the inherent in almost all graphs. stakes of prediction are high. The
graph-database industry from 2 The ability to apply prov- results of a prediction are as ac-
percent to an estimated 25 per- en metrics to graphs means that curate as the data underlying the
cent of the database market by their characteristics might be predictive tools. False data could
2017. Of course, many companies quantified to allow an objective gravely affect outcomes and liter-
employ their own languages and evaluation of the graph. In cas- ally endanger security. Consider
techniques for data management. es where graph data is dynamic, the consequence of a faulty pre-
A real need exists for standards such as in an ongoing process, dictive model for disaster relief,
that, at a minimum, support data a powerful predictive capability which calls for distributing re-
transportability. becomes possible, assuming the sources in an unaffected region as
data stream is accessible. This ap- opposed to the affected region. In
Knowledge management: proach presumes combinations this regard, good security practice
Privacy and security of graph theory and combinato- results in the highest ethical stan-
Security, particularly for propri- rial mathematics can be applied dards of applied science.
etary architectural designs, must against a real-time data stream. Although graph databas-
be taken into consideration. If Moreover, various graph configu- es hold great promise in a world
Web sharing is envisioned as a rations could be classified based wrapped in networks of all kinds,
reasonable means to generate a on their metrics. Such classifica- they also contain some inherent
lot of system-representative tri- tion templates, each with a graph security risks that have yet to be
ples from resident experts, a se- signature based on its metrics, fully understood, much less ap-
cure portal to the RDF data store could then permit identification preciated. Rather than piling on
becomes exceedingly important. of and a predictive baseline for the bandwagon, the prudent IT
User authentication and verifica- similar graphs as they arise. professional must carefully evalu-
tion also become important. ate potential risks in the context
Although knowledge man- Prediction: Security and of the intended operating envi-
agement is perhaps less extensive privacy ronment and perform the neces-
than discovery, related databases Current cybersecurity best prac- sary tradeoffs to achieve accept-
still might possess specific iden- tices suggest the importance of able levels of security and data
tity attributes that must be well taking a snapshot of a system un- protection. If security and privacy
protected. Front-end provisions der study to determine its security issues surrounding relatively new
must assure the existence of both and privacy vulnerabilities, lead- technologies, such as increasingly
security against intrusion and the ing to the accreditation of systems popular graph databases, aren’t
privacy of any personal data con- proven to be “secure”. The fallacy of considered up-front, they become
tained in the graph database. Fail- such practice is that most systems far more costly to implement
ure to offer adequate protection are influenced by ever-changing downstream.
could disqualify otherwise prom- environments, which serve to
Graph databases, on the other with challenges. Here are some when a fraudulent transac-
hand, offer new methods of un- of their biggest: tion occurs.
covering fraud rings and other • Complex link analysis to dis- • Evolving and dynamic fraud
complex scams with a high level cover fraud patterns — Un- rings — Fraud rings are con-
of accuracy through advanced covering fraud rings requires tinuously growing in shape
contextual link analysis, and you to traverse data relation- and size, and your applica-
they are capable of stopping ships with high computa- tion needs to detect these
advanced fraud scenarios in real tional complexity, a problem fraud patterns in this highly
time. that’s exacerbated as a fraud dynamic and emerging envi-
ring grows. ronment.
The key challenges in • Detect and prevent fraud as
fraud detection it happens — To prevent a Overcoming fraud
Between the enormous amounts fraud ring, you need real-time detection challenges
of data available for analysis and link analysis on an intercon- with graph databases
today’s experienced fraud rings nected dataset, from the time While no fraud-prevention mea-
(and solo fraudsters), fraud-de- a false account is created to sures are perfect, significant
tection professionals are beset improvements occur when you
Conclusion
When it comes to graph-based
fraud detection, you need to
augment your fraud-detection
capability with link analysis. That
being said, two points are clear:
• As business processes be-
come faster and more auto-
mated, the time margins for
detecting fraud are narrow-
ing, increasing the need for a
Figure 1. A graph of a series of transactions from different IP addresses with a likely fraud event real-time solution.
occurring from IP1, which has carried out multiple transactions with five different credit cards. • Traditional technologies are
not designed to detect elab-
look beyond individual data fraudster can create a large num- orate fraud rings. Graph da-
points to the connections that ber of synthetic identities to car- tabases add value through
link them. ry out sizeable schemes. analysis of connected data
Understanding the connec- Consider an online transac- points.
tions between data, and deriving tion with the following identifi- Graph databases are the ideal en-
meaning from these links, doesn’t ers: user ID, IP address, location, abler for efficient and manage-
necessarily mean gathering new tracking cookie, and credit-card able fraud-detection solutions.
data. You can draw significant number. Typically, the relation- Graph databases uncover a vari-
insights from your existing data ships between these identifiers ety of important fraud patterns
simply by reframing the problem should be (almost) one-to-one. from fraud rings and collusive
in a new way: as a graph. Some variations naturally ac- groups to educated criminals op-
Unlike most other ways of count for shared machines, fam- erating on their own — all in real
looking at data, graphs are de- ilies sharing a single credit-card time.
signed to express relatedness. number, individuals using multi-
Graph databases uncover pat- ple computers, and the like.
terns that are difficult to detect However, as soon as the re-
with traditional representations lationships between these vari-
such as tables. An increasing ables exceed a reasonable num-
number of companies use graph ber, fraud should be considered
databases to solve a variety of as a strong possibility. The more
connected-data problems, in- interconnections exist amongst
cluding fraud detection. identifiers, the greater the cause
for concern. Large and tightly
Example: E-commerce knit graphs are very strong indi-
fraud cators that fraud is taking place.
As our lives become increasingly See Figure 1 for an example.
digital, a growing number of fi- By putting checks into
nancial transactions are conduct- place and associating them with
ed online. Fraudsters have adapt- the appropriate event triggers,
ed quickly to this trend and have such schemes can be discovered
devised clever ways to defraud before they are able to inflict sig-
online payment systems. nificant damage.
While this type of activity Triggers can include events
can and does involve criminal such as logging in, placing an or-
rings, even a single well-informed der, or registering a credit card —
Brian Underwood is a software engineer and lover of all things data. As a developer advocate
for Neo4j and co-maintainer of the Neo4j Ruby gem, Brian regularly lectures and writes on his
blog about the power and simplicity of graph databases. Brian is currently traveling the world
with his wife and son. Follow Brian on Twitter or join him on LinkedIn.
When building a full-stack Web application, you have many choices for
the database that you will put on the bottom of the stack. You want a
database that is dependable, certainly, but which also allows you to
model your data well. Neo4j is a good choice as the foundation of your
Web-application stack if your data model contains a lot of connected
data and relationships.
What is Neo4j? Why Neo4j?
Neo4j is a graph database, which simply means To choose a database for a Web application, you
that instead of storing data in tables or collections, should consider what it is that you want from it. Top
it stores data as nodes and relationships between criteria include:
nodes. In Neo4j, both nodes and relationships can • Is it easy to use?
contain properties with values. In addition: • Will it let you easily respond to changes in re-
• Nodes can have zero or more labels (like “Author” quirements?
or “Book”). • Is it capable of high-performance queries?
• Relationships have exactly one type (like “WROTE” • Does it allow for easy data modeling?
or “FRIEND_OF”). • Is it transactional?
• Relationships are always directed from one node • Does it scale?
to another (but can be queried regardless of di- • Is it fun (sadly, an often overlooked quality in a
rection). database)?
anything from simple CRUD access to a complicated, you can see at graphgist.neo4j.com) is a portal for
deeply nested view of a resource. GraphGists. A GraphGist is a simple AsciiDoc text
file (with images if need be) that describes the data
Which stack should you use with model, setup, and use-case queries to be executed.
Neo4j? A reader’s browser interactively renders the file and
All major programming languages have support for visualizes it live. A GraphGist is much like an IPython
Neo4j via the HTTP API, either via a basic HTTP library notebook or an interactive white paper. It also allows
or via a number of native libraries that offer higher- readers to write their own queries to explore the
level abstractions. Since Neo4j is written in Java, all dataset from the browser.
languages that have a JVM interface can take advan- Neo Technology, the creators of Neo4j, want-
tage of the high-performance APIs in Neo4j. ed to provide a showcase for GraphGists created by
Neo4j also has its own “stack” to allow to you the community, and this was my project. Neo4j was
choose different access methods ranging from easy used as the back end, of course, but for the rest of
access to raw performance. It offers: the stack I used Node.js with Express.js and the Neo4j
• a HTTP API for making Cypher queries and re- package, Angular.js, and Swagger UI.
trieving results in JSON,
• an “unmanaged extension”
facility in which you can write
your own endpoints for your
Neo4j database,
• A Java API for specifying tra-
versals of nodes and relation-
ships at a higher level,
• A low-level batch-loading
API for massive initial data in-
gestion, and
• A core Java API for direct ac-
cess to nodes and relation-
ships for maximum perfor-
mance.
An application example
I recently took on a project to
expand a Neo4j-based appli-
cation. The application (which
Code 1
Code 2
related nodes so that the server can traverse right We are fortunate to have many excellent data-
where it needs to go. base choices. While relational databases are still the
That said, there are a couple of downsides to best choice for storing structured data, NoSQL data-
this approach. While it’s possible to retrieve all of bases are a good choice for managing semi-struc-
the data required in one query, the query is quite tured, unstructured, and graph data. If you have a
long. I haven’t yet found a way to modularize it for data model with lot of connected data and want a
reuse. Along the same lines, we might want to use database that is intuitive, fun, and fast, you should
this same endpoint in another place but show more get to know Neo4j.
information about the related gists. We could modify This article was authored by Brian Underwood
the query to return that data but then it would be with contributions from Michael Hunger.
returning data unnecessary for the original use case.
32
for a Truly Competitive
Advantage
31
Architectures You Always
Wondered About
33 Cloud
Migration
30
Description, Discovery,
and Profiles