0% found this document useful (0 votes)
2 views

neo4j_sessio11_graphDataModeling

Uploaded by

jofloru023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

neo4j_sessio11_graphDataModeling

Uploaded by

jofloru023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Neo4j

Graph Data Modeling


Departament de Ciències de la Computació
Graph Data Modeling

1. Introduction to Graph Data Modeling

2. Designing the Initial Graph Data Model

3. Graph Data Modeling Core Principles

4. Common Graph Structures

5. Refactoring and Evolving a Graph Data Model

Bases de Dades no Relacionals. Neo4j 2


What is Graph Data Modeling?

Graph data modeling is a collaborative effort by stakeholders


including developers
Stakeholders include business analysts, architects, managers,
project leaders…

The application domain is analyzed by stakeholders and developers


▪ They develop a data model
▪ Stakeholders must understand the domain and provide answers

Neo4j is a full-featured graph database


▪ It includes tools used to create property graphs
▪ It supports application access in retrieving data for business use cases by
traversing the graph

3
Bases de Dades no Relacionals. Neo4j
Neo4j Property Graph Model

• Nodes (Entities)

• Relationships

• Properties

• Labels
Graph Traversal

MATCH (r:Residence)<-[:OWNS]-(p:Person)
WHERE r.address = '475 Broad Street'
RETURN p

5
Graph Data Modeling

1. Introduction to Graph Data Modeling

2. Designing the Initial Graph Data Model

3. Graph Data Modeling Core Principles

4. Common Graph Structures

5. Refactoring and Evolving a Graph Data Model

Bases de Dades no Relacionals. Neo4j 6


Designing the Initial Data Model

1. Understand the domain

2. Create high-level sample data

3. Define specific questions for the application

4. Identify entities

5. Identify connections between entities

6. Test the questions against the model

7. Test scalability

7
Identify Entities from Questions

Entities are the nouns in the application questions:


▪ What ingredients are used in a recipe?

▪ Who is married to this person?

o The generic nouns often become labels in the model


o Use domain knowledge when deciding how to
further groupe or differentiate entities

8
Define Properties

Two purposes for properties:


1. Unique identification
2. Answering application questions
Otherwise properties are decoration (these properties should not be added)

Properties are used for:


– Anchoring (where to begin the query)
– Traversing the graph (navigation)
– Returning data from the query

9
Identify Connections Between Entities

Connections are the verbs in the application questions:

▪ What ingredients are used in a recipe?

▪ Who is married to this person?

10
Naming Relationships

▪ Stakeholders must agree upon name (type) for the relationship


▪ Avoid names that could be construed as nouns (e.g. email)

Do not do this: Instead do this:

11
Direction and Type

Direction and type are required for relationships

Select direction and type based on expected questions:

1. What episode follows ‘The Ark in Space’? (NEXT )

2. What episode came before ‘Genesis of the Daleks’? ( PREVIOUS)

12
Node Fanout
firstName: ‘Patrick’
lastName: ‘Scott’
age: 34
addr1: ‘Flat 3B’
addr2: ’83 Landor St’
city: ‘Axebridge'
postalCode: ‘DF3 0AS’

Person

addr1: ‘Flat 3B’


addr2: ’83 Landor St’ firstName: ‘Patrick’
lastName: ‘Scott’
city: ‘Axebridge'
age: 34
postalCode: ‘DF3 0AS’

Residence :LIVES_AT Person

13
How Much Node Fanout?

14
Graph Data Modeling

1. Introduction to Graph Data Modeling

2. Designing the Initial Graph Data Model

3. Graph Data Modeling Core Principles

4. Common Graph Structures

5. Refactoring and Evolving a Graph Data Model

Bases de Dades no Relacionals. Neo4j 15


Graph Modeling Core Principles

● Nodes
○ Uniqueness
○ Fanout ● Properties
● Relationships ● Data object accessibility
○ Naming best practices
○ Semantic redundancy
○ Types vs. Properties

16
Node Best Practices
Uniqueness of Nodes: Before
Notes:
▪ Country nodes are
considered super nodes
(a node with lots of fan-in
or fan-out)
▪ Be careful when using
them in a design
▪ Be aware of queries that
might select all paths in or
out of a super node

17
Node Best Practices
Uniqueness of Nodes: After

18
Complex Data
Use Fanout Judiciously for Complex Data
▪ Reduce property duplication
▪ Reduce gather-and-inspect

20
Best Practices for Modeling Relationships

Data models should address:

• Using specific relationship types

• Using types vs. properties

• Reducing symmetric relationships


Using Specific Relationship Types

22
But Not Too Specific

23
Do Not Use Symmetric Relationships

24
Semantics of Symmetry are Important

25
Using Types vs. Properties

26
Property Best Practices

▪ Property lookups have a cost


▪ Parsing a complex property adds more cost

▪ Anchors and properties used for traversal should be as simple as possible


▪ Identifiers, outputs, and decoration are OK as complex values

27
Best practices for Data Accessibility

For each query, how much work must Neo4j do to evaluate if the
traversal represents a “good” or a “bad” path?

28
Hierarchy of Accessibility
For each data object, how much work must Neo4j do to evaluate if this is a “good”
path or a “bad” one?

Most 1. Anchor node label


Anchor node properties (indexed)
accessible
Anchor
Node
Least processing required
2. Relationship type

3. Anchor node properties (non-


Downstream indexed)
Nodes
4. Downstream node labels

Least 5. Relationship properties


accessible Downstream node properties
Most processing required
Graph Data Modeling

1. Introduction to Graph Data Modeling

2. Designing the Initial Graph Data Model

3. Graph Data Modeling Core Principles

4. Common Graph Structures

5. Refactoring and Evolving a Graph Data Model

Bases de Dades no Relacionals. Neo4j 30


Common Graph Structures

● Intermediate node

● Linked list

● Timeline tree

● Multiple structures in a single model

31
Intermediate Nodes

Create intermediate nodes when you need to:

▪ Connect more than two nodes in a single context

▪ Relate something to a relationship

32
Intermediate Nodes

33
Intermediate Nodes: Sharing Context

34
Intermediate Nodes: Sharing Data

35
Intermediate Nodes: Organizing Data

36
Linked Lists

Do NOT

37
Interleaved Linked List

38
Head and Tail of Linked List
Some possible use cases:
▪ Add episodes as they are broadcast
▪ Maintain pointer to first and last episodes
▪ Find all broadcast episodes
▪ Find latest broadcast episode

39
Timeline Tree

40
Using Multiple Structures

41
Using the Timeline Tree

42
Using Intermediate Nodes

43
Using Linked Lists

44
Graph Data Modeling

1. Introduction to Graph Data Modeling

2. Designing the Initial Graph Data Model

3. Graph Data Modeling Core Principles

4. Common Graph Structures

5. Refactoring and Evolving a Graph Data Model

Bases de Dades no Relacionals. Neo4j 45


What is Refactoring?
Important: Your model depends on your data and your queries

Refactoring is the process of …


– Changing the data structure ...
– Without altering its semantic meaning

Refactoring often involves moving data from one structure to another


Sometimes refactoring involves adding additional data from other
sources
The most common type of refactoring is ...
– Restructure the graph to use a property value
– A property value is used to create a label, a node, or a relationship

46
Hierarchy of Accessibility (reminder)
For each data object, how much work must Neo4j do to evaluate if this is a “good”
path or a “bad” one?

Most 1. Anchor node label


Anchor node properties (indexed)
accessible
Anchor
Node
Least processing required
2. Relationship type

3. Anchor node properties (non-


Downstream indexed)
Nodes
4. Downstream node labels

Least 5. Relationship properties


accessible Downstream node properties
Most processing required
Why Refactor?

Data models can be optimized for:


Note: Improving
– Query performance behavior in one of
these areas
– Model simplicity & intuitiveness
frequently involves
– Query simplicity (i.e., simpler Cypher strings) sacrifices in others

– Easier data updates

Another important reason to refactor is to accommodate new


application questions in the same model

48
Goal: Eliminate Duplicate Data in Properties

49
Refactor Example: Extracting Nodes From Properties

50
Goal: Use Labels Instead of Property Values

51
Refactor Example: Turn Property Values
into Labels for Nodes

52
Goal: Use Nodes Instead of Properties for relationships

Possible dense node

53
Refactor: Extract Nodes from Relationship Properties

54
Graph Data Modeling

1. Introduction to Graph Data Modeling

2. Designing the Initial Graph Data Model

3. Graph Data Modeling Core Principles

4. Common Graph Structures

5. Refactoring and Evolving a Graph Data Model

– Example

Bases de Dades no Relacionals. Neo4j 55


Refactoring example: Modeling airline flights

Leonardo DiCaprio as Frank Abagnale in the Steven Spielberg movie “Catch Me If You Can”

Credit: Max De Marzi https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/

56
Refactoring example: Modeling airline flights
Important: Your model depends on your data and your queries

Our data → Airports and Flights between them


Ask yourself:
• What are the entities?

• What are the connections between the entities?

• What properties do we need?

Initial Question for Our Model


▪ What flights will take me from Malmo to New York on Friday?

57
Initial Model

Question: What flights will take me from Malmo to New York on Friday?

Comment: The concept of a Flight is expressed as a relationship

The model can answer this question, so the model seems fine

58
Initial Model

Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

Comment: To find flight AY189, we need to traverse every relationship in the graph,
because it is impossible to anchor on relationships. This query is very inefficient!

59
Initial Model

Question 1: What flights will take me from Malmo to New York on Friday?

More questions:
• What if we want to connect Customers or Staff to a flight? → Not possible!
• What if a flight was rerouted to another Airport due to weather? → Not possible!

Given some of the queries we imagine for our data a flight really should be a node
60
Refactor: Create Intermediate Flight Nodes
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

Adding Flight nodes allows to anchor on flight data, reducing traversal

61
Refactor: Create Intermediate Flight Nodes
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

Adding Flight nodes allows to anchor on flight data, reducing traversal

Note: Airlines are required to publish flight plans 12 months in advance.


How much work must Neo4j do to answer Question 1?
• Neo4j must check every flight leaving Malmo, then consult the flight data.
Then we check which of those flights land in the desired place!
How can we elevate the flight date for better efficiency?
62
Refactor: Create AirportDay Intermediate Nodes
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

Adding the Intermediate node AirportDay:


▪ It reduces the number of relationships in Airport nodes, since there are fewer days than flights

▪ We still need to check every AirportDay to find the right date, but the traversals are reduced

63
Refactor: Create AirportDay Intermediate Nodes
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

Adding the Intermediate node AirportDay

If model changes, we must check if older queries are still OK. What about Q1 and Q2? All OK
But… how to reduce wasted traversal even further for DATES?

64
Possible Refactor: Change Relationship Type to Date
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

We make date a relationship type


▪ It hardly changes the model, but performance improves. Now, we can traverse only to the
relevant AirportDay. And Q2 is unaffected.

65
Possible Refactor: Change Relationship Type to Date
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

We make date a relationship type


Comment: Are Airport nodes necessary? If we remove them, then:
• We could remove a modest number of Airport nodes and many HAS_DAY relationships

66
Possible Refactor: Remove Airport Nodes
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

We remove Airport nodes → it is less intuitive but more efficient


Comment: But what if no direct flight available? How to find an itinerary (connecting flights)?
It must check each flight and its destinations, and second-order destinations... → Inefficient!!

67
Refactor: Add Destination Intermediate Nodes
Question 1: What flights will take me from Malmo to New York on Friday?

Question 2: Mom is on flight AY189. When will she land?

Adding the intermediate node Destination → queries on destination are efficient

The scope of the graph grows proportionally to the number of Destinations served by an airport,
not the number of Flights. Airports have multiple flights per destination (at different times of day)

Comment: Is this refactor affecting Q2? No

68

You might also like