SlideShare a Scribd company logo
Scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Introduction into
scalable graph analysis with
Apache Giraph and Spark GraphX
Roman Shaposhnik rvs@apache.org @rhatr
Director of Open Source, Pivotal Inc.
Shameless plug #1
Shameless plug #1
Agenda:
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1
Lets define some terms
•  Graph is a G = (V, E), where E VxV
•  Directed multigraphs with properties attached to each vertex and edge

foo
bar
fee
2
1 foo
bar
fee
42
fum
What kind of graphs are we talking about?
• Page ranking on Facebook social graph (mid 2013)
•  10^9 (billions) vertices
•  10^12 (trillion) edges
•  10^15 (petabtybe) cold storage data scale
•  200 servers
•  …all in under 4 minutes!
“On day one Doug created
HDFS and MapReduce”
Google papers that started it all
• GFS (file system)
•  distributed
•  replicated
•  non-POSIX"

• MapReduce (computational framework)
•  distributed
•  batch-oriented (long jobs; final results)
•  data-gravity aware
•  designed for “embarrassingly parallel” algorithms
HDFS pools and abstracts direct-attached storage
…
HDFS
MR MR
A Unix analogy
§ It is as though instead of:
$	
  grep	
  foo	
  bar.txt	
  |	
  tr	
  “,”	
  “	
  “	
  |	
  sort	
  -­‐u	
  
	
  
§ We are doing:
$	
  grep	
  foo	
  <	
  bar.txt	
  >	
  /tmp/1.txt	
  
$	
  tr	
  “,”	
  “	
  “	
  	
  <	
  /tmp/1.txt	
  >	
  /tmp/2.txt	
  
$	
  sort	
  –u	
  <	
  /tmp/2.txt	
  
Enter Apache Spark
RAM is the new disk, Disk is the new tape
Source: UC Berkeley Spark project (just the image)
RDDs instead of HDFS files, RAM instead of Disk
warnings = textFile(…).filter(_.contains(“warning”))
.map(_.split(‘ ‘)(1))
HadoopRDD
path = hdfs://
FilteredRDD
contains…
MappedRDD
split…
pooled RAM
RDDs: resilient, distributed, datasets
§ Distributed on a cluster in RAM
§ Immutable (mostly)
§ Can be evicted, snapshotted, etc.
§ Manipulated via parallel operators (map, etc.)
§ Automatically rebuilt on failure
§ A parallel ecosystem
§ A solution to iterative and multi-stage apps
What’s so special about Graphs and
big data?
Graph relationships
§ Entities in your data: tuples
-  customer data
-  product data
-  interaction data
§ Connection between entities: graphs
-  social network or my customers
-  clustering of customers vs. products
A word about Graph databases
§  Plenty available
-  Neo4J, Titan, etc.
§  Benefits
-  Query language
-  Tightly integrate systems with few moving parts
-  High performance on known data sets
§  Shortcomings
-  Not easy to scale horizontally
-  Don’t integrate with HDFS
-  Combine storage and computational layers
-  A sea of APIs
What’s the key API?
§ Directed multi-graph with labels attached to vertices and edges
§ Defining vertices and edges dynamically
§ Selecting sub-graphs
§ Mutating the topology of the graph
§ Partitioning the graph
§ Computing model that is
-  iterative
-  scalable (shared nothing)
-  resilient
-  easy to manage at scale
Bulk Synchronous Parallel
BSP compute model
BSP in a nutshell
time
communications
local
processing
barrier #1
barrier #2
barrier #3
Vertex-centric BSP application
@rhatr
@TheASF
@c0sin
“Think like a vertex”
•  I know my local state
•  I know my neighbors
•  I can send messages to vertices
•  I can declare that I am done
•  I can mutate graph topology
Local state, global messaging
time
communications
vertices are
doing local
computing
and pooling 
messages
superstep #1
all vertices are
done computing
superstep #2
Lets put it all together
Hadoop ecosystem view
HDFS
Pig
Sqoop Flume
MR
Hive
Tez
Giraph
Mahout
Spark
SparkSQL
MLib
GraphX
HAWQ
Kafka
YARN
MADlib
Spark view
HDFS, Ceph, GlusterFS, S3
Hive
Spark
SparkSQL
MLib
GraphX
Kafka
YARN, Mesos, MR
Enough boxology!
Lets look at some code
Our toy for the rest of this talk
Adjacency lists stored on HDFS
$ hadoop fs –cat /tmp/graph/1.txt
1
2 1 3
3 1 2
@rhatr
@TheASF
@c0sin
3
1
2
Graph modeling in GraphX
§  The property graph is parameterized over the vertex (VD) and edge (ED) types
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
§  Graph[(String, String), String]
Hello world in GraphX
$ spark*/bin/spark-shell
scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”)
scala val edges = inputFile.flatMap(s = { // “2 1 3”
val l = s.split(t); // [ “2”, “1”, “3” ]
l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ]
})
scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int]
scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x =
println(Hello world from the:  + x._1 +  :  + x._2.mkString( )) )
scala result.collect() // don’t try this @home
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Graph modeling in Giraph
BasicComputationI	
  extends	
  WritableComparable,	
  	
  	
  	
  	
  //	
  VertexID	
  	
  	
  -­‐-­‐	
  vertex	
  ref	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  V	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  VertexData	
  -­‐-­‐	
  a	
  vertex	
  datum	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  E	
  extends	
  Writable,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  EdgeData	
  	
  	
  -­‐-­‐	
  an	
  edge	
  label	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  M	
  extends	
  Writable	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  MessageData-­‐–	
  message	
  payload	
  
	
  
	
  
V	
  is	
  sort	
  of	
  like	
  VD	
  
E	
  is	
  sort	
  of	
  like	
  ED	
  
Hello world in Giraph
public class GiraphHelloWorld extends
BasicComputationIntWritable, IntWritable, NullWritable, NullWritable {
public void compute(VertexIntWritable, IntWritable, NullWritable vertex,
IterableNullWritable messages) {
System.out.print(“Hello world from the: “ + vertex.getId() + “ : “);
for (EdgeIntWritable, NullWritable e : vertex.getEdges()) {
System.out.print(“ “ + e.getTargetVertexId());
}
System.out.println(“”);
vertex.voteToHalt();
}
}
How to run it
$ giraph target/*.jar giraph.GiraphHelloWorld 
-vip /tmp/graph/ 
-vif org.apache.giraph.io.formats.IntIntNullTextInputFormat 
-w 1 
-ca giraph.SplitMasterWorker=false,giraph.logLevel=error
Hello world from the: 1 :
Hello world from the: 2 : 1 3
Hello world from the: 3 : 1 2
Anatomy of Giraph run
BSP assumes an exclusively vertex view
Turning Twitter into Facebook
@rhatr
@TheASF
@c0sin
@rhatr
@TheASF
@c0sin
Hello world in Giraph
public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){
if (getSuperstep() == 0) {
sendMessageToAllEdges(vertex, vertex.getId());
} else {
for (Text m : ms) {
if (vertex.getEdgeValue(m) == null) {
vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE));
}
}
}
vertex.voteToHalt();
}
BSP in GraphX
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
42
0
3
Single source shortest path
scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message
((id, dist, newDist) = math.min(dist, newDist), // Vertex Program
triplet = { // Send Message
if (triplet.srcAttr + triplet.attr  triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) = math.min(a,b)) // Merge Messages
scala println(sssp.vertices.collect.mkString(n))
2
5
0
3
Operational views of the graph
Masking instead of mutation
§ def subgraph(
epred: EdgeTriplet[VD,ED] = Boolean = (x = true),
vpred: (VertexID, VD) = Boolean = ((v, d) = true))
: Graph[VD, ED]
§ def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
Built-in algorithms
§  def pageRank(tol: Double, resetProb: Double = 0.15):
Graph[Double, Double]
§  def connectedComponents(): Graph[VertexID, ED]
§  def triangleCount(): Graph[Int, ED]
§  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
Final thoughts
Giraph
§ An unconstrained BSP framework
§ Specialized fully mutable,
dynamically balanced in-memory
graph representation
§ Very procedural, vertex-centric
programming model
§ Genuine part of Hadoop ecosystem
§ Definitely a 1.0
GraphX
§ An RDD framework
§ Graphs are “views” on RDDs and
thus immutable
§ Functional-like, “declarative”
programming model
§ Genuine part of Spark ecosystem
§ Technically still an alpha
QA
Thanks!

More Related Content

PDF
Introducing Apache Giraph for Large Scale Graph Processing
PDF
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
PPT
Giraph at Hadoop Summit 2014
PDF
Processing edges on apache giraph
PPT
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
DOCX
Neo4j vs giraph
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Large Scale Graph Processing with Apache Giraph
Introducing Apache Giraph for Large Scale Graph Processing
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Giraph at Hadoop Summit 2014
Processing edges on apache giraph
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Neo4j vs giraph
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Large Scale Graph Processing with Apache Giraph

What's hot (20)

PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Apache Giraph: Large-scale graph processing done better
PDF
GraphX: Graph analytics for insights about developer communities
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
PPT
Mapreduce in Search
PPTX
Graph databases: Tinkerpop and Titan DB
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
PPTX
Large Scale Machine Learning with Apache Spark
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Data profiling in Apache Calcite
PDF
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PDF
Fast Data Analytics with Spark and Python
PDF
Apache Giraph
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PDF
Intro to Spark and Spark SQL
PDF
Spark Meetup @ Netflix, 05/19/2015
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Apache Giraph: Large-scale graph processing done better
GraphX: Graph analytics for insights about developer communities
An excursion into Graph Analytics with Apache Spark GraphX
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Mapreduce in Search
Graph databases: Tinkerpop and Titan DB
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Large Scale Machine Learning with Apache Spark
Apache spark-melbourne-april-2015-meetup
Data profiling in Apache Calcite
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
Fast Data Analytics with Spark and Python
Apache Giraph
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Intro to Spark and Spark SQL
Spark Meetup @ Netflix, 05/19/2015
Ad

Viewers also liked (13)

PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PDF
Kudu - Fast Analytics on Fast Data
PPTX
HPE Keynote Hadoop Summit San Jose 2016
PPTX
Hadoop Graph Processing with Apache Giraph
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PDF
Apache kudu
PPTX
Machine Learning with GraphLab Create
PDF
Time Series Analysis with Spark
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PPTX
Introduction to Apache Kudu
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Kudu - Fast Analytics on Fast Data
HPE Keynote Hadoop Summit San Jose 2016
Hadoop Graph Processing with Apache Giraph
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache kudu
Machine Learning with GraphLab Create
Time Series Analysis with Spark
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Introduction to Apache Kudu
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Efficient Data Storage for Analytics with Apache Parquet 2.0
Next-generation Python Big Data Tools, powered by Apache Arrow
Ad

Similar to Introduction into scalable graph analysis with Apache Giraph and Spark GraphX (20)

PPT
Hadoop trainingin bangalore
PPTX
The Fundamentals Guide to HDP and HDInsight
PDF
Apache Flink & Graph Processing
PPT
Behm Shah Pagerank
PDF
Full stack analytics with Hadoop 2
PDF
Big Data for Mobile
PPT
MapReduce in cgrid and cloud computinge.ppt
PPTX
Hadoop ecosystem
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PDF
Cloud jpl
PDF
Apache Spark: What? Why? When?
PDF
Introduction to Apache Spark
PPT
Spark training-in-bangalore
PDF
Hadoop ecosystem
PDF
Osd ctw spark
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
Map Reduce
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PDF
Scala+data
Hadoop trainingin bangalore
The Fundamentals Guide to HDP and HDInsight
Apache Flink & Graph Processing
Behm Shah Pagerank
Full stack analytics with Hadoop 2
Big Data for Mobile
MapReduce in cgrid and cloud computinge.ppt
Hadoop ecosystem
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Cloud jpl
Apache Spark: What? Why? When?
Introduction to Apache Spark
Spark training-in-bangalore
Hadoop ecosystem
Osd ctw spark
MAP REDUCE IN DATA SCIENCE.pptx
Map Reduce
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Scala+data

More from rhatr (8)

PDF
Unikernels: in search of a killer app and a killer ecosystem
PDF
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
PDF
Tachyon and Apache Spark
PDF
Apache Spark: killer or savior of Apache Hadoop?
PPTX
OSv: probably the best OS for cloud workloads you've never hear of
PDF
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
PDF
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
PDF
Elephant in the cloud
Unikernels: in search of a killer app and a killer ecosystem
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
Tachyon and Apache Spark
Apache Spark: killer or savior of Apache Hadoop?
OSv: probably the best OS for cloud workloads you've never hear of
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Elephant in the cloud

Recently uploaded (20)

PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
iTop VPN Crack Latest Version Full Key 2025
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
CHAPTER 2 - PM Management and IT Context
Salesforce Agentforce AI Implementation.pdf
AutoCAD Professional Crack 2025 With License Key
17 Powerful Integrations Your Next-Gen MLM Software Needs
Reimagine Home Health with the Power of Agentic AI​
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Why Generative AI is the Future of Content, Code & Creativity?
Oracle Fusion HCM Cloud Demo for Beginners
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Digital Systems & Binary Numbers (comprehensive )
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
iTop VPN Crack Latest Version Full Key 2025

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

  • 1. Scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik [email protected] @rhatr Director of Open Source, Pivotal Inc.
  • 2. Introduction into scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik [email protected] @rhatr Director of Open Source, Pivotal Inc.
  • 6. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 7. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  • 8. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1
  • 9. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1 foo bar fee 42 fum
  • 10. What kind of graphs are we talking about? • Page ranking on Facebook social graph (mid 2013) •  10^9 (billions) vertices •  10^12 (trillion) edges •  10^15 (petabtybe) cold storage data scale •  200 servers •  …all in under 4 minutes!
  • 11. “On day one Doug created HDFS and MapReduce”
  • 12. Google papers that started it all • GFS (file system) •  distributed •  replicated •  non-POSIX" • MapReduce (computational framework) •  distributed •  batch-oriented (long jobs; final results) •  data-gravity aware •  designed for “embarrassingly parallel” algorithms
  • 13. HDFS pools and abstracts direct-attached storage … HDFS MR MR
  • 14. A Unix analogy § It is as though instead of: $  grep  foo  bar.txt  |  tr  “,”  “  “  |  sort  -­‐u     § We are doing: $  grep  foo  <  bar.txt  >  /tmp/1.txt   $  tr  “,”  “  “    <  /tmp/1.txt  >  /tmp/2.txt   $  sort  –u  <  /tmp/2.txt  
  • 16. RAM is the new disk, Disk is the new tape Source: UC Berkeley Spark project (just the image)
  • 17. RDDs instead of HDFS files, RAM instead of Disk warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split… pooled RAM
  • 18. RDDs: resilient, distributed, datasets § Distributed on a cluster in RAM § Immutable (mostly) § Can be evicted, snapshotted, etc. § Manipulated via parallel operators (map, etc.) § Automatically rebuilt on failure § A parallel ecosystem § A solution to iterative and multi-stage apps
  • 19. What’s so special about Graphs and big data?
  • 20. Graph relationships § Entities in your data: tuples -  customer data -  product data -  interaction data § Connection between entities: graphs -  social network or my customers -  clustering of customers vs. products
  • 21. A word about Graph databases §  Plenty available -  Neo4J, Titan, etc. §  Benefits -  Query language -  Tightly integrate systems with few moving parts -  High performance on known data sets §  Shortcomings -  Not easy to scale horizontally -  Don’t integrate with HDFS -  Combine storage and computational layers -  A sea of APIs
  • 22. What’s the key API? § Directed multi-graph with labels attached to vertices and edges § Defining vertices and edges dynamically § Selecting sub-graphs § Mutating the topology of the graph § Partitioning the graph § Computing model that is -  iterative -  scalable (shared nothing) -  resilient -  easy to manage at scale
  • 24. BSP in a nutshell time communications local processing barrier #1 barrier #2 barrier #3
  • 25. Vertex-centric BSP application @rhatr @TheASF @c0sin “Think like a vertex” •  I know my local state •  I know my neighbors •  I can send messages to vertices •  I can declare that I am done •  I can mutate graph topology
  • 26. Local state, global messaging time communications vertices are doing local computing and pooling messages superstep #1 all vertices are done computing superstep #2
  • 27. Lets put it all together
  • 28. Hadoop ecosystem view HDFS Pig Sqoop Flume MR Hive Tez Giraph Mahout Spark SparkSQL MLib GraphX HAWQ Kafka YARN MADlib
  • 29. Spark view HDFS, Ceph, GlusterFS, S3 Hive Spark SparkSQL MLib GraphX Kafka YARN, Mesos, MR
  • 31. Our toy for the rest of this talk Adjacency lists stored on HDFS $ hadoop fs –cat /tmp/graph/1.txt 1 2 1 3 3 1 2 @rhatr @TheASF @c0sin 3 1 2
  • 32. Graph modeling in GraphX §  The property graph is parameterized over the vertex (VD) and edge (ED) types class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] } §  Graph[(String, String), String]
  • 33. Hello world in GraphX $ spark*/bin/spark-shell scala val inputFile = sc.textFile(“hdfs:///tmp/graph/1.txt”) scala val edges = inputFile.flatMap(s = { // “2 1 3” val l = s.split(t); // [ “2”, “1”, “3” ] l.drop(1).map(x = (l.head.toLong, x.toLong)) // [ (2, 1), (2, 3) ] }) scala val graph = Graph.fromEdgeTuples(edges, ) // Graph[String, Int] scala val result = graph.collectNeighborIds(EdgeDirection.Out).map(x = println(Hello world from the: + x._1 + : + x._2.mkString( )) ) scala result.collect() // don’t try this @home Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 34. Graph modeling in Giraph BasicComputationI  extends  WritableComparable,          //  VertexID      -­‐-­‐  vertex  ref                                                                                                        V  extends  Writable,                              //  VertexData  -­‐-­‐  a  vertex  datum                                    E  extends  Writable,                              //  EdgeData      -­‐-­‐  an  edge  label                                    M  extends  Writable                              //  MessageData-­‐–  message  payload       V  is  sort  of  like  VD   E  is  sort  of  like  ED  
  • 35. Hello world in Giraph public class GiraphHelloWorld extends BasicComputationIntWritable, IntWritable, NullWritable, NullWritable { public void compute(VertexIntWritable, IntWritable, NullWritable vertex, IterableNullWritable messages) { System.out.print(“Hello world from the: “ + vertex.getId() + “ : “); for (EdgeIntWritable, NullWritable e : vertex.getEdges()) { System.out.print(“ “ + e.getTargetVertexId()); } System.out.println(“”); vertex.voteToHalt(); } }
  • 36. How to run it $ giraph target/*.jar giraph.GiraphHelloWorld -vip /tmp/graph/ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -w 1 -ca giraph.SplitMasterWorker=false,giraph.logLevel=error Hello world from the: 1 : Hello world from the: 2 : 1 3 Hello world from the: 3 : 1 2
  • 38. BSP assumes an exclusively vertex view
  • 39. Turning Twitter into Facebook @rhatr @TheASF @c0sin @rhatr @TheASF @c0sin
  • 40. Hello world in Giraph public void compute(VertexText, DoubleWritable, DoubleWritable vertex, IterableText ms ){ if (getSuperstep() == 0) { sendMessageToAllEdges(vertex, vertex.getId()); } else { for (Text m : ms) { if (vertex.getEdgeValue(m) == null) { vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE)); } } } vertex.voteToHalt(); }
  • 42. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 42 0 3
  • 43. Single source shortest path scala val sssp = graph.pregel(Double.PositiveInfinity) // Initial message ((id, dist, newDist) = math.min(dist, newDist), // Vertex Program triplet = { // Send Message if (triplet.srcAttr + triplet.attr triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a,b) = math.min(a,b)) // Merge Messages scala println(sssp.vertices.collect.mkString(n)) 2 5 0 3
  • 44. Operational views of the graph
  • 45. Masking instead of mutation § def subgraph( epred: EdgeTriplet[VD,ED] = Boolean = (x = true), vpred: (VertexID, VD) = Boolean = ((v, d) = true)) : Graph[VD, ED] § def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  • 46. Built-in algorithms §  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] §  def connectedComponents(): Graph[VertexID, ED] §  def triangleCount(): Graph[Int, ED] §  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
  • 47. Final thoughts Giraph § An unconstrained BSP framework § Specialized fully mutable, dynamically balanced in-memory graph representation § Very procedural, vertex-centric programming model § Genuine part of Hadoop ecosystem § Definitely a 1.0 GraphX § An RDD framework § Graphs are “views” on RDDs and thus immutable § Functional-like, “declarative” programming model § Genuine part of Spark ecosystem § Technically still an alpha