RDD Actions

The document provides examples of common transformations and actions that can be performed on RDDs in PySpark. It demonstrates how to create pair RDDs with (key, value) tuples and use transformations like reduceByKey(), groupByKey(), combineByKey() to perform aggregations by key. Other examples show how to use actions like foreach(), foreachPartition(), fold(), reduce(), takeOrdered() and sampling functions on RDDs. It also discusses persisting RDDs in memory or disk to avoid recomputation.

Uploaded by

durgapriyachikkala05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

RDD Actions

Uploaded by

durgapriyachikkala05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Actions-RDD

foreach example
from pyspark import SparkContext

sc = SparkContext("local", "ForEachExample")
rdd = sc.parallelize([1, 2, 3, 4, 5])

def my_function(x):
print(x)

rdd.foreach(my_function)

sc.stop()
foreachPartition example
from pyspark import SparkContext

sc = SparkContext("local", "ForEachPartitionExample")
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # Creating 2 partitions

def my_partition_function(iterator):
for x in iterator:
print(x)

rdd.foreachPartition(my_partition_function)

sc.stop()
Fold() example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("FoldExample").getOrCreate()
# Create an RDD of numbers
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Define the binary function for multiplication
def multiply(x, y):
return x * y
# Use the fold function
product_result = numbers_rdd.fold(1, multiply)
# Print the result
print("Product using fold:", product_result)
# Stop the Spark session
spark.stop()
Reduce() example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("ReduceExample").getOrCreate()
# Create an RDD of numbers
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Define the binary function for addition
def add(x, y):
return x + y
# Use the reduce function
sum_result = numbers_rdd.reduce(add)
# Print the result
print("Sum using reduce:", sum_result)
# Stop the Spark session
spark.stop()
Aggregate Fn example
import findspark
findspark.init() def comb_op(acc1, acc2):
from pyspark.sql import SparkSession # Combine two accumulators by adding their
# Create a Spark session
sum and multiplying their products
return (acc1[0] + acc2[0], acc1[1] * acc2[1])
spark = SparkSession.builder.appName("AggregateExample").getOrCreate()
# Create an RDD of numbers # Use the aggregate function
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2) (sum_result, product_result) =
# Define the zero value and the aggregate functions numbers_rdd.aggregate(zero_value, seq_op,
comb_op)
zero_value = (0, 1) # Accumulator for sum, product
def seq_op(accumulator, element): # Print the results
# Update the accumulator by adding the element to sum and multiplying print("Sum:", sum_result)
to product
print("Product:", product_result)
return (accumulator[0] + element, accumulator[1] * element)
# Stop the Spark session
spark.stop()
takeordered ()

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("takeOrderedExample").getOrCreate()
# Sample data
data = [(3, "Alice"), (1, "Bob"), (5, "Charlie"), (2, "David"), (4, "Eve")]
# Create an RDD from the sample data
rdd = spark.sparkContext.parallelize(data)
# Take the top 3 elements based on the first element of each tuple (ascending order)
top_elements = rdd.takeOrdered(3, key=lambda x: x[0])
# Print the top elements
for element in top_elements:
print(element)
# Stop the Spark session
spark.stop()
Sampling from rdd
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("takeSampleExample").getOrCreate()
# Sample data
data = list(range(1, 20))
# Create an RDD from the sample data
rdd = spark.sparkContext.parallelize(data)
# Take a random sample of 5 elements without replacement
sample_without_replacement = rdd.takeSample(False, 5)
# Take a random sample of 5 elements with replacement
sample_with_replacement = rdd.takeSample(True, 5)
# Print the samples
print("Sample without replacement:", sample_without_replacement)
print("Sample with replacement:", sample_with_replacement)
# Stop the Spark session
spark.stop( )
Persistence in RDD
• Spark RDD’s are lazily evaluation
• Hence spark will recompute an RDD and its dependencies every time
an action is called
• This might become expensive for iterative algorithms
• Persist data-a better option
Persist()
RDD.persist(storageLevel)
• storageLevel specifies where and how to persist the RDD. It is an optional argument that determines the
storage level.

• Common storage levels include:

• MEMORY_ONLY: Cache the RDD in memory as deserialized Java objects (default).

• MEMORY_ONLY_SER: Cache the RDD in memory as serialized Java objects.

• MEMORY_AND_DISK: Cache the RDD in memory, and spill to disk if the memory is not sufficient.

• MEMORY_AND_DISK_SER: Cache the RDD in memory as serialized Java objects, and spill to disk if the memory is not sufficient.

• DISK_ONLY: Cache the RDD on disk.

Example of persist()
import findspark
findspark.init()
from pyspark.storagelevel import StorageLevel
from pyspark import SparkContext
sc = SparkContext("local", "RDD Persistence Example")
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Persist the RDD in memory as deserialized Java objects
rdd.persist(storageLevel=StorageLevel.MEMORY_ONLY)
# Perform some operations on the RDD
sum_result = rdd.reduce(lambda x, y: x + y)
print("Sum of elements:", sum_result)
# The RDD is cached in memory, so it can be reused without recomputation
product_result = rdd.map(lambda x: x * 2).collect()
print("Doubled elements:", product_result)
# Stop the SparkContext
sc.stop()
More about persist() in spark
• If memory overflow happens spark evicts data based on LRU policy
• rnpersist() can be used
• Rdd.unpersist()
WORKING WITH (KEY,VALUE) PAIRS
Pair RDD
ETL performed on RDD to get them to (key,value) pair
Special operations defined on RDD pair
reducebykey()
join()
Creating pair rdd
import findspark
findspark.init()
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "Pair RDD Example")
# Create an RDD with tuples (key, value)
data = [(1, "apple"), (2, "banana"), (3, "cherry"), (4, "date"), (5, "elderberry")]
rdd = sc.parallelize(data)
# Now, 'rdd' is a Pair RDD
# Perform operations on the Pair RDD
# For example, let's filter the fruits with keys greater than 2
filtered_rdd = rdd.filter(lambda x: x[0] > 2)
# Collect and print the results
results = filtered_rdd.collect()
for result in results:
print(result)
# Stop the SparkContext
sc.stop()
Note: Other programming languages like Scala and Java require the data type of the rdd to change , before applying aggregate functions
Transformations on Pair RDD’s
reducebykey()
import findspark
findspark.init()
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "reduceByKey Example")
# Create a Pair RDD with key-value pairs
data = [(1, 2), (2, 4), (1, 6), (2, 8), (3, 1)]
pair_rdd = sc.parallelize(data)

# Use reduceByKey to calculate the sum of values for each key

sum_rdd = pair_rdd.reduceByKey(lambda x, y: x + y)
# Collect and print the results
results = sum_rdd.collect()
for result in results:
print("Key:", result[0], "Sum:", result[1])
# Stop the SparkContext
sc.stop()
Groupbykey()
import findspark Output:
findspark.init()
from pyspark import SparkContext Key: 1, Values: ['apple', 'cherry']
# Create a SparkContext
Key: 2, Values: ['banana', 'date']
Key: 3, Values: ['elderberry']
sc = SparkContext("local", "groupByKey Example")
# Create a Pair RDD with key-value pairs
data = [(1, 'apple'), (2, 'banana'), (1, 'cherry'), (2, 'date'), (3, 'elderberry')]
pair_rdd = sc.parallelize(data)
# Use groupByKey to group values by key
grouped_rdd = pair_rdd.groupByKey()
# Iterate through the grouped results and print them
for key, values in grouped_rdd.collect():
print(f"Key: {key}, Values: {list(values)}")
# Stop the SparkContext
sc.stop()
Combinebykey()
from pyspark import SparkContext
average_scores_rdd = pair_rdd.combineByKey(createCombiner, mergeValue, mergeCombiners)
# Create a SparkContext # Calculate the average score for each student
sc = SparkContext("local", "combineByKey Example") average_scores = average_scores_rdd.map(lambda x: (x[0], x[1][0] / x[1][1]))
# Create a Pair RDD with student scores
data = [("Alice", 85), ("Bob", 90), ("Alice", 78), ("Bob", 88), # Collect and print the results
("Alice", 92)] results = average_scores.collect()
for result in results:
pair_rdd = sc.parallelize(data) print("Student:", result[0], "Average Score:", result[1])
# Use combineByKey to calculate the average score for each
student # Stop the SparkContext
# - createCombiner initializes an accumulator (sum, count) for sc.stop()
each key
# - mergeValue adds a new score to the accumulator
# - mergeCombiners combines the accumulators from different
partitions OUTPUT:
def createCombiner(score):
return (score, 1)
def mergeValue(accumulator, score): Student: Alice Average Score: 85.0
total_score, count = accumulator Student: Bob Average Score: 89.0
return (total_score + score, count + 1)

def mergeCombiners(accumulator1, accumulator2):

total_score1, count1 = accumulator1
total_score2, count2 = accumulator2
return (total_score1 + total_score2, count1 + count2)

Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
PySpark_RDD_Cheat_Sheet
No ratings yet
PySpark_RDD_Cheat_Sheet
1 page
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
2 - Intro to PySpark RDD
No ratings yet
2 - Intro to PySpark RDD
35 pages
Note
No ratings yet
Note
14 pages
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
No ratings yet
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
6 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
External Video-En (15)
No ratings yet
External Video-En (15)
2 pages
SPARK
No ratings yet
SPARK
35 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark
No ratings yet
Spark
96 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
RDD
No ratings yet
RDD
4 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
ADE Training
No ratings yet
ADE Training
1 page
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
pyspark (1)
No ratings yet
pyspark (1)
44 pages
L4
No ratings yet
L4
65 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
2335_m8_demo1_v1_0h2_cq188do
No ratings yet
2335_m8_demo1_v1_0h2_cq188do
9 pages
spark
No ratings yet
spark
160 pages
Pyspark
No ratings yet
Pyspark
31 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
SPARK
No ratings yet
SPARK
36 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
4 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Assam GK Assajobonline
No ratings yet
Assam GK Assajobonline
7 pages
Rhetorical Device Reflection Rubric
No ratings yet
Rhetorical Device Reflection Rubric
4 pages
Finding Main Ideas and Supporting Details
No ratings yet
Finding Main Ideas and Supporting Details
20 pages
StartUp Level2 Workbook Unit2
No ratings yet
StartUp Level2 Workbook Unit2
9 pages
Deutsch Mit Felix Und Franzi Lesson Plans For German - Goethe-Institut (PDFDrive)
No ratings yet
Deutsch Mit Felix Und Franzi Lesson Plans For German - Goethe-Institut (PDFDrive)
106 pages
Abstract Tapicon 2024
No ratings yet
Abstract Tapicon 2024
3 pages
Presenting Vocabulary
No ratings yet
Presenting Vocabulary
9 pages
Five Macro Skills
No ratings yet
Five Macro Skills
2 pages
BT U1-U4
No ratings yet
BT U1-U4
18 pages
There Is - There Are Funiture
No ratings yet
There Is - There Are Funiture
1 page
Clauses Phrases
No ratings yet
Clauses Phrases
7 pages
Denotation and Connotation
No ratings yet
Denotation and Connotation
15 pages
Strategies For The Listening Test of Ielts: Visit AIPPG IELTS Downloads Section For More Tips
No ratings yet
Strategies For The Listening Test of Ielts: Visit AIPPG IELTS Downloads Section For More Tips
8 pages
Doonan Cultural Personal Project
No ratings yet
Doonan Cultural Personal Project
18 pages
MCQs Tenses
No ratings yet
MCQs Tenses
10 pages
DLL - English 3 - Q1 - W2
No ratings yet
DLL - English 3 - Q1 - W2
5 pages
Philippines
No ratings yet
Philippines
18 pages
Active and Passive Voices 1
No ratings yet
Active and Passive Voices 1
13 pages
Sign Language Detection Project Report
No ratings yet
Sign Language Detection Project Report
9 pages
Boophone Disticha - Plantz Africa
No ratings yet
Boophone Disticha - Plantz Africa
3 pages
(Ebook) A Rulebook for Arguments by Anthony Weston ISBN 9780872205536, 0872205533 pdf download
100% (2)
(Ebook) A Rulebook for Arguments by Anthony Weston ISBN 9780872205536, 0872205533 pdf download
50 pages
7th Class Go Grammar CH 4 Pronoun Ans Key
100% (1)
7th Class Go Grammar CH 4 Pronoun Ans Key
6 pages
Ims Lesson 4
No ratings yet
Ims Lesson 4
5 pages
Lecon 4
No ratings yet
Lecon 4
3 pages
Pride and Prejudice Lesson 2 Answers
No ratings yet
Pride and Prejudice Lesson 2 Answers
2 pages
Aloe Ferox - Plantz Africa
No ratings yet
Aloe Ferox - Plantz Africa
3 pages
Final Exam Fourth Year
No ratings yet
Final Exam Fourth Year
6 pages
Final Exam
No ratings yet
Final Exam
12 pages
Programming Language and Paradigms
No ratings yet
Programming Language and Paradigms
28 pages
P1 English scheme of work
No ratings yet
P1 English scheme of work
15 pages

RDD Actions

Uploaded by

RDD Actions

Uploaded by

Actions-RDD

from pyspark.sql import SparkSession

• Common storage levels include:

• MEMORY_ONLY_SER: Cache the RDD in memory as serialized Java objects.

• DISK_ONLY: Cache the RDD on disk.

# Use reduceByKey to calculate the sum of values for each key

def mergeCombiners(accumulator1, accumulator2):

You might also like