0% found this document useful (0 votes)
12 views

Unit 5 Note

Unit 5 Note
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 5 Note

Unit 5 Note
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Module V: SPARK

Introduction, Spark Jobs and API, Spark 2.0 Architecture, Resilient Distributed Datasets:
Internal Working, Creating RDDs, Transformations, Actions. Data Frames: Python to RDD
Communications, speeding up PySpark with Data Frames, Creating Data Frames and Simple
Data Frame Queries, Interoperating with RDDs, Querying with Data Frame.

Introduction to Apache Spark:

Apache Spark is an open-source, distributed computing system that provides fast and general-
purpose cluster computing for big data processing. It was developed to overcome the limitations
of the Hadoop MapReduce model, offering better performance and ease of use. Spark supports
various programming languages, including Scala, Java, Python, and R.

Spark Jobs:

In Spark, a job refers to a parallel computation consisting of multiple tasks that can be executed
on a cluster. The central abstraction for Spark jobs is the Resilient Distributed Dataset (RDD),
which is a fault-tolerant collection of elements that can be processed in parallel. Jobs are divided
into stages, and each stage consists of tasks that run on different nodes of the cluster.

Key concepts related to Spark jobs include:

 Transformation: Operations applied to an RDD to create a new RDD (e.g., map, filter).

 Action: Operations that trigger the execution of transformations and return results to the
driver program (e.g., count, collect).

 DAG (Directed Acyclic Graph): Spark represents the execution plan of a job as a DAG,
allowing for efficient parallel execution.

 Shuffling: The process of redistributing data across partitions, often occurring during
operations like grouping or joining.

Spark API:

Spark provides a rich set of APIs in different programming languages, making it accessible to a
wide range of developers. The primary APIs include:

 RDD API: The core abstraction in Spark, representing an immutable distributed


collection of objects. RDDs support both parallel processing and fault tolerance.

 DataFrame API: A higher-level API providing a more structured and optimized


interface for data manipulation. It introduces the concept of DataFrames, similar to those
in R and Python.
 Spark SQL: A Spark module for structured data processing, enabling the execution of
SQL queries on Spark data.

 MLlib (Machine Learning Library): A scalable machine learning library for Spark,
providing various algorithms and utilities for data analysis.

 Spark Streaming: An extension of the core Spark API for processing real-time
streaming data.

Spark 2.0 Architecture: Efficient Distributed Processing

Spark 2.0 introduced significant improvements upon its predecessor, offering a robust and
efficient framework for distributed processing of large datasets. Here's a breakdown of its key
components:

Master-Slave Architecture:

Spark follows a master-slave architecture, consisting of:

 Spark Driver: The master node, responsible for submitting jobs, scheduling tasks, and
coordinating communication across the cluster.

 Executors: Slave nodes that reside on worker machines and execute the actual tasks
distributed by the driver. Each application gets its own set of executors, ensuring
isolation and resource allocation.

 Cluster Manager: An external service (e.g., YARN, Mesos, Standalone) responsible for
managing resources and allocating them to Spark applications.

Data Abstractions:

Spark 2.0 relies on two fundamental data abstractions for processing:


 Resilient Distributed Datasets (RDDs): Represent in-memory, fault-tolerant collections
of elements across the cluster. RDDs can be created from various data sources and
manipulated through transformations (actions that modify the data) and actions
(operations that return a result).

 Directed Acyclic Graph (DAG): Represents the execution plan for a Spark job. The
driver translates the program logic into a DAG, where each vertex represents a
transformation or action and edges represent dependencies between them.

Processing Workflow:

1. Job Submission: The Spark driver submits the Spark job with the program logic to the
cluster manager.

2. DAG Generation: The driver converts the program logic into a DAG, optimizing the
execution plan.

3. Task Distribution: The driver partitions the RDDs into smaller chunks and distributes
them across the executors.

4. Task Execution: Executors run the assigned tasks on the data partitions in memory,
leveraging in-memory processing for faster performance.

5. Result Aggregation: Results from individual tasks on different executors are shuffled
and aggregated by the driver as per the DAG dependencies.

6. Job Completion: Once all tasks are complete and results are aggregated, the job is
considered finished.

Key Improvements in Spark 2.0:

 Unified DataFrames and Datasets: Introduced DataFrames and Datasets as unified


interfaces for structured data, offering improved performance and ease of use compared
to RDDs.

 Structured Streaming: Enabled real-time stream processing capabilities for continuous


data ingestion and analysis.

 Tungsten Engine: Enhanced performance through optimized memory management and


code generation techniques.

Benefits of Spark 2.0 Architecture:

 Scalability: Handles large datasets efficiently by distributing processing across a cluster


of machines.
 Fault Tolerance: RDDs are fault-tolerant, meaning lost data can be recomputed
automatically.

 Performance: Leverages in-memory processing for faster computations compared to


traditional disk-based methods.

 Ease of Use: High-level APIs like DataFrames simplify data manipulation and analysis.

Resilient Distributed Datasets: Internal Working

Resilient Distributed Datasets (RDDs) are the foundation for data processing in Apache Spark.
Here's a deeper dive into their internal workings:

Key Characteristics:

 Immutable: Once created, an RDD cannot be modified. This simplifies fault tolerance
and parallelization.
 Partitioned: Large datasets are split into smaller logical partitions, allowing for parallel
processing across the cluster's executors.
 Lazy Evaluation: Transformations on RDDs are not immediately computed. Spark builds
a DAG (Directed Acyclic Graph) representing the sequence of operations, and only
executes the required computations when an action (e.g., count, collect) triggers the final
result.

Internal Structure:

An RDD internally consists of the following components:

 List of Partitions: Defines the data distribution across the cluster. Each partition holds a
portion of the entire dataset.
 Partition Function: A function that specifies how to compute each partition. It can access
data from external storage systems like HDFS or generate data on the fly.
 Lineage Information: Tracks the dependencies between RDDs. When a partition is lost
due to a node failure, lineage information allows Spark to recompute the lost data based
on its parent RDDs.
 Preferred Locations (Optional): These specify the executors on which certain partitions
are preferably computed. This can optimize data locality and minimize network traffic.

Transformations and Actions:

 Transformations: Operations that create new RDDs based on existing ones. Examples
include map (applies a function to each element), filter (selects elements based on a
condition), join (combines data from two RDDs). Transformations are lazy and only
define the operations in the DAG.
 Actions: Operations that trigger actual computations on the RDD and return a value to the
driver program. Examples include count (returns the number of elements), collect
(returns all elements as an array), first (returns the first element). Actions trigger the
execution of the DAG and the processing of RDD partitions.

Fault Tolerance:

RDDs achieve fault tolerance through lineage information and in-memory caching. When a node
failure occurs and a partition is lost, Spark can recompute the lost partition using its lineage
information and the data from its parent RDD. Additionally, Spark can cache frequently used
RDDs in memory across the cluster, minimizing recomputation in case of failures.

Benefits of RDDs:

 Scalability: Distributed processing across a cluster allows handling large datasets


efficiently.
 Reliability: Lineage information and caching ensure fault tolerance and data recovery in
case of failures.
 Flexibility: A rich set of transformations allows for complex data manipulation
workflows.

Creating RDDs:
There are two primary ways to create RDDs in Spark:
1. Parallelizing an Existing Collection:
 This approach is suitable for smaller datasets that can fit entirely in-memory on the driver
program.
 You leverage the sparkContext.parallelize(collection) method.
 The collection argument can be:
o A list or array of elements in your program.
o A local file stored on the driver node.
Here's an example:
# Import SparkContext
from pyspark.sql import SparkContext

# Create a SparkContext
sc = SparkContext.getOrCreate()

# Create a list of numbers


data = [1, 2, 3, 4, 5]

# Parallelize the list to create an RDD


numbers_rdd = sc.parallelize(data)

# Print the first element of the RDD (action)


print(numbers_rdd.first())

2. Referencing External Datasets:


 This approach is ideal for large datasets stored in external storage systems like HDFS,
Cassandra, or any data source supported by Hadoop InputFormat.
 Spark provides various methods for reading data depending on the data format (text files,
JSON, CSV, etc.). Here are some common examples:
Text Files:
# Read a text file line by line
text_rdd = sc.textFile("hdfs://path/to/your/file.txt")
# Read a CSV file with header (infers schema)
csv_rdd = sc.csvFile("hdfs://path/to/your/data.csv", header=True)
# Read a JSON file as a collection of JSON objects
json_rdd = sc.jsonFile("hdfs://path/to/your/data.json")
Key Points:
 When creating RDDs from external datasets, Spark reads the data in partitions for parallel
processing across the cluster.
 You can specify the number of partitions while reading the data to control the level of
parallelism.
 The default number of partitions is determined by Spark based on the cluster
configuration.
Additional Considerations:
 Once you have created an RDD, you can manipulate it using transformations (e.g., map,
filter) to create new RDDs or use actions (e.g., count, collect) to obtain results.
 RDDs are immutable, so any transformation results in a new RDD, leaving the original
data untouched.
In Apache Spark, Resilient Distributed Datasets (RDDs) are the fundamental data structure for
representing and processing large datasets across a cluster. Understanding how to create,
manipulate, and interact with RDDs is essential for working with Spark. Here's a breakdown of
RDD transformations and actions:

Transformations:

 Operations that create a new RDD based on an existing one.

 Transformations define a series of steps to be performed on the data but don't


immediately execute any computations.

 Spark builds a Directed Acyclic Graph (DAG) representing the sequence of


transformations applied.

 Common Transformations:

o map(func): Applies a function to each element of the RDD, resulting in a new


RDD with transformed elements.

o filter(func): Selects elements based on a condition defined by the function,


creating a new RDD with only the matching elements.
o flatMap(func): Similar to map, but allows each element to be transformed into
zero, one, or more elements in the resulting RDD.

o join(otherRDD, func): Combines elements from two RDDs based on a join


condition specified by the function.

o union(otherRDD): Creates a new RDD by combining all elements from two


RDDs.

o distinct(): Returns a new RDD with duplicate elements removed.

o sortByKey(func, ascending=True): Sorts elements based on a key extraction


function, with optional control over ascending or descending order.

Actions:

 Operations that trigger the actual computation on the RDD and return a value to
the driver program.

 Actions force the execution of the DAG and processing of RDD partitions based on the
defined transformations.

 Common Actions:

o count(): Returns the total number of elements in the RDD.

o first(): Returns the first element in the RDD.

o collect(): Returns all elements of the RDD as a collection to the driver program
(caution: use with smaller datasets due to memory limitations).

o reduce(func): Reduces the RDD to a single value by applying a function that


combines elements pairwise.

o take(n): Returns the first n elements from the RDD.

o saveAsTextFile(path): Saves the RDD data as a text file.

Key Points:

 Transformations and actions are crucial for manipulating and interacting with RDDs in
Spark.

 Transformations are lazy, meaning they don't execute until an action triggers the DAG.
This allows for efficient optimization and reduces unnecessary computations.
 RDDs are immutable, so transformations create new RDDs without modifying the
original data.

Example:

# Create a sample RDD

numbers = sc.parallelize([1, 2, 3, 4, 5])

# Apply transformations (define DAG)

squared_numbers = numbers.map(lambda x: x * x) # Square each element

filtered_numbers = squared_numbers.filter(lambda x: x % 2 == 0) # Filter even numbers

# Trigger action (execute DAG)

count = filtered_numbers.count() # Count the even squared numbers

print(f"Number of even squared numbers: {count}")

In this example, the map and filter transformations define the operations but don't perform
calculations until the count action is called. This triggers the execution of the DAG and returns
the final result to the driver program.

Data Frames: Python to RDD Communications

1. DataFrame to RDD Conversion:

 You can convert a DataFrame into an RDD using the rdd attribute.

 This creates an RDD of Row objects, where each Row object represents a row in the
DataFrame.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("df_to_rdd").getOrCreate()

# Create a DataFrame

data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]

df = spark.createDataFrame(data, ["name", "age"])

# Convert DataFrame to RDD

name_age_rdd = df.rdd

# Access elements in the RDD (Row objects)


for row in name_age_rdd.collect():

print(f"Name: {row.name}, Age: {row.age}")

2. RDD to DataFrame Conversion:

 You can convert an RDD back into a DataFrame using the toDF method of the RDD
object.

 This requires providing a schema (column names and data types) that aligns with the
structure of the RDD data.

# Sample RDD of tuples

data_rdd = sc.parallelize([("David", 35), ("Emily", 22)])

# Define the schema for the DataFrame

schema = ["name", "age"]

# Convert RDD to DataFrame

df = data_rdd.toDF(schema)

# Print the DataFrame

df.show()

Use Cases:

 Leveraging Existing RDD Code: If you have existing code that manipulates data using
RDDs, you can convert a DataFrame to an RDD to utilize those functionalities.

 Fine-Grained Control: RDDs offer more low-level control over data manipulation
compared to DataFrames. In specific scenarios, converting a DataFrame to an RDD
might be necessary for achieving the desired outcome.

 Interoperability: In some cases, you might need to interact with data structures from
other Spark libraries (e.g., MLlib) that primarily work with RDDs. Converting a
DataFrame to an RDD can facilitate this interaction.

Important Considerations:

 Converting a DataFrame to an RDD loses the benefits of DataFrame optimizations and


type safety.
 Frequent conversions between DataFrames and RDDs can be inefficient. It's generally
recommended to primarily use DataFrames for structured data processing and leverage
RDDs only when necessary.

 Consider using Spark Datasets (introduced in Spark 2.0) as a unified approach for
structured data, offering the advantages of both DataFrames and RDDs.

Speeding up PySpark with Data Frames:

Data Serialization and Caching:

 Kryo Serializer: The default serialization format in Spark is Java Serialization


(JavaSerializer). However, Kryo serialization offers significant performance
improvements by being faster and more efficient. You can configure your Spark
application to use Kryo by
setting spark.serializer to org.apache.spark.serializer.KryoSerializer in your Spark
configuration.

 Data Caching: Frequently accessed DataFrames can benefit from caching in memory
across executors. This minimizes redundant computations by reusing the cached data for
subsequent operations. Use the cache() method on your DataFrame to enable caching.

Optimizing DataFrame Operations:

 Filter Pushdown: Push filtering operations down to the data source during data reading.
This reduces the amount of data transferred to the cluster for processing, improving
performance. Spark can often automatically push down filters, but you can explicitly
specify filters in the data source reading method (e.g., df.filter("column_name >
10").read.parquet("path/to/data")).

 Partition Pruning: Similar to filter pushdown, partition pruning eliminates irrelevant


data partitions based on filtering conditions. Spark can intelligently prune partitions
during reading, further reducing data transfer and processing time.

 Columnar Processing: DataFrames store data in a columnar format, where each column
is stored separately. This allows Spark to process only the required columns for specific
operations, instead of the entire row, leading to performance gains.

Leveraging Hardware and Cluster Configuration:

 Utilizing In-Memory Computing: Spark can leverage available memory on worker


nodes for in-memory processing. Ensure you have sufficient memory allocated to your
Spark executors to maximize in-memory operations for DataFrames.
 Tuning Shuffle Operations: Shuffle operations involve data exchange between
executors during certain DataFrame operations (e.g., joins, aggregations). Optimize
shuffle performance by adjusting configurations
like spark.shuffle.memoryFraction and spark.shuffle.spillThreshold to manage memory
allocation and spilling to disk.

Advanced Techniques:

 Broadcast Variables: For small, frequently used datasets (e.g., lookup tables), consider
broadcasting them as broadcast variables. This makes a single copy of the data available
on each executor, avoiding redundant data transfer for each task.

 Tungsten Engine (Spark 2.0+): Introduced in Spark 2.0, the Tungsten Engine optimizes
memory management and code generation for DataFrame operations, leading to
significant performance improvements.

Additional Tips:

 Code Optimization: Review your code and identify potential bottlenecks. Use clear and
concise DataFrame operations to avoid unnecessary computations.

 Monitoring and Profiling: Utilize Spark's built-in monitoring tools (e.g., Spark UI) to
identify slow tasks and execution bottlenecks within your application.

 Consider Alternatives: For specific use cases, libraries like Apache Arrow or Koalas
might offer performance advantages over PySpark DataFrames. Evaluate your needs and
explore alternative options if performance is a critical factor.

Creating Data Frames and Simple Data Frame Queries:

Here's a breakdown of creating DataFrames and performing simple queries in PySpark:

1. Creating DataFrames:

There are several ways to create DataFrames in PySpark:

 From Lists or Tuples:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("df_example").getOrCreate()

# Create a list of tuples

data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]

# Create a DataFrame with schema (optional)


df = spark.createDataFrame(data, ["name", "age"])

# Print the DataFrame

df.show()

 From CSV, Parquet, or JSON Files:

# Read a CSV file with header (infers schema)

df = spark.read.csv("data.csv", header=True)

# Read a Parquet file

df = spark.read.parquet("data.parquet")

# Read a JSON file

df = spark.read.json("data.json")

 Using SQL Commands:

# Create a temporary table from existing data

data.toDF().createOrReplaceTempView("people")

# Create a DataFrame using SQL syntax

df = spark.sql("SELECT * FROM people")

2. Simple Data Frame Queries:

DataFrames offer a variety of methods for querying and manipulating data:

 Selecting Columns:

o select(col1, col2): Selects specific columns.

o df.name: Selects a single column by name.

 Filtering Data:

o filter(condition): Filters rows based on a condition. (e.g., df.filter(df.age > 25))

 Sorting:

o orderBy(col1.desc(), col2.asc()): Sorts by multiple columns with optional


ascending/descending order.
 Aggregation:

o count(): Counts the number of rows.

o avg(col): Calculates the average of a numeric column.

o min(col) or max(col): Finds the minimum or maximum value in a column.

 Grouping:

o groupBy(col): Groups rows by a specific column.

o You can then use aggregation functions on the grouped DataFrame.

Here's an example demonstrating some of these operations:

# Select name and filter for adults (age >= 18)

adults = df.select("name").filter(df.age >= 18)

# Sort adults by age in descending order

sorted_adults = adults.orderBy(df.age.desc())

# Count the number of adults

adult_count = sorted_adults.count()

print(f"Number of adults: {adult_count}")

Interoperating with RDDs:

1. DataFrame to RDD Conversion:

 You can convert a DataFrame into an RDD using the rdd attribute.

 This creates an RDD of Row objects, where each Row object represents a row in the
DataFrame.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("df_to_rdd").getOrCreate()

# Create a DataFrame

data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]

df = spark.createDataFrame(data, ["name", "age"])


# Convert DataFrame to RDD

name_age_rdd = df.rdd

# Access elements in the RDD (Row objects)

for row in name_age_rdd.collect():

print(f"Name: {row.name}, Age: {row.age}")

2. RDD to DataFrame Conversion:

 You can convert an RDD back into a DataFrame using the toDF method of the RDD
object.

 This requires providing a schema (column names and data types) that aligns with the
structure of the RDD data.

# Sample RDD of tuples

data_rdd = sc.parallelize([("David", 35), ("Emily", 22)])

# Define the schema for the DataFrame

schema = ["name", "age"]

# Convert RDD to DataFrame

df = data_rdd.toDF(schema)

# Print the DataFrame

df.show()

Use Cases for Interoperability:

 Leveraging Existing RDD Code: If you have existing code that manipulates data using
RDDs, you can convert a DataFrame to an RDD to utilize those functionalities.

 Fine-Grained Control: RDDs offer more low-level control over data manipulation
compared to DataFrames. In specific scenarios, converting a DataFrame to an RDD
might be necessary for achieving the desired outcome.

 Interoperability: In some cases, you might need to interact with data structures from other
Spark libraries (e.g., MLlib) that primarily work with RDDs. Converting a DataFrame to
an RDD can facilitate this interaction.
Important Considerations:

 Converting a DataFrame to an RDD loses the benefits of DataFrame optimizations and


type safety.

 Frequent conversions between DataFrames and RDDs can be inefficient. It's generally
recommended to primarily use DataFrames for structured data processing and leverage
RDDs only when necessary.

 Consider using Spark Datasets (introduced in Spark 2.0) as a unified approach for
structured data, offering the advantages of both DataFrames and RDDs.

Additional Techniques:

 Working with Specific Columns: You can use the select method on a DataFrame to select
specific columns and convert the resulting DataFrame subset into an RDD. This allows
you to focus on a particular portion of the data for RDD manipulation.

 Custom RDD Transformations: If you require specific transformations not readily


available in DataFrames, you can convert the DataFrame to an RDD, define a custom
transformation function, and then convert the resulting RDD back to a DataFrame.

Querying with Data Frame:

In Apache Spark, DataFrames offer a powerful and user-friendly way to query and manipulate
structured data. Here's a breakdown of querying with DataFrames:

1. Selecting Columns:

 You can select specific columns from a DataFrame using the select method. This returns
a new DataFrame containing only the chosen columns.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("df_query").getOrCreate()

# Create a sample DataFrame

data = [("Alice", 25, "Engineer"), ("Bob", 30, "Manager"), ("Charlie", 28, "Analyst")]

df = spark.createDataFrame(data, ["name", "age", "occupation"])

# Select name and age columns

selected_df = df.select("name", "age")

# Print the selected DataFrame


selected_df.show()

2. Filtering Data:

 You can filter rows based on specific conditions using the filter method. This creates a
new DataFrame containing only rows that meet the filtering criteria.

# Filter for adults (age >= 18)

adults_df = df.filter(df.age >= 18)

# Filter for people with specific occupations

managers_df = df.filter(df.occupation == "Manager")

# Print the filtered DataFrames

adults_df.show()

managers_df.show()

3. Sorting Data:

 You can sort the DataFrame by one or more columns using the orderBy method. This
allows you to order the data in ascending or descending order.

# Sort by age in descending order

sorted_df = df.orderBy(df.age.desc())

# Sort by name in ascending order and then by age in descending order

sorted_df = df.orderBy("name", df.age.desc())

# Print the sorted DataFrame

sorted_df.show()

4. Joining DataFrames:

 You can combine data from two DataFrames based on a shared column using
the join method. This is useful for relational operations between datasets.

# Create another DataFrame (assuming relevant schema)

departments_df = spark.createDataFrame([("Engineering", "Tech"), ("Management", "HR")],


["department", "location"])
# Join DataFrames based on occupation and department

joined_df = df.join(departments_df, df.occupation == departments_df.department, "inner") #


Choose join type (inner, outer, etc.)

# Print the joined DataFrame

joined_df.show()

5. Aggregation:

 You can perform various aggregation operations on DataFrames using functions


like count, avg, min, max, and sum. These functions summarize data across the entire
DataFrame or grouped subsets.

# Count the total number of rows

total_count = df.count()

# Calculate the average age

avg_age = df.agg({"age": "avg"}) # Returns a DataFrame with the aggregate result

# Find the minimum and maximum age

min_max_age = df.select(df.age.min(), df.age.max()).alias("age_stats") # Create an alias for the


result

# Print the aggregation results

print(f"Total Count: {total_count}")

avg_age.show()

min_max_age.show()

6. Grouping:

 You can group rows based on a specific column using the groupBy method. This allows
you to perform aggregate operations on groups of rows that share the same value in the
grouping column.

# Group by occupation and calculate average age within each group

avg_age_by_occupation = df.groupBy("occupation").agg({"age": "avg"})


# Print the grouped DataFrame with average age

avg_age_by_occupation.show()

Spark SQL Integration:

DataFrames also seamlessly integrate with Spark SQL, allowing you to use SQL-like syntax for
querying. Here's an example:

Python

# Register the DataFrame as a temporary table

df.createOrReplaceTempView("people")

# Query the data using SQL syntax

sql_df = spark.sql("SELECT name, age FROM people WHERE age > 25 ORDER BY age
DESC")

# Print the result DataFrame

sql_df.show()

You might also like