Unit 5 Note
Unit 5 Note
Introduction, Spark Jobs and API, Spark 2.0 Architecture, Resilient Distributed Datasets:
Internal Working, Creating RDDs, Transformations, Actions. Data Frames: Python to RDD
Communications, speeding up PySpark with Data Frames, Creating Data Frames and Simple
Data Frame Queries, Interoperating with RDDs, Querying with Data Frame.
Apache Spark is an open-source, distributed computing system that provides fast and general-
purpose cluster computing for big data processing. It was developed to overcome the limitations
of the Hadoop MapReduce model, offering better performance and ease of use. Spark supports
various programming languages, including Scala, Java, Python, and R.
Spark Jobs:
In Spark, a job refers to a parallel computation consisting of multiple tasks that can be executed
on a cluster. The central abstraction for Spark jobs is the Resilient Distributed Dataset (RDD),
which is a fault-tolerant collection of elements that can be processed in parallel. Jobs are divided
into stages, and each stage consists of tasks that run on different nodes of the cluster.
Transformation: Operations applied to an RDD to create a new RDD (e.g., map, filter).
Action: Operations that trigger the execution of transformations and return results to the
driver program (e.g., count, collect).
DAG (Directed Acyclic Graph): Spark represents the execution plan of a job as a DAG,
allowing for efficient parallel execution.
Shuffling: The process of redistributing data across partitions, often occurring during
operations like grouping or joining.
Spark API:
Spark provides a rich set of APIs in different programming languages, making it accessible to a
wide range of developers. The primary APIs include:
MLlib (Machine Learning Library): A scalable machine learning library for Spark,
providing various algorithms and utilities for data analysis.
Spark Streaming: An extension of the core Spark API for processing real-time
streaming data.
Spark 2.0 introduced significant improvements upon its predecessor, offering a robust and
efficient framework for distributed processing of large datasets. Here's a breakdown of its key
components:
Master-Slave Architecture:
Spark Driver: The master node, responsible for submitting jobs, scheduling tasks, and
coordinating communication across the cluster.
Executors: Slave nodes that reside on worker machines and execute the actual tasks
distributed by the driver. Each application gets its own set of executors, ensuring
isolation and resource allocation.
Cluster Manager: An external service (e.g., YARN, Mesos, Standalone) responsible for
managing resources and allocating them to Spark applications.
Data Abstractions:
Directed Acyclic Graph (DAG): Represents the execution plan for a Spark job. The
driver translates the program logic into a DAG, where each vertex represents a
transformation or action and edges represent dependencies between them.
Processing Workflow:
1. Job Submission: The Spark driver submits the Spark job with the program logic to the
cluster manager.
2. DAG Generation: The driver converts the program logic into a DAG, optimizing the
execution plan.
3. Task Distribution: The driver partitions the RDDs into smaller chunks and distributes
them across the executors.
4. Task Execution: Executors run the assigned tasks on the data partitions in memory,
leveraging in-memory processing for faster performance.
5. Result Aggregation: Results from individual tasks on different executors are shuffled
and aggregated by the driver as per the DAG dependencies.
6. Job Completion: Once all tasks are complete and results are aggregated, the job is
considered finished.
Ease of Use: High-level APIs like DataFrames simplify data manipulation and analysis.
Resilient Distributed Datasets (RDDs) are the foundation for data processing in Apache Spark.
Here's a deeper dive into their internal workings:
Key Characteristics:
Immutable: Once created, an RDD cannot be modified. This simplifies fault tolerance
and parallelization.
Partitioned: Large datasets are split into smaller logical partitions, allowing for parallel
processing across the cluster's executors.
Lazy Evaluation: Transformations on RDDs are not immediately computed. Spark builds
a DAG (Directed Acyclic Graph) representing the sequence of operations, and only
executes the required computations when an action (e.g., count, collect) triggers the final
result.
Internal Structure:
List of Partitions: Defines the data distribution across the cluster. Each partition holds a
portion of the entire dataset.
Partition Function: A function that specifies how to compute each partition. It can access
data from external storage systems like HDFS or generate data on the fly.
Lineage Information: Tracks the dependencies between RDDs. When a partition is lost
due to a node failure, lineage information allows Spark to recompute the lost data based
on its parent RDDs.
Preferred Locations (Optional): These specify the executors on which certain partitions
are preferably computed. This can optimize data locality and minimize network traffic.
Transformations: Operations that create new RDDs based on existing ones. Examples
include map (applies a function to each element), filter (selects elements based on a
condition), join (combines data from two RDDs). Transformations are lazy and only
define the operations in the DAG.
Actions: Operations that trigger actual computations on the RDD and return a value to the
driver program. Examples include count (returns the number of elements), collect
(returns all elements as an array), first (returns the first element). Actions trigger the
execution of the DAG and the processing of RDD partitions.
Fault Tolerance:
RDDs achieve fault tolerance through lineage information and in-memory caching. When a node
failure occurs and a partition is lost, Spark can recompute the lost partition using its lineage
information and the data from its parent RDD. Additionally, Spark can cache frequently used
RDDs in memory across the cluster, minimizing recomputation in case of failures.
Benefits of RDDs:
Creating RDDs:
There are two primary ways to create RDDs in Spark:
1. Parallelizing an Existing Collection:
This approach is suitable for smaller datasets that can fit entirely in-memory on the driver
program.
You leverage the sparkContext.parallelize(collection) method.
The collection argument can be:
o A list or array of elements in your program.
o A local file stored on the driver node.
Here's an example:
# Import SparkContext
from pyspark.sql import SparkContext
# Create a SparkContext
sc = SparkContext.getOrCreate()
Transformations:
Common Transformations:
Actions:
Operations that trigger the actual computation on the RDD and return a value to
the driver program.
Actions force the execution of the DAG and processing of RDD partitions based on the
defined transformations.
Common Actions:
o collect(): Returns all elements of the RDD as a collection to the driver program
(caution: use with smaller datasets due to memory limitations).
Key Points:
Transformations and actions are crucial for manipulating and interacting with RDDs in
Spark.
Transformations are lazy, meaning they don't execute until an action triggers the DAG.
This allows for efficient optimization and reduces unnecessary computations.
RDDs are immutable, so transformations create new RDDs without modifying the
original data.
Example:
In this example, the map and filter transformations define the operations but don't perform
calculations until the count action is called. This triggers the execution of the DAG and returns
the final result to the driver program.
You can convert a DataFrame into an RDD using the rdd attribute.
This creates an RDD of Row objects, where each Row object represents a row in the
DataFrame.
spark = SparkSession.builder.appName("df_to_rdd").getOrCreate()
# Create a DataFrame
name_age_rdd = df.rdd
You can convert an RDD back into a DataFrame using the toDF method of the RDD
object.
This requires providing a schema (column names and data types) that aligns with the
structure of the RDD data.
df = data_rdd.toDF(schema)
df.show()
Use Cases:
Leveraging Existing RDD Code: If you have existing code that manipulates data using
RDDs, you can convert a DataFrame to an RDD to utilize those functionalities.
Fine-Grained Control: RDDs offer more low-level control over data manipulation
compared to DataFrames. In specific scenarios, converting a DataFrame to an RDD
might be necessary for achieving the desired outcome.
Interoperability: In some cases, you might need to interact with data structures from
other Spark libraries (e.g., MLlib) that primarily work with RDDs. Converting a
DataFrame to an RDD can facilitate this interaction.
Important Considerations:
Consider using Spark Datasets (introduced in Spark 2.0) as a unified approach for
structured data, offering the advantages of both DataFrames and RDDs.
Data Caching: Frequently accessed DataFrames can benefit from caching in memory
across executors. This minimizes redundant computations by reusing the cached data for
subsequent operations. Use the cache() method on your DataFrame to enable caching.
Filter Pushdown: Push filtering operations down to the data source during data reading.
This reduces the amount of data transferred to the cluster for processing, improving
performance. Spark can often automatically push down filters, but you can explicitly
specify filters in the data source reading method (e.g., df.filter("column_name >
10").read.parquet("path/to/data")).
Columnar Processing: DataFrames store data in a columnar format, where each column
is stored separately. This allows Spark to process only the required columns for specific
operations, instead of the entire row, leading to performance gains.
Advanced Techniques:
Broadcast Variables: For small, frequently used datasets (e.g., lookup tables), consider
broadcasting them as broadcast variables. This makes a single copy of the data available
on each executor, avoiding redundant data transfer for each task.
Tungsten Engine (Spark 2.0+): Introduced in Spark 2.0, the Tungsten Engine optimizes
memory management and code generation for DataFrame operations, leading to
significant performance improvements.
Additional Tips:
Code Optimization: Review your code and identify potential bottlenecks. Use clear and
concise DataFrame operations to avoid unnecessary computations.
Monitoring and Profiling: Utilize Spark's built-in monitoring tools (e.g., Spark UI) to
identify slow tasks and execution bottlenecks within your application.
Consider Alternatives: For specific use cases, libraries like Apache Arrow or Koalas
might offer performance advantages over PySpark DataFrames. Evaluate your needs and
explore alternative options if performance is a critical factor.
1. Creating DataFrames:
spark = SparkSession.builder.appName("df_example").getOrCreate()
df.show()
df = spark.read.csv("data.csv", header=True)
df = spark.read.parquet("data.parquet")
df = spark.read.json("data.json")
data.toDF().createOrReplaceTempView("people")
Selecting Columns:
Filtering Data:
Sorting:
Grouping:
sorted_adults = adults.orderBy(df.age.desc())
adult_count = sorted_adults.count()
You can convert a DataFrame into an RDD using the rdd attribute.
This creates an RDD of Row objects, where each Row object represents a row in the
DataFrame.
spark = SparkSession.builder.appName("df_to_rdd").getOrCreate()
# Create a DataFrame
name_age_rdd = df.rdd
You can convert an RDD back into a DataFrame using the toDF method of the RDD
object.
This requires providing a schema (column names and data types) that aligns with the
structure of the RDD data.
df = data_rdd.toDF(schema)
df.show()
Leveraging Existing RDD Code: If you have existing code that manipulates data using
RDDs, you can convert a DataFrame to an RDD to utilize those functionalities.
Fine-Grained Control: RDDs offer more low-level control over data manipulation
compared to DataFrames. In specific scenarios, converting a DataFrame to an RDD
might be necessary for achieving the desired outcome.
Interoperability: In some cases, you might need to interact with data structures from other
Spark libraries (e.g., MLlib) that primarily work with RDDs. Converting a DataFrame to
an RDD can facilitate this interaction.
Important Considerations:
Frequent conversions between DataFrames and RDDs can be inefficient. It's generally
recommended to primarily use DataFrames for structured data processing and leverage
RDDs only when necessary.
Consider using Spark Datasets (introduced in Spark 2.0) as a unified approach for
structured data, offering the advantages of both DataFrames and RDDs.
Additional Techniques:
Working with Specific Columns: You can use the select method on a DataFrame to select
specific columns and convert the resulting DataFrame subset into an RDD. This allows
you to focus on a particular portion of the data for RDD manipulation.
In Apache Spark, DataFrames offer a powerful and user-friendly way to query and manipulate
structured data. Here's a breakdown of querying with DataFrames:
1. Selecting Columns:
You can select specific columns from a DataFrame using the select method. This returns
a new DataFrame containing only the chosen columns.
spark = SparkSession.builder.appName("df_query").getOrCreate()
data = [("Alice", 25, "Engineer"), ("Bob", 30, "Manager"), ("Charlie", 28, "Analyst")]
2. Filtering Data:
You can filter rows based on specific conditions using the filter method. This creates a
new DataFrame containing only rows that meet the filtering criteria.
adults_df.show()
managers_df.show()
3. Sorting Data:
You can sort the DataFrame by one or more columns using the orderBy method. This
allows you to order the data in ascending or descending order.
sorted_df = df.orderBy(df.age.desc())
sorted_df.show()
4. Joining DataFrames:
You can combine data from two DataFrames based on a shared column using
the join method. This is useful for relational operations between datasets.
joined_df.show()
5. Aggregation:
total_count = df.count()
avg_age.show()
min_max_age.show()
6. Grouping:
You can group rows based on a specific column using the groupBy method. This allows
you to perform aggregate operations on groups of rows that share the same value in the
grouping column.
avg_age_by_occupation.show()
DataFrames also seamlessly integrate with Spark SQL, allowing you to use SQL-like syntax for
querying. Here's an example:
Python
df.createOrReplaceTempView("people")
sql_df = spark.sql("SELECT name, age FROM people WHERE age > 25 ORDER BY age
DESC")
sql_df.show()