Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
main
1 contributor
Back to index
Next: Streaming
Batch vs Streaming
There are 2 ways of processing data:
This lesson will cover batch processing. Next lesson will cover streaming.
Weekly
Daily (very common)
Hourly (very common)
X timnes per hous
Every 5 minutes
Etc...
DataLake CSV P
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 2/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
However, the advantages of batch jobs often compensate for its shortcomings, and as a
result most companies that deal with data tend to work with batch jobs mos of the time
(probably 90%).
Introduction to Spark
Video source
What is Spark?
Apache Spark is an open-source multi-language unified analytics engine for large-scale
data processing.
Pulls data
Data Lake Spark
Outputs data
Spark can be ran in clusters with multiple nodes, each pulling and transforming data.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 3/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Spark is multi-language because we can use Java and Scala natively, and there are
wrappers for Python, R and other languages.
Spark can deal with both batches and streaming data. The technique for streaming data is
seeing a stream of data as a sequence of small batches and then applying similar
techniques on them to those used on regular badges. We will cover streaming in detail in
the next lesson.
There are tools such as Hive, Presto or Athena (a AWS managed Presto) that allow you to
express jobs as SQL queries. However, there are times where you need to apply more
complex manipulation which are very difficult or even impossible to express with SQL (such
as ML models); in those instances, Spark is the tool to use.
Yes Hive/Presto/Athena
Can the
Data Lake job be expressed Data Lake
with SQL?
No Spark
A typical workflow may combine both tools. Here's an example of a workflow involving
Machine Learning:
Use a model
SQL Athena job Spark job
Python job Spark job
Raw data Data Lake Train a model Model
Train ML Apply model
Save output
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 4/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Installing Spark
Video source
Install instructions for Linux, MacOS and Windows are available on the course repo.
After installing the appropiate JDK and Spark, make sure that you set up PySpark by
following these instructions.
We can use Spark with Python code by means of PySpark. We will be using Jupyter
Notebooks for this lesson.
import pyspark
from pyspark.sql import SparkSession
We now need to instantiate a Spark session, an object that we use to interact with Spark.
spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.getOrCreate()
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 5/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
SparkSession is the class of the object that we instantiate. builder is the builder
method.
master() sets the Spark master URL to connect to. The local string means that
Spark will run on a local cluster. [*] means that Spark will run with as many CPU
cores as possible.
appName() defines the name of our application/session. This will show in the Spark UI.
getOrCreate() will create the session or recover the object if it was previously
created.
Note: Spark dataframes use custom data types; we cannot use regular Python types.
For this example we will use the High Volume For-Hire Vehicle Trip Records for January
2021 available from the NYC TLC Trip Record Data webiste. The file should be about 720MB
in size.
df = spark.read \
.option("header", "true") \
.csv('fhvhv_tripdata_2021-01.csv')
option() contains options for the read method. In this case, we're specifying that
the first line of the CSV file contains the column names.
csv() is for readinc CSV files.
You can see the contents of the dataframe with df.show() (only a few rows will be shown)
or df.head() . You can also check the current schema with df.schema ; you will notice that
all values are strings.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 6/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
1. Create a smaller CSV file with the first 1000 records or so.
2. Import Pandas and create a Pandas dataframe. This dataframe will have inferred
datatypes.
3. Create a Spark dataframe from the Pandas dataframe and check its schema.
spark.createDataFrame(my_pandas_dataframe).schema
4. Based on the output of the previous method, import types from pyspark.sql and
create a StructType containing a list of the datatypes.
types contains all of the available data types for Spark dataframes.
df = spark.read \
.option("header", "true") \
.schema(schema) \
.csv('fhvhv_tripdata_2021-01.csv')
You may find an example Jupiter Notebook file using this trick in this link.
Partitions
Video Source
A Spark cluster is composed of multiple executors. Each executor can process data
independently in order to parallelize and speed up work.
In the previous example we read a single large CSV file. A file can only be read by a single
executor, which means that the code we've written so far isn't parallelized and thus will
only be run by a single executor rather than many at the same time.
In order to solve this issue, we can split a file into multiple parts so that each executor can
take care of a part and have all executors working simultaneously. These splits are called
partitions.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 7/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
We will now read the CSV file, partition the dataframe and parquetize it. This will create
multiple files in parquet format.
You may check the Spark UI at any time and see the progress of the current job, which is
divided into stages which contain tasks. The tasks in a stage will not start until all tasks on
the previous stage are finished.
When creating a dataframe, Spark creates as many partitions as CPU cores available by
default, and each partition creates a task. Thus, assuming that the dataframe was initially
partitioned into 6 partitions, the write.parquet() method will have 2 stages: the first with
6 tasks and the second one with 24 tasks.
Besides the 24 parquet files, you should also see a _SUCCESS file which should be empty.
This file is created when the job finishes successfully.
Trying to write the files again will output an error because Spark will not write to a non-
empty folder. You can force an overwrite with the mode argument:
df.write.parquet('fhvhv/2021/01/', mode='overwrite')
The opposite of partitioning (joining multiple partitions into a single partition) is called
coalescing.
Spark dataframes
Video source
We can create a dataframe from the parquet files we created in the previous section:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 8/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
df = spark.read.parquet('fhvhv/2021/01/')
Unlike CSV files, parquet files contain the schema of the dataset, so there is no need to
specify a schema like we previously did when reading the CSV file. You can check the
schema like this:
df.printSchema()
(One of the reasons why parquet files are smaller than CSV files is because they store the
data according to the datatypes, so integer values will take less space than long or string
values.)
There are many Pandas-like operations that we can do on Spark dataframes, such as:
Filtering by value - returns a dataframe whose records match the condition stated in
the filter.
And many more. The official Spark documentation website contains a quick guide for
dataframes.
Actions vs Transformations
Video source
Some Spark methods are "lazy", meaning that they are not executed right away. You can
test this with the last instructions we run in the previous section: after running them, the
Spark UI will not show any new jobs. However, running df.show() right after will execute
right away and display the contents of the dataframe; the Spark UI will also show a new
job.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 9/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
These lazy commands are called transformations and the eager commands are called
actions. Computations only happen when actions are triggered.
df.select(...).filter(...).show()
Both select() and filter() are transformations, but show() is an action. The whole
instruction gets evaluated only when the show() action is triggered.
Selecting columns
Filtering
Joins
Group by
Partitions
...
Video source
Besides the SQL and Pandas-like commands we've seen so far, Spark provides additional
built-in functions that allow for more complex data manipulation. By convention, these
functions are imported as follows:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 10/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
df \
.withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
.withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \
.select('pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \
.show()
Besides these built-in functions, Spark allows us to create User Defined Functions (UDFs)
with custom behavior for those instances where creating SQL queries for that behaviour
becomes difficult both to manage and test.
UDFs are regular functions which are then passed as parameters to a special builder. Let's
create one:
We can then use our UDF in transformations just like built-in functions:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 11/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
df \
.withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
.withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \
.withColumn('base_id', crazy_stuff_udf(df.dispatching_base_num)) \
.select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID'
.show()
Spark SQL
We already mentioned at the beginning that there are other tools for expressing batch
jobs as SQL queries. However, Spark can also run SQL queries, which can come in handy if
you already have a Spark cluster and setting up an additional tool for sporadic use isn't
desirable.
Note: this block makes use of the yellow and green taxi datasets for 2020 and 2021 as
parquetized local files. You may create a DAG with Airflow as seen on lesson 2 or you
may download and parquetize the files directly; check out this extra lesson to see
how.
Let's now load all of the yellow and green taxi data for 2020 and 2021 to Spark dataframes.
Assuning the parquet files for the datasets are stored on a data/pq/color/year/month
folder structure:
df_green = spark.read.parquet('data/pq/green/*/*')
df_green = df_green \
.withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \
.withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')
df_yellow = spark.read.parquet('data/pq/yellow/*/*')
df_yellow = df_yellow \
.withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \
.withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 12/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Because the pickup and dropoff column names don't match between the 2 datasets,
we use the withColumnRenamed action to make them have matching names.
We need to find out which are the common columns. We could do this:
However, this command will not respect the column order. We can do this instead to
respect the order:
common_colums = []
yellow_columns = set(df_yellow.columns)
Before we combine the datasets, we need to figure out how we will keep track of the taxi
type for each record (the service_type field in dm_monthyl_zone_revenue.sql ). We will
add the service_type column to each dataframe.
df_green_sel = df_green \
.select(common_colums) \
.withColumn('service_type', F.lit('green'))
df_yellow_sel = df_yellow \
.select(common_colums) \
.withColumn('service_type', F.lit('yellow'))
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 13/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
df_trips_data = df_green_sel.unionAll(df_yellow_sel)
df_trips_data.groupBy('service_type').count().show()
Video source
We can make SQL queries with Spark with spark.sqll("SELECT * FROM ???") . SQL expects
a table for retrieving records, but a dataframe is not a table, so we need to register the
dataframe as a table first:
df_trips_data.registerTempTable('trips_data')
With our registered table, we can now perform regular SQL operations.
spark.sql("""
SELECT
service_type,
count(1)
FROM
trips_data
GROUP BY
service_type
""").show()
Note that the SQL query is wrapped with 3 double quotes ( " ).
The query output can be manipulated as a dataframe, which means that we can perform
any queries on our table and manipulate the results with Python as we see fit.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 14/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
We can now slightly modify the dm_monthyl_zone_revenue.sql , and run it as a query with
Spark and store the output in a dataframe:
df_result = spark.sql("""
SELECT
-- Reveneue grouping
PULocationID AS revenue_zone,
date_trunc('month', pickup_datetime) AS revenue_month,
service_type,
-- Revenue calculation
SUM(fare_amount) AS revenue_monthly_fare,
SUM(extra) AS revenue_monthly_extra,
SUM(mta_tax) AS revenue_monthly_mta_tax,
SUM(tip_amount) AS revenue_monthly_tip_amount,
SUM(tolls_amount) AS revenue_monthly_tolls_amount,
SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,
SUM(total_amount) AS revenue_monthly_total_amount,
SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,
-- Additional calculations
AVG(passenger_count) AS avg_montly_passenger_count,
AVG(trip_distance) AS avg_montly_trip_distance
FROM
trips_data
GROUP BY
1, 2, 3
""")
We removed the with statement from the original query because it operates on an
external table that Spark does not have access to.
We removed the count(tripid) as total_monthly_trips, line in Additional
calculations because it also depends on that external table.
We change the grouping from field names to references in order to avoid mistakes.
Once we're happy with the output, we can also store it as a parquet file just like any other
dataframe. We could run this:
df_result.write.parquet('data/report/revenue/')
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 15/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
However, with our current dataset, this will create more than 200 parquet files of very small
size, which isn't very desirable.
In order to reduce the amount of files, we need to reduce the amount of partitions of the
dataset, which is done with the coalesce() method:
df_result.coalesce(1).write.parquet('data/report/revenue/', mode='overwrite')
Spark internals
Spark Cluster
Video source
Until now, we've used a local cluster to run our Spark code, but Spark clusters often
contain multiple computers that behace as executors.
Spark clusters are managed by a master, which behaves similarly to an entry point of a
Kubernetes cluster. A driver (an Airflow DAG, a computer running a local script, etc.) that
wants to execute a Spark job will send the job to the master, which in turn will divide the
work among the cluster's executors. If any executor fails and becomes offline for any
reason, the master will reassign the task to another executor.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 16/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
partition
Spark cluster
executor partition
spark-submit
driver (Spark job) master executor partition
port 4040
executor partition
partition
Each executor will fetch a dataframe partition stored in a Data Lake (usually S3, GCS or a
similar cloud provider), do something with it and then store it somewhere, which could be
the same Data Lake or somewhere else. If there are more partitions than executors,
executors will keep fetching partitions until every single one has been processed.
This is in contrast to Hadoop, another data analytics engine, whose executors locally store
the data they process. Partitions in Hadoop are duplicated across several executors for
redundancy, in case an executor fails for whatever reason (Hadoop is meant for clusters
made of commodity hardware computers). However, data locality has become less
important as storage and data transfer costs have dramatically decreased and nowadays
it's feasible to separate storage from computation, so Hadoop has fallen out of fashion.
GROUP BY in Spark
Video source
df_green_revenue = spark.sql("""
SELECT
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 17/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
SUM(total_amount) AS amount,
COUNT(1) AS number_records
FROM
green
WHERE
lpep_pickup_datetime >= '2020-01-01 00:00:00'
GROUP BY
1, 2
""")
This query will output the total revenue and amount of trips per hour per zone. We need to
group by hour and zones in order to do this.
Since the data is split along partitions, it's likely that we will need to group data which is in
separate partitions, but executors only deal with individual partitions. Spark solves this
issue by separating the grouping in 2 stages:
1. In the first stage, each executor groups the results in the partition they're working on
and outputs the results to a temporary partition. These temporary partitions are the
intermediate results.
intermediate group by
dataframe executors
hour 1, zone 1, 100 revenue, 5 trips
partition 1 executor 1
hour 1, zone 2, 200 revenue, 10 trips
2. The second stage shuffles the data: Spark will put all records with the same keys (in
this case, the GROUP BY keys which are hour and zone) in the same partition. The
algorithm to do this is called external merge sort. Once the shuffling has finished, we
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 18/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
can once again apply the GROUP BY to these new partitions and reduce the records to
the final output.
Note that the shuffled partitions may contain more than one key, but all records
belonging to a key should end up in the same partition.
intermediate results shuffling
reduced records - final group by
hour 1, zone 1, 100 revenue, 5 trips
hour 1, zone 1, 100 revenue, 5 trips
hour 1, zone 1, 50 revenue, 2 trips hour 1, zone 1, 350 revenue, 17 trips
hour 1, zone 2, 200 revenue, 10 trips
hour 1, zone 1, 200 revenue, 10 trips
hour 1, zone 1, 200 revenue, 10 trips hour 1, zone 2, 200 revenue, 10 trips
hour 2, zone 1, 75 revenue, 3 trips hour 1, zone 2, 250 revenue, 11 trips
hour 2, zone 1, 75 revenue, 3 trips
hour 2, zone 1, 75 revenue, 3 trips
Running the query should display the following DAG in the Spark UI:
Stage 1 Stage 2
If we were to add sorting to the query (adding a ORDER BY 1,2 at the end), Spark would
perform a very similar operation to GROUP BY after grouping the data. The resulting DAG
would look liked this:
Stage 1 Stage 2 Stage 3
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 19/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
By default, Spark will repartition the dataframe to 200 partitions after shuffling data. For
the kind of data we're dealing with in this example this could be counterproductive
because of the small size of each partition/file, but for larger datasets this is fine.
Shuffling is an expensive operation, so it's in our best interest to reduce the amount of
data to shuffle when querying.
Joins in Spark
Video source
Joining tables in Spark is implemented in a similar way to GROUP BY and ORDER BY , but
there are 2 distinct cases: joining 2 large tables and joining a large table and a small table.
df_green_revenue_tmp = df_green_revenue \
.withColumnRenamed('amount', 'green_amount') \
.withColumnRenamed('number_records', 'green_number_records')
df_yellow_revenue_tmp = df_yellow_revenue \
.withColumnRenamed('amount', 'yellow_amount') \
.withColumnRenamed('number_records', 'yellow_number_records')
Both of these queries are transformations; Spark doesn't actually do anything when we
run them.
We will now perform an outer join so that we can display the amount of trips and revenue
per hour per zone for green and yellow taxis at the same time regardless of whether the
hour/zone combo had one type of taxi trips or the other:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 20/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
on= receives a list of columns by which we will join the tables. This will result in a
primary composite key for the resulting table.
how= specifies the type of JOIN to execute.
When we run either show() or write() on this query, Spark will have to create both the
temporary dataframes and the joint final dataframe. The DAG will look like this:
Stage 2 Stage 3
SortMergeJoin WholeStageCodegen(5)
Stage 1
For stage 3, given all records for yellow taxis Y1, Y2, ... , Yn and for green taxis G1,
G2, ... , Gn and knowing that the resulting composite key is key K = (hour H, zone Z) ,
we can express the resulting complex records as (Kn, Yn) for yellow records and (Kn,
Gn) for green records. Spark will first shuffle the data like it did for grouping (using the
external merge sort algorithm) and then it will reduce the records by joining yellow and
green data for matching keys to show the final output.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 21/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
(K3, Y3)
(K2, Y2)
(K2, Y2, G1)
(K2, G1)
green taxis
(K4, G3)
Because we're doing an outer join, keys which only have yellow taxi or green taxi
records will be shown with empty fields for the missing data, whereas keys with both
types of records will show both yellow and green taxi data.
If we did an inner join instead, the records such as (K1, Y1, Ø) and (K4, Ø,
G3) would be excluded from the final result.
Let's now use the zones lookup table to match each zone ID to its corresponding name.
df_zones = spark.read.parquet('zones/')
df_result.drop('LocationID', 'zone').write.parquet('tmp/revenue-zones')
Because we renamed the LocationID in the joint table to zone , we can't simply
specify the columns to join and we need to provide a condition as criteria.
We use the drop() method to get rid of the extra columns we don't need anymore,
because we only want to keep the zone names and both LocationID and zone are
duplicate columns with numeral ID's only.
We also use write() instead of show() because show() might not process all of the
data.
The zones table is actually very small and joining both tables with merge sort is
unnecessary. What Spark does instead is broadcasting: Spark sends a copy of the complete
table to all of the executors and each executor then joins each partition of the big table in
memory by performing a lookup on the local broadcasted table.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 23/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
executors
executor 1 result
return
broadcast zones (local) executor zone, ...
lookup
executor 2
zones
return
broadcast zones (local) executor zone, ...
lookup
executor 3
return
broadcast zones (local) executor zone, ...
lookup
big table
partition 1
partition 2
partition 3
Shuffling isn't needed because each executor already has all of the necessary info to
perform the join on each partition, thus speeding up the join operation by orders of
magnitude.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 24/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Dataframes are actually built on top of RDDs and contain a schema as well, which plain
RDDs do not.
Let's take a look once again at the SQL query we saw in the GROUP BY section:
SELECT
date_trunc('hour', lpep_pickup_datetime) AS hour,
PULocationID AS zone,
SUM(total_amount) AS amount,
COUNT(1) AS number_records
FROM
green
WHERE
lpep_pickup_datetime >= '2020-01-01 00:00:00'
GROUP BY
1, 2
1. We can re-implement the SELECT section by choosing the 3 fields from the RDD's
rows.
rdd = df_green \
.select('lpep_pickup_datetime', 'PULocationID', 'total_amount') \
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 25/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
.rdd
2. We can implement the WHERE section by using the filter() and take() methods:
filter() returns a new RDD cointaining only the elements that satisfy a
predicate, which in our case is a function that we pass as a parameter.
take() takes as many elements from the RDD as stated.
def filter_outliers(row):
return row.lpep_pickup_datetime >= start
rdd.filter(filter_outliers).take(1)
def prepare_for_grouping(row):
hour = row.lpep_pickup_datetime.replace(minute=0, second=0, microsecond=0)
zone = row.PULocationID
key = (hour, zone)
amount = row.total_amount
count = 1
value = (amount, count)
rdd \
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 26/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
.filter(filter_outliers) \
.map(prepare_for_grouping)
2. We now need to use the reduceByKey() method, which will take all records with the
same key and put them together in a single record by transforming all the different
values according to some rules which we can define with a custom function. Since we
want to count the total amount and the total number of records, we just need to add
the values:
rdd \
.filter(filter_outliers) \
.map(prepare_for_grouping) \
.reduceByKey(calculate_revenue)
3. The output we have is already usable but not very nice, so we map the output again in
order to unwrap it.
rdd \
.filter(filter_outliers) \
.map(prepare_for_grouping) \
.reduceByKey(calculate_revenue) \
.map(unwrap)
Using namedtuple isn't necessary but it will help in the next step.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 27/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
result_schema = types.StructType([
types.StructField('hour', types.TimestampType(), True),
types.StructField('zone', types.IntegerType(), True),
types.StructField('revenue', types.DoubleType(), True),
types.StructField('count', types.IntegerType(), True)
])
df_result = rdd \
.filter(filter_outliers) \
.map(prepare_for_grouping) \
.reduceByKey(calculate_revenue) \
.map(unwrap) \
.toDF(result_schema)
We can use toDF() without any schema as an input parameter, but Spark will have to
figure out the schema by itself which may take a substantial amount of time. Using
namedtuple in the previous step allows Spark to infer the column names but Spark
will still need to figure out the data types; by passing a schema as a parameter we skip
this step and get the output much faster.
As you can see, manipulating RDDs to perform SQL-like queries is complex and time-
consuming. Ever since Spark added support for dataframes and SQL, manipulating RDDs in
this fashion has become obsolete, but since dataframes are built on top of RDDs, knowing
how they work can help us understand how to make better use of Spark.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 28/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
mapPartitions() is a convenient method for dealing with large datasets because it allows
us to separate it into chunks that we can process more easily, which is handy for workflows
such as Machine Learning.
Let's demonstrate this workflow with an example. Let's assume we want to predict taxi
travel length with the green taxi dataset. We will use VendorID , lpep_pickup_datetime ,
PULocationID , DOLocationID and trip_distance as our features. We will now create an
RDD with these columns:
duration_rdd = df_green \
.select(columns) \
.rdd
Let's now create the method that mapPartitions() will use to transform the partitions.
This method will essentially call our prediction model on the partition that we're
transforming:
import pandas as pd
def model_predict(df):
# fancy ML code goes here
(...)
# predictions is a Pandas dataframe with the field predicted_duration in it
return predictions
def apply_model_in_batch(rows):
df = pd.DataFrame(rows, columns=columns)
predictions = model_predict(df)
df['predicted_duration'] = predictions
We're assuming that our model works with Pandas dataframes, so we need to import
the library.
We are converting the input partition into a dataframe for the model.
RDD's do not contain column info, so we use the columns param to name the
columns because our model may need them.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 29/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Pandas will crash if the dataframe is too large for memory! We're assuming that
this is not the case here, but you may have to take this into account when dealing
with large partitions. You can use the itertools package for slicing the partitions
before converting them to dataframes.
Our model will return another Pandas dataframe with a predicted_duration column
containing the model predictions.
df.itertuples() is an iterable that returns a tuple containing all the values in a single
row, for all rows. Thus, row will contain a tuple with all the values for a single row.
yield is a Python keyword that behaves similarly to return but returns a generator
object instead of a value. This means that a function that uses yield can be iterated
on. Spark makes use of the generator object in mapPartitions() to build the output
RDD.
You can learn more about the yield keyword in this link.
With our defined fuction, we are now ready to use mapPartitions() and run our
prediction model on our full RDD:
df_predicts = duration_rdd \
.mapPartitions(apply_model_in_batch)\
.toDF() \
.drop('Index')
df_predicts.select('predicted_duration').show()
We're not specifying the schema when creating the dataframe, so it may take some
time to compute.
We drop the Index field because it was created by Spark and it is not needed.
As a final thought, you may have noticed that the apply_model_in_batch() method does
NOT operate on single elements, but rather it takes the whole partition and does
something with it (in our case, calling a ML model). If you need to operate on individual
elements then you're better off with map() .
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 30/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Google Cloud Storage is an object store, which means that it doesn't offer a fully featured
file system. Spark can connect to remote object stores by using connectors; each object
store has its own connector, so we will need to use Google's Cloud Storage Connector if
we want our local Spark instance to connect to our Data Lake.
Before we do that, we will use gsutil to upload our local files to our Data Lake. gsutil is
included with the GCP SDK, so you should already have it if you've followed the previous
chapters.
-r stands for recursive; it's used to state that the contents of the local folder are to
be uploaded. For single files this option isn't needed.
We now need to follow a few extra steps before creating the Spark session in our
notebook. Import the following libraries:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 31/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
Now we need to configure Spark by creating a configuration object. Run the following
code to create it:
credentials_location = '~/.google/credentials/google_credentials.json'
conf = SparkConf() \
.setMaster('local[*]') \
.setAppName('test') \
.set("spark.jars", "./lib/gcs-connector-hadoop3-2.2.5.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_
You may have noticed that we're including a couple of options that we previously used
when creating a Spark Session with its builder. That's because we implicitly created a
context, which represents a connection to a spark cluster. This time we need to explicitly
create and configure the context like so:
sc = SparkContext(conf=conf)
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.Go
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
hadoop_conf.set("fs.gs.auth.service.account.enable", "true")
This will likely output a warning when running the code. You may ignore it.
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 32/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
df_green = spark.read.parquet('gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/*/*')
You should obviously change the URI in this example for yours.
spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.getOrCreate()
This code will stard a local cluster, but once the notebook kernel is shut down, the cluster
will disappear.
We will now see how to crate a Spark cluster in Standalone Mode so that the cluster can
remain running even after we stop running our notebooks.
Simply go to your Spark install directory from a terminal and run the following command:
./sbin/start-master.sh
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 33/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
You should now be able to open the main Spark dashboard by browsing to
localhost:8080 (remember to forward the port if you're running it on a virtual machine).
At the very top of the dashboard the URL for the dashboard should appear; copy it and
use it in your session code like so:
spark = SparkSession.builder \
.master("spark://<URL>:7077") \
.appName('test') \
.getOrCreate()
Note that we used the HTTP port 8080 for browsing to the dashboard but we use the
Spark port 7077 for connecting our code to the cluster.
Using localhost as a stand-in for the URL may not work.
You may note that in the Spark dashboard there aren't any workers listed. The actual Spark
jobs are run from within workers (or slaves in older Spark versions), which we need to
create and set up.
Similarly to how we created the Spark master, we can run a worker from the command line
by running the following command from the Spark install directory:
./sbin/start-worker.sh <master-spark-URL>
Once you've run the command, you should see a worker in the Spark dashboard.
Note that a worker may not be able to run multiple jobs simultaneously. If you're running
separate notebooks and connecting to the same Spark worker, you can check in the Spark
dashboard how many Running Applications exist. Since we haven't configured the workers,
any jobs will take as many resources as there are available for the job.
We will use the argparse library for parsing parameters. Convert a notebook to a script
with nbconvert , manually modify it or create it from scratch and add the following:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 34/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
import argparse
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
parser.add_argument('--input_green', required=True)
parser.add_argument('--input_yellow', required=True)
parser.add_argument('--output', required=True)
input_green = args.input_green
input_yellow = args.input_yellow
output = args.output
We can now modify previous lines using the 3 parameters we've created. For example:
df_green = spark.read.parquet(input_green)
Once we've finished our script, we simply call it from a terminal line with the parameters
we need:
python my_script.py \
--input_green=data/pq/green/2020/*/ \
--input_yellow=data/pq/yellow/2020/*/ \
--output=data/report-2020
spark-submit \
--master="spark://<URL>" \
my_script.py \
--input_green=data/pq/green/2020/*/ \
--input_yellow=data/pq/yellow/2020/*/ \
--output=data/report-2020
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 35/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
And the Spark session code in the script is simplified like so:
spark = SparkSession.builder \
.appName('test') \
.getOrCreate()
You may find more sophisticated uses of spark-submit in the official documentation.
After you're done running Spark in standalone mode, you will need to manually shut it
down. Simply run the ./sbin/stop-worker.sh ( ./sbin/stop-slave.sh in older Spark
versions) and ``./sbin/stop-master.sh` scripts to shut down Spark.
Dataproc is Google's cloud-managed service for running Spark and other data processing
tools such as Flink, Presto, etc.
You may access Dataproc from the GCP dashboard and typing dataproc on the search
bar. The first time you access it you will have to enable the API.
In the images below you may find some example values for creating a simple cluster. Give
it a name of your choosing and choose the same region as your bucket.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 36/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
We would normally choose a standard cluster, but you may choose single node if you
just want to experiment and not run any jobs.
Optionally, you may install additional components but we won't be covering them in this
lesson.
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 37/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
You may leave all other optional settings with their default values. After you click on
Create , it will take a few seconds to create the cluster. You may notice an extra VM
instance under VMs; that's the Spark instance.
In Dataproc's Clusters page, choose your cluster and un the Cluster details page, click on
Submit job . Under Job type choose PySpark , then in Main Python file write the path to
your script (you may upload the script to your bucket and then copy the URL).
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 38/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Make sure that your script does not specify the master cluster! Your script should take the
connection details from Dataproc; make sure it looks something like this:
spark = SparkSession.builder \
.appName('test') \
.getOrCreate()
We also need to specify arguments, in a similar fashion to what we saw in the previous
section, but using the URL's for our folders rather than the local paths:
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 39/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
Now press Submit . Sadly there is no easy way to access the Spark dashboard but you can
check the status of the job from the Job details page.
Before you can submit jobs with the SDK, you will need to grant permissions to the Service
Account we've been using so far. Go to IAM & Admin and edit your Service Account so that
the Dataproc Administrator role is added to it.
We can now submit a job from the command line, like this:
You may find more details on how to run jobs in the official docs.
Back to index
Next: Streaming
Under construction
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 40/41
12/15/22, 12:01 PM dataeng-zoomcamp/5_batch_processing.md at main · ziritrion/dataeng-zoomcamp · GitHub
https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md 41/41