0% found this document useful (1 vote)
367 views

Cleaning Data With PySpark Chapter1

This document discusses data cleaning with Apache Spark. It introduces data cleaning as preparing raw data for use in data processing pipelines. Common data cleaning tasks include reformatting text, performing calculations, and removing garbage or incomplete data. Spark is advantageous for data cleaning due to its scalability and powerful framework for data handling. An example demonstrates cleaning raw data by reformatting names and ages and converting city names to states. Spark schemas can define the format of a DataFrame to filter garbage data and improve read performance. Spark uses immutability and lazy evaluation to efficiently handle transformations on data frames. The Parquet format is also discussed as it supports schemas, columnar data storage, and predicate pushdown for efficient data processing in Spark.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
367 views

Cleaning Data With PySpark Chapter1

This document discusses data cleaning with Apache Spark. It introduces data cleaning as preparing raw data for use in data processing pipelines. Common data cleaning tasks include reformatting text, performing calculations, and removing garbage or incomplete data. Spark is advantageous for data cleaning due to its scalability and powerful framework for data handling. An example demonstrates cleaning raw data by reformatting names and ages and converting city names to states. Spark schemas can define the format of a DataFrame to filter garbage data and improve read performance. Spark uses immutability and lazy evaluation to efficiently handle transformations on data frames. The Parquet format is also discussed as it supports schemas, columnar data storage, and predicate pushdown for efficient data processing in Spark.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Intro to data

cleaning with
Apache Spark
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.

Possible tasks in data cleaning:

Reformatting or replacing text

Performing calculations

Removing garbage or incomplete data

CLEANING DATA WITH PYSPARK


Why perform data cleaning with Spark?
Problems with typical data systems:

Performance

Organizing data ow

Advantages of Spark:

Scalable

Powerful framework for data handling

CLEANING DATA WITH PYSPARK


Data cleaning example
Raw data: Cleaned data:

name age (years) city last name rst name age (months) state

Smith, John 37 Dallas Smith John 444 TX

Wilson, A. 59 Chicago Wilson A. 708 IL

null 215

CLEANING DATA WITH PYSPARK


Spark Schemas
De ne the format of a DataFrame

May contain various data types:


Strings, dates, integers, arrays

Can lter garbage data during import

Improves read performance

CLEANING DATA WITH PYSPARK


Example Spark Schema
Import schema

import pyspark.sql.types
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])

Read CSV le containing data

people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)

CLEANING DATA WITH PYSPARK


Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Immutability and
Lazy Processing
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Variable review
Python variables:

Mutable

Flexibility

Potential for issues with concurrency

Likely adds complexity

CLEANING DATA WITH PYSPARK


Immutability
Immutable variables are:

A component of functional programming

De ned once

Unable to be directly modi ed

Re-created if reassigned

Able to be shared ef ciently

CLEANING DATA WITH PYSPARK


Immutability Example
De ne a new data frame:

voter_df = spark.read.csv('voterdata.csv')

Making changes:

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)

voter_df = voter_df.drop(voter_df.year)

CLEANING DATA WITH PYSPARK


Lazy Processing
Isn't this slow?

Transformations

Actions

Allows ef cient planning

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)

voter_df.count()

CLEANING DATA WITH PYSPARK


Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Understanding
Parquet
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Dif culties with CSV les
No de ned schema

Nested data requires special handling

Encoding format limited

CLEANING DATA WITH PYSPARK


Spark and CSV les
Slow to parse

Files cannot be ltered (no "predicate pushdown")

Any intermediate use requires rede ning schema

CLEANING DATA WITH PYSPARK


The Parquet Format
A columnar data format

Supported in Spark and other data processing frameworks

Supports predicate pushdown

Automatically stores schema information

CLEANING DATA WITH PYSPARK


Working with Parquet
Reading Parquet les

df = spark.read.format('parquet').load('filename.parquet')

df = spark.read.parquet('filename.parquet')

Writing Parquet les

df.write.format('parquet').save('filename.parquet')

df.write.parquet('filename.parquet')

CLEANING DATA WITH PYSPARK


Parquet and SQL
Parquet as backing stores for SparkSQL operations

flight_df = spark.read.parquet('flights.parquet')

flight_df.createOrReplaceTempView('flights')

short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')

CLEANING DATA WITH PYSPARK


Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K

You might also like