Cleaning Data With PySpark Chapter1
Cleaning Data With PySpark Chapter1
cleaning with
Apache Spark
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.
Performing calculations
Performance
Organizing data ow
Advantages of Spark:
Scalable
name age (years) city last name rst name age (months) state
null 215
import pyspark.sql.types
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])
Mike Metzger
Data Engineering Consultant
Variable review
Python variables:
Mutable
Flexibility
De ned once
Re-created if reassigned
voter_df = spark.read.csv('voterdata.csv')
Making changes:
voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
Transformations
Actions
voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
voter_df.count()
Mike Metzger
Data Engineering Consultant
Dif culties with CSV les
No de ned schema
df = spark.read.format('parquet').load('filename.parquet')
df = spark.read.parquet('filename.parquet')
df.write.format('parquet').save('filename.parquet')
df.write.parquet('filename.parquet')
flight_df = spark.read.parquet('flights.parquet')
flight_df.createOrReplaceTempView('flights')