Chapter 2
Chapter 2
into PySpark
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Oliver Willekens
Data Engineer at Data Minded
What is Spark?
A fast and general engine for large-scale data processing
Interactive analytics
Machine learning
spark = SparkSession.builder.getOrCreate()
prices.show()
+---------+-----------+------------+-----+--------+--------+----------+
| _c0| _c1| _c2| _c4| _c5| _c6| _c7|
+---------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
|Carrefour| FR| Nappy-k| 5.7| EUR| 30|2019-02-06|
| Tesco| IRL| Pampers| 6.3| EUR| 35|2019-02-07|
| DM| DE| Huggies| 6.8| EUR| 40|2019-02-01|
+---------+-----------+------------+-----+--------+--------+----------+
+---------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+---------+-----------+------------+-----+--------+--------+----------+
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
|Carrefour| FR| Nappy-k| 5.7| EUR| 30|2019-02-06|
| Tesco| IRL| Pampers| 6.3| EUR| 35|2019-02-07|
| DM| DE| Huggies| 6.8| EUR| 40|2019-02-01|
+---------+-----------+------------+-----+--------+--------+----------+
[('store', 'string'),
('countrycode', 'string'),
('brand', 'string'),
('price', 'string'),
('currency', 'string'),
('quantity', 'string'),
('date', 'string')]
prices = spark.read.options(header="true").schema(schema).csv("mnt/data_lake/landing/prices.csv")
print(prices.dtypes)
Oliver Willekens
Data Engineer at Data Minded
Reasons to clean data
Most data sources are not ready for analytics. This could be due to:
Invalid rows
Incomplete rows
Can our system cope with data that is 95% clean and 95% complete?
meaning of elds
ByteType Good for numbers that are within the range of -128 to 127.
ShortType Good for numbers that are within the range of -32768 to 32767.
IntegerType Good for numbers that are within the range of-2147483648 to 2147483647.
FloatType oat
StringType string
BooleanType bool
DateType datetime.date
store,countrycode,brand,price,currency,quantity,date
Aldi,BE,Diapers-R-Us,6.8,EUR,40,2019-02-03
-----------------------------------
Kruidvat,NL,Nappy-k,5.6,EUR,40,2019-02-15
DM,AT,Huggies,7.2,EUR,40,2019-02-01
+--------------------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+--------------------+-----------+------------+-----+--------+--------+----------+
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
|-----------------...| null| null| null| null| null| null|
| Kruidvat| NL| Nappy-k| 5.6| EUR| 40|2019-02-15|
| DM| AT| Huggies| 7.2| EUR| 40|2019-02-01|
+--------------------+-----------+------------+-----+--------+--------+----------+
+--------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+--------+-----------+------------+-----+--------+--------+----------+
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
|Kruidvat| NL| Nappy-k| 5.6| EUR| 40|2019-02-15|
| DM| AT| Huggies| 7.2| EUR| 40|2019-02-01|
+--------+-----------+------------+-----+--------+--------+----------+
prices = (spark.read.options(header="true")
.schema(schema)
.csv('/landing/prices_with_incomplete_rows.csv'))
prices.show()
+--------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+--------+-----------+------------+-----+--------+--------+----------+
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
|Kruidvat| null| Nappy-k| 5.6| EUR| null|2019-02-15|
+--------+-----------+------------+-----+--------+--------+----------+
+--------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+--------+-----------+------------+-----+--------+--------+----------+
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
|Kruidvat| null| Nappy-k| 5.6| EUR| 25|2019-02-15|
+--------+-----------+------------+-----+--------+--------+----------+
employees = spark.read.options(header="true").schema(schema).csv('employees.csv')
+-------------+----------+----------+----------+
|employee_name|department|start_date| end_date|
+-------------+----------+----------+----------+
| Bob| marketing|2012-06-01|2016-05-02|
| Alice| IT|2018-04-03|9999-12-31|
+-------------+----------+----------+----------+
better_frame = employees.withColumn("end_date",
when(col("end_date") > one_year_from_now, None).otherwise(col("end_date")))
better_frame.show()
+-------------+----------+----------+----------+
|employee_name|department|start_date| end_date|
+-------------+----------+----------+----------+
| Bob| marketing|2012-06-01|2016-05-02|
| Alice| IT|2018-04-03| null|
+-------------+----------+----------+----------+
Oliver Willekens
Data Engineer at Data Minded
Why do we need to transform data?
Process:
1. Collect data
3. Derive insights
Example:
country | purchase_order
________|_______________
India | 87254800912
Ukraine | 32498562223
European purchases?
country | purchase_order
________|_______________
Ukraine | 32498562223
->
country_of_purchase | purchase_order
____________________|________________
Ukraine | 32498562223
Spain | 74398221190
5. Ordering results
+---------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+---------+-----------+------------+-----+--------+--------+----------+
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
| Kruidvat| BE| Nappy-k| 4.8| EUR| 30|2019-01-28|
|Carrefour| FR| Nappy-k| 5.7| EUR| 30|2019-02-06|
| Tesco| IRL| Pampers| 6.3| EUR| 35|2019-02-07|
| DM| DE| Huggies| 6.8| EUR| 40|2019-02-01|
+---------+-----------+------------+-----+--------+--------+----------+
+--------+-----------+------------+-----+--------+--------+----------+
| store|countrycode| brand|price|currency|quantity| date|
+--------+-----------+------------+-----+--------+--------+----------+
|Kruidvat| BE| Nappy-k| 4.8| EUR| 30|2019-01-28|
| Aldi| BE|Diapers-R-Us| 6.8| EUR| 40|2019-02-03|
+--------+-----------+------------+-----+--------+--------+----------+
.groupBy(col('brand'))
.mean('price')
).show()
+------------+------------------+
| brand| avg(price)|
+------------+------------------+
|Diapers-R-Us| 6.800000190734863|
| Pampers| 6.300000190734863|
| Huggies| 7.0|
| Nappy-k|5.3666666348775225|
+------------+------------------+
.agg(
avg('price').alias('average_price'),
count('brand').alias('number_of_items')
)
).show()
+------------+------------------+---------------+
| brand| average_price|number_of_items|
+------------+------------------+---------------+
|Diapers-R-Us| 6.800000190734863| 1|
| Pampers| 6.300000190734863| 1|
| Huggies| 7.0| 2|
| Nappy-k|5.3666666348775225| 3|
+------------+-------+---------------+-------+---------+-----------+-----+--------+--------+---------
| brand| model|absorption_rate|comfort| store|countrycode|price|currency|quantity| dat
+------------+-------+---------------+-------+---------+-----------+-----+--------+--------+---------
|Diapers-R-Us|6months| 2| 3| Aldi| BE| 6.8| EUR| 40|2019-02-0
| Nappy-k|2months| 3| 4| Kruidvat| BE| 4.8| EUR| 30|2019-01-2
| Nappy-k|2months| 3| 4|Carrefour| FR| 5.7| EUR| 30|2019-02-0
| Pampers|3months| 4| 4| Tesco| IRL| 6.3| EUR| 35|2019-02-0
| Huggies|newborn| 3| 5| DM| DE| 6.8| EUR| 40|2019-02-0
+------------+-------+---------------+-------+---------+-----------+-----+--------+--------+---------
Oliver Willekens
Data Engineer at Data Minded
Running your pipeline locally
Running a Python program:
Conditions:
spark-submit \
--py-files dependencies.zip \
pydiaper/cleaning/clean_prices.py