0% found this document useful (0 votes)
17 views

Big Data With Spark and Hadoop

This document outlines tasks for a final project analyzing employee data using Spark. It involves loading CSV data into a DataFrame, defining a schema, creating views, running SQL queries, filtering, sorting, aggregating, and joining data frames.

Uploaded by

Hafiszan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Big Data With Spark and Hadoop

This document outlines tasks for a final project analyzing employee data using Spark. It involves loading CSV data into a DataFrame, defining a schema, creating views, running SQL queries, filtering, sorting, aggregating, and joining data frames.

Uploaded by

Hafiszan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CE – COLLEGE OF ENGINEERING

MASTER OF SCIENCE IN ELECTRICAL AND ELECTRONICS ENGINEERING


WITH MANAGEMENT

ARTIFICIAL INTELLIGENCE AND BIG DATA

EET736

Mini Project

Big Data with Spark and Hadoop

PREPARED BY

MUHAMMAD HAFISZAN BIN MOHD AKHIR

2023761495
Final Project: Data Analysis using Spark.

This final project is similar to the Practice Project you did. In this project, you will not be provided with
hints or solutions. You will create a DataFrame by loading data from a CSV file and apply
transformations and actions using Spark SQL. This needs to be achieved by performing the following
tasks:

• Task 1: Generate DataFrame from CSV data.


• Task 2: Define a schema for the data.
• Task 3: Display schema of DataFrame.
• Task 4: Create a temporary view.
• Task 5: Execute an SQL query.
• Task 6: Calculate Average Salary by Department.
• Task 7: Filter and Display IT Department Employees.
• Task 8: Add 10% Bonus to Salaries.
• Task 9: Find Maximum Salary by Age.
• Task 10: Self-Join on Employee Data.
• Task 11: Calculate Average Employee Age.
• Task 12: Calculate Total Salary by Department.
• Task 13: Sort Data by Age and Salary.
• Task 14: Count Employees in Each Department.
• Task 15: Filter Employees with the letter o in the Name.

Prerequisites
1. For this lab assignment, you will be using Python and Spark (PySpark). Therefore, it's
essential to make sure that the following libraries are installed in your lab environment or
within Skills Network (SN) Labs
Download the CSV data.

Task 1: Generate a Spark DataFrame from the CSV data


Read data from the provided CSV file, employees.csv and import it into a Spark DataFrame variable
named employees_df.

Task 2: Define a schema for the data


Construct a schema for the input data and then utilize the defined schema to read the CSV file to
create a DataFrame named employees_df.
Task 3: Display schema of DataFrame
Display the schema of the employees_df DataFrame, showing all columns and their respective data
types.

Task 4: Create a temporary view¶


Create a temporary view named employees for the employees_df DataFrame, enabling Spark SQL
queries on the data.

Task 5: Execute an SQL query


Compose and execute an SQL query to fetch the records from the employees view where the age of
employees exceeds 30. Then, display the result of the SQL query, showcasing the filtered records.
Task 6: Calculate Average Salary by Department
Compose an SQL query to retrieve the average salary of employees grouped by department. Display
the result.

Task 7: Filter and Display IT Department Employees


Apply a filter on the employees_df DataFrame to select records where the department is 'IT'. Display
the filtered DataFrame.
Task 8: Add 10% Bonus to Salaries
Perform a transformation to add a new column named "SalaryAfterBonus" to the DataFrame.
Calculate the new salary by adding a 10% bonus to each employee's salary.

Task 9: Find Maximum Salary by Age


Group the data by age and calculate the maximum salary for each age group. Display the result.
Task 10: Self-Join on Employee Data
Join the "employees_df" DataFrame with itself based on the "Emp_No" column. Display the result.

Task 11: Calculate Average Employee Age


Calculate the average age of employees using the built-in aggregation function. Display the result.

Task 12: Calculate Total Salary by Department


Calculate the total salary for each department using the built-in aggregation function. Display the
result.
Task 13: Sort Data by Age and Salary
Apply a transformation to sort the DataFrame by age in ascending order and then by salary in
descending order. Display the sorted DataFrame.
Task 14: Count Employees in Each Department
Calculate the number of employees in each department. Display the result.

Task 15: Filter Employees with the letter o in the Name


Apply a filter to select records where the employee's name contains the letter 'o'. Display the filtered
DataFrame.

Congratulations! You have completed


the project.
Now you know how to create a DataFrame from a CSV data file and perform a variety
of DataFrame transformations and actions using Spark SQL.

You might also like