0% found this document useful (0 votes)
14 views

Big Data Framework Final Project

The document outlines the requirements for a final project to build a predictive model using PysparkML. Students must select a dataset, perform preprocessing, encode categorical variables, transform data, build a classification or regression model, and submit a project report and Jupyter Notebook with code.

Uploaded by

Jasleen Jaswal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Big Data Framework Final Project

The document outlines the requirements for a final project to build a predictive model using PysparkML. Students must select a dataset, perform preprocessing, encode categorical variables, transform data, build a classification or regression model, and submit a project report and Jupyter Notebook with code.

Uploaded by

Jasleen Jaswal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Big Data Framework

Final Project: Predictive Model Creation using


PysparkML.
Project Description:

The primary objective of your final project is to build a predictive model on a


dataset of your choice, utilizing machine learning. This project is designed to
provide a hands-on opportunity to apply the knowledge and skills you've acquired.

Details:

Data Selection & Acquisition: Select a dataset of your choice that could be
meaningfully interpreted with machine learning. This could be anything from
social media posts, product reviews, news articles, etc.

Recommendations:

1. Loan default Prediction

2. Fake news Detection

3. Bitcoin price detector.

Data Handling: Import your dataset using Python and Pandas. define the schema.

Data Preprocessing: Check the dimensions, describe the data, handle any missing
or duplicate data, find the count of unique values in a column, and conduct other
necessary data preprocessing steps.

Encode Categorical variables: Convert your categorical variables into numerical


form using techniques like Label encoding and one-hot encoder and perform vector
assembler.

Definition for your reference: The Vector Assembler converts them into a single
feature column to train the machine learning model (such as Logistic Regression).
It accepts numeric, Boolean, and vector-type columns:
Data Transformation: Apply suitable transformations and scaling techniques to
your feature-engineered data.

Model Building: Choose and implement a machine learning model using Pyspark.
Depending on your chosen dataset and research question, this can be a
classification or regression model (logistic regression, SVM, decision tree, etc.).

Deliverables:
1. Project Report: A comprehensive report detailing your project, including dataset
choice and why, preprocessing steps, model choice and why, evaluation metrics,
and conclusions. (10 marks)

Grading criteria:

1. The document is organized and formatted well (2.5 marks)


2. The document contains dataset choice and why, preprocessing steps, model
choice and why, evaluation metrics, and conclusions (2.5marks)
3. The document can define all the techniques used and how they work, explain the
degree of effectiveness of the model and why it was effective (2.5marks)

4. The document goes over definitions of ML terms and demonstrates a strong


understanding of the process and concepts/ terms (2 .5marks)

2. Code: A well-commented Jupyter Notebook containing all your Python code.


Your code should be Neat, readable, and reproducible. (20 marks)

Grading criteria:

1. The code is organized and formatted well with solid use of commenting (5
marks)
2. Code is error-free and is written efficiently (5 marks)
3. Code uses all the recommended tools and libraries effectively and correctly (5
marks) 4. Code is dynamic and is made to be reusable (5 marks)

You might also like