Big Data Framework Final Project
Big Data Framework Final Project
Details:
Data Selection & Acquisition: Select a dataset of your choice that could be
meaningfully interpreted with machine learning. This could be anything from
social media posts, product reviews, news articles, etc.
Recommendations:
Data Handling: Import your dataset using Python and Pandas. define the schema.
Data Preprocessing: Check the dimensions, describe the data, handle any missing
or duplicate data, find the count of unique values in a column, and conduct other
necessary data preprocessing steps.
Definition for your reference: The Vector Assembler converts them into a single
feature column to train the machine learning model (such as Logistic Regression).
It accepts numeric, Boolean, and vector-type columns:
Data Transformation: Apply suitable transformations and scaling techniques to
your feature-engineered data.
Model Building: Choose and implement a machine learning model using Pyspark.
Depending on your chosen dataset and research question, this can be a
classification or regression model (logistic regression, SVM, decision tree, etc.).
Deliverables:
1. Project Report: A comprehensive report detailing your project, including dataset
choice and why, preprocessing steps, model choice and why, evaluation metrics,
and conclusions. (10 marks)
Grading criteria:
Grading criteria:
1. The code is organized and formatted well with solid use of commenting (5
marks)
2. Code is error-free and is written efficiently (5 marks)
3. Code uses all the recommended tools and libraries effectively and correctly (5
marks) 4. Code is dynamic and is made to be reusable (5 marks)