0% found this document useful (0 votes)
2 views

Assignment 4

This assignment involves analyzing a synthetic dataset on student performance factors, due on December 14, with a 9% weight. Students must submit a Python script, a report in PDF format, and graphs in PNG format, focusing on specific factors like study hours and teacher quality. The assignment includes tasks for data reading, statistical analysis, and trend analysis using linear regression and plotting.

Uploaded by

edwina.liu06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment 4

This assignment involves analyzing a synthetic dataset on student performance factors, due on December 14, with a 9% weight. Students must submit a Python script, a report in PDF format, and graphs in PNG format, focusing on specific factors like study hours and teacher quality. The assignment includes tasks for data reading, statistical analysis, and trend analysis using linear regression and plotting.

Uploaded by

edwina.liu06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment 4

📊 Worth: 9%
📅 Due: December 14, @ Midnight
🕑 Late submissions: 5% penalty per late day. Maximum of 5 late days allowed.
Feel free to add more functions to avoid code repetition

⚠ What to Submit
One team member should submit the following:

.py files called performance_factors.py


Report in .pdf format
All graphs in .png format

Modules
Ensure that the following modules are installed:

matplotlib
numpy

Student Performance Factors


Context

(Optional read)

In this assignment, you will study the impact of various factors affecting student performance in
exams. The "Student Performance Factors" is a synthetic dataset created for education
purposes
only. It can be found on Kaggle, a very popular data science website where many AI competitions
are
hosted.

The dataset includes various factors that may affect student performance, such as study habits,
attendance, and parental involvement. Your goal is to identify which factors have the greatest
impact
on student exam scores.

Dataset

The data contains 6608 lines and 20 columns. Each line represents the data for one student and
has
the following columns:

and has the following columns:

Column Type Description Index

int
Hours_Studied Number of hours spent studying per week. 0
orfloat
Column Type Description Index

int
Attendance Percentage of classes attended. 1
orfloat

Level of parental involvement in thestudent’s


Parental_Involvement string 2
education (Low, Medium, High).

Availability of educational resources


Access_to_Resources string 3
(Low,Medium,High).

Extracurricular_Activities string Participation in extracurricular activities(Yes,No). 4

int
Sleep_Hours Average number of hours of sleep pernight. 5
orfloat

int
Previous_Scores Scores from previous exams. 6
orfloat

Motivation_Level string Student’s level of motivation (Low,Medium,High). 7

Internet_Access string Availability of internet access (Yes,No). 8

int
Tutoring_Sessions Number of tutoring sessions attended permonth. 9
orfloat

Family_Income string Family income level (Low,Medium,High). 10

Teacher_Quality string Quality of the teachers (Low,Medium,High). 11

School_Type string Type of school attended (Public, Private). 12

Influence of peers on academic performance


Peer_Influence string 13
(Positive, Neutral, Negative).

int Average number of hours of physicalactivity per


Physical_Activity 14
orfloat week.

Learning_Disabilities string Presence of learning disabilities (Yes, No). 15

Highest education level of


Parental_Education_Level string 16
parents(HighSchool,College,Postgraduate).

Distance from home to school


Distance_from_Home string 17
(Near,Moderate,Far).

Gender string Gender of the student (Male, Female). 18

Column Type Description Index

int
Exam_Score Final exam score. This is the dependantvariable. 19
orfloat

Columns of interest
Focus on the following factors:

Hours_Studied
Teacher_Quality
School_Type
Two additional numeric factors of your choice.
PART I

1.1 Write the function read_data()

Input parameters:

the file_name

Returns:

A list containing all the required lists: [exam_scores_lists, study_hours_list, choice1_list,


choice2_list, teacher_list, school_list]

Task

Use the technique seen in class to read the .csv file

Important: Some lines in Hours_studied and Teacher_Quality are missing values. In


those

cases, append None as shown in the example below:

teacher_list = []
# ... code
for line in csv_reader:
# ... code
if line[TEACHER_INDEX] == '': #missing categorical value
teacher_list.append(None)
else:
teacher_list.append(line[TEACHER_INDEX])

For each line make sure to:

clean the data if necessary

convert it to the appropriate data type

add each value to its associated list.

1.2 Write the function print_stats()

Input parameter

A list of exam scores scores_list

Return

None

Task

Gather some statistics on the student scores, the minimum, maximum, average, standard
deviation,
median as well as the count of students.
Calculate the min_score , max_score , avg_score , the median med_scoreand the standard
deviation std

Calculate the count of elements in the list

Display the values as shown below

Hint: You can use numpys function np.median(scores_list) to calculate the median ,
np.mean(scores_list) to calculate the average and np.std(scores_list).
Call the function in the `main(). Copy the results into your report.

Example of output

------Exam scores statistics------


Average: 67.
Median : 67.
Min : 55
Max : 101
Std : 3.
Count : 6607

✅ Submit your work as .py file

PART II

In this section, we will focus on plotting and analysing the trends. You are free to design the
function
in which ever way you want, ensure that the graphs and the values are saved properly.

1.3 Write a function trend_analysis()

This step uses np.polyfit() to find a linear function that approximates the relationship between
student scores and other columns.

Task:

Fit a linear polynomial function onto the data using np.polyfit()where score_list is a
function of study_hours_list:

Plot both the original data and the model on the same graph.

Add axis labels and a graph title

Use different marker styles for the model

Add a legend and labels for plot.

Save the figure as trend_hours_studied.png

Print the equation the equation. Copy the results into your report.

equation = f"$y = {a:.2f}x + {b:.2f}$"


print_answer("Study Hours: ", equation)

(Optional) You can display the equation on the graph as such:


plt.text(x=20, y=70, equation, fontsize=14, color='black', ha='center',
va='center')

Repeat the previous steps with the two other lists choice1_list and choice2_list.

Example of graph

Happy Holidays! 🎄🎉✨ Wishing you a good end of semester and a restful break 🌟

You might also like