0% found this document useful (0 votes)
38 views20 pages

Yash 21BSDS12 Perdictive Analysis Report

The document discusses predicting student performance through educational data mining techniques. It proposes using machine learning algorithms like KNN, LDA, and SVM on educational data to predict student performance in mathematics and Portuguese lessons. The document also provides a literature review on previous research applying classification and feature selection methods to student performance data.

Uploaded by

yashpatelykp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views20 pages

Yash 21BSDS12 Perdictive Analysis Report

The document discusses predicting student performance through educational data mining techniques. It proposes using machine learning algorithms like KNN, LDA, and SVM on educational data to predict student performance in mathematics and Portuguese lessons. The document also provides a literature review on previous research applying classification and feature selection methods to student performance data.

Uploaded by

yashpatelykp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNITEDWORLD SCHOOL OF COMPUTATIONAL

INTELLIGENCE (USCI)

Summative Assessment (SA)

Submitted by

Yash Kamleshkumar Patel


(Enrolment. No.: 20210701004)

Course Code and Title: 21BSCS35E06 Predictive Analytics

B.Sc. (Hons.) Computer Science / Data Science /


AIML
V Semester – July – Nov 2023

Nov/Dec 2023
TABLE OF CONTENTS

1|Page
CHAPTER NO TITLE PAGE NO

1 ABSTRACTS 3

2 KEYWORDS 3

3 INTRODUCTION 3

4 LITERATURE SURVEY 5

5 PROPOSED METHODOLOGIES 6

6 EXTRACTIONS OF FEATURES 7

7 DATA PREPROCESSING 7

8 DATASET DESCRIPTION 8

9 DEVELOPMENT AND IMPLEMENTATION 9

10 RESULTS AND CONCLUSION 18

11 REFERENCES 19

2|Page
Title: Student Performance Prediction

Abstract:

Educational data mining plays a crucial role in enhancing the academic performance of
students in the field of education. By employing various techniques and algorithms in
educational data mining and data mining, it becomes possible to predict the academic
performance of both students and instructors alike. In the context of this paper, a machine
learning approach is proposed to forecast the academic success of secondary school students
in Mathematics and Portuguese lessons. To address the issue of unbalanced class distribution,
the proposed algorithm primarily utilizes normalization and z-score normalization techniques
during the pre-processing stage. Furthermore, feature selection processes are conducted
through the employment of a Genetic algorithm. In order to estimate the achievements of
students in both Mathematics and Portuguese lessons, the k-nearest neighbour (KNN), linear
discriminant analysis (LDA), and support vector machine (SVM) classifications are utilized.
To assess the performance of the aforementioned methods, the experimental results
encompass a comparison of accuracy, precision, F-score, and sensitivity values.

Keywords:

Educational Data Mining, K-Nearest Neighbour, Linear Discriminant Analysis, Machine


Learning, and Support Vector Machine are all techniques employed in the realm of academic
research.

Introduction:
Predicting the factors that influence student performance early in their academic program can
aid in addressing the issue of high dropout rates observed in higher education institutions.
The prediction of these factors can serve as a guide, and there exist various data mining tools
within the context of Educational Data Mining (Educational data mining ) that facilitate the
analysis and prediction of student performance, thereby benefiting educational institutions.
The implementation of such interventions has shown positive outcomes, including improved
pass rates, reduced dropout rates, and increased retention rates. In the realm of Educational
data mining , there are multiple tools available for data manipulation and feature engineering,
such as Microsoft Excel, Educational data mining Workbench, Python and Jupyter notebook,
and Structured Query Language (SQL). It is important to note that there is no one-size-fits-all
tool for Educational data mining , as the suitability of different tools depends on the specific
tasks at hand. Furthermore, a wide array of classification algorithms can be employed to
forecast processes and performance, including random forest, support vector machines,
AdaBoost, decision tree, Naïve Bayes, and K-nearest neighbor. Kumar et al. effectively
utilized Educational data mining in their research to enhance retention rates by predicting
slow learners among high school students and providing them with appropriate interventions

3|Page
for improvement. Their study identified Naïve Bayes, Multilayer Perceptron, SMO, J48, and
REPtree as the most widely utilized techniques in the field of Educational data mining
research. This paper aims to explore additional techniques within the realm of Educational
data mining .

Educational data mining assumes a distinctive position within this current context, as it
employs data mining techniques to extract pertinent information from a wide array of
educational data sets. The international educational data mining society defines this field as
an emerging discipline that focuses on developing methodologies to explore unique and
increasingly large-scale data derived from educational settings. These methodologies are then
employed to gain a deeper understanding of students and the environments in which they
learn . Put simply, data mining refers to a collection of computational techniques utilized to
extract information from extensive quantities of data. When these analyzed data stem from
educational contexts, it is referred to as Educational data mining . Additionally, the authors of
define Educational data mining as a dedicated field that concentrates on developing
methodologies to explore data from educational environments and subsequently use this data
to enhance our understanding of teaching and learning processes. In line with this, the authors
of assert that Educational data mining is a research area that aims to improve and refine
techniques to investigate data sets obtained from educational settings. According to the
authors, the nature of these data is more diverse than that traditionally observed in data used
for mining operations, necessitating adjustments and novel approaches. Simultaneously, the
diversity of these data signifies a significant resource with the potential to enhance education
Consequently, there is a need for methods and tools to aid in validating, interpreting, and
connecting this data to derive valuable and pertinent information. This particular focus on
DMDM methods is motivated by the ability to uncover behavioral patterns, which in turn
leads to improved products and services. Data mining finds its application across various
domains of expert systems. Currently, with the rising demand for distance learning and
computer-based courses, researchers in the field of computer science in education,
particularly those exploring the application of artificial intelligence in the educational model,
have turned to data mining to investigate scientific problems in education. These problems
include understanding the factors that influence education and devising methods to create a
more efficient education system. Consequently, a new field of research called Educational
Data Mining has garnered significant attention.

4|Page
Literature Survey:

Research in the field of Educational Data Mining has been ongoing for numerous years, with
scholars dedicating their efforts to exploring various aspects of the subject. For instance, in
the study conducted by the authors of , they aimed to enhance the accuracy of predicting
student data by employing an ensemble model that incorporates several classification
algorithms. Furthermore, they also sought to identify association rules that have a significant
impact on student performance through the use of rule-based methods .

In a similar vein, the researchers in delved into the effect of data pre-processing on
classification algorithms when applied to a student performance dataset that exhibited an
uneven distribution across different classrooms. To tackle the issue of unbalanced class
distribution, they implemented undersampling and oversampling techniques for support
vector machines, decision trees, and naive Bayes classification algorithms. It is worth noting
that the experimental results showcased higher accuracy values when employing the SMOTE
algorithm from the oversampling class .

Additionally, in their study , the authors aimed to assess the success of various classification
algorithms using the open-source software Weka on a student performance dataset. During
the experiments, they made an intriguing observation that certain features exerted a more
significant influence on student performance during the pre-processing data stage. Moreover,
they concluded that the accuracy performance of the classification algorithms improved
substantially after this preprocessing stage .

Proceeding ahead, the authors utilized decision tree-based classification algorithms to


forecast students' performance. Similarly, in , the researchers estimated student performance
on two distinct datasets using linear regression, decision tree, and naive Bayes algorithms.
Interestingly, the experimental results revealed that the accuracy values of the classification
algorithms experienced an increase following the feature selection process for both datasets.

Moreover, the study conducted by experts in utilized iterative classifier, OneR, LogitBoost,
and artificial neural network methods to predict the student's performance level based on the
student achievement dataset. Remarkably, they achieved superior performance values using
the OneR method in comparison to the other techniques.

In another study, the authors of evaluated the performance of classification algorithms on the
student dataset by implementing the feature selection method to the decision tree. Notably,
the random forest algorithm emerged as the most successful in terms of accuracy values .

5|Page
Furthermore, in their investigation, the authors of predicted students' success in their courses
through the utilization of the Naive Bayes classification algorithm on the student performance
dataset. Additionally, the researchers in analyzed decision trees and supported vector machine
algorithms using 10-fold cross-validation on the student dataset. To further enhance the
performance of these algorithms, they carried out hyperparameter optimization utilizing the
grid search algorithm.

Moreover, the researchers in conducted a comprehensive study on the student dataset,


employing three different approaches: binary classification, five-level classification, and
regression. For the purpose of comparison, they utilized k-means, nearest neighbor, support
vector machines, and naive Bayes algorithms. The findings of their study shed light on the
strengths and weaknesses of each approach.

Lastly, the authors of focused on predicting students' willingness to attend a higher education
program, employing a support vector machine, multilayer perceptron, and random forest
algorithm. The experimental results demonstrated that the random forest algorithm
outperformed the other algorithms in terms of success rate .

In summary, the research conducted in the field of Educational Data Mining has witnessed
significant progress over the years. Scholars have explored various classification algorithms,
association rules, and data pre-processing techniques to improve the accuracy and
performance of predicting student data. These studies have contributed valuable insights into
the factors influencing student performance and have paved the way for more advanced
methods and algorithms in the field.

Proposed methodology:

The flow of Proposed Student Performance Prediction


The flow of the Proposed Student Performance Prediction Educational Data Mining is a
process that transforms the data collected in the educational environment into useful
information for analysis and prediction. Educational data mining is widely used in the field
of education to predict student performance by utilizing various techniques and methods. In
order to develop an effective Educational data mining model, the process of feature selection
plays a crucial role in finding the optimal attributes that contribute to accurate predictions.
Additionally, classification methods are employed to make predictions based on the selected
features. In the present study, a novel feature selection approach based on genetic algorithms
was proposed to assess the performance of high school students in language lessons. The
SVM, K-NN, and LDA classification methods were utilized to evaluate the effectiveness of
the proposed approach. Furthermore, in order to enhance the prediction accuracy of the
classification models, the problem of an unbalanced distribution of classes was addressed as a

6|Page
priority. The methods mentioned above are described in detail in the research subsection,
providing a comprehensive explanation of their implementation and application.

Data Acquisition and Pre-Processing

Moving on to the next section, we delve into the crucial aspects of data acquisition and pre-
processing. Data obtained from real-world databases often suffer from inconsistencies, errors,
missing values, or may simply be unsuitable for Data Mining and Data Management
processes. Therefore, it becomes imperative to preprocess the data to ensure its quality and
suitability for further analysis. One of the key techniques used in data pre-processing is
normalization. The purpose of normalization is to manipulate the data in such a way that the
attributes of the feature vectors are brought to the same scale. This ensures that the magnitude
of the attributes does not affect the analysis and allows for meaningful comparisons to be
made. Normalization techniques aim to preserve the distribution of the data and retain the
implicit information present in each attribute. Commonly, normalized data is scaled to
intervals between 0 and 1, or -1 and 1. Another technique employed in data normalization is
z-score normalization, where each attribute of a feature vector is normalized based on its
mean and standard deviation. This technique ensures that the data is standardized and follows
a standard normal distribution, which facilitates analysis and comparison.

Extraction of feature

Moving forward, the extraction of features from the collected data is of utmost importance in
this study. The data used in this research was obtained from various sources in higher
education institutions and was sourced from the UCI repository. Building a predictive model
based on this data posed several challenges, including data inconsistency, class imbalance,
and attribute overlap. However, for students who have spent at least one semester in a college
program, several data characteristics were identified as highly informative and helpful in
building accurate predictive models. Feature extraction, as a crucial step in the classification
process, involves selecting input variables (i.e., substantial features) that effectively describe
student characteristics and can be utilized to predict their performance. In the present study,
the data included information such as student grades, demographics, social and academic
characteristics, which were collected through newsletters and surveys. Two sets of
performance data were provided for two different subjects: Mathematics (Math) and
Portuguese (Po). In a previous study, both datasets were structured using binary or five-label
regression and classification. The target attribute, denoted as G3, is closely related to
attributes G2 and G1. G3 represents the end value of the performance (given in the 3rd
period), while G1 and G2 represent the performance values in the 1st and 2nd periods,
respectively. Predicting G3 becomes more challenging without the inclusion of G2 and G1
attributes; however, the inclusion of these attributes significantly improves the accuracy of
predictions, making them highly valuable in the predictive modeling process.

7|Page
In conclusion, the proposed Student Performance Prediction Educational data mining is a
comprehensive approach that leverages data collected in the educational environment to
generate useful information for predicting student performance. The research addresses the
challenges of feature selection and classification, proposing a novel feature selection
approach based on genetic algorithms and evaluating its effectiveness using SVM, K-NN,
and LDA classification methods. Additionally, the issue of an unbalanced distribution of
classes is addressed to enhance the accuracy of predictions. Furthermore, the study highlights
the importance of data acquisition and pre-processing, emphasizing the need for data
normalization techniques to ensure data quality and comparability. The extraction of features
from the collected data proves to be a critical step in the classification process, with
significant attributes identified to accurately predict student performance. Overall, this
research contributes to the field of educational data mining by providing insights into the
predictive modeling of student performance and showcasing the benefits of utilizing
advanced techniques and methodologies.

Data set and its description:

The data is available publicly at https://www.kaggle.com/spscientist/students-performance-in-


exams/activity.

8|Page
Development and implementation

The necessary library used for implementation off student performance analysis project in
which we have used many modules and libraries which are required to get desire output. We
use NumPy, pandas, matplotlib ,seaborne and many other models like random forest decision
tree linear regression another.

9|Page
Here is the features of data sets we have in which we have features like gender ,race ethnicity
parents level of education ,launch, test preparation course ,maths score, reading score, writing
score.

Verifying if any null value is not present in the data set so it cannot impact in prediction.

10 | P a g e
Data set that used for the project, this display its description.

11 | P a g e
12 | P a g e
13 | P a g e
14 | P a g e
Apply the co-relation that helps the data set to performance predicting the right value.

15 | P a g e
Following the all of the model used in
prediction of the math score, Linear
regression is giving the high accuracy which
is 88% highest from all of the used model.

16 | P a g e
Here as we can see the actual value and prediction value both are accurately executed in the
model. The actual value is 79.5 for math score and the same answer comes from the
prediction value 79.5.

Result and discussion

Here is the result of the model we used in the project. I list the top 3 best working model with
high percentage of accuracy level.

Method used Accuracy

17 | P a g e
Linear Regression 88%

Random Forest 85%

Decision Tree 73%

Linear regression model we used and got accuracy of 88% where, we got 85 % and 73%
accuracy by using Random Forest and Decision tress models

Conclusion

Multiple research studies have been conducted to assess the academic achievements of
students by employing machine learning algorithms. By implementing an appropriate data
pre-processing procedure and selecting the optimal algorithm, it becomes feasible to enhance
the accuracy of the prediction outcomes. In this particular research, a wide array of classifier
algorithms is proposed to anticipate the success of secondary school students in both
Mathematics and Portuguese classes. Furthermore, in the data pre-processing phase, the issue
of imbalanced class distribution is effectively addressed through the utilization of data
normalization techniques. Additionally, the feature selection stage employs genetic algorithm
methodologies. Subsequently, the data is normalized within the confines of the [0, 1] interval,
followed by the execution of hyper-parameter tuning for the proposed classifier algorithms
during the training phase, which entails a five-level classification.
In our study, we emphasize that the characteristics of the students that we have incorporated
in our research are not restricted or limited in any way. On the contrary, it is possible to
introduce additional attributes into our database in order to enhance the effectiveness and
precision of our model. It is important to note that these new attributes have the potential to
contribute significantly to the refinement and optimization of our analysis. Furthermore, it is
not only the inclusion of new attributes that can enhance the depth of our insights, but also
the addition of more experts to our team. By involving a larger number of experts, we can
gain a more comprehensive and nuanced understanding of the factors that influence student
performance. Finally, we propose that the model we have developed in our research can also
be applied to evaluate and analyze the performance of senior students. By extending the
scope of our analysis to include senior students, we can gain valuable insights into their
academic achievements and identify potential areas for improvement. In summary, our work
highlights the potential for continuous expansion and improvement of our model through the
inclusion of new attributes and the involvement of more experts, while also demonstrating its
applicability to the analysis of senior students' performance.

18 | P a g e
GitHub link:
https://github.com/yashk2102/Predictive-Analysis/blob/main/PA_DM_STUDENT_PERFOR
MANCE_PREDICTION.ipynb

References

1. https://learninganalytics.upenn.edu/ryanbaker/paper323.pdf

2. https://ieeexplore.ieee.org/document/8809106

3. https://link.springer.com/article/10.1007/s10916-020-01562-1

4. https://revistas.ucv.edu.pe/index.php/ucv-scientia/article/view/1170

5. https://ieeexplore.ieee.org/document/8875405

6. https://ieeexplore.ieee.org/document/8725237

7. https://www.scielo.cl/scielo.php?script=sci_arttext&pid=S0718-
50062020000500233&lng=en&nrm=iso&tlng=en

8. http://sedici.unlp.edu.ar/bitstream/handle/10915/54674/Documento_completo__.pdf-
PDFA.pdf?sequence=1

19 | P a g e
9. https://ieeexplore.ieee.org/document/9080033

20 | P a g e

You might also like