Development of Student's Academic Performance Prediction Model
Development of Student's Academic Performance Prediction Model
1. INTRODUCTION
Education is an important factor which contribute immensely to the growth of the society at
large. Many educational organizations and school administrations today, leave no stone unturned
to improve their student’s academic performance, in which the marks obtained by the student in
the examination decide his/her future. They want to increase the number of students getting
passed yearly in order to develop the best quality of the education process in their institution, to
maintain the brand name of the organization and to educate students in a better way.
To achieve this target, Data Mining (DM) is considered as one of the most suitable technique in
giving additional insights to the Institutions of Higher Learning community to help them make
better decisions in educational activities. Data mining is the process used to transform data to
useful information by extracting and making use of relatively unknown patterns, trends and
dataset from large amount of stored raw data in the repository. Though many factors determine a
good institution, the academic performance of the student play a vital role in it (Jai & David,
2014).
Educational Data Mining (EDM) is an emerging trend in data mining, which is concerned with
developing techniques for exploring and mining useful educational patterns/dataset from the
database of an educational institution and using the extracted data to make decisions and
predictions for the enhancement of the educational system as it relate to students’ performance
and systemic improvement. Researchers in this field focus on discovering useful knowledge
either to help the educational institutes manage their students better, or to help students to
manage their education and deliverables better and enhance their performance (Amjad, 2016).
2. LITERATURE REVIEW
Data Mining
According to Mehmed (2011), Data mining is a process of discovering various models,
summaries, and derived values from a given collection of data. It has been widely used in recent
years due to the availability of huge amounts of data in electronic form, and there is a need for
turning such data into useful information and knowledge for large applications. These
applications are found in fields such as Artificial Intelligence, Machine Learning, Market
Analysis, Statistics and Database Systems, Business Management and Decision Support
Systems.
In practice, the two primary goals of data mining tend to be prediction and description (Mehmed,
2011). Prediction involves using some variables or fields in the data set to predict unknown or
future values of other variables of interest. Description, on the other hand, focuses on finding
patterns describing the data that can be interpreted by humans.
Therefore, it is possible to put data - mining activities into one of two categories:
1. Predictive data mining, which produces the model of the system described by the given data
set, or
2. Descriptive data mining, which produces new, nontrivial information based on the available
data set.
On the predictive end, the goal of data mining is to produce a model, expressed as an executable
code, which can be used to perform classification, prediction, estimation, or other similar tasks.
On the descriptive end of the spectrum, the goal is to gain an understanding of the analyzed
system by uncovering patterns and relationships in large data sets.
Figure 1: WEKA tool front view (Source: Purva and Kamal, 2015)
Execution in WEKA
Execution in WEKA is a step by step process (Purva and Kamal, 2015). First is data loading,
Data can be loaded from various sources, including files, URLs and databases. WEKA has the
capacity to read in ".csv" format (Comma Separated Value). Firstly we take excel datasheet from
real world, the first row contains the attribute names (separated by commas) followed by each
data row with attribute values listed in the same order (also separated by commas), convert
in .csv file format. Then go to the explore button in WEKA and save this .csv file. Once data is
loaded into WEKA, the data set automatically saved into ARFF format (Purva and Kamal,
2015).
Figure 2: Execution in WEKA tool (Source: Purva and Kamal, 2015)
3. METHODOLOGY
In this thesis, the students’ Grade Point Average (GPA) is selected as a dependent parameter.
The dataset collected of forty-nine (49) instances (students) was used to develop a training model
whose precision level and other parameters were considered to ascertain the model’s accuracy.
The model’s status was further confirmed by the User supplied Test set which was supplied to
the system, and the statistic of the result was compared to the Training model to confirm the
accuracy of the model, this was done by supplying the model the result of the first nine (9)
students and comparing the result generated.
To generate the prediction model, a class of unknown result (Predicted GPA) was created and
supplied to the model. The prediction level of the result and other performance metric signifies
the accuracy of the prediction for each student as revealed.
Algorithm of the model
The training dataset (I1…In) which is the total number of instances was loaded to WEKA. The
GPA (Key Performance Indicator) for each student (GPA1…GPAn) from the dataset with other
information about each student are used to build the model, data of unknown GPA (nG1…nGn)
for each student was fed into the model for prediction. The result generated displays the Previous
GPA and the Predicted GPA.
Step 1 - Start
Step 2 - Take input which is given by User
In= {I1,……...,In}
Step 3 - Dataset preparation
Dn = {{I1,…In}D}
Step 4 - Dataset elaboration
DI = {GPA1,…GPAn, I1,…..In, nG1,…..nGn}
Step 5 - Processing
While (Dn!=0 )
{
If (nGn==In)
Check GPAn;
}
Step 6 - Result Generation
R = {GPA, nG};
Step 7 - Stop
Where,
In = Input given by users
Dn = Dataset
D = Database
DI = Dataset contents
GPA1….GPAn = Previous GPA for each student
nG1…..nGn = New GPA (Predicted) for each student
R = Result generated
Flowchart of the Model
The system flowchart for training the data is shown in the Figure below:
Start
Select the file from database
Pre-process the data
Convert the file to CSV format
Stop
Figure 6: The Pre-process panel showing the attributes and basic statistics
The correctly classified instances are 46 at 93.8776% which signifies that the model is efficient
enough to be relied upon in contrast to the incorrectly classified instances. The Kappa statistics is
0.9373 or 93.73% which indicate that the classifier is doing better than mere chance.
2. Supplied test set - evaluates the classifier on how well it predicts the class of a set of
instances loaded from a file. Clicking on the ‘Supplied test Set’ button brings up a dialog
allowing you to choose the file to test on. The test data was created to control over-fitting after
the model is created, it is tested to ensure that the accuracy of the model does not decrease with
the test set. This ensure that the model will accurately predict future unknown values. Only nine
(9) instances (the first nine students) were selected and used to evaluate the model.
=== Summary ===
Correctly Classified Instances 9 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 9
Ignored Class Unknown Instances 40
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 2.85
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 2.96
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 4.04
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 4.77
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 4.88
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 3.92
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 2.65
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 3.35
A total of nine (9) instances (students) was used to evaluate/test the model. The model have the
same information as the Training data set model, thus showing that it is okay and efficient for
prediction.
The correctly classified instances are 9 at 100%, the incorrectly classified instances 0 at 0%,
Kappa Statistics is 1, Ignored instances 40 (this is because only nine (9) students were used to
test the model out of the forty nine (49) students of ACU), True positive (TP) is 1, False positive
(FP) is 0, Precision Rate, Recall and F-Measure are 1. It thus reflects that the model strong and
better for prediction.
Prediction Model
After training and testing the model, data of unknown academic performance was then fed into
the system for prediction. The prediction output of a given student is either “Pass” or “Fail”. Out
of forty-nine (49) datasets supplied into the system forty-six (46) were correctly predicted and
only three (3) were incorrectly predicted. The percentage of correctly predicted dataset is
93.8776% which is fair enough to be entirely depended on.
=== Summary ===
Correctly Classified Instances 46 93.8776 %
Incorrectly Classified Instances 3 6.1224 %
Kappa statistic 0.9373
Mean absolute error 0.0027
Root mean squared error 0.0365
Relative absolute error 6.2624 %
Root relative squared error 25.0284 %
Total Number of Instances 49
The model having correctly classified instances of 46, Kappa statistics of 0.9373, True positive
rate of 1, False positive rate of 0, Precision rate and Recall of 1 and F-measure of 0.667-1, shows
that most of the students have a high tendency to have the same GP in the coming Semester.
The table below shows and interpret the prediction of the first nine (9) students as predicted by
the model:
Student Name Matric. Current Predicted Prediction
Number G.P G.P Level
ADENIRAN IfeoluwaAdebola 15N02/001
2.85 2.85 1
BandeleOlaomopoOluwanifemi 15N02/005
4.88 4.88 1
From the graph, the Current G.P is represented by the blue bar (left) while the Predicted G.P is
represented by the orange bar (right). The graph shows that the Predicted G.P and the Current
G.P are at the same level for all the students, thus signifying that the student is liable to have the
same G.P in the coming Semester. For example Adeniran Ifeoluwa had a G.P of 2.85 is likely to
have the same G.P in the coming Semester because the prediction level is 1, the same goes for
other students.
Summary
In this thesis, J48 method of classification was used to build a model to predict the students’
academic performance on the basis of its accuracy measure and prediction level for small
datasets, and in terms of its merits above other decision tree algorithms. As the model was
evaluated using student’s record of the Computer Science Department, Ajayi Crowther
University, Oyo.
The simplicity of the J48 result output and its easy interpretation and prediction makes it a more
convenient tool to predict students’ academic performance.
Conclusion
The J48 decision tree present and achieve a high rate of accuracy. It classify the data into the
correctly and incorrectly instance as we cover seven (7) attributes under forty-nine (49) instances
and the model successfully identifies the students who are likely to fail. These students can be
considered for proper counselling so as to improve their result in the coming Semester.
The system will generally help students to benchmark their grades from their entry point into the
final year, thereby helping them to work harder in order to achieve this. Finally, the developed
system would help to significantly reduce the overall failure rate in most academic institutions as
students can be well guided and counselled.
Recommendation
From the results and findings of the experiments in this work, we recommend the adoption of
student performance prediction model as Education Data Mining is an emerging data Mining
discipline. More similar studies on different data set for machine learning approach is needed to
confirm the above finding.
The future work could also include applying data mining techniques on an expanded data set
with more unique attributes to get more accurate results. Also, a comparative analysis of these
results would be carried out based on other experiments results gotten from using other types of
decision tree algorithms such as C4.5, CHAID and CART.
REFERENCES
A. Dinesh Kumar, Dr. V. Radhika, A Survey on Predicting Student Performance,
International Journal of Computer Science and Information Technologies (IJCSIT), Vol. 5 (5),
2014, 6147-6149.
Amjad Abu Saa (2016), Educational Data Mining & Students’ Performance
Prediction, International Journal of Advanced Computer Science and Applications (IJACSA),
Vol. 7, No. 5.
Jai Ruby & K. David (2014), “A study model on the impact of various indicators in the
performance of students in Higher Education“, IJRET International Journal of Research in
Engineering and Technology, Vol. 3, Issue 5, May 2014, Pp.750-755.
Osmanbegovic E, Mirza S (2012), Data mining approach for predicting student performance.
JEcon Bus 10: 3-12.
Purva Sewaiwar, Kamal Kant Verma (2015), Comparative Study of Various Decision
Tree Classification Algorithm Using WEKA, International Journal of Emerging Research in
Management &Technology ISSN: 2278-9359, October, 2015 (Volume-4, Issue-10)
Ramesh V, Parkavi P, Ramar K (2013), Predicting student performance: A statistical and data
mining approach, IJCA 63: 35-39.
Sajadin et al (2011), Prediction of student academic performance by an application of
data mining techniques, International Conference on Management and Artificial Intelligence
IPEDR Vol.6 IACSIT Press, Bali, Indonesia.