Daud 2017
Daud 2017
415
Educators, parents and institutions would like to know the answer The impact of these features is determined as per their
of the question “Is it possible to predict the performance of a efficiency and most influential features are shortlisted.
student that is enrolled in an educational institution?” For 3. The performance of standard models (both discriminative
example, will he complete his degree or not? However, currently and generative) is analyzed by comprehensive experiments
the process of learning has been declared as an individual’s effort. with baseline and proposed features.
Therefore, developing models for evaluating learning efforts of a
The overall finding is that Learning Analytics based on
student is not an easy task [16]. Recently, data mining techniques
personalized features can improve prediction of students’
have been used to provide new insights for this problem. There
performance. For sure the generalization of these findings
are several diversified influential factors to evaluate the students’
requires additional research for the incorporation of additional
performance. These can be identified by using data mining
features like talent, skills and personal competencies from
approaches in educational sector [17]. Data mining for an
different online web sources.
educational system is an iterative process for hypotheses
Rest of the paper is arranged as follows: Section 2 describes the
development and testing. Fig. 1 shows the applications of data
related work. The problem definition is presented in Section 3.
mining in educational systems. Student performance evaluation
Section 4 provides applied models, performance measures, data
system can help in decision making for awarding scholarships or
collection and construction of feature space. Section 5 presents
in other words targeting the right students.
experiments conducted by using five classifiers with baseline and
proposed feature sets. Finally, section 6 concludes this work.
2. RELATED WORK
The research problem of students’ performance prediction can be
analyzed through diverse angles. In the current literature, a
number of complimentary approaches provide a baseline for such
an analysis. In an ideal scenario, a rich dataset with student
identity along with numerous characteristics could be the basis for
advanced learning analytics. The problem is that in most of the
cases, not all the data are available for the dynamic construction
of the student identity, further limited by lack of access to various
sources. In Fig. 2, we briefly present some of the most
representative methods of Applied Educational Data Mining and
Figure 1 The cycle of applying Data Mining in educational Learning Analytics based on a comprehensive literature review.
systems [17] Student performance prediction has got a lot of attention from the
educational data mining researchers. Typical data mining methods
In the past, mostly student performance is predicted by using have been employed to deal with different tasks related to the
different types of feature sets, such as, academic record, family students. A survey of data mining techniques for traditional
income and family assets [16,25,12]. Family income and educational systems such as adaptive web-based and content
expenditure feature sets play an important role in student management systems is presented in [16].
performance prediction. Intuitively family expenditures and An association rule based mining method is applied for selection
personal information feature sets seem equally effective for the of weak students in a school and is found effective [8]. Genetic
said task. Algorithm is used to assign the weights for the modeling of
This paper investigates several family expenditure and personal students’ grade for three levels (binary, 3-level and 9-level) [9]. It
information related feature sets based on learning analytics for shows that the combination of multiple classifiers leads to a
improved student performance prediction. Extensive significant improvement in classification. A model is proposed for
experimentations are conducted to evaluate the impact of existing, predicting student performance using six machine learning
proposed and hybrid feature sets. Effectiveness of proposed techniques for distance learning education, which is quite
features sets on real data of scholarship holding students from different from the traditional educational system [6]. The
different Pakistani universities is provided by using both experimental results show that demographic and performance
discriminative and generative classification models. Using features are better predictors for predicting student performance.
proposed features outperforms existing methods and 86% A regression model is applied to predict the test score of subject
accuracy is achieved in predicting that the student will complete for school students [14]. It concludes that mixed-effect models
the degree or not. According to the best of our knowledge the present best performance as compared to Bayesian network.
proposed features are not exploited before and they have a A prediction model (CHAID) is developed to predict the
significant impact on the students’ performance in studies. Our performance of higher secondary school students, which is critical
method can be used also as a benchmark for similar studies in the before getting admission into universities [16].
future. The grades of graduate students are predicted using Naïve
The main contributions of this paper are as follows: Bayesian and Rule Induction classifiers [25]. Clusters are made
from students’ data and the outliers are successfully identified. A
1. In this paper, two new feature sets family expenditures and model is presented to estimate the abilities of students and
student personal information are investigated. competence of teachers in order to predict the future student
2. An effective feature set of twenty-three features is outcomes [19]. It shows that demographic profiles and personality
constructed by combining proposed features along with traits features are correlated and have high impact on student
exiting features. The feature subset selection process is performance. Similarly student performance evaluation and
adapted by using information gain and gain ratio metrics. engineering students’ abilities are analyzed for improved
recruitment process by using data mining methods [12].
416
3. PROBLEM DEFINITION
The formal definition of student performance prediction problem
Association
Rules is described as:
Decision Genetic
Given n training samples (X1, z1), (X2, z2)… (Xn, zn), where Xi is
Trees Algorithms the feature vector for student ai and A is the set of n students
where A = {a1, a2, a3… an}. The Xi ∈ Rm and m is the total
number of features and zi is the student performance status (degree
completed or dropped) where zi ∈ {-1, +1}. To predict the
Support
Vector
Feature
Selection
performance of a student, the following prediction function is
Machine
proposed:
Advanced z = F (A / X) (1)
Learning Where,
≥ 0 if z = +1, completed
Analytics F (A / X) = [ ] (2)
Prediction Regression
< 0 if z = -1, dropped
Models Models
417
(true values) and 86 instances that are dropped in the midway or at 5.1 Individual Feature Analysis
the end (false values). In the first step, 20 students’ instances This section evaluates the impact of each feature for the prediction
have been used (10 completed, 10 dropped), and then 40, 60, 80 of student’s performance. Twenty-three features (selected by the
and 100 instances of dataset are used. However, after 100 feature extraction process) are selected for experiments. In
records/tuples, the performance remains unchanged as with the experiments five classifiers are used (BN, NB, SVM, C4.5 and
increase in instances in the data set. So, 100 instances of students CART) to analyze the influence of each feature for predicting the
(50 completed, 50 dropped) for experimental setup are selected. performance of students as shown in Fig. 4.
The distribution of the dataset is presented pictorially in Fig. 3. We find the “natural gas” expenditure is the best predictor for the
Completed 50% (50) desired student’s performance using C4.5 classification method as
shown in Fig. 4. BN and NB methods show second and third
Dropped 50% (50) highest F1- scores using same features. Other family expenditure
features also play important roles. The “Stock Value” feature has
the lowest performance for prediction and all classifiers present
same F1-score (0.333). By analyzing performance of best and
worst features, that conclude the proposed proposition based on
the family expenditure feature set improves classification
accuracy.
“Self Employed” is found to be the second-best feature that also
belongs to proposed feature set of student’s personal information
and all classifiers show 0.77 F1-score which represents better
performance by using proposed feature. The third best feature is
“Location” which belongs to the baseline feature set. If we
critically analyze the impacts of other proposed features in
Figure 3 Characteristics of dataset.
comparison with old features, better accuracy is obtained by using
our proposed feature space as compared to the old feature space.
4.4 Construction of Feature Space Hence, it can be concluded that students’ “natural gas”
Feature set is constructed by considering four categories of expenditure, “electricity” expenditure, “self-employed” and
characteristics related to student and his family. Initially, a pool of “location” characteristics are most influential for prediction of
33 features is constructed by combining some existing (baseline) his/her performance in academics.
and proposed features and then feature subset selection process is
applied to remove/reduce the number of redundant features.
Information Gain and Gain ratio are used to select the best feature
5.2 Comparisons
Performance of classifiers is analyzed using four baseline methods
subset.
and our proposed feature sets based method and results are
Overall, four categories of features are collected (some from
critically analyzed. The performance of experiments is evaluated
literature and some are proposed in this research work). Out of
by F1 score. The purpose of this experiment is to analyze the
twenty-three, thirteen are our proposed features. Table 1 presents
influence of proposed and existing features based methods for the
the description of each feature, its category and status (proposed
student performance prediction task.
or existing/old). Then feature subset selection process is adapted
The feature sets proposed by [16,12,25,23] are considered as
in the following manner: First of all, for comprehensive analysis
baseline for comparison as shown in Fig 5. Proposed method
of features’ comparison and best features selection, two measures,
significantly outperforms baseline methods as shown in Fig. 5.
information gain and gain ratio are used. A threshold of 0.01 is
SVM performs best for our proposed feature sets with F1 score of
used to identify the best feature subset. Finally, we get 23 features
0.867, which is 13% more as compared to second best method for
in which 13 are new (proposed) and 10 are old as shown in Table
SVM model. BN and NB classifiers overall perform better in case
1. We found larger information gain values of selected features as
of most methods as compared to C4.5 and CART. For C4.5 model
compared to gain ratio values. Because our dataset does not
the performance of most methods is very low and unstable.
contain equal number of samples for both classes and information
gain is biased towards maximum attribute values.
5.3 Discussion
5. EXPERIMENTAL RESULTS This research work presents the student academic prediction
methods that use four different types of features namely: family
In this section, comprehensive experiments are presented using
expenditure, family income, student personal information and
data set that is designed based on student’s information acquired
family assets. It also adapts the process of feature subset selection
from different universities of Pakistan as described in Section 4.3.
in order to identify the most effective determinants for student
The dataset consists of 100 student records (tuples) and 23
academic performance prediction. It is evident from the
features. Therefore, we get a 100 × 23 feature matrix. Default
comparative analysis that our proposed features are important
parameters are used for all classifiers using Weka 3.7.
predictors and achieved F1-score of 86% (Fig. 5) on real life
Five-fold cross validation method is used to evaluate the accuracy
undergraduate students’ data.
of all the classifiers. Experiments are conducted in two ways:
Firstly, influence of individual feature for the student’s
performance is analyzed. Secondly results of classifier on baseline
methods and proposed feature sets are evaluated.
418
Table 1 Features Distribution.
Category Name Description Status Info. Gain Features
Gain Ratio Used
Family Expenditure Electricity Bill Average of Electricity bills for last six months New 0.38 0.05 √
Natural Gas Bill Average of Gas bills for last six months New 0.26 0.06 √
Telephone Bill Average of Telephone bills for last six months New 0.10 0.04 √
Water Bill Average of Water bills for last six months New 0.10 0.06 √
Food Expenses Average of food expenses for last six months New 0.09 0.03 √
Miscellaneous Expenditure Average of Miscellaneous Expenditures for last six months New 0.11 0.02 √
Medical Average of Medical Expenditures for last six months New 0.06 0.01 √
Family Expenditure on Education Average of Family Expenditure on education for last six months New 0.35 0.04 √
Accommodation Expenses Average of Accommodation Expenses for last six months New 0.27 0.25 √
Studying Family Members Total number of studying family members of student Old 0.008 0.003
Dependent Family Member Total number of dependent family members of student Old 0.02 0.007
Family Income Father Income Per month income of father/guardian of student Old 0.29 0.04 √
Mother Income Per month income of mother of student Old 0.02 0.03 √
Land Income Per month income from land of family of student Old 0.02 0.05 √
Miscellaneous Income Per month miscellaneous income of family of student Old 0.08 0.03 √
Earning Hands Total number of Earning hands of student’s family Old 0.007 0.005
Father Status Status of father of student: alive or deceased New .0008 0.001
Father Retired Father retired or in service New 0.002 0.003
Guardian Alive Is student’s guardian alive New 0.003 0.004
Student Personal Gender The gender of the student (male or female) Old 0.004 0.005
Information
Marital Status Marital status of student (married or unmarried) New 0.003 0.01 √
House Owner Ship Student have his/her own house New 0.08 0.10 √
Previous Program Scholarship Scholarship received or not in previous academic program New .0002 .0003
Previous Institution Type Type of student previous institution Old 0.001 0.002
Self Employed Is student is self employed New 0.06 0.04 √
Family Assets Land Value Current value of lands belongs to student’s family Old 0.04 0.02 √
Bank Balance Bank balance of student’s family Old 0.05 0.07 √
Stock Value Value of Shares/Bonds belong to student’ s family Old 0.01 0.08 √
House Value Value of house belong to student’s family Old 0.14 0.03 √
House Condition Structure of house belong to student’s family New 0.06 0.04 √
Miscellaneous Asset Value Any other assets related to student Old 0.04 0.02 √
Location Type of Location where student resides; urban or rural Old 0.03 0.04 √
No of Vehicles at home How many vehicles belong to family of a student Old 0.005 0.008
0.9
0.8
0.7
0.6
F1 Score
0.5
0.4
0.3
0.2
0.1
419
1 him/her to better utilize his/her abilities in studies. On the other
hand, if the house condition is not good, the student’s time and
0.9
energy may be wasted in repairs or in helping his/her parents to
0.8 get the repairs done. Previous studies [16,23] have also explored
F1 Score 0.7 the gender and institution type characteristics, which we do not
consider.
0.6
0.5
6. CONCLUSIONS
0.4 In this research, an effort is made to find the impact of our
0.3 proposed features on student performance prediction with the help
0.2 of generative and discriminative classification models. A feature
space is constructed by considering characteristics of family
0.1 expenditure, family income, personal information and family
0 Bays Naïve
assets of students. The potential/dominant features selection is un-
SVM C4.5 (J48) CART
Network Bayesian avoidable as it provides us with subset of features. SVM classifier
Tair and Al Halees( 2012) 0.653 0.653 0.683 0.683 0.683
is found effective for our proposed features of family expenditure
Osmanbegovic and Suljic ( 2012) 0.713 0.731 0.695 0.333 0.657
and student personal information categories. It can be concluded
Sree, et al.( 2013) 0.748 0.782 0.699 0.444 0.577
from the results that family expenditure and personal information
Ramaswami and Bhaskaran ( 2010) 0.782 0.799 0.733 0.656 0.577
Hybrid Features (Proposed) 0.848 0.848 0.867 0.766 0.71
features have significant impact on the performance of the student
due to intuitive reasons provided in discussions.
In future, WWW research on Digital Learning and Learning
Figure 5 Comparison with baseline and proposed features. Analytics should be focused on the following directions:
Which kinds of methods and flexible applications permit the
The features related to family expenditure such as natural gas, construction of critical learners’ data from the WWW, e.g.
electricity, telephone, water, accommodation, miscellaneous mining of social media content can be a basis for personal
expenditures, and most importantly family expenditure on expenditure?
education are found to be most effective in predicting academic Which are the possibilities to proceed to dynamic profiling of
performance. Most of these features are ignored by the baseline personal characteristics of students from the Deep Web?
methods and prior studies [12,25]. The best predictive Which are the standards for codifying critical students’
performance is obtained when family expenditure based features information in the WWW, and how this can envision future
are combined with other features (hybrid features). It has been WWW based learning services?
observed that family expenditures affect the students’ Learning analytics of mobile and ubiquitous learning
performance and reduce their concentration and interest in studies. environments from the perspective of human computer interaction
The claims made on the basis of experimental outcomes are [1,2] also require detailed exploration in addition to
verified by 25 students studying on scholarships. Most of these aforementioned traditional and web based features.
students agree to the following discussions.
An increase in the expenditures of family reduces the REFERENCES
opportunities for a student to grow up and excel in their studies
[1] N. R. Aljohani and H. C. Davis, “Learning analytics in
because time and money are important factors in life and are
directly related with the family expenditure. The increase in mobile and ubiquitous learning environments,” in 11th World
expenditure, especially on medical treatments and Conference on Mobile and Contextual Learning, 2012.
accommodations dominantly affects the performance of students. [2] N. R. Aljohani, H. C. Davis, and S. W. Loke, “A comparison
More expenditure on medical relates to health issues and more between mobile and ubiquitous learning from the perspective
expenditure on accommodation may affect the budget (for
of human-computer interaction,” International Journal of
education) of a middle class family.
On the other hand, some personal characteristics of students are Mobile Learning and Organization, vol. 6, no. 3/4, pp. 218-
also important predictors for their performance evaluation, e.g., 231, 2012.
married students better concentrate on their studies as compared [3] R. Asif, A. Merceron, and M. K. Pathan, “Investigating
to bachelor students perhaps because of emotional stability in performance of students: a longitudinal study,” in Fifth
their personal lives. The same is the case with students who
themselves or their parents have their own property. Families International Conference on Learning Analytics And
having their own house definitely saves money by not paying Knowledge (LAK '15), New York, USA, 2015, pp. 108-112.
house rent and can spend these savings for the education of their [4] M. A. Chatti, A. L. Dyckhoff, U. Schroeder, and H Thüs, “A
children. They also don’t need to keep changing the rented houses reference model for learning analytics,” International
which may waste time and energy of a student. Similarly, the self-
Journal of Technology Enhanced Learning (IJTEL), vol. 4,
employment status of a student enables him/her to better schedule
time for studies in an efficient way because less worries about no. 5/6, pp. 318-331, 2012.
finances result in comfort and satisfaction. In addition to this, self- [5] N. Fournier, R. Kop, and H. Sitlia, “The value of learning
employment develops hard working attitude in the personalities of analytics to networked learning on a personal learning
the students, both these factors are very helpful for students in environment,” in 1st International Conference on Learning
achieving better performance in their studies. Last but not the
Analytics and Knowledge, 2011, pp. 104-109.
least, the house condition of a student is also an influential factor
because having a comfortable living accommodation enables [6] S. Kotsiantis, C. Pierrakeas, and P. Pintelas, “Predicting
420
students' performance in distance learning using machine [18] J. L. Santos, K. Verbert, S. Govaerts, and E. Duval,
learning techniques,” Applied Artificial Intelligence, vol. 18, “Addressing learner issues with StepUp!: An evaluation,” in
no. 5, pp. 411-426, 2004. International Conference on Learning Analytics and
[7] E. Lotsari, V. Verykios, C. Panagiotakopoulos, and D. Knowledge, 2013, pp. 14-22.
Kalles, “A Learning Analytics Methodology for Student [19] B. Shalem, Y. Bachrach, J. Guiver, and C. M. Bishop,
Profiling,” in Artificial Intelligence: Methods and “Students, teachers, exams and MOOCs: Predicting and
Applications, 2014, pp. 300-312. optimizing attainment in web-based education using a
[8] Y. Ma, B. Liu, C. K. Wong, P. S. Yu, and S. M. Lee, probabilistic graphical model,” in Joint European
“Targeting the right students using data mining,” in 6th ACM Conference on Machine Learning and Knowledge Discovery
SIGKDD International Conference on Knowledge Discovery in Databases, 2014, pp. 82-97.
and Data mining (KDD '00), New York, USA, 2000, pp. [20] A. Sharabiani, F. Karim, A. Sharabiani, M. Atanasov, and H.
457-464. Darabi, “An enhanced bayesian network model for prediction
[9] B. Minaei-Bidgoli, D. A. Kashy, G. Kortemeyer, and W. F. of students' academic performance in engineering programs,”
Punch, “Predicting student performance: an application of in IEEE Global Engineering Education Conference
data mining methods with an educational Web-based (EDUCON), 2014, pp. 832-837.
system,” in 33rd Annual Frontiers in Education (FIE 2003), [21] G., Siemens and P Long, “Penetrating the fog: Analytics in
Westminster, CO, 2003. learning and education,” EDUCAUSE Review, vol. 46, no. 5,
[10] T. Mishra, D. Kumar, and Sangeeta Gupta, “Students' 2011.
Employability Prediction Model through Data Mining,” [22] S. Slater, S. Joksimovic´, V. Kovanovic, R. S. Baker, and D.
International Journal of Applied Engineering Research, vol. Gasevic, “Tools for Educational Data Mining: A Review,”
11, no. 4, pp. 2275-2282, 2016. Journal of Educational and Behavioral Statistics, 2016.
[12] E. Osmanbegović and M. Suljić., “Data mining approach for [23] G. S. Sree and C. Rupa., “Data Mining: Performance
predicting student performance,” Economic Review, vol. 10, Improvement In Education Sector Using Classification And
no. 1, pp. 3-12, 2012. Clustering Algorithm,” International Journal of Innovative
[13] O. K. Oyedotun, S. N. Tackie, and Ebenezer O. Olaniyi, Research and Development, vol. 2, no. 7, pp. 101-106, 2013.
“Data Mining of Students' Performance: Turkish Students as [24] P. Strecht, L. Cruz, C. Soares, J. Mendes-Moreira, and R.
a Case Study,” International Journal of Intelligent Systems Abreu, “A Comparative Study of Classification and
and Applications, vol. 7, no. 9, pp. 20-27, 2015. Regression Algorithms for Modelling Students' Academic
[14] Z. A. Pardos, N. T. Heffernan, B. Anderson, C. L. Heffernan, Performance,” in International Educational Data Mining
and W. P. Schools, “Using fine-grained skill models to fit Society, 2015, pp. 392-395.
student performance with Bayesian networks,” in Handbook [25] M. M. A. Tair and A. M. El-Halees, “Mining educational
of educational data mining., 2010, pp. 417-426. data to improve students' performance: a case study,”
[15] P. J. Piety, D. T. Hickey, and M. J. Bishop, “Educational data International Journal of Information, vol. 2, no. 2, pp. 140-
sciences: Framing emergent practices for analytics of 146, 2012.
learning, organizations, and systems,” in 4th International [26] S. K. Yadav, B. Bharadwaj, and S. Pal, “Data mining
Conference on Learning Analytics and Knowledge, 2014, p. applications: A comparative study for predicting student's
193. performance,” International Journal of Innovative
[16] M. Ramaswami and R. Bhaskaran., “A CHAID based Technology and Creative Engineering, vol. 1, no. 12, pp. 13-
performance prediction model in educational data mining,” 19, 2011.
International Journal of Computer Science, vol. 7, no. 1, pp. [27] B. J. Zimmerman and A. Kitsantas., “Comparing student’s
10-18, 2010. self-discipline and self-regulation measures and their
[17] C. Romero and S. Ventura, “Educational data mining: A prediction of academic achievement,” Contemporary
survey from 1995 to 2005,” Expert systems with Educational Psychology, vol. 39, no. 2, pp. 145-155, 2014.
applications, vol. 33, no. 1, pp. 135-146, 2007.
421