BD Project
BD Project
COLLEGE OF INFORMATICS
2. Meliha Hawi
3. Zintalem Getinet
** Best Regards! **
ACKNOWLEDGEMENT
First of all, we would like to express our sincere thanks to "ALLAH" for giving us full healthy and
support us in the whole process of drafting our research proposal up to the completion of the report. With
great pleasure, we express our deep sense of gratitude to our Instructor Dr. Kula Kekeba (Ph.D.), for his
valuable guidance, constant encouragement, and patience throughout our research work, starting from
initiating us to prepare a research proposal in Big Data Analytics to end of the research report. We would also
like to extend our sincere thanks to all other staff members of the College of Informatics. Specially those of
the Information Technology Department. They have contributed in their way to making this research work
successful. With this chance, we would like to thank all our classmate students for the practical decision we
had together.
|Page
ABSTRACT
In the past two decades, the Biotechnology and medical fields have overlooked advances in the development
of emerging technology that open new possibilities for identifying and quantifying Bio-Analysis of disease.
Biochips is the basic idea behind Biochips Technology to give human expertise to the machine to perform
daily activities on behave of the human expert. We all have some perceptual abilities. Thus, we can
understand each other's feelings. So, like computer chips that can perform millions of mathematical operations
in one second, Biochips can perform thousands of biological reactions such as Genes decoding in a few
seconds. Biochip's technology helps identify over 80,000 genes in human DNA, an ongoing worldwide
research collaboration known as Human Genome Project. Developing Biochips platform incorporates
electronics for addressing, reading out, sensing, and controlling temperature and a handled analyzer capable of
multiparameter identification. The Biochips technology platform can be plugged in a peripheric standard bus
of the analyzer device or communicate through a wireless channel. Biochips enable us to realize revolutionary
new bioanalysis systems that can directly manipulate and analyze the micro/nano-scale world of
biomolecules, organelles, and cells.
|Page
TABLE OF CONTENT
CONTENTS…………………………………………………………………………………………......PAGE
ACKNOWLEDGEMENT..................................................................................................................i
ABSTRACT......................................................................................................................................ii
ACRONYM.....................................................................................................................................iii
TABLE OF CONTENT...................................................................................................................iv
LIST OF FIGURES...........................................................................................................................v
CHAPTER-ONE...............................................................................................................................1
1.1. INTRODUCTION..........................................................................................................................1
1.4. OBJECTIVES..................................................................................................................................3
|Page
1.7. SCOPE AND LIMITATION...........................................................................................................8
1.7.1. SCOPE...................................................................................................................................8
1.7.2. LIMITATION.......................................................................................................................8
1.8. TIME FRAME.................................................................................................................................9
BIBLIOGRAPHY...........................................................................................................................10
LIST OF FIGURES
Figure 1: Shows Different information flow during the literature review process...................................4
|Page
CHAPTER-ONE
1.1. INTRODUCTION
In recent years, universities worldwide focus on institutional transformation to improve student's
learning experience and quality education. Quality education is one of the Sustainable Development
Goals (SDG) of the Ethiopian government. Bule Hora University is one of the universities in Ethiopia
located in the southern portions of Ethiopia, working toward enhancing quality education by improving
the service required for the student. The geographical location of this university is around 470km south
of Addis Ababa, Oromia Regional State, West Guji Zone at Blue hora town.
The recent era of machine learning and big data technology allows higher institutions to use a wide
variety and volume of data to analyze student's performance to enhance their learning experience.
Machine Learning (ML) and Big Data Analytics (BDA) are gaining growing thrust to provide insights
that could inform the decision-making process to support institutions and improve student
performance[1].
Machine learning comes up with a wide variety of critical techniques in analyzing Big Data to predict
the students' academic performance depending on the educational background. Potential Data mining
applications in education can range from the predictive modelling of student retention or attrition and
student's learning behaviour to student's learning experience[2], [3]. Such analysis of the big data
inside the educational institution helps them better understand the student's specific learning needs,
which is more significant in tackling the problems faced by universities in predicting student retention
and progression.
One of the significant challenges of higher educational institutions is that most graduating students
failed to write their thesis work accordingly, resulting in extending their graduation year and hindering
them from promoting their academic rank. Hence, applying the potential application of big data
analytics combined with machine learning techniques has a vital role in improving student
performance[4].
|Page
Big data is a large and complex amount of data that is difficult to analyze using traditional analysis
methods. This data can be in structured, semi-structured, and unstructured forms. The massive data
flow has led to better analytical methods as traditional methods have become inefficient for processing
big data. Due to vast amounts of data in educational databases, predicting students' performance is
tricky[5], [6]. The shortage of an established framework for evaluating and tracking students' success is
not currently being careful. There are two primary reasons why such kind of occurring. First, the
research on existing prediction methods is still insufficient to determine the most appropriate methods
for predicting student performance in institutions. Second is the absence of inquiry of the specific
courses.
Thus, this research aims to investigate the most efficient machine learning technique in predicting
the final results of the thesis of postgraduate students in Bule Hora University, precisely that of the
College of Informatics postgraduate students, which in turn help the college to access their student's
level of performance and make a decision on how to help their student come up with improved
performance.
1.2. PROBLEM STATEMENT
The most significant challenges in higher educational institutions are a massive amount of data stored
in the university database, mainly structured data that could be simple to analyze using different data
mining and machine learning techniques. However, they did not utilize it the way it has to be to
develop better productive and successful predictive models. Those massive amounts of data can
improve the quality of education when data mining, frequent pattern generation, and machine learning
techniques are implemented to utilize them. Another challenge of higher educational institutions is that
most graduating students failed to write their thesis work accordingly, which resulted in extending their
graduation year and hindered them from promoting their academic rank. To the level of our
understanding, there are no predictive models for predicting student performance in their thesis work
that has been studied before. This motivates us to investigate different data mining or big data analytics
and machine learning techniques to develop better predictive models that help the college. Apart from
the above-stated problems, this research work tries to address the following problems:
Difficulties in early prediction of the status of the students.
Difficulties associated with examining an extensive, complex collection of data and
information contained in an organization's database.
1.3. RESEARCH QUESTIONS
|Page
As part of this research work, to attain the overall objectives of this research, the following research
question needs to be addressed:
o RQ#1:- What is the state of the art and the system developed to improve student performance?
o RQ#2:- How do the different machine learning and data mining techniques support improving
student performance?
o RQ#3:- What could be the possible cause for the student's performance in MSc thesis work to
become comparatively low?
1.4. OBJECTIVES
1.4.1. GENERAL OBJECTIVE
The general objective of this study is to investigate different machine learning techniques for
predicting student performance in thesis work. The above-stated aims or general objectives come up
with the following specific objectives which assist us in attaining our goal.
1.5. METHODOLOGY
The methodology could be a systematic approach to execute and manage the analysis study
expeditiously. It is excellent to utilize the most straightforward methodology to conform to the
|Page
standards of any research investigation[7]. In this research work, the SWOT analysis is used as This
section provides the detailed methodology used in this research work. For this research work, the
methodology includes:
Figure 1: Shows Different information flow during the literature review process
1.5.2. DESIGN METHODOLOGY
A well-executed research project should be iterative to allow for requirement and implementation
modifications during the project's lifecycle while still adhering to the established timeframes and
quality standards. By following this technique, project failures can be significantly decreased.
|Page
Spiral Paradigm is a risk-based model defined by iterative procedures that aid in risk mitigation. As
illustrated below, the Spiral Model is divided into four distinct software development life cycle
(SDLC) phases. The entire development process is iterative. Each repetition is denoted by the term
Spiral. As illustrated in Figure 2; the four major phases are as follows:
Step-1: Determine Objectives: The objectives of this project will be defined through an appropriate
feasibility assessment and milestones.
Step-2: Risk identification and mitigation: Risks will be recognized, prioritized, and mitigated based
on their severity.
Step-3: Development and Testing: During this stage, a high-quality functioning prototype will be
developed in compliance with the criteria in priority order. Later spirals will produce a working version
of the product.
Step-4: Planning Phase: This stage evaluates the project's output before moving on to the next Spiral.
The following figure shows the different phases of spiral models.
|Page
Define stage will define the objectives and goals and pick the features and model after pre-processing
the data. The model will be trained and tested during the design stage, using the appropriate accuracy
matrices. Finally, the model will be deployed and iterated against new data as part of the
implementation stage.
|Page
Figure 4: Shows KDD Process
The different phases of the process, as shown in Figure 4, are as follows:
For this research work, the source of datasets we will use is those open-source datasets available from
Kagle and Github websites because dataset preparation is a time-consuming task that will not be
completed within this limited time frame.
|Page
many diverse applications, particularly for educational and scientific purposes. WEKA is supposed to
be our experimental environment because of its:
Weka supports various standard data mining tasks, especially pre-processing, clustering, classification,
regression, visualization, and selection of features. Weka input should be formatted according to the
Attribute-Relational File Format and bearing the. arff extension.
1.6. SIGNIFICANCE
Using a prediction system in identifying the students' status in the education system/ sector has various
importance as it is an input for the decision-making process. Using machine learning and data mining
techniques makes it easy to determine the student status and provide whatever help is needed to
improve learning quality. Thus this research work has irreplaceable advantages for the education
system. Apart from the above-stated significance, the proposed system is intended to bring the
following advantages:
It will be able to provide prompt service on a timely basis.
It will lay the groundwork for further investigation.
1.7. SCOPE AND LIMITATION
1.7.1. SCOPE
The scope of a study describes the extent to which the research area will be examined and the
parameters used in the study. This essentially means that you will need to establish the scope of the
study and its objectives. Due to scarcity of time, the scope of this study is limited to examining the data
about the postgraduate students of the college of informatics.
1.7.2. LIMITATION
The study's limitations are those aspects of its design or methodology that impacted or influenced the
interpretation of its findings. They are the limits on the generalizability, application to practise, and
value of findings that arise due to the study's design, the method used to demonstrate internal and
external validity or the unanticipated difficulties encountered throughout the investigation[9].
|Page
The followings are among the limitation which might hinder us not achieve the above-stated aims and
objectives:
Lack Enough datasets from the college because we all know the colleges launched Postgraduate
classes two years ago, which reduces the learning algorithm's accuracy.
Another limitation is the time constraint which affects us not to conduct an in-depth
investigation.
1.8. TIME FRAME
Scheduling is the process of recording activities and milestones within a project in project
management. Additionally, a timetable often includes a specified start and end date, duration, and
resource allocation for each activity. Effective project scheduling is key to time management success.
The following figure shows the time frame or the schedules for accomplishing the task throughout the
research activities.
|Page
CHAPTER-THREE
Methodology
3.1. Study area
The research is carried out based on primary data extracted from the database of bule hora
university which is available for researchers. Secondary data for instance review important
document to gain farther information related with student achievement. Accordingly, in this study
both quantitative and qualitative research design; interview will be used to understand the
domain knowledge and to interpret the finding.
3.2. Study design
In order to achieve the above stated objectives, the researcher has used the CRoss-Industry
Standard Process for Data Mining (CRISP-DM) model, which contains six phases.
3.2.1. CRISP–DM
This section deals with the overviews of the data source, data cleaning and data
transformation of the data employed in this study. In general the researcher has followed the
steps of data mining process mentioned in the chapter one. In this study the methodology
adapted is CRISP-DM.
CRISP-DM (CRoss-Industry Standard Process for Data Mining), a nonproprietary and freely
available standard process for fitting data mining into the general problem-solving
strategy of a business or research unit. Developed by industry leaders with input from more
than 200 data mining users and data mining tool and service providers, CRISP-DM is an
industry tool and application neutral model. This model encourages best practices and offers
organizations the structure needed to realize better, faster results from data mining. The CRISP–
DM demands that data mining be seen as an entire process, from communication of the business
problem through data collection and management, data preprocessing, model building,
model evaluation, and finally, model deployment (Chapman et al, 2000). Even if the purpose of
the model is to increase knowledge of the data, the knowledge gained will need to be
organized and presented in a way that the customer can use it (Chapman et al, 2000).
According to Chapman et al (2000), a given data mining project has a life cycle consisting
of six phases. Note that the phase sequence is adaptive. That is, the next phase in the sequence
often depends on the outcomes associated with the preceding phase. Depending on the
behaviour and characteristics of the model, we may have to return to the data preparation
phase for further refinement before moving forward to the model evaluation phase
|Page
(Larose, 2005). CRISP-DM is complete and well documented. All the stages are properly
organized, structured and defined, allowing that a project could be easily understood or revised
(Santos &Azevedo, 2005).
Figure 3.1 Phases of the CRISP-DM reference model Source Laros (2005)
|Page
3.3. Business understanding
This initial phase focuses on understanding the project objectives and requirements
from a business perspective, then converting this knowledge into a DM problem definition and
a preliminary plan designed to achieve the objectives. In this study in order to understand the
business and the application, we work closely with domain experts in NEAEA and review related
literatures on the research topic. The source of data for this research taken from 2005 grade10
and 2007 Informatics college students National Examination result database. It provides
critical information for the monitoring and evaluation of the country’s Plan for Accelerated and
Sustained Development to End Poverty (PASDEP). The education sector general education quality
improvement policies and programs assist in the monitoring of the progress towards meeting the
Millennium Development Goals (MDGs).NEAEA conducted three consecutive learning
assessments in 2000, 2003/2004 and 2008 respectively to measure learning achievements of
Grade 8 students and identify the factors that determine those achievements. It also aimed at
providing comparative information on school improvement. But the target population was
Informatics students in each study in the country.
3.4. Data understanding
The data understanding phase starts with an initial data collection and proceeds with activities
in order to get familiar with the data, to identify data quality problems, to discover first
insights into the data or to detect interesting subsets. The primary data source for this study
mainly are 2005 and 2007 NEAEA data sets, Center for Information and Communication
Technology and Educational Management Information system Directorate under Ministry of
Education data set in excel format, which contains details information on student name,
address of student, student academic result of grade 10 and 12, school type, age, gender,
teachers’ qualification, availability of media technology (plazma) at school etc. The data are
dispersed on different offices in unorganized form. Hence to get familiar with the data, to
identify data quality problems, to have insights into the data and to detect interesting subsets of
the data to be used different literatures have been
|Page
surveyed. Besides, careful analysis of the data and its structure has been done together with
the domain experts.
3.5. Data preparation
Data preparation is the most important phases of the data analysis activity which involves the
construction of the final data set (data that will be fed into the analysis tool) from the initial raw
data. Data preparation generates a dataset smaller than the original one, which can significantly
improve the efficiency of data mining. This task includes: attribute selection, filling missed
values, correcting errors, or removing outliers (unusual or exceptional values), resolve
data conflicts using domain knowledge or expert decision to settle inconsistency. Since
there is no missing value occurred on the data set, no missing value replacement technique
applied on this research. Collected excel format prepared in some Weka
understandable formats. Then preprocessing activities are performed and the file is saved into
Weka acceptable comma separated values (CSV) or comma delimited file format. Weka native
data format is known as the ARFF (Attribute Relation File Format). It is basically a CSV
(comma separated value) format with some extra headers to specify what type each attribute
is (numerical, binary, nominal). The CSV file format is converted into ARFF by using
Weka mining software, to take advantage of easier data manipulation and also compatible
interaction with Weka software. During scan of the preprocessed data some basic statistics
summary was produced for each attributes into a form acceptable by the selected data mining
software Weka.
3.6. Modeling
In this phase various modeling techniques are selected and applied, and their parameters
are adjusted to optimal values. Although the choice of data mining techniques for
association tasks seems to be strongly dependent on the application, one of the data mining
techniques that are frequently employed for analysis tasks is association rule. As it is indicated
previously, the purpose of this research is to develop association rules model. Association rule
data mining technique is applied in predicting correlation between Information Technology and
software Engineering students’ academic result.
|Page
Once the data preprocessing task completed, the researcher Apriority algorithm,
which is the default algorithm selected. However, in order to change the parameters for this run
(e.g., support, confidence, etc.) WEKA allows the resulting rules to be sorted according to
different metrics such as confidence, leverage, and lift. In this research, the researcher has
selected lift as the criteria entered certain value as the minimum value for lift (or improvement)
is computed as the confidence of the rule divided by the support of the right-hand-side (RHS).
In a simplified form, given a rule L => R, lift is the ratio of the probability that L and R
occur together to the multiple of the two individual probabilities for L and R, i.e.
|Page
This indicates interpreting the discovered patterns and possibly returning to any of the
previous steps, as well as
possible visualization of the extracted patterns, removing irrelevant patterns, and translating the
useful ones into terms understandable by users. The evaluation process is also carried out to
identify interesting patterns representing knowledge based on some lift as an interestingness
measures.Both objective and subjective measures have been applied during the association rule
analysis. The subjective method requires additional expert knowledge or input, which is not fully
available during this study; the researcher used lift as objective measure. Therefore Improvement
and further analysis can be made in this area.
3.8. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is
to increase knowledge of the data, the knowledge gained will need to be organized and
presented in a way that the customer can use it. The sequence of the phases is not rigid. Moving
back and forth between different phases is usually required and possible (Chapman et al, 2000).
As it has been mentioned the scope of this study is to build a model for students’
academic performance. The successful model can support educational planner and
decision maker for analyzing education system and decision making system. As data mining
consider powerful tool to analyze the data and it has the capability to merge with any
system. Therefore the proposed data mining model can be integrating with the existing
DSS system. The practical implementation of the model can be helpful for the decision makers
to modify and adapt according to the organization.
|Page
hopes of identifying interesting patterns in the data. These interesting patterns
are then analyzed yielding knowledge (Bowen, 2006).
Discovering knowledge in data presents data mining as a well-structured standard process,
intimately connected with managers, decision makers, and those involved in deploying the
results (Larose, 2005). Therefore both novices and data-mining specialists need assistance
in knowledge discovery processes (Deshpande and Thakare, 2010). Many good data mining
software products are being used, ranging from well-established, Enterprise Miner by SAS
and Intelligent Miner by IBM, CLEMENTINE by SPSS, PolyAnalyst by Megaputer, and
many others in a growing and dynamic industry (Witten and Frank, 2005). WEKA (from
the University of Waikato in New Zealand) is an open source tool with many useful machine
learning methods (David and Delen, 2008).
Data mining tools need to be versatile, scalable, capable of accurately predicting responses
between actions and results, and capable of automatic implementation (Chackrabarti, et al,
2009). Data mining tools perform data analysis and may uncover important data patterns,
contributing greatly to the business strategies, knowledge bases, scientific and medical research
(Han and Kamber, 2006).
Data mining tools predict future trends and behaviors and help organizations to make
practical knowledge-driven decisions (Larose, 2005). The automated, prospective
analyses offered by data mining move beyond the analyses of past events provided by
retrospective tools typical of decision support systems. Data mining tools can answer the
questions that traditionally were more time consuming to resolve. They prepare databases for
finding hidden patterns, finding predictive information that experts may miss because it lies
outside their expectations ( Deshpande and Thakare, 2010). In this research, Weka 3.6.4
software is used as a mining tool. In addition Microsoft Excel for data cleaning and for
converting the original file to CSV file format, and Microsoft Word for documentation purpose
have been used.
|Page
There are factors that contribute to the usefulness of data mining tools or software to the
intended data mining tasks: The tool selected should be able to provide the required data mining
functions. The data mining functionality that the researcher will intended to carry out in this
research is describe the relation between the factors variables and students academic result. In
addition the methodologies used by the data mining software to perform each of the data mining
functions are also important factor to consider. The researcher chooses association rules in order to
analyze the correlation between the subjects, student’s performances related with other subgroups
and the relationship of Information Technology and Software Engineering academic performance of
the student.In addition, the selected tool can comfortably operate on windows operating system and
stand alone environment. Hence Windows 7 operating system on a standalone machine has been
utilized. Another important consideration in tool selection is visualization capabilities. The
variety, quality and flexibility of visualization tools may strongly influence the usability,
interpretability, and attractiveness of data mining systems. Weka has a facility to visualize its
output in this regard (Witten and Frank, 2005). The other reason behind the selection of Weka for
this study is familiarity of the researcher with the tool, its comprehensiveness for this study
requirements and the ease of availability of the tool; Weka provides a number of data mining
functionalities, Such as classification, clustering, association, attribute selection and visualization.
Weka is developed at the University of Waikato, in New Zealand. ‘Weka’ stands for the Waikato
Environment, Knowledge Analysis. The system is written in JAVA an object oriented programming
language, and has been tested under Linux, windows and Macintosh operating systems. Weka
includes varieties of tools, for preprocessing a data set, such as attribute selection, attribute filtering
and attribute transformation (Witten and Frank, 2005).
As it is mentioned in the methodology section in chapter one, the objective of this research is
to find association among variables. Hence, it is important to extract the association rule
implementations for model building and experiments to be carried
|Page
out in the data mining process, which also involve data mining tool selection and
algorithms used for modeling. To demonstrate real practicality in any data mining process,
selecting the potential mining tool is important to understand clearly the techniques and
algorithms to be implemented, and describe them specifically based on the tool used for the
research work.
So, in this study the researcher chooses to use the Weka 3.6.4 software. The reason why this tool is
specially selected is that it is the only toolkit that has gained widespread adoption and survived
for an extended period of time and it is freely available for download (i.e. it is an open source
software) and as well it offers many powerful features (sometimes not found in commercial data
mining software), it has become one of the most widely used data mining systems.
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories, descriptive and
predictive. Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order to make
predictions (Han and Kamber, 2006).
|Page
The descriptive model identifies the patterns or relationships in data and explores
the properties of the data examined (Deshpande and Thakare, 2010). Descriptive models
belong to the realm of unsupervised learning. Such models interrogate the database to
identify patterns and relationships in the data. Clustering (segmentation)
algorithms, pattern recognition models, visualization methods, among others, belong to this
family of descriptive models (Han and Kamber, 2006). In predictive modeling tasks, one
identifies patterns found in the data to predict future values. Predictive modeling consists
of several types of models such as classification, regression and Artificial Intelligence based
models. Predictive models are built, or trained, using data for which the value of the
response variable is already known. This kind of training is sometimes referred to as
supervised learning, because calculated or estimated values are compared with the known
results. Whereas descriptive techniques are sometimes referred to as unsupervised learning
because there is no already-known result to guide the algorithms (Two Crows Corporations,
2005). According to Fayyad, et al (1996) the goals of prediction and description can be achieved
using a variety of particular data-mining methods. The data mining methods are broadly
categorized as: On-Line Analytical Processing (OLAP), Classification, Clustering, and
Association Rule mining, etc. These methods use different types of algorithms and data. The data
source can be data warehouse, database, flat file or text file. The algorithms may be
Statistical Algorithms, Decision Tree based, Nearest Neighbor, Neural Network based, Genetic
Algorithms based, Ruled based, Support Vector Machine etc. (Deshpande and Thakare, 2010).
The Data mining methods that are used for this study and the behavior of the patterns it
discovers is described below.
|Page
Figure 3.2 Weka GUI Application Main Window
The first step after we run Weka is then to select the application (Explorer,
Experimenter, etc) to process for the data mining tasks using the application panel on the
window available. After the explorer window is chosen and the application data file is opened,
different techniques can be selected to perform data mining tasks as shown in the
following figure. The Weka software contains different techniques for data pre-process,
classify, cluster, associate, etc.
|Page
In the pre-process tab using Weka explorer window, we can load data into the system
by clicking on “Open file...” or other tabs in the window where the data is available and
well-suited for tool in the ARFF format or CVS format. Converting data formats through
Weka can be done by clicking on the “Save…” at the explorer window. Editing data in Weka
can also be done by clicking on the “Edit…” tab in the window. We then examine the data
thoroughly its attribute type, properties and class (last attribute) distribution. Finally, we can
visualize the data graphical what looks like before we continue to use other techniques in
the explorer window for further analysis presented. That is, the histogram shows the
distribution of each attribute with coloring by class grouping.
|Page
Weka’s Explorer window is used, it provides several options that affect the type of
associate that Weka produces and the way that it is constructed. The construct associate rule
button on the toolbar displays a dialog box, which consists of construction options
displayed on the following figure
Figure 3.4 Weka Explorer with association rule Evaluation Options dialog box
1) Find all frequent itemsets: Each of these itemsets will occur at least as frequently as a
predetermined minimum support count.
|Page
2) Generate strong association rules from the frequent itemsets: The rules must satisfy minimum
support and confidence. These rules are called strong rules.Association rule learning is a method for
discovering interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness; detect relationships or
associations between specific values of categorical variables in large data sets. Which means
in association we are interested in finding out association between items with no particular
focus on a target one, whereas in classification we basically map the set of record (attributes,
variables) into the class attribute. For this, the researcher used the NEAEA dataset which is
provided as a common dataset from 2005 and 2007. 67,200 records or cases from the total of
210,000 records are used for this association analysis. Association rule mining has been applied to
find correlations between items in a dataset, including identifying attributes characterizing patterns
of performance disparity between various groups of students, discovering interesting
relationships from a student’s academic result, finding out the relationships between each
pattern of subjects. Association rules mining is one of the most well studied data mining
tasks. It discovers relationships among attributes in databases, producing if-then statements
concerning attribute values (Kumar, 2012).IF-THEN rules are one of the most popular ways of
knowledge representation, due to their simplicity and comprehensibility (Han and Kapler ,
2005). There are different types of rules according to the data mining technique used, for
example: classification, association, sequential pattern analysis, prediction, causality
induction, optimization, etc. In the area of knowledge discovery in databases the most studied
ones are association rules, classifiers and predictors. An example of a generic rule format IF-THEN
in Extend Backus Naur Form (EBNF) notation it is shown in the table below.
|Page
BIBLIOGRAPHY
[1] M. Koutina and K. L. Kermanidis, “Predicting postgraduate students’ performance using machine
learning techniques,” IFIP Adv. Inf. Commun. Technol., vol. 364 AICT, no. PART 2, pp. 159–168,
2011, doi: 10.1007/978-3-642-23960-1_20.
[2] E. B. Belachew and F. A. Gobena, “Student Performance Prediction Model using Machine Learning
Approach: The Case of Wolkite University,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 7, no. 2,
pp. 46–50, 2017, doi: 10.23956/ijarcsse/v7i2/01219.
[3] A. Suresh, B. S. S, E. K. R, and G. N, “Student Performance Prediction using Machine Learning,” Int.
J. Comput. Sci. Mob. Comput., vol. 9, no. 9, pp. 38–42, 2020, doi: 10.47760/ijcsmc.2020.v09i09.004.
[4] R. Hasan, S. Palaniappan, S. Mahmood, A. Abbas, K. U. Sarker, and M. U. Sattar, “Predicting student
performance in higher educational institutions using video learning analytics and data mining
techniques,” Appl. Sci., vol. 10, no. 11, 2020, doi: 10.3390/app10113894.
[5] D. Buenaño-Fernández, D. Gil, and S. Luján-Mora, “Application of machine learning in predicting
performance for computer engineering students: A case study,” Sustain., vol. 11, no. 10, pp. 1–18,
2019, doi: 10.3390/su11102833.
[6] A. S. Hashim, W. A. Awadh, and A. K. Hamoud, “Student Performance Prediction Model based on
Supervised Machine Learning Algorithms,” IOP Conf. Ser. Mater. Sci. Eng., vol. 928, no. 3, 2020,
doi: 10.1088/1757-899X/928/3/032019.
[7] B. Imed, “How to Write Research Methodology: Overview, Tips, and Techniques « Guide 2
Research,” 2021. https://www.guide2research.com/research/how-to-write-research-methodology
(accessed Jul. 06, 2021).
[8] M. Alnoukari and A. El Sheikh, Knowledge Discovery Process Models, no. December 2020. 2011.
[9] J. H. Price and Judy Murnan., “Limitations of the Study - Organizing Your Social Sciences Research
Paper - Research Guides at University of Southern California,” 2004.
https://libguides.usc.edu/writingguide/limitations (accessed Jul. 06, 2021).
|Page