100% found this document useful (1 vote)
66 views

BD Project

This document is a research proposal submitted by three students to Dr. Kula Kekeba at Bule Hora University in Ethiopia. The proposal aims to use machine learning techniques and big data analytics to predict student performance in completing their MSc thesis. Specifically, it will analyze past student data from the College of Informatics at Bule Hora University to build models that can help identify factors influencing whether students successfully complete their thesis on time. The proposal includes an introduction outlining the problem statement, research questions, objectives, methodology, significance, scope and limitations, and timeframe for the research.

Uploaded by

Roba Halkano
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
66 views

BD Project

This document is a research proposal submitted by three students to Dr. Kula Kekeba at Bule Hora University in Ethiopia. The proposal aims to use machine learning techniques and big data analytics to predict student performance in completing their MSc thesis. Specifically, it will analyze past student data from the College of Informatics at Bule Hora University to build models that can help identify factors influencing whether students successfully complete their thesis on time. The proposal includes an introduction outlining the problem statement, research questions, objectives, methodology, significance, scope and limitations, and timeframe for the research.

Uploaded by

Roba Halkano
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

POSTGRADUATE PROGRAM

COLLEGE OF INFORMATICS

DEPARTMENT OF INFORMATION TECHNOLOGY(MSc) REGULAR

COURSE CODE: IT-5112


COURSE NAME: BIG DATA ANALYTICS

Title: - Prediction of Student Performance in MSc Thesis Using Machine

Learning Techniques: The case of Bule Hora University, College of

Informatics Post graduate Student.


Research proposal submitted to Dr. Kula Kekeba (Ph.D.)

Submitted By: 1. Dureso Mohammed

2. Meliha Hawi

3. Zintalem Getinet

Submitted To: Submission Date:

Dr Kula. K(Ph.D.) June 9, 2022

Bule Hora, Ethiopia

** Best Regards! **
ACKNOWLEDGEMENT
First of all, we would like to express our sincere thanks to "ALLAH" for giving us full healthy and
support us in the whole process of drafting our research proposal up to the completion of the report. With
great pleasure, we express our deep sense of gratitude to our Instructor Dr. Kula Kekeba (Ph.D.), for his
valuable guidance, constant encouragement, and patience throughout our research work, starting from
initiating us to prepare a research proposal in Big Data Analytics to end of the research report. We would also
like to extend our sincere thanks to all other staff members of the College of Informatics. Specially those of
the Information Technology Department. They have contributed in their way to making this research work
successful. With this chance, we would like to thank all our classmate students for the practical decision we
had together.

|Page
ABSTRACT
In the past two decades, the Biotechnology and medical fields have overlooked advances in the development
of emerging technology that open new possibilities for identifying and quantifying Bio-Analysis of disease.
Biochips is the basic idea behind Biochips Technology to give human expertise to the machine to perform
daily activities on behave of the human expert. We all have some perceptual abilities. Thus, we can
understand each other's feelings. So, like computer chips that can perform millions of mathematical operations
in one second, Biochips can perform thousands of biological reactions such as Genes decoding in a few
seconds. Biochip's technology helps identify over 80,000 genes in human DNA, an ongoing worldwide
research collaboration known as Human Genome Project. Developing Biochips platform incorporates
electronics for addressing, reading out, sensing, and controlling temperature and a handled analyzer capable of
multiparameter identification. The Biochips technology platform can be plugged in a peripheric standard bus
of the analyzer device or communicate through a wireless channel. Biochips enable us to realize revolutionary
new bioanalysis systems that can directly manipulate and analyze the micro/nano-scale world of
biomolecules, organelles, and cells.

|Page
TABLE OF CONTENT

CONTENTS…………………………………………………………………………………………......PAGE

ACKNOWLEDGEMENT..................................................................................................................i

ABSTRACT......................................................................................................................................ii

ACRONYM.....................................................................................................................................iii

TABLE OF CONTENT...................................................................................................................iv

LIST OF FIGURES...........................................................................................................................v

CHAPTER-ONE...............................................................................................................................1

1.1. INTRODUCTION..........................................................................................................................1

1.2. PROBLEM STATEMENT..............................................................................................................2

1.3. RESEARCH QUESTIONS.............................................................................................................2

1.4. OBJECTIVES..................................................................................................................................3

1.4.1. GENERAL OBJECTIVE......................................................................................................3


1.4.2. SPECIFIC OBJECTIVE........................................................................................................3
1.5. METHODOLOGY..........................................................................................................................3

1.5.1. LITERATURE REVIEW......................................................................................................3


1.5.2. DESIGN METHODOLOGY................................................................................................4
1.5.3. DATASET PREPARATION................................................................................................6
1.5.4. EXPERIMENTAL ENVIRONMENT SETUP.....................................................................7
1.6. SIGNIFICANCE..............................................................................................................................8

|Page
1.7. SCOPE AND LIMITATION...........................................................................................................8

1.7.1. SCOPE...................................................................................................................................8
1.7.2. LIMITATION.......................................................................................................................8
1.8. TIME FRAME.................................................................................................................................9

BIBLIOGRAPHY...........................................................................................................................10

LIST OF FIGURES

Figure 1: Shows Different information flow during the literature review process...................................4

Figure 2: Shows Spiral models for software developments......................................................................5

Figure 3: Shows the project workflow through different stages of SDLC...............................................6

Figure 4: Shows KDD Process.................................................................................................................7

Figure 5: Shows the Time frame for this research work...........................................................................9

|Page
CHAPTER-ONE

1.1. INTRODUCTION
In recent years, universities worldwide focus on institutional transformation to improve student's
learning experience and quality education. Quality education is one of the Sustainable Development
Goals (SDG) of the Ethiopian government. Bule Hora University is one of the universities in Ethiopia
located in the southern portions of Ethiopia, working toward enhancing quality education by improving
the service required for the student. The geographical location of this university is around 470km south
of Addis Ababa, Oromia Regional State, West Guji Zone at Blue hora town.
The recent era of machine learning and big data technology allows higher institutions to use a wide
variety and volume of data to analyze student's performance to enhance their learning experience.
Machine Learning (ML) and Big Data Analytics (BDA) are gaining growing thrust to provide insights
that could inform the decision-making process to support institutions and improve student
performance[1].
Machine learning comes up with a wide variety of critical techniques in analyzing Big Data to predict
the students' academic performance depending on the educational background. Potential Data mining
applications in education can range from the predictive modelling of student retention or attrition and
student's learning behaviour to student's learning experience[2], [3]. Such analysis of the big data
inside the educational institution helps them better understand the student's specific learning needs,
which is more significant in tackling the problems faced by universities in predicting student retention
and progression.
One of the significant challenges of higher educational institutions is that most graduating students
failed to write their thesis work accordingly, resulting in extending their graduation year and hindering
them from promoting their academic rank. Hence, applying the potential application of big data
analytics combined with machine learning techniques has a vital role in improving student
performance[4].

|Page
Big data is a large and complex amount of data that is difficult to analyze using traditional analysis
methods. This data can be in structured, semi-structured, and unstructured forms. The massive data
flow has led to better analytical methods as traditional methods have become inefficient for processing
big data. Due to vast amounts of data in educational databases, predicting students' performance is
tricky[5], [6]. The shortage of an established framework for evaluating and tracking students' success is
not currently being careful. There are two primary reasons why such kind of occurring. First, the
research on existing prediction methods is still insufficient to determine the most appropriate methods
for predicting student performance in institutions. Second is the absence of inquiry of the specific
courses.
Thus, this research aims to investigate the most efficient machine learning technique in predicting
the final results of the thesis of postgraduate students in Bule Hora University, precisely that of the
College of Informatics postgraduate students, which in turn help the college to access their student's
level of performance and make a decision on how to help their student come up with improved
performance.
1.2. PROBLEM STATEMENT
The most significant challenges in higher educational institutions are a massive amount of data stored
in the university database, mainly structured data that could be simple to analyze using different data
mining and machine learning techniques. However, they did not utilize it the way it has to be to
develop better productive and successful predictive models. Those massive amounts of data can
improve the quality of education when data mining, frequent pattern generation, and machine learning
techniques are implemented to utilize them. Another challenge of higher educational institutions is that
most graduating students failed to write their thesis work accordingly, which resulted in extending their
graduation year and hindered them from promoting their academic rank. To the level of our
understanding, there are no predictive models for predicting student performance in their thesis work
that has been studied before. This motivates us to investigate different data mining or big data analytics
and machine learning techniques to develop better predictive models that help the college. Apart from
the above-stated problems, this research work tries to address the following problems:
 Difficulties in early prediction of the status of the students.
 Difficulties associated with examining an extensive, complex collection of data and
information contained in an organization's database.
1.3. RESEARCH QUESTIONS

|Page
As part of this research work, to attain the overall objectives of this research, the following research
question needs to be addressed:

o RQ#1:- What is the state of the art and the system developed to improve student performance?
o RQ#2:- How do the different machine learning and data mining techniques support improving
student performance?
o RQ#3:- What could be the possible cause for the student's performance in MSc thesis work to
become comparatively low?

1.4. OBJECTIVES
1.4.1. GENERAL OBJECTIVE
The general objective of this study is to investigate different machine learning techniques for
predicting student performance in thesis work. The above-stated aims or general objectives come up
with the following specific objectives which assist us in attaining our goal.

1.4.2. SPECIFIC OBJECTIVE


To attain the above-stated challenges, it is necessary to provide an automated machine learning model
that can efficiently predict student performance, and also this research work should address the
following specific objectives:
 Conducting a preliminary literature review that supports us in identifying the gaps in the state
of art.
 Conducting an in-depth and comprehensive literature review covering additional research
conducted over several years improves student performance and quality education.
 Prioritize and identify the most significant contribution of previous research and its outcomes.
 Design, develop, and test the better predictive models for predicting student performance in
thesis work.
 Evaluate the proposed algorithm performance and conduct a comparative analysis of the
applied algorithms.
 Based on the result we have obtained throughout our investigation, we will recommend and put
future direction for further researchers.

1.5. METHODOLOGY
The methodology could be a systematic approach to execute and manage the analysis study
expeditiously. It is excellent to utilize the most straightforward methodology to conform to the

|Page
standards of any research investigation[7]. In this research work, the SWOT analysis is used as This
section provides the detailed methodology used in this research work. For this research work, the
methodology includes:

1.5.1. LITERATURE REVIEW


In order to critically assess the previous research and identify any research gaps in student performance
evaluation techniques, a literature review will be conducted to analyze relevant articles. Literature
review enables the search and evaluation of important research articles in Big Data Analytics (BDA)
and Machine Learning (ML), which are mainly related to improving student performance. The
following figure illustrates different phases of the literature review we have passed while proposing
this research work.

Figure 1: Shows Different information flow during the literature review process
1.5.2. DESIGN METHODOLOGY
A well-executed research project should be iterative to allow for requirement and implementation
modifications during the project's lifecycle while still adhering to the established timeframes and
quality standards. By following this technique, project failures can be significantly decreased.

|Page
Spiral Paradigm is a risk-based model defined by iterative procedures that aid in risk mitigation. As
illustrated below, the Spiral Model is divided into four distinct software development life cycle
(SDLC) phases. The entire development process is iterative. Each repetition is denoted by the term
Spiral. As illustrated in Figure 2; the four major phases are as follows:

Step-1: Determine Objectives: The objectives of this project will be defined through an appropriate
feasibility assessment and milestones.

Step-2: Risk identification and mitigation: Risks will be recognized, prioritized, and mitigated based
on their severity.

Step-3: Development and Testing: During this stage, a high-quality functioning prototype will be
developed in compliance with the criteria in priority order. Later spirals will produce a working version
of the product.

Step-4: Planning Phase: This stage evaluates the project's output before moving on to the next Spiral.
The following figure shows the different phases of spiral models.

Figure 2: Shows Spiral models for software developments.


At a high level, the project workflow is divided into three phases: define, design, and implement. As
seen in Figure 3, the following actions will be followed for each level of prototyping and design. The

|Page
Define stage will define the objectives and goals and pick the features and model after pre-processing
the data. The model will be trained and tested during the design stage, using the appropriate accuracy
matrices. Finally, the model will be deployed and iterated against new data as part of the
implementation stage.

Figure 3: Shows the project workflow through different stages of SDLC


1.5.3. DATASET PREPARATION
Big data analytics and machine learning are at the hearts of the Knowledge Discovery (KDD) process
in databases. Including inference from the algorithms and techniques that explore data, develop
models, discover patterns, and provide insight. KDD is divided into four different phases, as shown in
Figure 4 below. At the same time, any of these three techniques (DM, Machine Learning, or Sentiment
Analysis) may be utilized in Phase 3 of the KDD process. Because data sources and purposes vary,
each system requires a unique set of methods. Classification, prediction, clustering, relation mining,
and data distillation for human judgment are the most frequently utilized data mining and machine
learning techniques in education. The KDD method is interactive and iterative (with several user-
driven decisions), and each phase involves various stakeholders[8].

|Page
Figure 4: Shows KDD Process
The different phases of the process, as shown in Figure 4, are as follows:

 Data selection: Creating the target dataset.


 Data cleaning and pre-processing: Data cleaning and pre-processing for noise removal, outlier
detection, and handling missing and unknown values.
 Data Transformation: dimension reduction (feature selection and extraction) and attribute
transformation (discretization of the numerical features).
 Data mining task: Selecting appropriate data mining tasks depending on the KDD goals (e.g.,
summarization, classification, regression, and clustering) to identify meaningful insights.

For this research work, the source of datasets we will use is those open-source datasets available from
Kagle and Github websites because dataset preparation is a time-consuming task that will not be
completed within this limited time frame.

1.5.4. EXPERIMENTAL ENVIRONMENT SETUP


For this research work, we are going to use WEKA tools. Weka provides various tools and methods for
data analysis and predictive modelling, together with graphical user interfaces enabling simple access
to these operations and machine-learning experimentation systems.  WEKA is currently utilized in

|Page
many diverse applications, particularly for educational and scientific purposes. WEKA is supposed to
be our experimental environment because of its: 

 GNU General Public License free availability.


 Portability is fully developed in the Java programming language and runs on practically any
modern computing platform.
 Comprehensive data pre-processing and modelling techniques.
 Graphical user interfaces make it user-friendly.

Weka supports various standard data mining tasks, especially pre-processing, clustering, classification,
regression, visualization, and selection of features. Weka input should be formatted according to the
Attribute-Relational File Format and bearing the. arff extension.

1.6. SIGNIFICANCE
Using a prediction system in identifying the students' status in the education system/ sector has various
importance as it is an input for the decision-making process. Using machine learning and data mining
techniques makes it easy to determine the student status and provide whatever help is needed to
improve learning quality. Thus this research work has irreplaceable advantages for the education
system. Apart from the above-stated significance, the proposed system is intended to bring the
following advantages:
 It will be able to provide prompt service on a timely basis.
 It will lay the groundwork for further investigation.
1.7. SCOPE AND LIMITATION
1.7.1. SCOPE
The scope of a study describes the extent to which the research area will be examined and the
parameters used in the study. This essentially means that you will need to establish the scope of the
study and its objectives. Due to scarcity of time, the scope of this study is limited to examining the data
about the postgraduate students of the college of informatics.
1.7.2. LIMITATION
The study's limitations are those aspects of its design or methodology that impacted or influenced the
interpretation of its findings. They are the limits on the generalizability, application to practise, and
value of findings that arise due to the study's design, the method used to demonstrate internal and
external validity or the unanticipated difficulties encountered throughout the investigation[9].

|Page
The followings are among the limitation which might hinder us not achieve the above-stated aims and
objectives:
 Lack Enough datasets from the college because we all know the colleges launched Postgraduate
classes two years ago, which reduces the learning algorithm's accuracy.
 Another limitation is the time constraint which affects us not to conduct an in-depth
investigation.
1.8. TIME FRAME
Scheduling is the process of recording activities and milestones within a project in project
management. Additionally, a timetable often includes a specified start and end date, duration, and
resource allocation for each activity. Effective project scheduling is key to time management success.
The following figure shows the time frame or the schedules for accomplishing the task throughout the
research activities.

Figure 5: Shows the Time frame for this research works

|Page
CHAPTER-THREE

Methodology
3.1. Study area

The research is carried out based on primary data extracted from the database of bule hora
university which is available for researchers. Secondary data for instance review important
document to gain farther information related with student achievement. Accordingly, in this study
both quantitative and qualitative research design; interview will be used to understand the
domain knowledge and to interpret the finding.
3.2. Study design

In order to achieve the above stated objectives, the researcher has used the CRoss-Industry
Standard Process for Data Mining (CRISP-DM) model, which contains six phases.
3.2.1. CRISP–DM
This section deals with the overviews of the data source, data cleaning and data
transformation of the data employed in this study. In general the researcher has followed the
steps of data mining process mentioned in the chapter one. In this study the methodology
adapted is CRISP-DM.
CRISP-DM (CRoss-Industry Standard Process for Data Mining), a nonproprietary and freely
available standard process for fitting data mining into the general problem-solving
strategy of a business or research unit. Developed by industry leaders with input from more
than 200 data mining users and data mining tool and service providers, CRISP-DM is an
industry tool and application neutral model. This model encourages best practices and offers
organizations the structure needed to realize better, faster results from data mining. The CRISP–
DM demands that data mining be seen as an entire process, from communication of the business
problem through data collection and management, data preprocessing, model building,
model evaluation, and finally, model deployment (Chapman et al, 2000). Even if the purpose of
the model is to increase knowledge of the data, the knowledge gained will need to be
organized and presented in a way that the customer can use it (Chapman et al, 2000).
According to Chapman et al (2000), a given data mining project has a life cycle consisting
of six phases. Note that the phase sequence is adaptive. That is, the next phase in the sequence
often depends on the outcomes associated with the preceding phase. Depending on the
behaviour and characteristics of the model, we may have to return to the data preparation
phase for further refinement before moving forward to the model evaluation phase

|Page
(Larose, 2005). CRISP-DM is complete and well documented. All the stages are properly
organized, structured and defined, allowing that a project could be easily understood or revised
(Santos &Azevedo, 2005).

Figure 3.1 Phases of the CRISP-DM reference model Source Laros (2005)

|Page
3.3. Business understanding
This initial phase focuses on understanding the project objectives and requirements
from a business perspective, then converting this knowledge into a DM problem definition and
a preliminary plan designed to achieve the objectives. In this study in order to understand the
business and the application, we work closely with domain experts in NEAEA and review related
literatures on the research topic. The source of data for this research taken from 2005 grade10
and 2007 Informatics college students National Examination result database. It provides
critical information for the monitoring and evaluation of the country’s Plan for Accelerated and
Sustained Development to End Poverty (PASDEP). The education sector general education quality
improvement policies and programs assist in the monitoring of the progress towards meeting the
Millennium Development Goals (MDGs).NEAEA conducted three consecutive learning
assessments in 2000, 2003/2004 and 2008 respectively to measure learning achievements of
Grade 8 students and identify the factors that determine those achievements. It also aimed at
providing comparative information on school improvement. But the target population was
Informatics students in each study in the country.
3.4. Data understanding
The data understanding phase starts with an initial data collection and proceeds with activities
in order to get familiar with the data, to identify data quality problems, to discover first
insights into the data or to detect interesting subsets. The primary data source for this study
mainly are 2005 and 2007 NEAEA data sets, Center for Information and Communication
Technology and Educational Management Information system Directorate under Ministry of
Education data set in excel format, which contains details information on student name,
address of student, student academic result of grade 10 and 12, school type, age, gender,
teachers’ qualification, availability of media technology (plazma) at school etc. The data are
dispersed on different offices in unorganized form. Hence to get familiar with the data, to
identify data quality problems, to have insights into the data and to detect interesting subsets of
the data to be used different literatures have been

|Page
surveyed. Besides, careful analysis of the data and its structure has been done together with
the domain experts.
3.5. Data preparation

Data preparation is the most important phases of the data analysis activity which involves the
construction of the final data set (data that will be fed into the analysis tool) from the initial raw
data. Data preparation generates a dataset smaller than the original one, which can significantly
improve the efficiency of data mining. This task includes: attribute selection, filling missed
values, correcting errors, or removing outliers (unusual or exceptional values), resolve
data conflicts using domain knowledge or expert decision to settle inconsistency. Since
there is no missing value occurred on the data set, no missing value replacement technique
applied on this research. Collected excel format prepared in some Weka
understandable formats. Then preprocessing activities are performed and the file is saved into
Weka acceptable comma separated values (CSV) or comma delimited file format. Weka native
data format is known as the ARFF (Attribute Relation File Format). It is basically a CSV
(comma separated value) format with some extra headers to specify what type each attribute
is (numerical, binary, nominal). The CSV file format is converted into ARFF by using
Weka mining software, to take advantage of easier data manipulation and also compatible
interaction with Weka software. During scan of the preprocessed data some basic statistics
summary was produced for each attributes into a form acceptable by the selected data mining
software Weka.
3.6. Modeling
In this phase various modeling techniques are selected and applied, and their parameters
are adjusted to optimal values. Although the choice of data mining techniques for
association tasks seems to be strongly dependent on the application, one of the data mining
techniques that are frequently employed for analysis tasks is association rule. As it is indicated
previously, the purpose of this research is to develop association rules model. Association rule
data mining technique is applied in predicting correlation between Information Technology and
software Engineering students’ academic result.

|Page
Once the data preprocessing task completed, the researcher Apriority algorithm,
which is the default algorithm selected. However, in order to change the parameters for this run
(e.g., support, confidence, etc.) WEKA allows the resulting rules to be sorted according to
different metrics such as confidence, leverage, and lift. In this research, the researcher has
selected lift as the criteria entered certain value as the minimum value for lift (or improvement)
is computed as the confidence of the rule divided by the support of the right-hand-side (RHS).
In a simplified form, given a rule L => R, lift is the ratio of the probability that L and R
occur together to the multiple of the two individual probabilities for L and R, i.e.

lift = Pr(L,R) / Pr(L).Pr(R).


If this value is 1, then L and R are independent. The higher this value, the more likely the
existence of L and R together in a transaction is not just a random occurrence, but
because of some relationship between them. Here the researcher also change the default value
of rules (10) to be 20; this indicates that the program will report no more than the top 20 rules
(in this case sorted according to their lift values). The upper bound for minimum support is set
to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA starts with the upper
bound support and incrementally decreases support (by delta increment which by default is set to
0.05 or 5%). The algorithm halts when either the specified numbers of rules are
generated, or the lower bound for min. support is reached. The significance testing option is only
applicable in the case of confidence and is by default not used (-1.0). The rules were discovered
based on the specified threshold values for support and lift. For each rule, the frequency counts
for the LHS and RHS of each rule is given, as well as the values for confidence, lift, leverage, and
conviction.
3.7. Evaluation
At this stage, the model (or models) obtained are more thoroughly evaluated and the steps
executed to construct the model are reviewed to be certain it properly achieves the business
objectives. Evaluation is the key to making real progress in data mining. After building a
model, we must evaluate its results and interpret their significance (Two Crows
Corporation, 2005). This stage consists of the interpretation and evaluation of the mined
patterns.

|Page
This indicates interpreting the discovered patterns and possibly returning to any of the
previous steps, as well as
possible visualization of the extracted patterns, removing irrelevant patterns, and translating the
useful ones into terms understandable by users. The evaluation process is also carried out to
identify interesting patterns representing knowledge based on some lift as an interestingness
measures.Both objective and subjective measures have been applied during the association rule
analysis. The subjective method requires additional expert knowledge or input, which is not fully
available during this study; the researcher used lift as objective measure. Therefore Improvement
and further analysis can be made in this area.

3.8. Deployment

Creation of the model is generally not the end of the project. Even if the purpose of the model is
to increase knowledge of the data, the knowledge gained will need to be organized and
presented in a way that the customer can use it. The sequence of the phases is not rigid. Moving
back and forth between different phases is usually required and possible (Chapman et al, 2000).
As it has been mentioned the scope of this study is to build a model for students’
academic performance. The successful model can support educational planner and
decision maker for analyzing education system and decision making system. As data mining
consider powerful tool to analyze the data and it has the capability to merge with any
system. Therefore the proposed data mining model can be integrating with the existing
DSS system. The practical implementation of the model can be helpful for the decision makers
to modify and adapt according to the organization.

DM TOOL, TECHNIQUES AND ALGORITHMS


3.9. Data Mining Tool Selection
Data mining or knowledge discovery refers to the process of finding interesting information
in large repositories of data. The term data mining also refers to the step in the knowledge
discovery process in which special algorithms are employed

|Page
hopes of identifying interesting patterns in the data. These interesting patterns
are then analyzed yielding knowledge (Bowen, 2006).
Discovering knowledge in data presents data mining as a well-structured standard process,
intimately connected with managers, decision makers, and those involved in deploying the
results (Larose, 2005). Therefore both novices and data-mining specialists need assistance
in knowledge discovery processes (Deshpande and Thakare, 2010). Many good data mining
software products are being used, ranging from well-established, Enterprise Miner by SAS
and Intelligent Miner by IBM, CLEMENTINE by SPSS, PolyAnalyst by Megaputer, and
many others in a growing and dynamic industry (Witten and Frank, 2005). WEKA (from
the University of Waikato in New Zealand) is an open source tool with many useful machine
learning methods (David and Delen, 2008).
Data mining tools need to be versatile, scalable, capable of accurately predicting responses
between actions and results, and capable of automatic implementation (Chackrabarti, et al,
2009). Data mining tools perform data analysis and may uncover important data patterns,
contributing greatly to the business strategies, knowledge bases, scientific and medical research
(Han and Kamber, 2006).
Data mining tools predict future trends and behaviors and help organizations to make
practical knowledge-driven decisions (Larose, 2005). The automated, prospective
analyses offered by data mining move beyond the analyses of past events provided by
retrospective tools typical of decision support systems. Data mining tools can answer the
questions that traditionally were more time consuming to resolve. They prepare databases for
finding hidden patterns, finding predictive information that experts may miss because it lies
outside their expectations ( Deshpande and Thakare, 2010). In this research, Weka 3.6.4
software is used as a mining tool. In addition Microsoft Excel for data cleaning and for
converting the original file to CSV file format, and Microsoft Word for documentation purpose
have been used.

|Page
There are factors that contribute to the usefulness of data mining tools or software to the
intended data mining tasks: The tool selected should be able to provide the required data mining
functions. The data mining functionality that the researcher will intended to carry out in this
research is describe the relation between the factors variables and students academic result. In
addition the methodologies used by the data mining software to perform each of the data mining
functions are also important factor to consider. The researcher chooses association rules in order to
analyze the correlation between the subjects, student’s performances related with other subgroups
and the relationship of Information Technology and Software Engineering academic performance of
the student.In addition, the selected tool can comfortably operate on windows operating system and
stand alone environment. Hence Windows 7 operating system on a standalone machine has been
utilized. Another important consideration in tool selection is visualization capabilities. The
variety, quality and flexibility of visualization tools may strongly influence the usability,
interpretability, and attractiveness of data mining systems. Weka has a facility to visualize its
output in this regard (Witten and Frank, 2005). The other reason behind the selection of Weka for
this study is familiarity of the researcher with the tool, its comprehensiveness for this study
requirements and the ease of availability of the tool; Weka provides a number of data mining
functionalities, Such as classification, clustering, association, attribute selection and visualization.
Weka is developed at the University of Waikato, in New Zealand. ‘Weka’ stands for the Waikato
Environment, Knowledge Analysis. The system is written in JAVA an object oriented programming
language, and has been tested under Linux, windows and Macintosh operating systems. Weka
includes varieties of tools, for preprocessing a data set, such as attribute selection, attribute filtering
and attribute transformation (Witten and Frank, 2005).

As it is mentioned in the methodology section in chapter one, the objective of this research is
to find association among variables. Hence, it is important to extract the association rule
implementations for model building and experiments to be carried

|Page
out in the data mining process, which also involve data mining tool selection and

algorithms used for modeling. To demonstrate real practicality in any data mining process,
selecting the potential mining tool is important to understand clearly the techniques and
algorithms to be implemented, and describe them specifically based on the tool used for the
research work.

So, in this study the researcher chooses to use the Weka 3.6.4 software. The reason why this tool is
specially selected is that it is the only toolkit that has gained widespread adoption and survived
for an extended period of time and it is freely available for download (i.e. it is an open source
software) and as well it offers many powerful features (sometimes not found in commercial data
mining software), it has become one of the most widely used data mining systems.

3.10. Data Mining or Knowledge Discovery Process

Data mining or knowledge discovery refers to the process of finding interesting


information in large repositories of data. The term data mining also refers to the step in the
knowledge discovery process in which special algorithms are employed in hopes of
identifying interesting patterns in the data. These interesting patterns are then analyzed
yielding knowledge (Bowen, 2006). Discovering knowledge in data presents data mining as a
well-structured standard process, intimately connected with managers, decision makers, and
those involved in deploying the results (Larose, 2005). Therefore both novices and data-
mining specialists need assistance in knowledge discovery processes (Deshpande and Thakare,
2010).

Data mining tasks

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories, descriptive and
predictive. Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order to make
predictions (Han and Kamber, 2006).

|Page
The descriptive model identifies the patterns or relationships in data and explores

the properties of the data examined (Deshpande and Thakare, 2010). Descriptive models
belong to the realm of unsupervised learning. Such models interrogate the database to
identify patterns and relationships in the data. Clustering (segmentation)
algorithms, pattern recognition models, visualization methods, among others, belong to this
family of descriptive models (Han and Kamber, 2006). In predictive modeling tasks, one
identifies patterns found in the data to predict future values. Predictive modeling consists
of several types of models such as classification, regression and Artificial Intelligence based
models. Predictive models are built, or trained, using data for which the value of the
response variable is already known. This kind of training is sometimes referred to as
supervised learning, because calculated or estimated values are compared with the known
results. Whereas descriptive techniques are sometimes referred to as unsupervised learning
because there is no already-known result to guide the algorithms (Two Crows Corporations,
2005). According to Fayyad, et al (1996) the goals of prediction and description can be achieved
using a variety of particular data-mining methods. The data mining methods are broadly
categorized as: On-Line Analytical Processing (OLAP), Classification, Clustering, and
Association Rule mining, etc. These methods use different types of algorithms and data. The data
source can be data warehouse, database, flat file or text file. The algorithms may be
Statistical Algorithms, Decision Tree based, Nearest Neighbor, Neural Network based, Genetic
Algorithms based, Ruled based, Support Vector Machine etc. (Deshpande and Thakare, 2010).
The Data mining methods that are used for this study and the behavior of the patterns it
discovers is described below.

3.11. The Weka tool


Once the Weka software download and install to the machine where the mining process
carried out and optional files have been set up, everything is ready to use Weka. To employ
Weka for data mining tasks, there is a graphical user interface (shown below), which consists
of buttons and menu commands. This main window has four panels on its interface, where
each panel is used to perform different tasks.

|Page
Figure 3.2 Weka GUI Application Main Window

The first step after we run Weka is then to select the application (Explorer,
Experimenter, etc) to process for the data mining tasks using the application panel on the
window available. After the explorer window is chosen and the application data file is opened,
different techniques can be selected to perform data mining tasks as shown in the
following figure. The Weka software contains different techniques for data pre-process,
classify, cluster, associate, etc.

Figure 3.3 Weka Explorer Windows

|Page
In the pre-process tab using Weka explorer window, we can load data into the system
by clicking on “Open file...” or other tabs in the window where the data is available and
well-suited for tool in the ARFF format or CVS format. Converting data formats through
Weka can be done by clicking on the “Save…” at the explorer window. Editing data in Weka
can also be done by clicking on the “Edit…” tab in the window. We then examine the data
thoroughly its attribute type, properties and class (last attribute) distribution. Finally, we can
visualize the data graphical what looks like before we continue to use other techniques in
the explorer window for further analysis presented. That is, the histogram shows the
distribution of each attribute with coloring by class grouping.

Figure 3.4 visualize the data graphical weka explorer windows

In addition to pre-processing, a number of data mining methods/tools are implemented


in the Weka software. Some of them are based on Apriori algorithm. Before we get run
through Weka, we should get into the specific details of each method and we should
understand what model strives to accomplish what type of data and what goals model attempts
to accomplish. If association rule panel of

|Page
Weka’s Explorer window is used, it provides several options that affect the type of

associate that Weka produces and the way that it is constructed. The construct associate rule
button on the toolbar displays a dialog box, which consists of construction options
displayed on the following figure

Figure 3.4 Weka Explorer with association rule Evaluation Options dialog box

3.12. Association rule in weka


There are a number of machine intelligent techniques that are available in the market but at
the same time not all tools are the best for all problems in the data set. Different data sets produce
different results based on the algorithms used. In this study we are testing one algorithm by
adjusting different parameters value based on Apriori method association rule. Our aim is to find
the best model. The association rule mining can be considered as a two-step process (Kamber ,2006)

1) Find all frequent itemsets: Each of these itemsets will occur at least as frequently as a
predetermined minimum support count.

|Page
2) Generate strong association rules from the frequent itemsets: The rules must satisfy minimum
support and confidence. These rules are called strong rules.Association rule learning is a method for
discovering interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness; detect relationships or
associations between specific values of categorical variables in large data sets. Which means
in association we are interested in finding out association between items with no particular
focus on a target one, whereas in classification we basically map the set of record (attributes,
variables) into the class attribute. For this, the researcher used the NEAEA dataset which is
provided as a common dataset from 2005 and 2007. 67,200 records or cases from the total of
210,000 records are used for this association analysis. Association rule mining has been applied to
find correlations between items in a dataset, including identifying attributes characterizing patterns
of performance disparity between various groups of students, discovering interesting
relationships from a student’s academic result, finding out the relationships between each
pattern of subjects. Association rules mining is one of the most well studied data mining
tasks. It discovers relationships among attributes in databases, producing if-then statements
concerning attribute values (Kumar, 2012).IF-THEN rules are one of the most popular ways of
knowledge representation, due to their simplicity and comprehensibility (Han and Kapler ,
2005). There are different types of rules according to the data mining technique used, for
example: classification, association, sequential pattern analysis, prediction, causality
induction, optimization, etc. In the area of knowledge discovery in databases the most studied
ones are association rules, classifiers and predictors. An example of a generic rule format IF-THEN
in Extend Backus Naur Form (EBNF) notation it is shown in the table below.

|Page
BIBLIOGRAPHY

[1] M. Koutina and K. L. Kermanidis, “Predicting postgraduate students’ performance using machine
learning techniques,” IFIP Adv. Inf. Commun. Technol., vol. 364 AICT, no. PART 2, pp. 159–168,
2011, doi: 10.1007/978-3-642-23960-1_20.
[2] E. B. Belachew and F. A. Gobena, “Student Performance Prediction Model using Machine Learning
Approach: The Case of Wolkite University,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 7, no. 2,
pp. 46–50, 2017, doi: 10.23956/ijarcsse/v7i2/01219.
[3] A. Suresh, B. S. S, E. K. R, and G. N, “Student Performance Prediction using Machine Learning,” Int.
J. Comput. Sci. Mob. Comput., vol. 9, no. 9, pp. 38–42, 2020, doi: 10.47760/ijcsmc.2020.v09i09.004.
[4] R. Hasan, S. Palaniappan, S. Mahmood, A. Abbas, K. U. Sarker, and M. U. Sattar, “Predicting student
performance in higher educational institutions using video learning analytics and data mining
techniques,” Appl. Sci., vol. 10, no. 11, 2020, doi: 10.3390/app10113894.
[5] D. Buenaño-Fernández, D. Gil, and S. Luján-Mora, “Application of machine learning in predicting
performance for computer engineering students: A case study,” Sustain., vol. 11, no. 10, pp. 1–18,
2019, doi: 10.3390/su11102833.
[6] A. S. Hashim, W. A. Awadh, and A. K. Hamoud, “Student Performance Prediction Model based on
Supervised Machine Learning Algorithms,” IOP Conf. Ser. Mater. Sci. Eng., vol. 928, no. 3, 2020,
doi: 10.1088/1757-899X/928/3/032019.
[7] B. Imed, “How to Write Research Methodology: Overview, Tips, and Techniques « Guide 2
Research,” 2021. https://www.guide2research.com/research/how-to-write-research-methodology
(accessed Jul. 06, 2021).
[8] M. Alnoukari and A. El Sheikh, Knowledge Discovery Process Models, no. December 2020. 2011.
[9] J. H. Price and Judy Murnan., “Limitations of the Study - Organizing Your Social Sciences Research
Paper - Research Guides at University of Southern California,” 2004.
https://libguides.usc.edu/writingguide/limitations (accessed Jul. 06, 2021).

|Page

You might also like