0% found this document useful (0 votes)
32 views

Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents

Uploaded by

niko mosura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents

Uploaded by

niko mosura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Journal of Information and Telecommunication

ISSN: 2475-1839 (Print) 2475-1847 (Online) Journal homepage: https://www.tandfonline.com/loi/tjit20

Usage Apriori and clustering algorithms in WEKA


tools to mining dataset of traffic accidents

Faisal Mohammed Nafie Ali & Abdelmoneim Ali Mohamed Hamed

To cite this article: Faisal Mohammed Nafie Ali & Abdelmoneim Ali Mohamed Hamed
(2018) Usage Apriori and clustering algorithms in WEKA tools to mining dataset of
traffic accidents, Journal of Information and Telecommunication, 2:3, 231-245, DOI:
10.1080/24751839.2018.1448205

To link to this article: https://doi.org/10.1080/24751839.2018.1448205

© 2018 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group

Published online: 13 Apr 2018.

Submit your article to this journal

Article views: 34661

View related articles

View Crossmark data

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tjit20
JOURNAL OF INFORMATION AND TELECOMMUNICATION
2018, VOL. 2, NO. 3, 231–245
https://doi.org/10.1080/24751839.2018.1448205

Usage Apriori and clustering algorithms in WEKA tools to


mining dataset of traffic accidents
Faisal Mohammed Nafie Alia and Abdelmoneim Ali Mohamed Hamedb
a
Department of Computer Science, College of Science and Humanities at Alghat, Majmaah University,
Majmaah, Saudi Arabia; bDepartment of Mathematical, College of Science and Humanities at Alghat,
Majmaah University, Majmaah, Saudi Arabia

ABSTRACT ARTICLE HISTORY


The aim of this study is finding approaches for investigating Received 1 December 2017
association rules mining algorithms and clustering to offer new Accepted 1 March 2018
rules from a broad set of discovered rules which taken from traffic
KEYWORDS
accident data at Alghat Provence in KSA. Several tools are Data mining; association
applying in data mining to extracting data. WEKA provides rules; clustering; traffic
applications of learning algorithms that can efficiently execute any accidents; Apriori; EM
dataset. In WEKA tools, there are many algorithms used to mining algorithm
data. Apriori and cluster are the first-rate and most famed
algorithms. Apriori is the simple algorithm, which applied for
mining of repeated the patterns from the transaction dataset to
find frequent itemsets and association between various item sets.
A cluster is a technique used to group a collection of items having
similar features. Association rules applied to find the connection
between data items in a transactional database. Association rules
data mining algorithms used to discover frequent association.
WEKA tools were used to analysing traffic dataset, which
composed of 946 instances and 8 attributes. Apriori algorithm and
EM cluster were implemented for traffic dataset to discover the
factors, which causes accidents. Through the results, shows that
the Apriori algorithm is better than the EM cluster algorithm.

1. Introduction
There is a significant amount of data stored in the databases, and with the rapid spread of
the data warehouse, it is necessary to find techniques to extract information and knowl-
edge by exploiting these data stored for used in problem-solving and decision-making
using modern computer applications, the current smart technology famous as artificial
intelligence. Data mining is an analytical process that combines artificial intelligence, stat-
istics, and machine learning. It is considered a step of knowledge in databases. Data
mining and machine learning are topics in artificial intelligence that focus on pattern dis-
covery, prediction, and forecasting based on possessions of gathered data (Witten, Frank,
Hall, & Pal, 2016).

CONTACT Faisal Mohammed Nafie Ali [email protected] Department of Computer Science, College of Science and
Humanities at Alghat, Majmaah University, Majmaah 11952, Saudi Arabia
© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/
licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
232 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

Data mining is repeated process within which progress as the operation is defined by
discovery, through either automatic or manual method. It is potential to put data mining
actions into one of two classes: predictive and descriptive. The function of predictive is
produced the system explained by the given data set. Predictive is generate new, not
trivial information based on the available data collections (Han, Pei, & Kamber, 2011).
Several techniques are using in data mining to extracting data such as R-programing,
SPSS, IBM Clementine, WEKA, Knime, and Orange. Table 1 presents the compression
between several data mining tools and shown advantages and disadvantages of these
tools (Solanki, 2013).
This research aims to suggest an approach for employ association rules mining algor-
ithms and clustering by using data mining tool to offering new rules from a broad set of
discovered rules which taken from Traffic accident data at Alghat Provence in KSA within
four years (1432, 1434, 1435, and 1436).
Clustering is the assignment of appointing a set of items to groups so that the elements
in the same cluster are more like to any other than to these in another one. Clustering is an
essential mission of explorative data mining, and a combined method for statistical data
analysis used in such fields, containing machine learning, pattern recognition, image
analysis, information retrieval, and bioinformatics. It offers the best interface to the user
than comparing the other data mining tools (Han et al., 2011). It is a technique to
group a set of items having similar features.
Association rules applied to find the connection between data items in a transactional
database. Association rules data mining algorithms used to discover frequent association
(Amira, Pareek, & Araar, 2015).
There are many algorithms used to mining data. In this paper, authors attempted to find
the best association rules using WEKA data mining tools. Apriori and cluster are the first-
rate and most famed algorithms. Apriori is the simple algorithm, which applied for mining
of repeated the patterns from the transaction database. The Apriori reaches good perform-
ance by decreasing the size of candidate sets. However, in states with very many recurrent
item sets, large item sets, or little minimum support; it still suffers from the cost of gener-
ating a massive number of candidate sets (Wu et al., 2008). The objective of using Apriori

Table 1. Advantages and disadvantages of data mining tools.


Tool Advantages Disadvantages
R-programming Purely statistical, R has better graphics, easier to Less specialized for data mining, requires
combine with other statistical calculations knowledge of array language
WEKA Easiness of use can be extended in RM. Poor documentation, weak classical statistics,
WEKA product gives beginners users a tool to poor parameter optimization, weak CSV
discover hidden information from database and reader
file systems with simple to use options and
visual interfaces
KNIME Molecular analysis, Mass spectrometry. Chemistry Limited error measurements, no wrapper
Development Kit methods for descriptor selection, poor
parameter optimization
ORANGE Better debugger, Shortest scripts, poor statistics, Big Installation, limited reporting capabilities
suitable for novice experts
IBM SPSS SPSS is a good tool for non-programmers. Easy to SPSS their graphical results not improving
understand and use, with extensive and model intervention is limited
documentation of use, and is quite strong with
quantitative analysis. SPSS is non-code
programme
JOURNAL OF INFORMATION AND TELECOMMUNICATION 233

algorithm is to find frequent itemsets and association between different itemsets, that is,
association rule. Apriori is Easy implementation. The algorithm applies information from
previous steps to produce the frequent itemsets (Shweta & Garg, 2013). Apriori is the
most uncomplicated algorithm, which is employed for mining of repetitive patterns
from the transaction database. We have aimed to execute the Apriori algorithm for ade-
quate study work, and we have applied WEKA for mentioning the process of association
rule mining. The benefits of using Apriori algorithm are usages large item set property.
Easily parallelized, simply and easy to implement, Apriori algorithm is an efficient algor-
ithm for finding all frequent itemsets.
EM algorithm is a general method of finding the maximum likelihood estimate of data
distribution when data is partially missing or hidden (Prajwala & Sangeeta, 2014).The
advantages of using EM algorithm are to give a beneficial result for the real world data
set. Moreover, use this algorithm when you want to perform cluster analysis of a small
scene or region-of-interest, and are not satisfied with the results obtained from the
other algorithms (Sharma, Bajpai, & Litoriya, 2012). EM algorithm is an essential algorithm
for data mining. We used this algorithm when we are satisfied the result of other algor-
ithms methods. EM is chosen to cluster data for the many reasons: first, It has a robust stat-
istical basis. Second, it is linear in database size. Third, it is healthy to noisy data. Fourth, it
can accept the desired number of the cluster as input. Fifth, it can handle high dimension-
ality, and final, it converges fast given a proper initialization (Abbas, 2008), guarantees
about optimality, easily explainable results (Ordonez & Omiecinski, 2002). Also has
several disadvantages, Algorithm is highly involved, it is hard to initialize, and the
quality of the final solution depends on the quality of the initial solution (Slimani &
Lazzez, 2014).

2. WEKA
WEKA term is a set of modern machine learning ways and data pre-handling tools. It is
identified as a set of machine learning approaches for data extraction tasks (Seppelt,
Voinov, & Lange, 2012). It designed so that users can speedily test out existing machine
learning modes on new datasets in very flexible ways (Frank et al., 2009). The workbench
contains techniques for the first data mining troubles: regressions, classification, cluster-
ing, and association rule mining, conception, and attribute selection (Seppelt et al.,
2012). It is an excellent appropriate for improving new machine learning methods (Hall
et al., 2009). The user can access components through JAVA programming or command
line interfaces. It affords graphical user interface in an application named the WEKA Knowl-
edge Flow Environment featuring visual programming and WEKA explorer (Parikh &
Tirkha, 2013).
There are three additional graphical user interfaces to WEKA. The Knowledge Stream
interface allows the user to design configurations for flowed data handling. WEKA’s third
interface, the Experimenter intended to help the user to answer a fundamental practical
question when implementing classification and regression methods: Which techniques
and parameter values work better for the given problem. The fourth interface, so-called
the Workbench, is a unified graphical user interface that combines the other three into
one application (Witten et al., 2016). In this study, we chose WEKA from other software
tools on the market because it is the package that would be recommended for people
234 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

who are beginners to such software to those who are very adept. The software merely is very
robust with built-in features. WEKA is contained many built-in features that require no pro-
gramming or coding knowledge. Then WEKA has become very common with the academic
and industrial researchers, also widely applied for teaching aims. WEKA is better suited for
mining association rules, powerful in machine learning techniques and Suited for
machine learning. WEKA is user-friendly with a graphical interface that allows for fast
setup and operation. WEKA work on the prediction that the user data is gained as a flat
file or relation. In another word, that means each data object described by a stable
number of attributes that ordinarily are of a specified type, normal alpha-numeric or
numeric values (Ramamohan, Vasantharao, Chakravarti, & Ratnam, 2012).
WEKA offers applications of learning algorithms that you can efficiently use to your
dataset. It contains a diversity of tools for converting datasets, such as the algorithms
for discretization and sampling (Witten et al., 2016).
WEKA makes it easy to compare different solution strategies based on the same evalu-
ation method and identify the one that is most appropriate for the problem at hand. It is
implemented in JAVA and runs on almost any computing platform (Frank, Hall, Trigg,
Holmes, & Witten, 2004). WEKA provides applications of learning algorithms that can
quickly implement any dataset. It also contains a diversity of tools for transforming
dataset (Frank et al., 2009). WEKA is an open source software tool for implementing
machine-learning algorithms.

3. Related work
Bansal and Bhambhu (2013) reported that association rule transacts with frequent itemsets
as done by much association algorithms like Apriori algorithm, which used in widely real
vitality applications. In this paper, authors contain the use of association rule mining in
extracting pattern that frequently happened within a dataset and explanation the
implementation of the Apriori algorithm WEKA technique from a dataset, which is gathering
of demeaning crimes against women in Session court. This paper studies two association
rule algorithms Apriori algorithm and Predictive Apriori algorithm and matches the result
of both the algorithms using WEKA tool. Therefore, the result of rules together algorithms
visibly shows that Apriori algorithm achieves better and faster than the Predictive Apriori
algorithm. The study uses a comprehension of recurrent pattern matching based on
support and confidence measures produced excellent results in various fields. The paper
indicates that investigation of repetitive pattern matching based on support and confidence
measures provided excellent results in multiple areas.
Apriori Algorithm can saw as a two-stage operation:

(I) All item sets having support factor greater than or equal to, the user-specified
minimum support.
(II) All rules are having the confidence factor more significant than or similar to the user-
specified minimum confidence.

Research explained that association rule in data mining shows a critical key in the
process of mining data for repeated item sets. Apriori algorithm is applied to find out
JOURNAL OF INFORMATION AND TELECOMMUNICATION 235

and comprehend the underlying patterns involved in the court’s records from their data
contains in various sections.
Amira, Pareek, and Araar (2015) offered association rule-mining algorithms are com-
monly applied to find all rules in the database to pleasing some minimum support and
minimum confidence restrictions. The number of generated rules reduced the adaptation
of the association rule-mining algorithm to mine only a particular subset of association
rules where the classification class attribute is assigned to the right-hand-side was inves-
tigated in past research. In this study, a dataset about traffic accidents was gathered from
Dubai Traffic Department, UAE. After data preprocessing, Apriori and Predictive Apriori
association rules algorithms were applied to the dataset to explore the link between
recorded accidents factors to accident acuity in Dubai area. Two sets of class association
rules were generated using the two algorithms and summarized to get the most interest-
ing rules using technical measures. Experimental results showed that the class association
rules generated by Apriori algorithm were more effective than those generated by Predic-
tive Apriori algorithm were. More associations between accident factors and accident
severity level were explored when applying Apriori algorithm. This paper showed that
when applying rule covers method on the generated class association rules using
Apriori and Predictive Apriori algorithms, many class association rules produced by
Apriori algorithm were eliminated, and more effective rules than those generated by Pre-
dictive Apriori algorithm were obtained.
Shrivastava and Panda (2014) explained there are several algorithms developed to mine
the association rules from the huge databases. Authors offered the Apriori algorithm is the
best common algorithm to mine the association rules from the dataset. Various tools are
existing to execute the Apriori algorithm. WEKA is an open source software tool for imple-
menting machine-learning algorithms. A study defined WEKA is the gathering or a collec-
tion of the implements for execution data mining with the application of the association
rules in it. Association rules formed by analysing data for various samples and using the
standard support and dependability to identify the most important relationships. They
are differed into separate classes in data mining and used in the WEKA to perform the
operations. The result in Apriori algorithm generates the best association rule for the
dataset after operating the WEKA tool. The implementation of Apriori algorithm, it can
be more compatible and purposeful in future, by the implementation of the new associ-
ation algorithms for some other new operations and analysis in this WEKA tool.
Agrawal and Agrawal (2017) explained details description about Analysis of Clustering
Algorithm of WEKA Tools. Paper defined clustering is a method used in several areas such
as image analysis, pattern recognition, and statistical data analysis. Clustering is a partition
of data into sets of similar items. Every cluster contains various items that are analogous to
them and unlike compared to objects of other sets. Some clustering algorithms represent
to produce clusters (Chauhan, Kaur, & Alam, 2010). WEKA tool used to compare different
clustering algorithms. It used because it provides a better interface to the user than
compare to other data mining tools. In this paper, algorithms are analysing and comparing
the various clustering algorithm by using WEKA tool to find out which algorithm will be
more comfortable for the operators for execution clustering algorithm. This present the
applications of data mining WEKA tool it provides the cluster’s huge data set and cluster-
ing that provide making a hand in the optimizing of the search engine.
236 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

Verma, Srivastava, Chack, Diswar, and Gupta (2012) demonstrated EM algorithm is a


reiterative procedure for finding maximum likelihood or maximum a posteriori estimates
of parameters in statistical paradigm, wherever the model consists of hidden implicit vari-
ables. The paper dealt with that the EM iteration rotates among implementing an expec-
tation step, which calculates the expectation of the log-probability estimated using the
present evaluate for the parameters, and maximization(M)step, those count parameters
maximizing the expected log-probability found on the expectation (E) step. These par-
ameters estimations used to determine the allocation of the potential variables in the
sequent expectation step.
Tanna and Ghodasara (2014) discussed the using of Apriori through WEKA for repeated
pattern mining. Paper described Apriori algorithm is very an effective for extracting
repeated groups for Boolean association rules. The conclusion of this paper that Apriori
is the simple algorithm, which applied for mining of repeated patterns from the trans-
action database. Paper presented the used of WEKA implements for association rule to
applying Apriori algorithm. Authors exercised the Apriori Algorithm to get the association
rules that have minSupport = 50% and min confidence = 50% by using WEKA GUI. They
have tried to implement the Apriori algorithm for sufficient research work, and they
have utilized WEKA for referring the process of association rule mining.
AprioriTID: Slimani and Lazzez (2014) reported AprioriTID proposed by Agrawal and
Srikant (1994). This algorithm has the extra property that the database is not used at all
for counting the support of candidate item set after the first pass. Instead, an encoding
of the applicant item sets used in the previous pass is employed for this purpose. The
most critical tasks of frequent pattern mining approaches are itemset mining, sequential
pattern mining, sequential rule mining, and association rule mining. Apriori algorithm is
among the proposed initially structure which deals with association rule problems. In syn-
chronism with Apriori, the AprioriTid and AprioriHybrid algorithms have been offered. For
smaller problem sizes, the AprioriTid algorithm is executed equivalently well as Apriori, but
the performance degraded two times slower when applied to massive problems. The
support counting method included in the Apriori algorithm has involved voluminous
research due to the performance of the algorithm. A useful number of ineffective data
mining algorithms exist in the literature for frequent mining patterns. This study offered
a summary of the status and future directions of frequent pattern mining.
Hierarchical clustering: Steinbach, Karypis, and Kumar (2000) demonstrated the results
of experimental research of several general document-clustering techniques: agglomera-
tive hierarchical clustering and K-means. Authors presented Hierarchical clustering is many
times represented as the excellent property clustering approach but is limited because of
its quadratic time complexity. The runtime of parting K-means is very enticing when com-
pared to that of hierarchical clustering methods. However, through the way of study tests
Authors exposed that a simple and efficient variant of K-means, ‘bisecting’ K-means, can
produce clusters of documents that are better than those given by ‘regular’ K-means
and useful or useful than those yielded by agglomerative hierarchical clustering tech-
niques. The study also has been able to find what we think is a reasonable explanation
for this behaviour.
FP-growth: Novitasari, Hermawan, Abdullah, Sembiring, and Herawan (2015) presented
the candidate set generation and tests are two major drawbacks in Apriori-like algorithms.
Therefore, to deal with this problem, a new data structure called frequent pattern tree
JOURNAL OF INFORMATION AND TELECOMMUNICATION 237

(FPTree) was introduced. FP-Growth was then developed based on this data structure and
currently is a benchmarked and fastest algorithm for mining frequent itemset Lee, Kim,
Cai, Han (2003). The benefits of FP-Growth are, it needs two times of scanning the trans-
action database. First, it scans the database to calculate a list of various items sorted by
descending order and eliminates rare items. Then, it scans to compress the database
into an FPTree structure and mines the FP-Tree recursively to construct its conditional
FP-Tree.
Mansouri and Javad Kargar (2014) showed driving accidents had always been counted as
one of the most likely causes of deaths in the societies today. In this study, the rules and
issues motivating the traffic road accidents have been mined along with extracting a
local data model after collecting the data from a diversity of sources followed by data collec-
tion and combination, data cleaning, and separating the inconsistent data. In this study used
data mining methods, such as clustering and decision tree. The objective of this research
was to analyse and monitor the road traffic accidents using the data mining techniques
in suburban roads in Isfahan Province. The obtained results in this study are interesting
and significant which can be considered by authorities as invaluable information to be
used for decreasing the road accidents. Furthermore, five algorithms existing in data
mining was used in this study for knowledge discovery of the accident dataset. The C5.0
decision tree algorithm proved to generate the best results and performance. Later in this
research clustering of the data was also performed but did not result in separation of clusters
with a specific meaning. Based on the clustering results, it can be concluded that each route
follows its particular pattern and differentiating the data concerning time, vehicle, and the
road status is not generalizable to all of the routes. In determining the accident type as
Casualty, fatal, and car crash, the most important characteristic was the type of vehicle.
Verma, Srivastava, Chack, Diswar, and Gupta (2012) presented data clustering is a
manner of setting similar data into groupings. The paper showed that the cluster algor-
ithm divisions a data set into some groups such that the similarity within a group is
larger than among groups. This study revises six types of clustering techniques – k-
means clustering, hierarchical clustering, DBS can clustering, density-based clustering,
optics, EM algorithm. These clustering techniques are implemented and analysed using
a clustering tool WEKA. Performance of the six techniques are obtainable and compared.
The paper presented several indicates: The performance of K-means algorithm is better
than hierarchical clustering algorithm, all the algorithms have some confusion in some
(noisy) data when clustered, K-means and EM algorithm are very sentient for fuss in a
dataset. This noise makes it complex for the algorithm to cluster data into convenient clus-
ters while affecting the outcome of the algorithm, K-means algorithm is faster than other
clustering algorithm and generates property clusters when applying, a hierarchical cluster-
ing algorithm is more sensitive for noisy data.
Prajwala and Sangeeta (2014) demonstrated the two clustering algorithms considered
are EM and density-based algorithm. EM algorithm is a common way of discovering the
maximum likelihood estimation of data distribution when data are lost or concealed. In
density-based clustering, clusters are large areas in the data space, split by sections of
lower object density. This paper showed WEKA an open source tool is used for comparing
these two algorithms. In conditions of likelihood, EM algorithm is better than a density-
based algorithm; the density-based algorithm takes less time than EM algorithm to
build the model.
238 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

Kumar and Rukmani (2010) proposed this research on the web using mining and in par-
ticular, efforts on finding the web usage procedures of websites from the server log files.
The study used Apriori algorithm and Frequent Pattern Growth algorithm for evaluation
memory practice and time usage. The characterize of using Apriori algorithm are Operates
large item set property. It easily parallelized and easy to implement. The research showed
some restrictions of Apriori algorithm. Treating a huge number of applicant sets is costly. It
is tedious to recurrently scan the database and checked a large set of nominees by pattern
identical, which is especially true for mining long patterns. The main distinguishes of the
FP-growth algorithm is usages compressed data structure and rejects recurrent database
scan. The main obstacle of the FP-growth algorithm is the fulminatory amount of lacks a
good candidate generation method. Future research can combine FP-Tree with Apriori
nominee to make way to solve the drawbacks of together Apriori and FP-growth.
Krömer et al. (2013) used to investigate a data set describing traffic accidents in Ethiopia
and use a machine learning method based on artificial evolution and fuzzy systems to
mine symbolic description of selected features of the data set. Paper demonstrated
there are simple fuzzy classifiers as well as complex rule-based fuzzy classification
systems that usually build and maintain sophisticated rule bases. The popularity of
fuzzy classifiers can be attributed to their ability to perform soft classification, to assign
multiple labels to data samples, and to the ease of their interpretation. This study com-
pared the ability of evolutionary fuzzy rules to evolve classifiers for binary and multi-
class attributes. While the rules for a binary attribute were successful, the artificial evol-
ution as implemented in this work was not able to find fuzzy rules that would accurately
classify data according to selected multi-class attributes.
Rai, Verma, and Thoke (2012) defined MSApriori is an association rule mining algorithm
planned to mine frequent itemsets including rare objects and to give better performance in
comparison with approaches that employ single minimum support. In association rule,
mining MSApriori algorithm plays an important role as it considers rare item sets. This
paper proposed a novel approach MSApriori-T algorithm, which uses total support tree
structure to make MSApriori algorithm more efficient. T-tree stores each item in a tree as
nodes and links are available to its child node. To beat the drawback of an MSApriori algor-
ithm that needs high storage requirement and processing time, authors proposed an
approach that combines the MSA prior algorithm with a total support tree storage structure
resulting in a more efficient algorithm in terms of storage requirement and processing time.

4. Methodology
The importance of this research is in suggesting a way using data mining algorithms to
determine the causes of accidents in terms of time, road, driver nationality, and type of
accident from a large set of discovered rules extracted from Alghat traffic accidents real
data. This study was based on traffic accident data which taken from public traffic depart-
ment in Alghat Provence in KSA within four years (1432, 1434, 1435, and 1436). One of the
main obstacles, which researchers faced when collecting data from traffic department that
information of the accident in traffic registration form is incomplete. For this reason, many
variables had been neglected from the analysis such as the driver age, driver health status,
driver behaviour, and weather state. WEKA tools used for preprocessing and analysing
data. In WEKA, we implemented two tools, Apriori algorithm in association rules and EM
JOURNAL OF INFORMATION AND TELECOMMUNICATION 239

clustering algorithm. A comparison between these algorithms (Ariori and clustering) were
made to discover the factors, which caused accidents.
Figure 1 shows an Attributes relationship File Format (ARFF) for the traffic accident
dataset after converted it from excel file. The header of the data is started with the
name of the relationship (traffic), and a block knows the attributes (year, type of accident,
location, number of vehicles, driver, injured, and death). Also, the @data line includes the
values entries for each attribute. It is prepared dataset in Attribute relation format file to
execute in WIKA interface. ARFF format just gives a dataset; it cannot appoint which of

Figure 1. Traffic accident dataset an ARFF file.

Figure 2. Opening ARFF file in WEKA explorer.


240 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

the attributes the one that is supposed to be predicted. It can be applied to locate different
algorithms used in WEKA software.
In this part Figure 2 displays ARFF file for the traffic accident dataset which pre-proces-
sing in WEKA explorer. The file contains 8 attributes and 946 instances. At this stage, the
data will be ready for mining and extraction information by using various algorithms sup-
ported by WEKA tools.

Figure 3. Apply the Apriori algorithm.

Figure 4. Apply the EM cluster algorithm in WEKA.


JOURNAL OF INFORMATION AND TELECOMMUNICATION 241

Figure 3 shows the use of the Apriori algorithm to find best results that have minSup-
port = 0.4 and minimum confidence = 0.9.
Figure 4. Shows the best results obtained by the EM cluster algorithm.

5. Results and discussions


After Apriori algorithm executed, we obtained many results, which based on the size of the
set of the large itemsets. Table 2 explains the results obtained with item sets: 10, the results
show that the highest number of accidents occurred in 1434, most of the incidents hap-
pened during the day, the most common types of accidents were a collision with another
vehicle. The highest accidents appeared in highway the drivers who caused the accidents
were non-Saudis. Most accidents happened between two cars or more. The total number
of accidents was 946; there were 171 injured and 35 death.
Table 3 shows the results taken with item sets: 4, the output display that most of the
incidents occurred during the day, the most common types of accidents were a collision
with another vehicle. As the results, the highest number of accidents occurred in highway
with 51%. Main accidents happened between two vehicles or more.
Table 4 displays the results reached with item sets 3, the highest number of accidents
occurred in 1434, most of the incidents happened during the day; the most common types
of accidents are a collision with another vehicle, and the highway achieved the highest
accidents and drivers who caused the accidents were non-Saudis.
Table 5 shows the best rules found in Apriori algorithm. The results depend on the com-
parison of confidence, leverage, and convince. All rules in this table support the rules pre-
sented in above tables.

Table 2. Size of a set of large itemsets L (1): 10.


List of attributes Minimum support Confidence Result
1. Year 0.4 0.9 year = 1434 407
2. Time time = Day 542
time = Night 404
3. Type of accident type_of_cc = collision_with_other_vehicle 496
4. Location location = Highway 487
5. Number of vehicles number_of_v = 2 438
number_of_v = 1 478
6. Driver driver = both_sides_non_Saudi 380
7. Injured injured = 0 775
8. Death Death = 0 911

Table 3. Size of a set of large itemsets L (2): 4.


List of attributes Minimum support Confidence Result
1. Year 0.4 0.9 time = Day injured = 0 Death = 0 447
2. Time type_of_cc = collision_with other_vehicle _
3. Type of accident number_of_v = 2 Death = 0 406
4. Location type_of_cc = collision_with_other_
5. Number of vehicles vehicle injured = 0 Death = 0 411
6. Driver location = Highway injured = 0 Death = 0 385
7. Injured
8. Death
242 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

Table 4. Size of a set of large itemsets L (2): 13


List of attributes Minimum support Confidence Result
1. Year 0.4 0.9 Year = 1434 Death = 0 387
2. Time Time = Day injured = 0 453
3. Type of accident Time = Day Death = 0 525
4. Location Time = Night Death = 0 386
5. Number of vehicles type_of_cc = collision_with_other_vehicle = 2 419
6. Driver type_of_cc = collision_with_other_vehicle injured = 0 417
7. Injured type_of_cc = collision_with_other_vehicle Death = 0 480
8. Death Location = Highway injured = 0 392
Location = Highway Death = 0 471
Number_of_v = 2 Death = 0 425
Number_of_v = 1 injured = 0 383
Number_of_v = 1 Death = 0 459
Injured = 0 Death = 0 762

Table 5. Best rules using Apriori algorithm.


List of attributes Best rules found
1. Year 1. time = Day injured = 0 453 ==> Death = 0 447 < conf:(0.99)> lift:(1.02) lev:(0.01) [10]
conv:(2.39)
2. Time 2. type_of_acc = collision_with_other vehicle injured = 0 417 ==> Death = 0 411 < conf:
(0.99)> lift:(1.02) lev:(0.01) [9] conv:(2.2)
3. Type of accident 3. injured = 0 775 ==> Death = 0 762 < conf:(0.98)> lift:(1.02) lev:(0.02) [15] conv:(2.05)
4. Location 4. location = Highway injured = 0 392 ==> Death = 0 385 < conf:(0.98)> lift:(1.02) lev:(0.01)
[7] conv:(1.81)
5. Number of 5. number_of_v = 2 438 ==> Death = 0 425 < conf:(0.97)> lift:(1.01) lev:(0) [3] conv:(1.16)
vehicles
6. Driver 6. type_of_acc = collision_with_other_vehicle number_of_v = 2 419 ==> Death = 0 406 <
conf:(0.97)> lift:(1.01) lev:(0) [2] conv:(1.11)
7. Injured 7. time = Day 542 ==> Death = 0 525 < conf:(0.97)> lift:(1.01) lev:(0) [3] conv:(1.11)
8. Death 8. type_of_acc = collision_with_other_vehicle 496 ==> Death = 0 480 < conf:(0.97)> lift:(1)
lev:(0) [2] conv:(1.08)
9. location = Highway 487 ==> Death = 0 471 < conf:(0.97)> lift:(1) lev:(0) [2] conv:(1.06)
10. number_of_v = 1 478 ==> Death = 0 459 < conf:(0.96)> lift:(1) lev:(0-) [-1] conv:(0.88)

Table 6. The distribution of accidents per year.


Year Total number of accidents Death Injured
1432 292 5 58
1434 407 1 81
1435 112 21 22
1436 135 8 10
Total 946 35 171

Table 7. Summarized results obtained by using EM clustering algorithm.


Cluster
Attribute 0 1 2 3 4
Result (0.29) (0.1) (0.44) (0) (0.16)
Year 1434 152.0739 10.6103 183.5875 1 64.7282
Time Day 168.9914 46.8987 245.2619 1 84.848
Type of accident Collision with other vehicle 2.3265 89.8892 404.5182 1 3.2661
Location Highway 154.6209 52.2836 186.8669 1 97.2285
Number of vehicles 1 277.908 50.6856 1.7462 1 151.6602
Driver Both sides non Saudi 142.2255 40.4978 125.3396 1 75.9372
Injured 0 253.7052 66.4617 364.6537 1 94.1795
Death Death = 0 266 75 421 0 149
Death = 1 2 3 12 0 13
Death = 2 0 0 1 0 4
JOURNAL OF INFORMATION AND TELECOMMUNICATION 243

Table 6 represents the number of accidents happened within four years (1432, 1434,
1435, and 1435), the data show that the most accidents and injured occurred in 1434.
The highest death in 1435.
Table 7 represents the summarizes results obtained using the EM clustering algorithm.
Through the results obtained from current study showed that in EM cluster algor-
ithm time taken to build model was (1.58 s), and Log likelihood was (7.46685-). Log-
likelihood here refers to the probability of identifying a correct group of data
elements. The EM algorithm is a general statistical method of maximum likelihood
estimation. EM cluster may converge to a poor locally optimal solution, therefore; it
needs an unknown number of iterations to converge to a good solution (Ordonez
& Omiecinski, 2002). While Appling Apriori algorithm in this research we obtained
for the best result because the Apriori algorithm is an efficient algorithm for
finding all frequent itemsets. Therefore Apriori algorithm more effective better than
the EM cluster algorithm.

6. Conclusion
The aim of the study to present the implementations of the WEKA tools in data mining
techniques. Apriori and cluster algorithms used to discover and concept the underlying
patterns involved in the traffic accident dataset in Alghat Provence. As result of rules of
both algorithms, display that Apriori algorithm performs better and faster than cluster
algorithm. The paper presents Apriori algorithm is a simple and efficient tool to analyse
the dataset. In general, WEKA interface is a very useful tool in data mining, which allows
the user to choose many different algorithm and compare them to reach the accurately
required results.

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Faisal Mohammed Nafie Ali received the B.Sc. from Omdurman Ahlia University, Faculty of Applied &
Computer science in Sudan, in 2001. He got M.Sc. and Ph.D. degree in Computer Science in 2009 and
2014 respectively from Alneelain University Faculty of Computer science and Information Technol-
ogy in Sudan. He worked in the field of education as Computer Science teacher in Sudan from
2002 to 2013. He worked as Assistant Professor in Majmaah University, Suadia Arabia from 2014
until now; He worked as Oracle Database Administrator in National Pensions Fund in Sudan from
2007 to 2014. He has an experience in Data mining using WEKA and Clementine. He has many Cer-
tifications in oracle database Administrator.
Abdelmoneim Ali Mohamed Hamed received the B.Sc. and M.Sc.in mathematics in Sudan, Alnileen
University Faculty of science, in 1989 and 2005 respectively. He has received Ph.D. in applied stat-
istics in Sudan, Sudan University for science and technology, 2012. From 1989 to 2008, He worked
in the field of education as a mathematics teacher in Sudan and Saudi Arabia. From 2009 to 2013,
He worked as a lecturer at Al Ahfad University for Girls. From 2014 until now, he worked as Assistant
Professor at Al Majmaah University. He has an experience in statistical analysis using SPSS and WEKA.
244 F. M. NAFIE ALI AND A. A. MOHAMED HAMED

References
Abbas, O. A. (2008). Comparisons between data clustering algorithms. International Arab Journal of
Information Technology (IAJIT), 5(3), 320–325.
Agrawal, R., & Agrawal, J. (2017). Analysis of clustering algorithm of Weka tool on air pollution
dataset. International Journal of Computer Applications, 168(13), 1–5.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules, In Proc. 20th Int. Conf.
Very Large Data Bases, VLDB, vol. 1215, (pp. 487–499), September.
Amira, A., Vikas, P., & Abdelaziz, A. (2015). Applying Association Rules Mining Algorithms for Traffic
Accidents in Dubai. International Journal of Soft Computing and Engineering (IJSCE), 5(4), 1–12.
Bansal, D., & Bhambhu, L. (2013). Usage of Apriori algorithm of data mining as an application to grie-
vous crimes against women. International Journal of Computer Trends and Technology, 4(19), 3194–
3199.
Chauhan, R., Kaur, H., & Alam, M. A. (2010). Data clustering method for discovering clusters in spatial
cancer databases. International Journal of Computer Applications, 10(6), 9–14.
Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I. H., & Trigg, L. (2009). Weka-a machine
learning workbench for data mining. In In data mining and knowledge discovery handbook (pp.
1269–1277). Boston, MA: Springer.
Frank, E., Hall, M., Trigg, L., Holmes, G., & Witten, I. H. (2004). Data mining in bioinformatics using
Weka. Bioinformatics (oxford, England), 20(15), 2479–2481.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data
mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques. Elsevier.
Krömer, P., Beshah, T., Ejigu, D., Snášel, V., Platoš, J., & Abraham, A. (2013, April). Mining traffic accident
features by evolutionary fuzzy rules. In Computational Intelligence in Vehicles and Transportation
Systems (CIVTS), 2013 IEEE Symposium on (pp. 38–43). IEEE.
Kumar, B. S., & Rukmani, K. V. (2010). Implementation of web usage mining using APRIORI and FP
growth algorithms. International Journal of Advanced Networking and Applications, 1(06), 400–404.
Lee, Y. K, Kim, W. Y, Cai, Y. D, & Han, J. (2003). CoMine: Efficient Mining of Correlated Patterns. In ICDM,
3, 581–584. November.
Mansouri, M., & Javad Kargar, M. (2014). Analysis and monitoring of the traffic suburban road acci-
dents using data mining techniques; a case study of Isfahan Province in Iran. The Open
Transportation Journal, 8(1), 39–49.
Novitasari, W., Hermawan, A., Abdullah, Z., Sembiring, R. W., & Herawan, T. (2015). A method of dis-
covering interesting association rules from student admission dataset. International Journal of
Software Engineering and Its Applications, 9(8), 51–66.
Ordonez, C., & Omiecinski, E. (2002, November). FREM: Fast and robust EM clustering for large data
sets. In Proceedings of the eleventh international conference on Information and knowledge manage-
ment. (pp. 590-599). ACM.
Parikh, D., & Tirkha, P. (2013). Data mining & data stream mining—open source tools. International
Journal of Innovative Research in Science, Engineering and Technology, 2(10), 5234–5239.
Prajwala, T. R., & Sangeeta, V. I. (2014). Comparative analysis of EM clustering algorithm and density
based clustering algorithm using WEKA tool. International Journal of Engineering Research and
Development, 9(8), 19–24.
Rai, D., Verma, K., & Thoke, A. S. (2012). MSApriori using total support tree data structure. International
Journal of Computer Applications, 43(23), 45–49.
Ramamohan, Y., Vasantharao, K., Chakravarti, C. K., & Ratnam, A. S. K. (2012). A study of data mining
tools in knowledge discovery process. International Journal of Soft Computing and Engineering
(IJSCE) ISSN, 2(3), 2231–2307.
Seppelt, R., Voinov, A. A., & Lange, S. (2012). Tools for environmental data mining and intelligent
decision support. Iemss. Org.
Sharma, N., Bajpai, A., & Litoriya, M. R. (2012). Comparison the various clustering algorithms of WEKA
tools. Facilities, 4(7), 78–80.
JOURNAL OF INFORMATION AND TELECOMMUNICATION 245

Shrivastava, A. K., & Panda, R. N. (2014). Implementation of Apriori algorithm using WEKA. KIET
International Journal of Intelligent Computing and Informatics, 1(1), 4.
Shweta, M., & Garg, D. K. (2013). Mining efficient association rules through Apriori algorithm using
attributes and comparative analysis of various association rule algorithms. International Journal
of Advanced Research in Computer Science and Software Engineering, 3(6), 306–312.
Slimani, T, & Lazzez, A. (2014). Efficient analysis of pattern and association rule mining approaches.
Journal of Information Technology and Computer Science (IJITCS), 6(3), 70–81.
Solanki, H. (2013). Comparative study of data mining tools and analysis with unified data mining
theory. International Journal of Computer Applications, 75(16), 23–28.
Steinbach, M., Karypis, G., & Kumar, V. (2000, August). A comparison of document clustering tech-
niques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525–526).
Tanna, P., & Ghodasara, Y. (2014). Using Apriori with WEKA for frequent pattern mining. arXiv preprint
arXiv:1406.7371.
Verma, M, Srivastava, M, Chack, N, Diswar, A. K, & Gupta, N. (2012). A comparative study of various
clustering algorithms in data mining. International Journal of Engineering Research and
Applications (IJERA), 2(3), 1379–1384.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and
techniques. Cambridge, MA: Morgan Kaufmann.
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., & Zhou, Z. H. (2008). Top 10 algorithms
in data mining. Knowledge and Information Systems, 14(1), 1–37.

You might also like