0% found this document useful (0 votes)
37 views

Data Preprocessing and Apriori Algorithm Improvement in Medical Data Mining

The document discusses using data mining techniques like association rule mining to analyze medical data. Association rule mining can analyze frequency relationships in transaction data to discover rules. Preprocessing medical data and improving association rule algorithms can help find risk factors and enhance disease prevention and health management.

Uploaded by

smit.malde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Data Preprocessing and Apriori Algorithm Improvement in Medical Data Mining

The document discusses using data mining techniques like association rule mining to analyze medical data. Association rule mining can analyze frequency relationships in transaction data to discover rules. Preprocessing medical data and improving association rule algorithms can help find risk factors and enhance disease prevention and health management.

Uploaded by

smit.malde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the 6th International Conference on Communication and Electronics Systems (ICCES-2021)

IEEE Xplore Part Number: CFP21AWO-ART; ISBN: 978-0-7381-1405-7

Data Preprocessing and Apriori Algorithm


Improvement in Medical Data Mining
Feng Lv
Library of Yunnan University of Chinese Medicine, KunMing,650500,Yunnan, China
[email protected]

Abstract— In recent years, various medical and health updating at an unforeseen rate every year. Data mining
information systems have been widely used, and a large amount applications have also been extended to all walks of life in
of medical-related data has been accumulated in the hospital. The society. People use and process data stored in various servers
rise of mobile medicine has made medical information more and or data warehouses to realize trends such as trend analysis,
2021 6th International Conference on Communication and Electronics Systems (ICCES) | 978-1-6654-3587-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICCES51350.2021.9489242

more digitized, and the medical industry has entered a veritable status quo interpretation and even disease classification
era of big data. These medical data are extremely valuable for the prediction in the medical field. At present, the most
diagnosis and treatment of diseases and medical research. representative types of data mining research directions are
Unfortunately, most hospitals currently only complete the cluster analysis, decision tree classification, time series
collection and storage of patient medical data, but lack in-depth
sequence prediction, feature extraction, and association rule
analysis and utilization of them. The method of data mining is
used to discover the laws from the massive medical data, which
mining. Among them, time series sequence analysis and
provides a novel method for medical personnel to acquire association rule mining are the most widely used in the
knowledge. Among medical data mining methods, intelligent medical industry [ 13-16].
methods such as association rules, artificial neural networks, and Association rule mining can analyze the frequency
rough set theory show unique advantages. Among them, relationship of the data item set in the transaction set from a
association rule mining can analyze the frequency relationship of given set of data items and transaction sets. Using association
the data item set in the transaction set from a given set of data
rule mining technology to analyze the medical data of patients
items and transaction sets.
with chronic diseases and find out the risk factors related to
Keywords— Medical Image, Data Mining, Preprocessing, chronic diseases, patients can do a good job in the prevention
Association rules, Medical Diagnosis of chronic diseases in time and enhance their own health
management capabilities. Association rule algorithm is the
I. INTRODUCTION focus of association rule mining, and its performance directly
affects the result of mining. Therefore, it is necessary to carry
In the past few years, information technology has
out research on association rule mining algorithm.
developed rapidly, and people can collect a large amount of
data through various modern data collection tools. At the same In view of the relatively high degree of medical
time, various industries in various social fields where people standardization in foreign developed countries, medical
live have collected a variety of information on the industry's information technology is relatively leading [17-20]. The
production, management, operation or sales, scientific Apriori algorithm is a classic algorithm in association rule
research, etc., which has led to a continuous increase in the mining technology. It has been continuously studied and
amount of data storage on a global scale[1-4] . People are improved during the development of data mining technology
becoming more and more rich in data generation and for so many years. It is the mainstream of research by many
collection methods, which has led to an explosive growth in scholars and people who love data mining technology, and is
the amount of data. Traditional data processing methods widely used in many In the business decision of the enterprise.
cannot meet people’s higher demand for data processing. Regarding the research on the placement of "beer-diapers" in
Therefore, discovering accurate data that people are concerned supermarkets in 1993, R. Agrawal et al. first proposed the
about from a large amount of data and discovering the internal concept of association rules and used them in the user's
relationship between data and transaction phenomena are commodity transaction database. In the following years, this
faced in the process of processing data. The problem [5-8]. algorithm gradually became the most core and classic
algorithm in the field of association rule mining due to its
On the other hand, as the concept of "Internet +" continues
simple, intuitive and easy-to-understand characteristics. With
to deepen, our country is also beginning to slowly move closer
the birth of high-performance concurrent processing systems,
to high-end medical systems such as smart medical care.
the efficient processing of massive amounts of data has
Because of this, while continuing to deepen digitization and
become faster and more convenient. In the field of medical
informatization, a large amount of medical data of patients
and health care, association rule algorithms are constantly
with chronic diseases has been generated, including patient
being improved and applied by scholars and researchers, and
physical examination data, medical diagnosis data, treatment
satisfactory results have been achieved. This fully confirms
medication lists, and so on. Many important information is
that the application of data mining technology in the field of
often hidden behind these massive medical data of patients
disease diagnosis and chronic disease prediction has broad
with chronic diseases [9-12]. Since the term KDD was
application prospects [21-24].
proposed, data mining technology has been developing and

978-0-7381-1405-7/21/$31.00 ©2021 IEEE 1205


Authorized licensed use limited to: Somaiya University. Downloaded on April 10,2024 at 17:46:42 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 6th International Conference on Communication and Electronics Systems (ICCES-2021)
IEEE Xplore Part Number: CFP21AWO-ART; ISBN: 978-0-7381-1405-7

II. THE PROPOSED METHODOLOGY B. Disease Prediction Algorithm


A. Data Mining Technology Naive Bayes classifier is a simple and efficient
classification algorithm with Bayes' theorem as the core. In
The task and goal of data mining is to discover potentially some fields, the classification accuracy of the naive Bayes
valuable knowledge among data from a large amount of data classifier is comparable to the accuracy of classification
in the database. Data mining is a cross-discipline of multiple algorithms such as decision trees and neural networks. The
disciplines, such as database, statistics, visualization and other naive Bayes classifier is based on the maximal posterior
related disciplines. Because data mining can be applied to hypothesis, while assuming that each feature is independent of
different fields, the mining methods adopted are different. each other, so that the naive Bayes classification algorithm can
According to the data mining function to be realized, the be completed in a limited time when performing multi-feature
pattern type to be found for the realization of the mining task attribute classification.
is specified. A pattern refers to a specific abstract description The core of the naive Bayes classifier is the Bayes theorem,
of a data set. It is usually divided into predictive and which provides a method for solving the posterior probability
descriptive models according to the role of models. Predictive from the prior probability.
mode is the ability to mine certain feature information from
known data sets to perform predictive analysis on unknown P X | H P H 
P H | X   (1)
data sets; descriptive mode is to describe the rules, laws, and P X 
knowledge features that already exist in the data set without

 C    Px C 
making any predictions . Each method of data mining has its n

advantages and corresponding fields, and any one method PX (2)


cannot realize all the functions of mining. Only multiple k 1
methods are used in combination to achieve complementary The decision tree algorithm is a self-supervised learning
advantages. algorithm that learns from the labeled training data set and
(1) Classification analysis. By analyzing the data objects, generates a decision tree.
we find that they have the same characteristics. A
classification function or model is constructed from the
analyzed data and the same characteristics, and then the
classification function is used to classify other data. Divide the
data with the same characteristics into one category, and
finally realize the classification of the data. In this way, further
data analysis can be performed on the classified data.
(2) Association rule analysis. Association rules refer to the
inherent relationships or connections between data items.
Association rule analysis refers to revealing the association
rules that exist between data. It is the most basic and important
Fig. 1 Decision tree
analysis method for data analysis. The goal is to analyze the
Decision tree algorithm for decision-making induction is
rules when things happen.
based on information gain, and the classic ID3 algorithm uses
(3) Decision tree method. Decision tree is a kind of
the value of information gain as the attribute selection metric.
predictive model algorithm, through the target classification of
data, and then find potential and valuable information from the m
data. InfoD     p log 2  p (3)
(4) Neural network. It is a processing unit composed of a i 1
large number of human-like nerve cells, and they are
Gain  A  InfoD   InfoA D  (4)
interconnected by a network. After the data is entered, they
can determine the data mode. Because of its self-learning BP neural network is a network that can sort out and
ability, it can analyze complex data like the human brain, and reconstruct errors, and it is also the neural network with the
can analyze the development trend of things through complex best application at present. The BP neural network algorithm
patterns. has a self-supervised learning process. The algorithm can be
(5) Genetic algorithm is a data processing method that trained multiple times with the sample set data of a given input
simulates the natural mechanism of biological inheritance. and output, and continuously modify the connection weight
First, a set of genetic operators is used to naturally select the between each node in the network. This makes the result
initial solution, and then the rule is used to find the solution in output of the entire network closer to the expected value and
the space, and finally the optimal solution is obtained. achieves the expected experimental results.
(6) Cluster analysis. Clustering is a processing method of
grouping the same data in a data set, and then forming a class
after the grouping is completed. Grouping is to make all
objects in the same group have a high degree of similarity, so
that the similarity between different groups is minimized.

978-0-7381-1405-7/21/$31.00 ©2021 IEEE 1206


Authorized licensed use limited to: Somaiya University. Downloaded on April 10,2024 at 17:46:42 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 6th International Conference on Communication and Electronics Systems (ICCES-2021)
IEEE Xplore Part Number: CFP21AWO-ART; ISBN: 978-0-7381-1405-7

Fig. 2 Three-layer BP Neural Network

C. Apriori Association Rule Algorithm


The association rule obtained by the Apriori algorithm is a
single-dimensional, single-layer, Boolean association rule. In
Fig. 3. The resulting knowledge base
order to finally find the frequent itemsets in the target
transaction database, the algorithm uses a two-stage strategy III. EXPERIMENT
connection step and pruning step to generate frequent itemsets,
There is a breast cancer image database T, as shown in
and compares with the minimum confidence threshold to Table 1, which has 40 transactions. There are five largest
obtain the association rules. entries for each transaction in the database.
The Apriori algorithm uses a layer-by-layer search and
iteration method. The algorithm is simple, intuitive and has no Table 1. Breast cancer image database
complicated theoretical derivation. It uses a pruning strategy
when generating the candidate item set, which greatly reduces
the size of the candidate item set. However, in actual
applications, users may set a relatively small support
threshold. If the transaction data set is a little larger, there will
be a lot of items in the project concentration, and the
performance of the algorithm will show obvious shortcomings
at this time.
(1) Scan the transaction database frequently. The Apriori
algorithm needs to scan the database multiple times to
calculate the support of the candidate item set and compare it
with the minimum support threshold, so as to decide whether
to add the candidate item set to the set of frequent itemsets.
(2) A large number of candidate item sets may be The decision trees constructed are shown in the figures
generated. The algorithm generates the option set through the below.
concatenation operation of frequent sets. In this process, the
growth rate of the candidate option set tends to increase
exponentially.
(3) Adopt the limitation of unique support degree, which
will lead to many redundant rules. The acquisition of
association rules is determined by the amount of transaction
data and the evaluation degree. If the support degree is too
large or too small, it will affect the number of rules finally
obtained.
The candidate item set tree is shown in the figure below.

Fig. 4. Decision trees


Finally, the generated frequent patterns are shown in Table
2.

978-0-7381-1405-7/21/$31.00 ©2021 IEEE 1207


Authorized licensed use limited to: Somaiya University. Downloaded on April 10,2024 at 17:46:42 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 6th International Conference on Communication and Electronics Systems (ICCES-2021)
IEEE Xplore Part Number: CFP21AWO-ART; ISBN: 978-0-7381-1405-7

Table 2. Generated frequent patterns

[7] Zhao Jibeng. Improvement and research of Apriori algorithm based on


Hadoop [D]. Northwest Normal University, 2016
[8] Li Lei. Analysis and Research on Apriori algorithm based on cloud
computing and big data [J]. Information technology, 2016, 000 (009):
93-95
[9] Yu Shoujian, Zhou Yiyang. Improvement of Apriori algorithm based on
prefix itemset [J]. Computer applications and software, 2017, 34 (002):
290-294
[10] Wang Xin. Improvement of Apriori algorithm in intrusion detection
system [J]. Security technology, 2016, 4 (4): 6
[11] Liu Lijuan. Research and application of improved Apriori algorithm [J].
Computer engineering and design, 2017 (12): 142-146
[12] Sun Xing. Processing and analysis of human detection data based on
association algorithm [D]. Xi'an University of science and technology,
2019
[13] Wu Bo. Research on the application of Apriori algorithm mining
technology in WANO human factor data [D]. Nanhua University, 2016
[14] Hu Shan. Improvement of Apriori algorithm and its application in college
students' mental health test [J]. Automation and instrumentation, 2016,
000 (006): 222-224
IV. CONCLUSION [15] Li Hongwei, Ma Zhongyuan, Xie Zhenbo. Application of data mining
technology based on Apriori algorithm in data processing of flight
Association rule mining is a main research direction in the parameters of an engine [J]. Aircraft design, 2018, 038 (004): 5-8, 19
field of data mining, which can directly express the [16] Wang Xue, Shi Yuanbo, Huang Yueyang. Research on Data Mining
relationship between data itemsets. In this era when the Based on Apriori algorithm in mobile medical terminal system [J].
influence of big data is more and more rapid, data mining Digital technology and application, 2017 (09): 60-61
technology is constantly moving towards a more efficient and [17] Zhang Chao. Two optimization analysis of Apriori algorithm for data
accurate direction. Therefore, the combination of data mining mining based on association rules [J]. Journal of Shaoguan University,
2019 (9): 16-20
technology and disease prediction research has become a
scientific research trend. This paper combines the association [18] Qian Cheng. Improvement and application of Apriori algorithm in the
research of College Students' emotional quality [D]. Shanghai Normal
rule mining algorithm in data mining with medical data University, 2019
analysis. Based on the traditional Apriori algorithm, the [19] Li Wei, Liu Guangming, Meng Xiangfei, et al. Application and
improved algorithm is applied to the disease prediction. optimization of parallel Apriori algorithm in massive medical document
data mining [J]. Journal of Beijing Normal University: Natural Science
REFERENCES Edition, 2016, 52 (4): 420-424
[1] Chai Yi. Research on Association classification algorithm and its [20] Xu Jianjun, Zhang Guohua. Application and practice of data mining
application in massive medical data mining of chronic diseases [D]. algorithm based on Apriori [J]. Computer technology and development,
Beijing University of Posts and telecommunications, 2016 2020, 030 (004): 206-210
[2] Zhang Kai. Research on the application of data mining technology in [21] Huang liming, Liu Zhenyu. Using improved Apriori algorithm to
medical expense data [D]. Beijing University of Posts and determine the association rules of prescription drugs in pharmacy [J].
telecommunications, 2015 Electronic design engineering, 2018, 26 (24): 36-40
[3] Zhang Jirong, Wang Xiangyang. Research and improvement of Apriori [22] Li Jun. Application Research of diabetes information system based on
algorithm based on XML data mining [J]. Computer measurement and data mining [D]. Guangxi University, 2019
control, 2016 (issue 6): 178-180 [23] Han Bing. Research on the improvement of Apriori data mining
[4] Chen Miao. Application Research of an improved Apriori algorithm in algorithm based on cloud platform [J]. Information systems engineering,
mobile platform teaching evaluation [D]. Chongqing Normal University, 2018, 000 (010): 144-145
2017 [24] Wu Ruidong, Zhang Maohong, Dong Jing, et al. Data analysis of medical
[5] Wang Xun. Application of data mining in medical field [D]. University of record home page based on Apriori algorithm [J]. China Digital
Electronic Science and technology, 2016 Medicine, 2018 (1): 79-81
[6] Dong Jinfeng. Improvement and parallel processing of association rule
algorithm in data mining [D]. Harbin University of science and
technology, 2016

978-0-7381-1405-7/21/$31.00 ©2021 IEEE 1208


Authorized licensed use limited to: Somaiya University. Downloaded on April 10,2024 at 17:46:42 UTC from IEEE Xplore. Restrictions apply.

You might also like