Data-Driven-Decision-Making-Using-Analytics-Computational-Intelligence-Techniques-1St-Edition
Data-Driven-Decision-Making-Using-Analytics-Computational-Intelligence-Techniques-1St-Edition
https://ebookmeta.com/product/data-driven-decision-making-using-
analytics-computational-intelligence-techniques-1st-edition-
parul-gandhi-surbhi-bhatia-kapal-dev/
https://ebookmeta.com/product/business-analytics-the-science-of-
data-driven-decision-making-2nd-edition-u-dinesh-kumar/
https://ebookmeta.com/product/beginning-azure-cognitive-services-
data-driven-decision-making-through-artificial-intelligence-1st-
edition-alicia-moniz/
https://ebookmeta.com/product/business-analytics-a-data-driven-
decision-making-approach-for-business-volume-i-1st-edition-amar-
sahay/
Computational Intelligence and Data Analytics:
Proceedings of ICCIDA 2022 Rajkumar Buyya
https://ebookmeta.com/product/computational-intelligence-and-
data-analytics-proceedings-of-iccida-2022-rajkumar-buyya/
https://ebookmeta.com/product/developing-informed-intuition-for-
decision-making-data-analytics-applications-1st-edition-
liebowitz-jay/
https://ebookmeta.com/product/data-driven-decision-making-in-
fragile-contexts-evidence-from-sudan-1st-edition-alexander-
hamilton/
https://ebookmeta.com/product/political-decision-making-and-
security-intelligence-recent-techniques-and-technological-
developments-1st-edition-dallacqua/
https://ebookmeta.com/product/data-analytics-and-business-
intelligence-computational-frameworks-practices-and-
applications-1st-edition-vincent-charles/
i
The objective of this series is to provide researchers a platform to present state of the
art innovations, research, and design and implement methodological and algorithmic
solutions to data processing problems, designing and analyzing evolving trends in
health informatics and computer-aided diagnosis. This series provides support and
aid to researchers involved in designing decision support systems that will permit
societal acceptance of ambient intelligence. The overall goal of this series is to pre-
sent the latest snapshot of ongoing research as well as to shed further light on future
directions in this space. The series presents novel technical studies as well as position
and vision papers comprising hypothetical/speculative scenarios. The book series
seeks to compile all aspects of computational intelligence techniques from funda-
mental principles to current advanced concepts. For this series, we invite researchers,
academicians and professionals to contribute, expressing their ideas and research in
the application of intelligent techniques to the field of engineering in handbook, ref-
erence, or monograph volumes.
Edited by
Parul Gandhi, Surbhi Bhatia, and Kapal Dev
iv
CRC Press
Boca Raton and London
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
© 2022 selection and editorial matter, Parul Gandhi, Surbhi Bhatia and Kapal Dev; individual chapters,
the contributors
CRC Press is an imprint of Taylor & Francis Group, LLC
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or
contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-
8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
A catalog record for this title has been requested
ISBN: 978-1-03-205827-6 (hbk)
ISBN: 978-1-03-205828-3 (pbk)
ISBN: 978-1-00-319940-3 (ebk)
DOI: 10.1201/9781003199403
Typeset in Times
by Newgen Publishing UK
v
Contents
Preface.......................................................................................................................vii
List of Contributors....................................................................................................ix
Editors’ Biography.....................................................................................................xi
v
vi
vi Contents
Index....................................................................................................................... 137
vii
Preface
Digitalization has increased our capabilities for collecting and generating data from
different sources. Therefore, tremendous data have flooded in every aspect of our
lives. This growth created an urgent need to develop techniques and tools to handle,
analyze, and manage data to map it into useful information. This mapping will help
improve the performance which eventually supports decision making.
This book brings new opportunities in the area of Data Analytics for Decision
Making for further research targeting different verticals such as healthcare and cli-
mate change. Further, it explores the concepts of Database Technology, Machine
Learning, Knowledge-Based System, High Performance Computing, Information
Retrieval, Finding Patterns Hidden in Large Datasets, and Data Visualization. In add-
ition, this book presents various paradigms including pattern mining, clustering and
classification, and data analysis. The aim of this book is to provide technical solutions
in the field of data analytics and data mining.
This book lays the required basic foundation and also covers cutting-edge topics.
With its algorithmic perspective, examples, and comprehensive coverage, this book
will offer solid guidance to researchers, students, and practitioners.
vii
viii
ix
Contributors
Contributor’s Affiliation Email
Name
Pradeep Kumar Professor Department of Computer [email protected]
Bhatia Science and Engineering, Guru
Jambheshwar University of Science
& Technology, Hisar
Prerna Bhatnagar Assistant Professor, Indirapuram Preranabhatnagar.iihs@gmail.
Institute of Higher Studies (IIHS), com
Ghaziabad, Uttar Pradesh
Ankur Singh Bist Chief AI Data Scientist, Signy [email protected]
Advanced Technologies, India
Tejinder Pal Singh Assistant Professor, Department [email protected]
Brar of Computer Applications, CGC
Landran, Punjab
Kapal Dev University of Johannesburg, South [email protected]
Africa
Dagjit Singh Assistant Professor, PDM [email protected]
Dhatterwal University, Bahadurgarh, Jhajjar,
Haryana, India
Parul Gandhi Professor, Faculty of Computer [email protected]
Applications, MRIIRS, Faridabad
Ovais Bashir Scholar, Department of Computer [email protected]
Gashroo Science Jamia Millia Islamia, New
Delhi, India
Sharad Goel Director & Professor, Indirapuram [email protected]
Institute of Higher Studies (IIHS),
Ghaziabad, Uttar Pradesh
Sonal Kapoor Associate Professor, Indirapuram [email protected]
Institute of Higher Studies (IIHS),
Ghaziabad, UttaPradesh
Kuldeep Singh Associate Professor, Galgotias [email protected]
Kaswan University, Greater Noida, Gautam
Buddha Nagar, UP, India
Monica Mehrotra Professor, Department of Computer [email protected]
Science Jamia Millia Islamia, New
Delhi, India
Preety Assistant Professor, PDM [email protected]
University, Bahadurgarh, Jhajjar,
Haryana, India
ix
x
x Contributors
Editors’ Biography
PARUL GANDHI
Dr Ghandhi has a is Doctorate in the subject of Computer Science with the study area
in Software Engineering from Guru Jambheshwar University, Hisar. She is also a Gold
Medalist in M.Sc. Computer Science, with a strong inclination toward academics and
research. She has 15 years of academic, research, and administrative experience. She
has published more than 40 research papers in reputable international/national journal
and conferences. Her research interests include software quality, soft computing,
and software metrics and component-based software development, data mining, and
IOT. Presently, Dr Gandhi is working as Professor at Manav Rachna International
Institute of Research and Studies (MRIIRS), Faridabad. She is also handling the PhD
program of the University. She has been associated as an Editorial Board member of
SN Applied Sciences and also a reviewer with various respected journals of IEEE and
conferences. Dr Gandhi has successfully published many book chapters in Scopus-
indexed books and also edited various books with well-known indexing databases like
Wiley and Springer. She also handles special issues in journals of Elsevier, Springer
as a guest editor. She has been called as a resource person in various FDPs and also
chaired sessions in various IEEE conferences. Dr Gandhi is the lifetime member of
the Computer Society of India.
xi
newgenprepdf
xii
KAPAL DEV
Dr Dev is a Postdoctoral Research Fellow with the CONNECT Centre, School of
Computer Science and Statistics, Trinity College Dublin (TCD). His education profile
revolves over ICT background, i.e. Electronics (B.E and M.E), Telecommunication
Engineering (PhD), and Postdoc (Fusion of 5G and Blockchain). He received his
PhD degree from Politecnico di Milano, Italy in July 2019. His research interests
include blockchain, 5G beyond networks, and artificial intelligence. Previously,
Dr Dev worked as 5G Junior Consultant and Engineer at Altran Italia S.p.A, Milan
on 5G use cases. He is PI of two Erasmus + International Credit Mobility projects.
He is an evaluator of MSCA Co-Fund schemes, Elsevier Book proposals, and top sci-
entific journals and conferences including IEEE TII, IEEE TITS, IEEE TNSE, IEEE
JBHI, FGCS, COMNET, TETT, IEEE VTC, and WF-IoT. Dr Dev is TPC member of
IEEE BCA 2020 in conjunction with AICCSA 2020, ICBC 2021, DICG Colocated
with Middleware 2020, and FTNCT 2020. He is also serving as guest editor (GE)
in COMNET (I.F 3.11), Associate Editor in IET Quantum Communication, GE in
COMCOM (I.F: 2.8), GE in CMC-Computers, Materials & Continua (I.F 4.89),
and lead chair in one of CCNC 2021 workshops. Dr Dev is also acting as Head of
Projects for Oceans Network funded by the European Commission.
1
CONTENTS
1.1 Big Data..............................................................................................................2
1.1.1 Big Data V’s...........................................................................................2
1.1.1.1 Volume.....................................................................................3
1.1.1.2 Variety......................................................................................3
1.1.1.3 Velocity....................................................................................4
1.1.1.4 Veracity....................................................................................4
1.1.1.5 Validity.....................................................................................4
1.1.1.6 Visualization of Big Data.........................................................4
1.1.1.7 Value........................................................................................4
1.1.1.8 Big Data Hiding.......................................................................4
1.1.2 Challenges with Big Data.......................................................................4
1.1.3 Analytics of Big Data.............................................................................5
1.1.3.1 Use Cases Used in Big Data Analytics....................................5
1.1.3.1.1 Amazon’s “360-Degree View”..............................5
1.1.3.1.2 Amazon –Improving User Experience.................5
1.1.4 Social Media Analysis and Response.....................................................5
1.1.4.1 IoT –Preventive Maintenance and Support.............................5
1.1.4.2 Healthcare................................................................................5
1.1.4.3 Insurance Fraud........................................................................6
1.1.5 Big Data Analytics Tools........................................................................6
1.1.5.1 Hadoop.....................................................................................6
1.1.5.2 MapReduce Optimize..............................................................7
1.1.5.3 HBase Hadoop Structure..........................................................7
1.1.5.4 Hive Warehousing Tool............................................................8
DOI: 10.1201/9781003199403-1 1
newgenprepdf
1.1
BIG DATA
The advent of IoT (internet of things) devices, business intelligence systems and
AI (artificial intelligence) has led to their widespread implementation and to con-
tinuously increase the amount of data in existence. The development of self-driving
cars, smart cities, home and factory automation, intelligent avionics systems, weap-
onry automation, medical process automation, Ericsson Company has estimated that
nearly 29 billion connected devices are expected by 2022, of which 18 billion would
apply to IoT. The number of IoT units, led by the new use scenarios, is projected to
grow by 21% between 2016 and 2022. IDC reports that by 2025, real-time data will
be more than a quarter of all data. Over the years, control systems kept evolving at
different levels of Big Data information security. These control measures although
serving as the underlying strategies for securing big data, have limited capability in
combating recent attacks as malicious hackers have found new ways of launching
destructive operations on big data infrastructures [1].
Digital data will increase as like zettabytes. This forecast gives insight into the
higher rate of vulnerabilities and the large scale data security loopholes that may
arise. Big data companies are facing greater challenges on how to highly secure and
manage the constantly growing data.
Some of the challenges include the following:
1.1.1.1 Volume
The cumulative number of data is referred to in the volume. Today, Facebook
contributes to 500 terabytes of new data every day. A single flight through the
United States can produce 240 terabytes of flight data. In the near future, mobile
phones and the data that they generate and ingest will result in thousands of new,
continuously changing data streams that include information on the world, loca-
tion, and other matters.
1.1.1.2 Variety
Data are of various types such as text, sensor data, audio, graphics, and video. Various
data forms exist.
Structured data: data that can be saved in the row and column table in the data-
base. These data are linked and can be mapped into pre-designed fields quickly, for
example relational database.
4
Semi-structured data: partially ordered data such as XML and CSV archives.
Unstructured data: data which cannot be pre-defined, for example text, audio, and
video files. It accounts for approximately 80% of data. It is fast growing and its use
could assist in company’s decision making.
1.1.1.3 Velocity
Measuring how easily the data is entering as data streams constantly and receiving
usable data in real time from the webcam.
1.1.1.4 Veracity
Consistency or trust of data is veracity.
It investigates whether data obtained from Twitter posts is trustworthy and correct,
with hash tags, abbreviations, styles, etc.
1.1.1.5 Validity
It is important to verify the authenticity of the data prior to processing large data sets.
1.1.1.7 Value
It refers to the worth of the data being extracted. The bulk of data having no value is
not at all useful for the company. Data needs to be converted into something valuable
to achieve business gains. Through the estimation of the full costs for the produc-
tion and processing of big data, businesses can determine whether big data analytics
really add some value to their business relative to the ROI that business insights are
supposed to produce.
1.1.4.2 Healthcare
Big data in healthcare refers to vast volumes of data obtained from a number of
sources such as electronic gadgets such as exercise tracking systems, smart clocks,
and sensors. Biometric data such as X-rays, CT scan, medical documents, EHR,
demographic, family history, asthma, and clinical trial findings also come under big
data. It helps physicians develop libraries that are vital to genetic disease predic-
tion. For example, preventive treatment should be carried out for people at risk of a
particular illness (e.g. diabetes). According to the data from other human beings, it
can include proposals for each patient. Clinical decision support (CDS) software in
hospitals investigates on-site diagnostic information and guidance to health providers
on diagnosing patients and drafting orders. Wearables continually gather health data
from customers and notify physicians. If something goes wrong, the doctor or any
6
other expert will be immediately alerted. Without further interruption, the doctor will
call patients and give them all the guidance they need.
The HDFS is Hadoop’s stock part. It stores information in fixed blocks by separ-
ating the files. Blocks are kept in a cluster of nodes using a master/slave framework
in which a cluster contains a single name node and the other nodes are database
nodes (slave nodes). NameNode and DataNode are the HDFS Main Components.
The HDFS master node retains and handles the blocks in the DataNodes and highly
open servers that monitor client’s access to files. It also documents metadata for all
cluster files such as block location and file size. The files are available in this cluster.
Metadata is maintained using the following two files:
• FsImage: contains all the modifications ever made across the Hadoop Cluster
NameNode from the beginning (stored on the disk).
• EditLogs: comprises all recent updates –for example the updates in the last 1
hour (stored in RAM).
The NameNode processes the replication factor and collects the Heartbeat (default 3
seconds) and a block report from all DataNodes to ensure live DataNodes. It is also
7
1. Collaborative Sifting
2. Clustering
3. Categorization/Classification [9]
1.1.5.8 Bigtable
A Bigtable system was introduced in 2004 and is currently used by many Google
users such as Map Minimize. They are also used in Bigtable, Google Readers, Google
Maps, Google Book Search, Google Earth, Blogger.com, Google Code facilitations,
Orkut, YouTube, and Gmail to provide and alter information. Encouragement Google
has flexibility and greater control of execution on particular database. The wide table
is expanded using scattered data stock piling the board model, which relies on the
stock of the parts in order to boost data recovery.
and dynamic mining data generate motivated information by the exponential devel-
opment of structured data and its scale and nature progressions. Complex papers are
absolutely illustrated with complex details on WWW servers, Website flips, interper-
sonal institutions, communications systems, and transport systems. Although com-
plex dependency structures depending on the information make our acceptability
criteria complexity, however, broad data complexity is applied from a wide variety
of viewpoints including increasingly complex types, complicated simple semitone
relationships, and complex information association structures.
Big data contains ordered information, unstructured information, semi-structured
information, etc. Social databases, texts, hypertexts, pictures, sound and video data
are particularly available. Web news, Facebook commentaries, Picasa photographs,
and YouTube video on an honorary scholastic occasion. There is no doubt that these
details include sound semantical relationships. Mining a dynamic semantic inter-
action with knowledge “text image video” would primarily lead to developing appli-
cation systems such as the search engines or suggestion architectures.
There are relations between individuals with respect to big data. In the internet,
there are blogs, and hyperlinks are used to create a mind-boggling structure for web
pages. Social links also occur between individuals who build complicated social
networks, for instance massive Facebook, Twitter, LinkedIn relationship details
and other internet-based life, including call detail records (CDRs), gadgets and
sensor information, GPS and geocoded map data, huge image documents trans-
ferred through the Manager File Transfer Convention, and web texts. Expanding
research activities have begun to solve problems of institutional development,
swarms engagement, and documentation and communication in agreements of
dynamic relationship systems.
1.1.9 Conclusions
As the data volume grows every day, big data will begin to expand and become one
of the fascinating prospects in the next few years. We have an incredible increase in
volume, speed, and variety of data today. Security and secrecy have also seen expo-
nential development in the vulnerability ecosystem and data protection threats. Now
we are in a new era where by using Cloud Computing we can manage all our data with
less money and effort, because we are taking the route of outsourcing for managing
big data [18]. Not only we can manage with less effort but we can manage big data in
a very effective manner using MapReduce technique. Through effective data science
analytics, machine learning, and statistical computations systems, the security of big
data can be greatly enhanced beyond the already existing control systems. This can
give rise to highly intelligent data security implementation with the ability to self
protect and automate actions when a suspicious pattern is detected. IT professionals
will be able to advance operative, defensive, and active measures to prevent malicious
attacks on big data by the vast options data science presents.
REFERENCES
[1] Bhatia, S. A. (2020). Comparative Study of Opinion Summarization Techniques. IEEE
Transactions on Computational Social Systems, 8(1), 110–117.
[2] Kshatri, S. S., Singh, D., Narain, B., Bhatia, S., Quasim, M. T., & Sinha, G. R. (2021).
An empirical analysis of machine learning algorithms for crime prediction using
stacked generalization: An ensemble approach. IEEE Access, 9, 67488–67500.
[3] Vinodhan D, Vinnarasi A. (2016). IOT based smart home. International Journal of
Engineering and Innovative Technology (IJEIT), 10, 35–38.
[4] Kamalpreet Singh, Ravinder Kaur (2014). “Hadoop: Addressing Challenges Of Big
Data,” IEEE International Advanced Computing Conference (IACC).
[5] Cloudera.com (2015). Introduction to Hadoop and MapReduce [online]. Available
at: www.cloudera.com/content/cloudera/en/trainin g/courses/udacity/mapreduce.html
(accessed on 16/1/2015).
[6] Charu C. Aggarwal, Naveen Ashish, Amit Sheth (2013). “The Internet of Things: A
Survey from the Data-Centric Perspective, Managing and Mining Sensor Data, Springer
Science+Business Media, New York, pp. 384–428.
[7] Obaidat, M. S., & Nicopolitidis, P. (2016). Smart Cities and Homes: Key Enabling
Technologies. Cambridge, MA: Morgan Kaufmann, pp. 91–108.
[8] Eugene Silberstein, “Automatic Controls and Devices”Residential Construction
Academy HVAC, Chapter 7, pp. 158–184.
[9] Scuro, Carmelo & Sciammarella, Paolo & Lamonaca, Francesco & Olivito, R. &
Carnì, Domenico. (2018). IoT for structural health monitoring. IEEE Instrumentation
and Measurement Magazine. 21(6). 4–9 and 14.
14
[10] Alojail, M., & Bhatia, S. (2020). A novel technique for behavioral analytics using
ensemble learning algorithms in E-commerce. IEEE Access, 8, 150072-150080.
[11] Sheikh, R. A., Bhatia, S., Metre, S. G., & Faqihi, A. Y. A. (2021). Strategic value real-
ization framework from learning analytics: a practical approach. Journal of Applied
Research in Higher Education.
[12] Bisht B., & Gandhi P. (2019) “Review Study on Software Defect Prediction
Models premised upon Various Data Mining Approaches”, INDIACom-2019, 10th
INDIACom 6th International Conference on “Computing For Sustainable Global
Development” at Bharti Vidyapeeth’s Institute of Computer Applications and
Management (BVICAM).
[13] Gandhi P., & Pruthi J. (2020) Data Visualization Techniques: Traditional Data to Big
Data. In: Data Visualization. Springer, Singapore. pp. 53–74.
[14] Sethi, J. K., & Mittal, M. (2020). Monitoring the impact of air quality on the COVID-
19 fatalities in Delhi, vol-1 using machine learning techniques. Disaster Medicine
and Public Health Preparedness, 1–8.
[15] Chhetri B et al., Estimating the prevalence of stress among Indian students during
the COVID-19 pandemic: a cross-sectional study from India, Journal of Taibah
University Medical Sciences, pp. 35–50 https://doi.org/10.1016/j.jtumed.2020.12.0
[16] Dagjit Singh Dhatterwal, Preety, Kuldeep Singh Kaswan, The Knowledge
Representation in COVID-19 Springer, ISBN: 978-981-15-7317-0
[17] Chawla, S., Mittal, M., Chawla M., Goyal, L. M. (2020). Corona Virus -SARS-
CoV-2: an insight to another way of natural disaster. EAI Endorsed Transactions on
Pervasive Health and Technology, 6, 22.
[18] Gandhi K., & Gandhi P. (2016) “Cloud Computing Security Issues: An Analysis”,
INDIACom-2016 10th INDIACom 3rd International Conference on “Computing
for Sustainable Global Development” at Bharti Vidyapeeth’s Institute of Computer
Applications and Management (BVICAM), pp. 7670–7673.
newgenprepdf
15
2 Analytical Theory
Frequent Pattern Mining
Ovais Bashir Gashroo1 and Monica Mehrotra2
1
Scholar, Department of Computer Science,
Jamia Millia Islamia, New Delhi, India
2
Professor, Department of Computer Science,
Jamia Millia Islamia, New Delhi, India
CONTENTS
2.1 Introduction......................................................................................................15
2.2 Frequent Pattern Mining Algorithms................................................................16
2.2.1 Apriori Algorithm.................................................................................17
2.2.2 DHP Algorithm....................................................................................18
2.2.3 FP-Growth Algorithm...........................................................................18
2.2.4 EClaT Algorithm..................................................................................19
2.2.5 Tree Projection Algorithm....................................................................20
2.2.6 TM Algorithm......................................................................................21
2.2.7 P-Mine Algorithm................................................................................22
2.2.8 Can-Mining Algorithm.........................................................................22
2.3 Analysis of the Algorithms...............................................................................23
2.4 Privacy Issues...................................................................................................23
2.5 Applications of FPM........................................................................................24
2.5.1 For Customer Analysis.........................................................................24
2.5.2 Frequent Patterns for Classification.....................................................24
2.5.3 Frequent Patterns Aimed at Clustering.................................................25
2.5.4 Frequent Patterns for Outlier Analysis.................................................25
2.5.5 Frequent Patterns for Indexing.............................................................25
2.5.6 Frequent Patterns for Text Mining.......................................................25
2.5.7 Frequent Patterns for Spatial and Spatiotemporal Applications...........26
2.5.8 Applications in Chemical and Biological Fields..................................26
2.6 Resources Available for Practitioner................................................................26
2.7 Future Works and Conclusion..........................................................................27
2.1
INTRODUCTION
Frequent pattern mining (FPM) is a process in which the association between distinct
items in a database needs to be found. Frequent patterns are actually itemsets, sub-
structures, or sub-sequences which appear within a set of data and have frequency not
below a user-specified limit. For example, collection of items, like bread and eggs,
DOI: 10.1201/9781003199403-2 15
16
that look often together in a transaction data set can be called as frequent itemset
[1]. A sub-sequence like purchasing a laptop followed by a camera, afterwards, an
external hard drive, if it occurs in the history of shopping database often, it will be
termed as a frequent sequential pattern. Structural forms like subgraphs or subtrees
which can be joined with itemsets or sub-sequences are referred to as sub-structures.
And the frequent occurrence of sub-structures within a graph database is known as
frequent structural pattern.
FPM has been a requisite task in data mining, and researchers have been a lot
more dedicated on this concept from the last couple of years. With its profuse use in
data mining problems such as classification and clustering, FPM has been broadly
researched. The advent of FPM into real-world businesses has led to the promotion
of sales which resulted in increase in profits. FPM has been applied in domains like
recommender systems, bioinformatics, and decision making. The literature dedicated
to this field of research is abundant and has achieved tremendous progress such as
the development of efficient and effective algorithms aimed at mining of frequent
itemsets. FPM is of immense importance in many important data mining tasks like
association and correlation analysis, analyzing patterns in spatiotemporal data, clas-
sification, and cluster analysis.
The process of FPM can be specified as follows: If we have a database DTB
which contains transactions T1, T2 … Tn, all the patterns P need to be determined
which appear in no less than a fraction s of all the transactions [2]. Fraction s is usu-
ally referred to as the “minimum support”. This was put forward first by Aggarwal
et al. [3] in 1993 for analyzing market basket as a kind of association rule mining.
FPM was able to analyze the buying habits of customers by discovering the asso-
ciations among the items which are being placed by customers in their respective
baskets used for shopping. For example, customers who are buying bread, what
are their chances of buying eggs? Information like this can help increase the sales
because the owners will do marketing as per this information and shelf spaces will
be arranged accordingly.
This chapter will provide a comprehensive study in the field of FPM. The chapter
will also explore some of the algorithms for FPM, analysis of FPM algorithms,
privacy issues, various applications of FPM, and some of the resources that are avail-
able for those who want to practice the FPM methods. Finally, the chapter will be
concluded with future directions in this area.
2.2
FREQUENT PATTERN MINING ALGORITHMS
Many researchers have come up with a lot of algorithms to enhance the FPM process.
This section analyzes several FPM algorithms to give us an understanding. Generally,
FPM algorithms can be categorized under three categories [2], namely Join-Based,
Pattern Growth, and Tree-based algorithms. Join-Based algorithms use a bottom-up
method to recognize frequent items in a data set and keep on enlarging them into
itemsets as long as those itemsets become more than a minimum threshold value
described by the user in the database. On the other hand, Tree-Based algorithms use
a set-enumeration technique for solving the frequent itemset generation problem. It
17
Analytical Theory 17
achieves this through the construction of a lexicographic tree enabling the mining
of items via a collection of ways as the breadth-first or depth-first order. Lastly,
Pattern Growth algorithms apply a divide-and-conquer method to partition and pro-
ject databases depending on the presently identified frequent patterns and expand
them into longer ones in the projected databases [4]. Algorithms such as Apriori,
DHP, AprioriTID, and AprioriHybrid are classified under Join-Based Algorithms,
while algorithms such as AIS, TreeProjection, EClaT, VIPER, MAFIA, and TM are
classified under Tree-Based Algorithms [4]. FP-Growth, TFR, SSR, P-Mine, LP-
Growth, and Extract are classified under Pattern Growth algorithms [4]. Some of the
algorithms from all the three categories will be discussed here in this chapter.
2.2.1 Apriori Algorithm
Apriori [5] is the first algorithm which has been used for mining of frequent patterns.
This algorithm mines itemsets which are frequent so that Boolean association rules
can be generated. An iterative stepwise technique of searching is employed to
find (k+1)-itemsets from k-itemsets. An example of transactional data is shown in
Table 2.1; it contains items purchased in different transactions. Initially, the whole
database is searched so that all frequent 1-itemsets are identified after calculating
them. Then only those among them which fulfill the minimum support threshold are
captured. The entire database has to be scanned for identifying each frequent itemset
and it has to be made sure that no frequent k-itemset is possible to be identified.
Supposing minimum support count is 2, that time in our example, records having
support greater or equal to minimum support are going to be included into the next
phase for processing by the algorithm.
The size of candidate itemsets is lessened significantly through the Apriori algo-
rithm and a significant performance gain is also provided in many of the cases.
However, there are few limitations which are critical in nature that this algorithm
suffers from [6]. One of them being that if there is a rise in the total count of frequent
k-itemsets, large amount of candidate itemsets have to be produced. In response, the
algorithm has to scan the entire database recurrently and the verification of a large set
of candidate items is needed using the pattern matching technique.
TABLE 2.1
Example of Transactional Data
Transaction ID Item ID’s
T100 a, b, e
T101 b, d
T102 b, c
T103 a, b, d
T104 a, c
T105 b, c
T106 a, c
T107 a, b, c, e
18
The main benefit of the Apriori algorithm is that it uses an iterative step-
wise searching technique for discovering (k+1)- itemsets from k- itemsets. The
disadvantages of the Apriori algorithm are that it needs to generate a number of can-
didate sets when the itemsets are greater in number and also the database has to be
scanned repeatedly to determine the support count of the itemsets [4].
2.2.2 DHP Algorithm
DHP stands for the Direct Hashing and Pruning method [7] and it was put forward
after the Apriori algorithm. Mainly two optimizations are proposed in this algorithm
to speed up itself. First is pruning the candidate itemsets in each iteration, and second
is trimming the transactions so that the support counting procedure becomes more
effective [2].
For pruning the itemsets, the DHP algorithm keeps track of the incomplete infor-
mation regarding the candidate (k+1)-itemsets, meanwhile counting the support
explicitly of candidate k-itemsets. When the counting of candidate k-itemsets is
being done, all of the (k+1) subsets are discovered and are hashed in a table that
preserves the counting of the number of subsets which are hashed into each entry
[2]. The retrieval of counts from the hash table for each itemset is done during the
stage of counting (k+1)-itemsets. As there are collisions in the hash table, the counts
are overestimated. Itemsets which have counts under the user stated support level are
actually pruned for further attention.
Trimming of transactions is the second optimization which is proposed in the DHP
algorithm. An item can be pruned from a transaction if it doesn’t occur in no less
than k frequent itemsets in Fk because it can’t be used for support calculation so that
frequent patterns can be found any longer. This can be understood by an important
observation that if an item doesn’t occur in no less than k frequent itemsets in Fk,
then any other frequent itemset will be containing that item in it. The width of the
transaction is reduced, and the efficiency in terms of processing is increased through
this step [2].
2.2.3 FP-Growth Algorithm
It is known as frequent pattern growth algorithm [6] and it mines frequent itemsets
without any costly process for the generation of candidates. The algorithm employs
a divide and conquer approach for compressing the frequent items into an FP-Tree
which holds all the data related to the association of the frequent items. The division
of FP-Tree into a number of conditional FP-Trees for every frequent item is done so
that each frequent item can be mined individually [4]. The representation of frequent
items with FP-Tree is displayed in Figure 2.1.
The problem of identification of long frequent patterns is being solved in the
FP-Growth algorithm by repeatedly searching via conditional FP-Trees which are
smaller in size. Examples of conditional FP- Trees and the detailed conditional
FP-Trees which are present in Figure 2.1 can be accessed in [8]. As per [4], “The
Conditional Pattern Base is a ‘sub-database’ which consists of every prefix path in
the FP-Tree that co-occurs with every frequent length-1 item. It is used to construct the
19
Analytical Theory 19
Conditional FP-Tree and generate all the frequent patterns related to every frequent
length-1 item”. There is a significant reduction for searching the frequent patterns in
terms of cost. According to [9], constructing an FP-Tree is a time-consuming process
if the set of data available is huge.
The first advantage of this algorithm is that the association related information
of every itemset is being preserved and second the volume of data which has to be
searched shrinks [3]. However, the disadvantage of this algorithm is that the time
required for constructing an FP-Tree will be high if data upon which it is being built
is very large [4][9].
2.2.4 EClaT Algorithm
By using the data format, which is vertical in nature, the EClaT (Equivalence Class
Transformation) algorithm [10] efficiently mines the frequent itemsets as shown in
Table 2.2. Transactions containing particular itemsets are grouped into the same record
using this method of data representation. The transformation of data into the vertical
format from the horizontal format is done using the EClaT algorithm after looking
into the database once. The production of frequent (k+1)-itemsets is accomplished
through the intersection of the transactions containing the frequent k-itemsets. This
process is being repeated till all the frequent itemsets are intersected with each other
and there remains no frequent itemset that can be discovered as depicted in Tables 2.3
and 2.4.
The EClaT algorithm need not scan the database multiple times so that (k+1)-
itemsets are identified. For transforming the data from the horizontal to the vertical
format, the database needs to be scanned only one time. The support count for each
itemset is just the total number of transactions that hold that particular itemset, so
the database does not need to be scanned more than once for identifying the support
20
TABLE 2.2
Vertical Format of Transactional Data
Itemset Transaction ID
a {T100, T103, T104, T106, T107}
b {T100, T101, T102, T103, T105, T107}
c {T102, T104, T105, T106, T107}
d {T101, T103}
e {T100, T107}
TABLE 2.3
Vertical Data Format of 2-Itemsets
Itemset Transaction ID
{a, b} {T100, T103, T107}
{a, c} {T104, T106, T107}
{a, d} {T103}
{a, e} {T100, T107}
{b, c} {T102, T105, T107}
{b, d} {T101, T103}
{b, e} {T100, T107}
{c, e} {T107}
TABLE 2.4
Vertical Data format of 3-Itemsets
Itemset Transaction ID
{a, b, c} {T107}
{a, b, e} {T100, T107}
count. A lot of memory space will be needed and the processing time will be also
higher for the intersection of the itemsets if the transactions involved in an itemset
are higher [4]. The advantage of the EClaT algorithm is that the database needs to be
looked into only once [4]. However, the intersection of long transactions sets takes
both more space in memory and longer processing time, which is a disadvantage of
this algorithm [4].
Analytical Theory 21
or a mixture of both of them. The nodes of the lexicographic tree are used for
keeping the support of every frequent itemset in each transaction after it has been
calculated by this algorithm. The performance involved in the calculation of the
total number of transactions containing a particular itemset is improved [4]. The
lexicographic tree representing the frequent items is shown in Figure 2.2 as an
example here.
The algorithm within the hierarchical structure of the lexicographic tree will
search only those subsets within the transactions which have the chance to hold the
frequent itemsets and this is an advantage of the Tree Projection algorithm because
the identification of the frequent itemsets is done much faster. A top-down technique
is used for traversing the lexicographic tree while searching it. The disadvantage of
this algorithm is that it is not efficient in consuming memory space when there are
different representations of the lexicographic tree [12].
2.2.6 TM Algorithm
Just like the EClaT algorithm, the TM (Transaction Mapping) [13] algorithm also
mines the frequent itemsets by making use of the vertical representation of data. For
every itemset, this algorithm transforms and maps the transaction IDs in a list of
transaction intervals which are present in other positions. After this, using the depth-
first search method, intersection is performed among the transaction intervals all the
22
way through the lexicographic tree to facilitate the counting of itemsets. The example
of this technique can be visualized in [13].
The compression of transaction IDs in a continuous transaction interval is not-
ably done using the TM technique if there is a high value of minimum support. The
advantage of this algorithm is that the time for intersection is saved when the itemsets
are being compressed in a list of transaction intervals [4]. For data sets containing
short frequent patterns, the TM algorithm achieved performance much better than
the FP-Growth and EClaT algorithms [4]. While comparing it with the FP-Growth
algorithm, the TM algorithm is slower when we consider the processing speed [4].
2.2.7 P-Mine Algorithm
The P-Mine [14] algorithm is capable of frequent itemset mining on a processor having
multiple cores by employing a parallel disk-based approach. The production of dense
data set on the disk is done in a less time using a data structure named VLDBMine.
The VLDBMine data structure uses a Hybrid-Tree for storing the whole data set and
the information that is required for the data retrieval process. The disk access is slow
which results in low performance, but here a pre-fetching method is implemented
which can load a number of projections of the data set into separate cores of the
processor such that the frequent itemsets are mined; this technique enhances the effi-
ciency in terms of disk access. At last, every processor core gives results, and they
are finally combined for the construction of the entire frequent itemsets. The P-Mine
algorithm architecture is illustrated in [14].
Improvements in performance and scalability are there for frequent itemset mining
because of the data set that is represented in the VLDBMine data structure. These
improvements are because of the Hybrid-Tree that is used in the VLDBMine data
structure. The optimization of scalability and performance when the mining of fre-
quent itemsets is done parallelly with the multiple cores present in a processor is an
advantage of the P-Mine algorithm. However, the disadvantage of this algorithm is
that the maximum level of optimization can be achieved only when there are multiple
cores present in a processor.
2.2.8 Can-Mining Algorithm
The mining of frequent itemsets is done in an incremental way from a Can-Tree
(Canonical-Order Tree) using the Can-Mining [15] algorithm. A header table is used
by this algorithm that consists of all the database items just like the FP-Growth algo-
rithm. Every item and their corresponding pointers to the first and last nodes which
contain the item in the Can-Tree are stored into the header table. A list containing the
frequent items is needed so that the algorithm can obtain the frequent patterns from
the Can-Tree by performing the mining operation. The advantage of this algorithm is
that at times when there is a high threshold value of the minimum support, the Can-
Mining algorithm outperforms the FP-Growth algorithm. However, the disadvantage
is that the time taken to mine is longer when the minimum support holds a much
lower threshold value. The architecture of the Can-Mining algorithm is illustrated
in [15].
23
Analytical Theory 23
TABLE 2.5
Timeline of the FPM Algorithms
S. No Algorithm Publication Year
1. Apriori [5] 1994
2. DHP [7] 1995
3. FR-Growth [6] 2000
4. EClaT [10] 2000
5. Tree Projection [11] 2001
6. TM [13] 2006
7. P-Mine [14] 2013
8. Can-Mining [15] 2015
2.3
ANALYSIS OF THE ALGORITHMS
Some of the recent and important algorithms for mining frequent patterns have
been discussed above. Both advantages and disadvantages of these algorithms were
discussed. The recent algorithms have to deal with the new kind of data and face
problems that the earlier algorithms didn’t have to. The overall performance of the
algorithms with respect to the execution time and the amount of memory space being
used is the key focus in this area. If the algorithm implemented in real-world applica-
tion doesn’t efficiently produce results on time, then it will cause loss to the business
where it was used. The main purpose of developing these efficient algorithms is to
have less execution time so that results can be produced quickly, which, in turn,
will lead to better decision making and eventually increase the sales in markets
and help grow businesses. In [9], the author has illustrated the runtime of different
algorithms using tables and graphs and also the memory usage of several existing
FPM algorithms. Table 2.5 shows the algorithms discussed above along with their
year of publication.
2.4
PRIVACY ISSUES
Privacy issue has become a very concerning thing in recent years because the
individual’s personal data is available widely [16]. The data shared is often reluctant,
is shared in a very constrained manner, or a low-quality version of data is shared.
These issues pose a challenge in discovering frequent patterns from the data. Below
we have discussed the challenges faced in terms of frequent pattern and association
rule mining:
for the meaningful discovery of patterns while maintaining the privacy of the
modified data.
• The problem of distributed privacy preservation [18] is that the data which
needs to be mined is being kept in a dispersed manner by the market contestants
who compete with each other. They want themselves data to find out the global
knowledge without revealing their local insights [2].
2.5
APPLICATIONS OF FPM
This section will emphasize mainly on the application part of FPM because it serves
as the main motivation for frequent pattern algorithms. These applications span over
a variety of other fields incorporating many other domains of data. As we have limited
space here, we will be focusing mainly on some of the key areas where it is being
used or we can say the application part of FPM.
Some of the key applications of FPM have been discussed in the following.
Attributes A1 and A2 will be taking the values a1 and a2, respectively, as implied
by the LHS of the rule and value c should be the class value as implied by the RHS
of the rule.
25
Analytical Theory 25
2.6
RESOURCES AVAILABLE FOR PRACTITIONER
Because the FPM methods are utilized so often for a lot of applications, we can use
software that is available immediately and aimed at providing services to use FPM in
these applications.
KDD Nuggets [39] is a website that contains links to different resources on
FPM. The implementation of different data mining algorithms which contains
algorithms for FPM is present in the Weka repository [40]. Bart Goethals has
implemented some of the distinguished FPM algorithms like Apriori, EClaT,
and FP-Growth [41]. FIMI is a repository [42] very well known for the efficient
implementation of many FPM algorithms. There is one free package from soft-
ware R called arules which is capable of doing FPM for many types, the details of
which can be accessed in [43]. Commercial software like Enterprise Miner which
is provided by SAS has the ability to perform both associations and sequential
pattern mining [44].
27
Analytical Theory 27
2.7
FUTURE WORKS AND CONCLUSION
Among the main challenges of data mining like classification, clustering, outlier
analysis, and FPM, the challenge of FPM is considered to be the leading problem
in the domain of data mining. This chapter gives a summary of some of the key
areas within FPM. This chapter also reviewed the strong points and weak points of
different algorithms. In addition, we also highlighted the issues which arise because
of the growing privacy concerns. Applications like analyzing the customer’s buying
behavior, outlier analysis of data, mining of textual data, and chemical and biological
data were discussed along with other applications.
We will be focusing on how to deal with the privacy preservation and effect-
ively mine our data without letting anything happen to the privacy. Also, some more
advanced FPM algorithms will be analyzed with their different versions. The focus
will be on how the earlier developed algorithms can be modified so that they can work
effectively in today’s environment. In addition, the variants of FPM shall be dealt
with to know how they impact the industry.
REFERENCES
[1] Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and
future directions. Data Min Knowl Discov., 15(1), 55–86. https://doi.org/10.1007/
s10618-006-0059-1
[2] Aggarwal CC (2014) An introduction to frequent pattern mining. Freq. Pattern
Min., 1–17.
[3] Agrawal R, Imieliński T, Swami A (1993) Mining Association Rules Between
Sets of Items in Large Databases. ACM SIGMOD Rec. https://doi.org/10.1145/
170036.170072
[4] Chee CH, Jaafar J, Aziz IA, et al (2019) Algorithms for frequent itemset mining: a
literature review. Artif. Intell. Rev., 52(4), 2603–2621.
[5] Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large
databases. In: Proc. of the 20th International Conference on Very Large Data Bases
(VLDB’94).
[6] Han J, Pei J, Yin Y (2000) Mining Frequent Patterns Without Candidate Generation.
SIGMOD Rec (ACM Spec Interes Gr Manag Data). https://doi.org/10.1145/
335191.335372
[7] Park JS, Chen MS, Yu PS (1995) An Effective Hash-Based Algorithm for Mining
Association Rules. ACM SIGMOD Rec. https://doi.org/10.1145/568271.223813
[8] Han J, Kamber M, Pei J (2012) Data Mining: Concepts and Techniques.
[9] Meenakshi A (2015) Survey of frequent pattern mining algorithms in horizontal and
vertical data layouts. Int J Adv Comput Sci Technol, 4(4).
[10] Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data,
12(3), 372–390. https://doi.org/10.1109/69.846291
[11] Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for gen-
eration of frequent item sets. J Parallel Distrib Comput., 61(3), 350–371. https://
doi.org/10.1006/jpdc.2000.1693
[12] Aggarwal CC, Bhuiyan MA, Hasan M Al (2014) Frequent pattern mining
algorithms: a survey. In: Frequent Pattern Mining.
28
[13] Song M, Rajasekaran S (2006) A transaction mapping algorithm for frequent itemsets
mining. IEEE Trans Knowl Data Eng., 18(4), 472–481. https://doi.org/10.1109/
TKDE.2006.1599386
[14] Baralis E, Cerquitelli T, Chiusano S, Grand A (2013) P-Mine: Parallel itemset mining
on large datasets. In: Proceedings of International Conference on Data Engineering.
[15] Hoseini MS, Shahraki MN, Neysiani BS (2016) A new algorithm for mining fre-
quent patterns in can tree. In: Conference Proceedings of 2015 2nd International
Conference on Knowledge-Based Engineering and Innovation, KBEI 2015
[16] Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining
models and algorithms. In: Privacy-preserving Data Mining (pp. 11–52).
[17] Evfimievski A, Srikant R, Agrawal R, Gehrke J (2004) Privacy preserving mining of
association rules. In: Information Systems.
[18] Clifton C, Kantarcioglu M, Vaidya J, et al (2002) Tools for privacy preserving
distributed data mining. ACM SIGKDD Explor Newsl. https://doi.org/10.1145/
772862.772867
[19] Ali K, Manganaris S, Srikant R (1997) Partial classification using association rules.
Knowl Discov Data Min., 97, 115–118.
[20] Liu B, Hsu W, Ma Y, Ma B (1998) Integrating classification and association rule
mining. Knowl Discov Data Min., 98, 80–86. https://doi.org/10.1.1.48.8380
[21] Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: International
Conference on Information and Knowledge Management, Proceedings
[22] L. P, E. H, H. L (2004) Subspace clustering for high dimensional data: a review.
SIGKDD Explor Newsl ACM Spec Interes Gr Knowl Discov Data Min
[23] Aggarwal CC, Reddy CK (2013) DATA Clustering Algorithms and Applications.
[24] He Z, Deng S, Xu X (2002) Outlier detection integrating semantic knowledge.
In: Lecture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics).
[25] Aggarwal CC, Wolf JL, Yu PS (1999) A new method for similarity indexing of
market basket data. SIGMOD Rec (ACM Spec Interes Gr Manag Data). https://
doi.org/10.1145/304181.304218
[26] Nanopoulos A, Manolopoulos Y (2002) Efficient similarity search for market basket
data. VLDB J., 11(2), 138–152. https://doi.org/10.1007/s00778-002-0068-7
[27] Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-based approach.
In: Proceedings of the ACM SIGMOD International Conference on Management
of Data.
[28] Yan X, Yu PS, Han J (2005) Substructure similarity search in graph databases.
In: Proceedings of the ACM SIGMOD International Conference on Management
of Data.
[29] Yan X, Zhu F, Han J, Yu PS (2006) Searching substructures with superimposed dis-
tance. In: Proceedings of the International Conference on Data Engineering.
[30] Aggarwal CC, Zhai CX (Eds.) (2013) Mining text data. Springer Science &
Business Media
[31] Aggarwal CC, Abdelzaher T (2013) Social sensing. In: Managing and Mining Sensor
Data (pp. 237–297).
[32] Li Z, Ding B, Han J, Kays R (2010) Swarm: mining relaxed temporal moving object
clusters. In: Proceedings of the VLDB Endowment, 3(1). https://doi.org/10.14778/
1920841.1920934
[33] Deshpande M, Kuramochi M, Wale N, Karypis G (2005) Frequent substructure-
based approaches for classifying chemical compounds. IEEE Trans Knowl Data
Eng., 17(8), 1036-1050. https://doi.org/10.1109/TKDE.2005.127
29
Analytical Theory 29
[34] Cong G, Tung AKH, Xu X, et al (2004) FARMER: finding interesting rule groups in
microarray datasets. In: Proceedings of the ACM SIGMOD International Conference
on Management of Data.
[35] Cong G, Tan KL, K.h.tung A, Xu X (2005) Mining top-k covering rule groups for gene
expression data. In: Proceedings of the ACM SIGMOD International Conference on
Management of Data.
[36] Pan F, Cong G, Tung AKH, et al (2003) Carpenter: Finding closed patterns in long
biological datasets. In: Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining.
[37] Wang J, Shapiro B, Shasha D (1999) Pattern Discovery in Biomolecular Data: Tools,
Techniques and Applications. Oxford University Press.
[38] Rigoutsos I, Floratos A (1998) Combinatorial pattern discovery in biological
sequences: the TEIRESIAS algorithm. Bioinformatics, 14(1), 55–67. https://doi.org/
10.1093/bioinformatics/14.1.55
[39] KDnuggets www.kdnuggets.com/software/associations.html
[40] www.cs.waikato.ac.nz/ml/weka/
[41] http://adrem.ua.ac.be/ ∼goethals/software/
[42] http://fimi.ua.ac.be/
[43] https://cran.r-project.org/web/packages/arules/index.html
[44] www2.sas.com/proceedings/forum2007/132-2007.pdf
30
31
CONTENTS
3.1 Introduction......................................................................................................32
3.1.1 Comparing Conventional Data Technique and Big
Data Technique.....................................................................................32
3.2 Big Data Technique Types................................................................................33
3.2.1 Structured Big Data Type.....................................................................33
3.2.2 Unstructured Big Data Type.................................................................33
3.2.3 Semi-Structured Big Data Type............................................................33
3.3 Essence of Big Data.........................................................................................34
3.3.1 Volume..................................................................................................34
3.3.2 Variety..................................................................................................34
3.3.3 Velocity.................................................................................................34
3.3.4 Variability.............................................................................................34
3.3.5 Value.....................................................................................................34
3.4 Categorization of Data Mining Systems..........................................................37
3.4.1 Classification on the Basis of the Type of Data Source
That Is Mined.......................................................................................37
3.4.2 Classification [3]on the Basis of King of Knowledge
Discovered............................................................................................37
3.4.3 Classification on the Basis of the Data Model on which
It Is Drawn............................................................................................37
3.4.4 Classification According to Different Mining Techniques That
Are Used...............................................................................................37
3.5 Data Mining Design.........................................................................................38
3.5.1 Data Source..........................................................................................38
3.5.2 Data Warehouse [2]Server...................................................................38
3.5.3 Data Mining Engine.............................................................................38
3.5.4 Pattern Assessment Module..................................................................38
3.5.5 Graphical User Interface (GUI)............................................................39
DOI: 10.1201/9781003199403-3 31
newgenprepdf
32
3.1
INTRODUCTION
Big data consists of a group of data sets that are very large that it becomes so challen-
ging to process with the help of available database management tools and mechanisms.
Businesses, governmental institutions, HCPs (Health Care Providers), and financial
and academic institutions are all taking advantage of the competency of Big Data to
work on different business anticipations in addition to enhanced experience by many
customers. Almost about 90% of the global data has been created over the past two
years. This rate is still growing enormously. In addition, we have noticed that big data
has been introduced currently in almost every industry and in every field.
According to Gartner’s view, big data can be defined as “Big data” is highly vol-
umed, high velocity, and diversified knowledge assets which requires cost-effective,
contemporary mode of knowledge processing for improved insight as well as for
good decision making”.
There are certain basic points about big data technique that are discussed below:
• This technique points out to large collection of data that is expanding exponen-
tially along with time.
• This technique is so comprehensive that it cannot be refined or evaluated with
the help of typical data processing mechanisms.
• This technique consists of data mining process, data storage, data examination,
data sharing, and data visualization.
• This technique is all-inclusive one including data, data frameworks, inclusive
of various tools and methods that can be used to examine and evaluate the data.