0% found this document useful (0 votes)
19 views

Data-Driven-Decision-Making-Using-Analytics-Computational-Intelligence-Techniques-1St-Edition

DDDM

Uploaded by

marxx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Data-Driven-Decision-Making-Using-Analytics-Computational-Intelligence-Techniques-1St-Edition

DDDM

Uploaded by

marxx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Data Driven Decision Making using

Analytics Computational Intelligence


Techniques 1st Edition
Visit to download the full and correct content document:
https://ebookmeta.com/product/data-driven-decision-making-using-analytics-computat
ional-intelligence-techniques-1st-edition/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Data Driven Decision Making using Analytics


(Computational Intelligence Techniques) 1st Edition
Parul Gandhi & Surbhi Bhatia & Kapal Dev

https://ebookmeta.com/product/data-driven-decision-making-using-
analytics-computational-intelligence-techniques-1st-edition-
parul-gandhi-surbhi-bhatia-kapal-dev/

Business Analytics: The Science of Data-Driven Decision


Making 2nd Edition U. Dinesh Kumar

https://ebookmeta.com/product/business-analytics-the-science-of-
data-driven-decision-making-2nd-edition-u-dinesh-kumar/

Beginning Azure Cognitive Services: Data-Driven


Decision Making Through Artificial Intelligence 1st
Edition Alicia Moniz

https://ebookmeta.com/product/beginning-azure-cognitive-services-
data-driven-decision-making-through-artificial-intelligence-1st-
edition-alicia-moniz/

Business Analytics A Data Driven Decision Making


Approach for Business Volume I 1st Edition Amar Sahay

https://ebookmeta.com/product/business-analytics-a-data-driven-
decision-making-approach-for-business-volume-i-1st-edition-amar-
sahay/
Computational Intelligence and Data Analytics:
Proceedings of ICCIDA 2022 Rajkumar Buyya

https://ebookmeta.com/product/computational-intelligence-and-
data-analytics-proceedings-of-iccida-2022-rajkumar-buyya/

Developing Informed Intuition for Decision Making Data


Analytics Applications 1st Edition Liebowitz Jay

https://ebookmeta.com/product/developing-informed-intuition-for-
decision-making-data-analytics-applications-1st-edition-
liebowitz-jay/

Data Driven Decision Making in Fragile Contexts


Evidence from Sudan 1st Edition Alexander Hamilton

https://ebookmeta.com/product/data-driven-decision-making-in-
fragile-contexts-evidence-from-sudan-1st-edition-alexander-
hamilton/

Political Decision making And Security Intelligence


Recent Techniques And Technological Developments 1st
Edition Dall'Acqua

https://ebookmeta.com/product/political-decision-making-and-
security-intelligence-recent-techniques-and-technological-
developments-1st-edition-dallacqua/

Data Analytics and Business Intelligence: Computational


Frameworks, Practices, and Applications 1st Edition
Vincent Charles

https://ebookmeta.com/product/data-analytics-and-business-
intelligence-computational-frameworks-practices-and-
applications-1st-edition-vincent-charles/
i

Data Driven Decision


Making Using Analytics
ii

Computational Intelligence Techniques


Series Editor: Vishal Jain

The objective of this series is to provide researchers a platform to present state of the
art innovations, research, and design and implement methodological and algorithmic
solutions to data processing problems, designing and analyzing evolving trends in
health informatics and computer-​aided diagnosis. This series provides support and
aid to researchers involved in designing decision support systems that will permit
societal acceptance of ambient intelligence. The overall goal of this series is to pre-
sent the latest snapshot of ongoing research as well as to shed further light on future
directions in this space. The series presents novel technical studies as well as position
and vision papers comprising hypothetical/​speculative scenarios. The book series
seeks to compile all aspects of computational intelligence techniques from funda-
mental principles to current advanced concepts. For this series, we invite researchers,
academicians and professionals to contribute, expressing their ideas and research in
the application of intelligent techniques to the field of engineering in handbook, ref-
erence, or monograph volumes.

Computational Intelligence Techniques and Their Applications to Software


Engineering Problems
Ankita Bansal, Abha Jain, Sarika Jain, Vishal Jain, Ankur Choudhary

Smart Computational Intelligence in Biomedical and Health Informatics


Amit Kumar Manocha, Mandeep Singh, Shruti Jain, Vishal Jain

Data Driven Decision Making Using Analytics


Parul Gandhi, Surbhi Bhatia, and Kapal Dev

Smart Computing and Self-​Adaptive Systems


Simar Preet Singh, Arun Solanki, Anju Sharma, Zdzislaw Polkowski and Rajesh Kumar

For more information about this series, please visit: https://​www.routledge.com/​


Computational-​Intelligence-​Techniques/​book-​series/​CIT
iii

Data Driven Decision


Making Using Analytics

Edited by
Parul Gandhi, Surbhi Bhatia, and Kapal Dev
iv

CRC Press
Boca Raton and London
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-​2742
and by CRC Press
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
© 2022 selection and editorial matter, Parul Gandhi, Surbhi Bhatia and Kapal Dev; individual chapters,
the contributors
CRC Press is an imprint of Taylor & Francis Group, LLC
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or
contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-​750-​
8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-​in-​Publication Data
A catalog record for this title has been requested
ISBN: 978-1-03-205827-6 (hbk)
ISBN: 978-1-03-205828-3 (pbk)
ISBN: 978-1-00-319940-3 (ebk)
DOI: 10.1201/​9781003199403
Typeset in Times
by Newgen Publishing UK
v

Contents
Preface.......................................................................................................................vii
List of Contributors....................................................................................................ix
Editors’ Biography.....................................................................................................xi

Chapter 1 Securing Big Data Using Big Data Mining........................................... 1


Preety, Dagjit Singh Dhatterwal, and Kuldeep Singh Kaswan

Chapter 2 Analytical Theory: Frequent Pattern Mining...................................... 15


Ovais Bashir Gashroo and Monica Mehrotra

Chapter 3 A Journey from Big Data to Data Mining in Quality


Improvement.......................................................................................31
Sharad Goel and Prerna Bhatnagar

Chapter 4 Significance of Data Mining in the Domain of Intrusion


Detection............................................................................................. 45
Parul Gandhi, Ravi Kumar Sharma, Tejinder Pal Singh Brar,
and Pradeep Bhatia

Chapter 5 Data Analytics and Mining: Platforms for Real-​Time


Applications........................................................................................ 61
Saima Saleem and Monica Mehrotra

Chapter 6 Analysis of Government Policies to Control Pandemic and


Its Effects on Climate Change to Improve Decision Making.............. 81
Vaibhav Saini and Kapal Dev

Chapter 7 Data Analytics and Data Mining Strategy to Improve Quality,


Performance and Decision Making..................................................... 95
D. Sheema and K. Ramesh

v
vi

vi Contents

Chapter 8 SMART Business Model: An Analytical Approach to


Astute Data Mining for Successful Organization.............................. 111
Sharad Goel and Sonal Kapoor

Chapter 9 AI and Healthcare: Praiseworthy Aspects and Shortcomings........... 125


Ashay Singh and Ankur Singh Bist

Index....................................................................................................................... 137
vii

Preface
Digitalization has increased our capabilities for collecting and generating data from
different sources. Therefore, tremendous data have flooded in every aspect of our
lives. This growth created an urgent need to develop techniques and tools to handle,
analyze, and manage data to map it into useful information. This mapping will help
improve the performance which eventually supports decision making.
This book brings new opportunities in the area of Data Analytics for Decision
Making for further research targeting different verticals such as healthcare and cli-
mate change. Further, it explores the concepts of Database Technology, Machine
Learning, Knowledge-​Based System, High Performance Computing, Information
Retrieval, Finding Patterns Hidden in Large Datasets, and Data Visualization. In add-
ition, this book presents various paradigms including pattern mining, clustering and
classification, and data analysis. The aim of this book is to provide technical solutions
in the field of data analytics and data mining.
This book lays the required basic foundation and also covers cutting-​edge topics.
With its algorithmic perspective, examples, and comprehensive coverage, this book
will offer solid guidance to researchers, students, and practitioners.

vii
viii
ix

Contributors
Contributor’s Affiliation Email
Name
Pradeep Kumar Professor Department of Computer [email protected]
Bhatia Science and Engineering, Guru
Jambheshwar University of Science
& Technology, Hisar
Prerna Bhatnagar Assistant Professor, Indirapuram Preranabhatnagar.iihs@gmail.
Institute of Higher Studies (IIHS), com
Ghaziabad, Uttar Pradesh
Ankur Singh Bist Chief AI Data Scientist, Signy [email protected]
Advanced Technologies, India
Tejinder Pal Singh Assistant Professor, Department [email protected]
Brar of Computer Applications, CGC
Landran, Punjab
Kapal Dev University of Johannesburg, South [email protected]
Africa
Dagjit Singh Assistant Professor, PDM [email protected]
Dhatterwal University, Bahadurgarh, Jhajjar,
Haryana, India
Parul Gandhi Professor, Faculty of Computer [email protected]
Applications, MRIIRS, Faridabad
Ovais Bashir Scholar, Department of Computer [email protected]
Gashroo Science Jamia Millia Islamia, New
Delhi, India
Sharad Goel Director & Professor, Indirapuram [email protected]
Institute of Higher Studies (IIHS),
Ghaziabad, Uttar Pradesh
Sonal Kapoor Associate Professor, Indirapuram [email protected]
Institute of Higher Studies (IIHS),
Ghaziabad, UttaPradesh
Kuldeep Singh Associate Professor, Galgotias [email protected]
Kaswan University, Greater Noida, Gautam
Buddha Nagar, UP, India
Monica Mehrotra Professor, Department of Computer [email protected]
Science Jamia Millia Islamia, New
Delhi, India
Preety Assistant Professor, PDM [email protected]
University, Bahadurgarh, Jhajjar,
Haryana, India

ix
x

x Contributors

Contributor’s Affiliation Email


Name
K. Ramesh Professor, Department of Computer [email protected]
Science & Engineering, Hindustan
Institute of Technology and
Science, Chennai, Tamil Nadu,
India
Vaibhav Saini Indian Institute of Technology, [email protected]
Delhi, India
Saima Saleem Scholar, Department of Computer [email protected]
Science Jamia Millia Islamia, New
Delhi, India
Ravi Kumar Assistant Professor, Department [email protected]
Sharma of Computer Applications, CGC
Landran, Punjab
D. Sheema Department of Computer [email protected]
Applications, Hindustan Institute of
Technology and Science, Chennai,
Tamil Nadu, India
Ashay Singh Data Scientist, US Tech Solutions [email protected]
Pvt Ltd, India
xi

Editors’ Biography
PARUL GANDHI
Dr Ghandhi has a is Doctorate in the subject of Computer Science with the study area
in Software Engineering from Guru Jambheshwar University, Hisar. She is also a Gold
Medalist in M.Sc. Computer Science, with a strong inclination toward academics and
research. She has 15 years of academic, research, and administrative experience. She
has published more than 40 research papers in reputable international/​national journal
and conferences. Her research interests include software quality, soft computing,
and software metrics and component-​based software development, data mining, and
IOT. Presently, Dr Gandhi is working as Professor at Manav Rachna International
Institute of Research and Studies (MRIIRS), Faridabad. She is also handling the PhD
program of the University. She has been associated as an Editorial Board member of
SN Applied Sciences and also a reviewer with various respected journals of IEEE and
conferences. Dr Gandhi has successfully published many book chapters in Scopus-
indexed books and also edited various books with well-known indexing databases like
Wiley and Springer. She also handles special issues in journals of Elsevier, Springer
as a guest editor. She has been called as a resource person in various FDPs and also
chaired sessions in various IEEE conferences. Dr Gandhi is the lifetime member of
the Computer Society of India.

SURBHI BHATIA PMP®


Dr Bhatia has a Doctorate in Computer Science and Engineering from Banasthali
Vidyapith, India. She is currently an Assistant Professor in the Department of
Information Systems, College of Computer Sciences and Information Technology,
King Faisal University, Saudi Arabia. She has eight years of teaching and academic
experience. She is an Editorial board member with Inderscience Publishers in the
International Journal of Hybrid Intelligence, SN Applied Sciences, Springer, and also
in several IEEE conferences. Dr Bhatia has been granted seven national and inter-
national patents. She has published more than 30 papers in reputable journals and
conferences in well-known indexing databases including SCI, SCIE, Web of Science,
and Scopus. She has delivered talks as keynote speaker in IEEE conferences and
faculty development programs. Dr Bhatia has successfully authored two books from
Springer and Wiley. Currently, she is editing three books from CRC Press, Elsevier,
and Springer. She also handles special issues in journals of Elsevier, Springer as a
guest editor. Dr Bhatia has been an active researcher in the field of data mining,
machine learning, and information retrieval.

xi
newgenprepdf

xii

xii Editors’ Biography

KAPAL DEV
Dr Dev is a Postdoctoral Research Fellow with the CONNECT Centre, School of
Computer Science and Statistics, Trinity College Dublin (TCD). His education profile
revolves over ICT background, i.e. Electronics (B.E and M.E), Telecommunication
Engineering (PhD), and Postdoc (Fusion of 5G and Blockchain). He received his
PhD degree from Politecnico di Milano, Italy in July 2019. His research interests
include blockchain, 5G beyond networks, and artificial intelligence. Previously,
Dr Dev worked as 5G Junior Consultant and Engineer at Altran Italia S.p.A, Milan
on 5G use cases. He is PI of two Erasmus + International Credit Mobility projects.
He is an evaluator of MSCA Co-​Fund schemes, Elsevier Book proposals, and top sci-
entific journals and conferences including IEEE TII, IEEE TITS, IEEE TNSE, IEEE
JBHI, FGCS, COMNET, TETT, IEEE VTC, and WF-​IoT. Dr Dev is TPC member of
IEEE BCA 2020 in conjunction with AICCSA 2020, ICBC 2021, DICG Colocated
with Middleware 2020, and FTNCT 2020. He is also serving as guest editor (GE)
in COMNET (I.F 3.11), Associate Editor in IET Quantum Communication, GE in
COMCOM (I.F: 2.8), GE in CMC-​Computers, Materials & Continua (I.F 4.89),
and lead chair in one of CCNC 2021 workshops. Dr Dev is also acting as Head of
Projects for Oceans Network funded by the European Commission.
1

1 Securing Big Data Using


Big Data Mining
Preety1, Dagjit Singh Dhatterwal2, and
Kuldeep Singh Kaswan3
1
Assistant Professor, PDM University, Bahadurgarh,
Jhajjar, Haryana, India
2
Assistant Professor, PDM University, Bahadurgarh,
Jhajjar, Haryana, India
3
Associate Professor, Galgotias University, Greater Noida,
Gautam Buddha Nagar, UP, India
Email ID: [email protected], [email protected],
[email protected]

CONTENTS
1.1 Big Data..............................................................................................................2
1.1.1 Big Data V’s...........................................................................................2
1.1.1.1 Volume.....................................................................................3
1.1.1.2 Variety......................................................................................3
1.1.1.3 Velocity....................................................................................4
1.1.1.4 Veracity....................................................................................4
1.1.1.5 Validity.....................................................................................4
1.1.1.6 Visualization of Big Data.........................................................4
1.1.1.7 Value........................................................................................4
1.1.1.8 Big Data Hiding.......................................................................4
1.1.2 Challenges with Big Data.......................................................................4
1.1.3 Analytics of Big Data.............................................................................5
1.1.3.1 Use Cases Used in Big Data Analytics....................................5
1.1.3.1.1 Amazon’s “360-​Degree View”..............................5
1.1.3.1.2 Amazon –​Improving User Experience.................5
1.1.4 Social Media Analysis and Response.....................................................5
1.1.4.1 IoT –​Preventive Maintenance and Support.............................5
1.1.4.2 Healthcare................................................................................5
1.1.4.3 Insurance Fraud........................................................................6
1.1.5 Big Data Analytics Tools........................................................................6
1.1.5.1 Hadoop.....................................................................................6
1.1.5.2 MapReduce Optimize..............................................................7
1.1.5.3 HBase Hadoop Structure..........................................................7
1.1.5.4 Hive Warehousing Tool............................................................8

DOI: 10.1201/9781003199403-1 1
newgenprepdf

2 Data Driven Decision Making Using Analytics

1.1.5.5 Pig Programming.....................................................................8


1.1.5.6 Mahout Sub-​Project Apache....................................................8
1.1.5.7 Non-​Structured Query Language.............................................8
1.1.5.8 Bigtable....................................................................................9
1.1.6 Security Threats for Big Data.................................................................9
1.1.7 Big Data Mining Algorithms..................................................................9
1.1.8 Big Data Mining for Big Data Security...............................................10
1.1.8.1 Securing Big Data..................................................................11
1.1.8.2 Real-​Time Predictive and Active Intrusion Detection
Systems..................................................................................11
1.1.8.3 Securing Valuable Information Using Data Science..............12
1.1.8.4 Pattern Discovery...................................................................12
1.1.8.5 Automated Detection and Response Using Data Science......12
1.1.9 Conclusions..........................................................................................13

1.1 
BIG DATA
The advent of IoT (internet of things) devices, business intelligence systems and
AI (artificial intelligence) has led to their widespread implementation and to con-
tinuously increase the amount of data in existence. The development of self-​driving
cars, smart cities, home and factory automation, intelligent avionics systems, weap-
onry automation, medical process automation, Ericsson Company has estimated that
nearly 29 billion connected devices are expected by 2022, of which 18 billion would
apply to IoT. The number of IoT units, led by the new use scenarios, is projected to
grow by 21% between 2016 and 2022. IDC reports that by 2025, real-​time data will
be more than a quarter of all data. Over the years, control systems kept evolving at
different levels of Big Data information security. These control measures although
serving as the underlying strategies for securing big data, have limited capability in
combating recent attacks as malicious hackers have found new ways of launching
destructive operations on big data infrastructures [1].
Digital data will increase as like zettabytes. This forecast gives insight into the
higher rate of vulnerabilities and the large scale data security loopholes that may
arise. Big data companies are facing greater challenges on how to highly secure and
manage the constantly growing data.
Some of the challenges include the following:

• Interception or corruption of data in transit.


• Data in storage which can be held internee by malicious parties or hackers.
• Output data can also be a point of malicious attack.
• Low or no encryption mechanism over the variety of data sources.
• Incompatibility resulting from the various forms of data implementation from
different sources.

1.1.1 Big Data V’s


The above-​outlined challenges greatly impact the Vs of big data building blocks that
are illustrated in Figure 1.1 [2].
3

Securing Big Data Using Big Data Mining 3

FIGURE 1.1 Nine V’s of big data

1.1.1.1  Volume
The cumulative number of data is referred to in the volume. Today, Facebook
contributes to 500 terabytes of new data every day. A single flight through the
United States can produce 240 terabytes of flight data. In the near future, mobile
phones and the data that they generate and ingest will result in thousands of new,
continuously changing data streams that include information on the world, loca-
tion, and other matters.

1.1.1.2  Variety
Data are of various types such as text, sensor data, audio, graphics, and video. Various
data forms exist.
Structured data: data that can be saved in the row and column table in the data-
base. These data are linked and can be mapped into pre-​designed fields quickly, for
example relational database.
4

4 Data Driven Decision Making Using Analytics

Semi-​structured data: partially ordered data such as XML and CSV archives.
Unstructured data: data which cannot be pre-​defined, for example text, audio, and
video files. It accounts for approximately 80% of data. It is fast growing and its use
could assist in company’s decision making.

1.1.1.3  Velocity
Measuring how easily the data is entering as data streams constantly and receiving
usable data in real time from the webcam.

1.1.1.4  Veracity
Consistency or trust of data is veracity.
It investigates whether data obtained from Twitter posts is trustworthy and correct,
with hash tags, abbreviations, styles, etc.

• Do you have faith in the data you gathered?


• Is the data enough reliable to gather insight?

1.1.1.5  Validity
It is important to verify the authenticity of the data prior to processing large data sets.

1.1.1.6 Visualization of Big Data


A big data processing task is how the findings are visualized since the data is too
broad and user-​friendly visualizations are difficult to locate.

1.1.1.7  Value
It refers to the worth of the data being extracted. The bulk of data having no value is
not at all useful for the company. Data needs to be converted into something valuable
to achieve business gains. Through the estimation of the full costs for the produc-
tion and processing of big data, businesses can determine whether big data analytics
really add some value to their business relative to the ROI that business insights are
supposed to produce.

1.1.1.8 Big Data Hiding


Huge volumes of usable data are lost when fresh information is mainly unstructured
and dependent on files.

1.1.2 Challenges with Big Data


• Storing exponentially growing huge data sets.
• Integrating disparate data sources.
• Generating insights in a timely manner.
• Data governance.
• Security issues.

There are so many challenges in handling big data.


5

Securing Big Data Using Big Data Mining 5

1.1.3 Analytics of Big Data


It analyzes the broad and diverse forms of data in order to detect secret trends, asso-
ciations, and other perspectives.

1.1.3.1  Use Cases Used in Big Data Analytics


1.1.3.1.1 Amazon’s “360-​Degree View”
In order to develop its recommendation engine, Amazon uses broad data obtained from
consumers. It makes recommendations on what you buy, your reviews/​feedback, any
personal details, your shipping address (to guess your income level based on where
you live), and browsing behavior. The company also makes recommendations based
on what other customers with similar profile bought. This also helps in retaining their
existing customers [3].

1.1.3.1.2 Amazon –​Improving User Experience


Amazon is analyzing any visitor clicking on its web pages which will allow the com-
pany to understand user’s web navigation behavior, their empirical paths to purchase
the app, and the paths that led them to leave the site. All this knowledge helps con-
sumers enhance their marketing and advertising experiences.

1.1.4 Social Media Analysis and Response


Companies monitor what people are saying about their products and services in social
media, and collect and analyze the posts on Facebook, Twitter, Instagram, etc. This
further helps improve their products and enhance customer satisfaction as well as
retain existing customers.

1.1.4.1  IoT –​Preventive Maintenance and Support


Sensors are used for tracking the system and transmitting the related data over the
internet in factories and other installations that use costly instruments. Big data tech-
nology programs process to identify whether a crisis is going to occur, often in real
time. Prevention of incidents or expensive shutdowns may help its sustain.

1.1.4.2  Healthcare
Big data in healthcare refers to vast volumes of data obtained from a number of
sources such as electronic gadgets such as exercise tracking systems, smart clocks,
and sensors. Biometric data such as X-​rays, CT scan, medical documents, EHR,
demographic, family history, asthma, and clinical trial findings also come under big
data. It helps physicians develop libraries that are vital to genetic disease predic-
tion. For example, preventive treatment should be carried out for people at risk of a
particular illness (e.g. diabetes). According to the data from other human beings, it
can include proposals for each patient. Clinical decision support (CDS) software in
hospitals investigates on-​site diagnostic information and guidance to health providers
on diagnosing patients and drafting orders. Wearables continually gather health data
from customers and notify physicians. If something goes wrong, the doctor or any
6

6 Data Driven Decision Making Using Analytics

other expert will be immediately alerted. Without further interruption, the doctor will
call patients and give them all the guidance they need.

1.1.4.3  Insurance Fraud


Insurers analyze their internal data to gain insight into potentially fraudulent claims,
call center notices and voice recordings, third party social media reports on bills of
individuals, salaries, insolvencies, criminal history, and address change. For example,
when a complainant reports about a flood damage to its vehicle, the climatic conditions
day can be found out through social media feed. Insurers should apply text analytics
to this data that can be find out small inconsistencies found in the case report of the
applicant. Fraudsters appear over time to adjust their narrative, making it a valuable
tool for the identification of crimes.

1.1.5 Big Data Analytics Tools


1.1.5.1  Hadoop
a. Hadoop a fully accessible software platform that the Apache Software
Foundation developed in 2006. It helps to store data, delivers power to
manage big jobs, and carries out virtual simultaneous jobs and tasks. It is
intended to incorporate data planning for hustling estimates and shaky lack
of activity across figurative hubs [4]. The two fundamental segments include
the core which is known as the Hadoop Distributed File System (HDFS) and
the processor which is known as the MapReduce motor. HDFS is used to
store colossal information which is continually set and duplicated to client
application at high transmission capacity, while MapReduce is used for hand-
ling gigantic informational indexes in a circulated manner through various
machines (Figure 1.2).

The HDFS is Hadoop’s stock part. It stores information in fixed blocks by separ-
ating the files. Blocks are kept in a cluster of nodes using a master/​slave framework
in which a cluster contains a single name node and the other nodes are database
nodes (slave nodes). NameNode and DataNode are the HDFS Main Components.
The HDFS master node retains and handles the blocks in the DataNodes and highly
open servers that monitor client’s access to files. It also documents metadata for all
cluster files such as block location and file size. The files are available in this cluster.
Metadata is maintained using the following two files:

• FsImage: contains all the modifications ever made across the Hadoop Cluster
NameNode from the beginning (stored on the disk).
• EditLogs: comprises all recent updates –​for example the updates in the last 1
hour (stored in RAM).

The NameNode processes the replication factor and collects the Heartbeat (default 3
seconds) and a block report from all DataNodes to ensure live DataNodes. It is also
7

Securing Big Data Using Big Data Mining 7

FIGURE 1.2 Hadoop Ecosystem

responsible for choosing new DataNodes in case of DataNode failure. DataNodes


behave as slave nodes in the HDFS and store the actual data. DataNodes have a com-
modity hardware –​an unpriced and not good performing device. It serves read and
write requests from clients. The secondary NameNode operates along with the main
NameNode as a daemon to assist and verify the control. It is used to merge EditLogs
with NameNode FsImage and periodically installs EditLogs from the NameNode and
adds to FsImage. The next time NameNode begins the new FsImage, it is copied back
into the NameNode. It routinely conducts controls (default 1 hour) and hence is also
called as CheckpointNode.

1.1.5.2  MapReduce Optimize


This architecture is designed to act as a wide-​ranging programming worldview for the
processing of a large volume of data on Java-​based distributed computers. It includes
two capabilities. The first is the Map, which calculates and transforms a number of
data into another data set and the Minimize, which results from the Map and treats
the tuples in a smaller set of tuples. It is the new step forward, but it is used mostly for
estimation. To completely maximize the map reduction process, anything other than
calculation must be used. A number of products should be developed so that large
knowledge can be handled effectively [5].

1.1.5.3 HBase Hadoop Structure


The Hadoop layout represents a first Bigtable configuration, and the HBase is a mech-
anism of a hierarchical column-​based database used to provide random data access
8

8 Data Driven Decision Making Using Analytics

while a large number of organized data is accessed. HBase is primarily responsible


for using the fault tolerance found in the HDFS. A portion of the HBase is necessary
to ensure that it retrieves the lines from its main register. Users can either store data
directly in HDFS or via HBase, and can randomly read/​write in HDFS via HBase.
The knowledge obtained from HBase includes a lack of properties, key, appreciation,
to which attributes can be alluded to in the non-​key parts [6].

1.1.5.4  Hive Warehousing Tool


Hive has been based on the Hadoop Distributed File System. It is considered as a
great warehousing platform for files (HDFS). It is an ideal method for evaluating
massive databases, large data sets, and ad hoc queries. Users will use a web GUI
and Java Database Compatibility (JDBC) to communicate with this framework. The
concept of the Map Minimize demands that job methods be created. The structure
can be viewed as a core part of the HDFS and the top of the knowledge distributor.
Applications and constant transactions on the internet are not processed on the Hive
network [7].

1.1.5.5  Pig Programming


The Hadoop ecosystem is another really powerful method that provides an additional
database for better efficiency. The Pig table is a tuple assemble in which a number
of tuples are worth each field. The language of the procedure data flow is called pig
Latin and is used primarily for programming. The language provides all the fixed
SQL definitions, including entering, organizing, projecting, and collecting samples.
Compared with the MapReduce scheme, it also makes for a greater extraction
standard, since a Pig Latin inquiry can be modified for a succession of MapReduce
firms [8].

1.1.5.6  Mahout Sub-​Project Apache


In 2008, Apache’s Lucene was created as a sub-​project and is an open-​source plat-
form used mainly to create usable algorithms for machine learning. The following
methods of machine language are used:

1. Collaborative Sifting
2. Clustering
3. Categorization/​Classification [9]

1.1.5.7  Non-​Structured Query Language


It is a non-​SQL or non-​relational database; it is a database that provides a framework
for the recovery and preservation of non-​social databases of data. Different kinds of
NoSQL databases, which represent a report for key pairs of values, parts, and map
databases, software engineers, demonstrate that their pre-​owned applications have
reasonable details about the structure. Because of the growth of simple internet use
and free storing of information, a massive amount of ordered, semi-​organized, and
non-​structured information is obtained and everything is set away for different kinds
9

Securing Big Data Using Big Data Mining 9

of uses. Normally, this information is intended as large information. You’re using


NoSQL databases for Google, Twitter, Amazon, and other famous websites [10].

1.1.5.8  Bigtable
A Bigtable system was introduced in 2004 and is currently used by many Google
users such as Map Minimize. They are also used in Bigtable, Google Readers, Google
Maps, Google Book Search, Google Earth, Blogger.com, Google Code facilitations,
Orkut, YouTube, and Gmail to provide and alter information. Encouragement Google
has flexibility and greater control of execution on particular database. The wide table
is expanded using scattered data stock piling the board model, which relies on the
stock of the parts in order to boost data recovery.

1.1.6 Security Threats for Big Data


• An unauthorized customer can access documents and perform authentication
or other attacks.
• An unauthorized customer can snap/​spy parcels of information to the customer.
• An unauthorized client will view/​compose a record details square.
• An unauthorized consumer can gain advantages to corrupt important informa-
tion in the related files.

1.1.7 Big Data Mining Algorithms


Big data implementation requires free sources and decentralized controls; it is
entirely exorbitant, owing to the imaginable delivery costs and protection issues
for each distributed datum source to the incorporated mining facility. In a double
standard procedure, input, model, and information can be stressed all the more clearly
in world mining companies. At the level of knowledge, each location will include data
measurements depending on the sources of information for the community and share
perspectives between directions to achieve a worldwide distribution. Each site will
complete neighborhood mining training activities at models or design levels to find
neighborhoods models with regard to restricted details. Quickly developing examples
can be synthesized by conglomerating designers through all the local sources by
exchanging designs between different sources [11]. The model relationship analysis at
the level of information shows the importance among models generated from different
information sources in deciding how important information sources are connected
and how accurate choices that rely on models from self-​governing resources can be
formed. As they are limited, there are unreasonably not enough information focusers
to decide accurate ends. Based on speculation, information is an exceptional category
of information fact, where not every information set depends on any simple/​incor-
rect conveyance at this point. Different factors such as the dissatisfaction of a sensor
center or certain typical approaches to intentionally skirt some qualities may deliver
the missing qualities. Although most knowledge mining measurements today include
solutions to tackle missing qualities, the attribution of data is an integrated area of
study that aims to construct upgraded models by crediting missing qualities. Complex
10

10 Data Driven Decision Making Using Analytics

and dynamic mining data generate motivated information by the exponential devel-
opment of structured data and its scale and nature progressions. Complex papers are
absolutely illustrated with complex details on WWW servers, Website flips, interper-
sonal institutions, communications systems, and transport systems. Although com-
plex dependency structures depending on the information make our acceptability
criteria complexity, however, broad data complexity is applied from a wide variety
of viewpoints including increasingly complex types, complicated simple semitone
relationships, and complex information association structures.
Big data contains ordered information, unstructured information, semi-​structured
information, etc. Social databases, texts, hypertexts, pictures, sound and video data
are particularly available. Web news, Facebook commentaries, Picasa photographs,
and YouTube video on an honorary scholastic occasion. There is no doubt that these
details include sound semantical relationships. Mining a dynamic semantic inter-
action with knowledge “text image video” would primarily lead to developing appli-
cation systems such as the search engines or suggestion architectures.
There are relations between individuals with respect to big data. In the internet,
there are blogs, and hyperlinks are used to create a mind-​boggling structure for web
pages. Social links also occur between individuals who build complicated social
networks, for instance massive Facebook, Twitter, LinkedIn relationship details
and other internet-​based life, including call detail records (CDRs), gadgets and
sensor information, GPS and geocoded map data, huge image documents trans-
ferred through the Manager File Transfer Convention, and web texts. Expanding
research activities have begun to solve problems of institutional development,
swarms engagement, and documentation and communication in agreements of
dynamic relationship systems.

1.1.8 Big Data Mining for Big Data Security


The field of data science involves a comprehensive analysis of the vast volume of
data, including the retrieval, using scientific techniques, diverse technology, and
algorithms, of useful insights from raw, organized, and unstructured data. It uses
methods and strategies to exploit data to uncover occult phenomena and develop
something unique and important from the knowledge surrounding us. With data
science techniques on big data, we can mitigate the rising risk and vulnerabilities on a
huge amount of data in our global tech ecosystem, thus implementing highly resilient
intelligent security measures from the data science features of predictive analysis,
statistical computations, machine learning, and deep learning facilities (Figure 1.3).
Data science advanced analytic systems avail the following:

• Security Information and Event Management (SIEM)


• Security Metrics
• Vulnerability Assessment
• Risk Modeling
• Government Compliance and Risk (GRC) Automation [12]
• Computer and Network Forensics
• File Integrity Scanning
11

Securing Big Data Using Big Data Mining 11

FIGURE 1.3 Data sciences for big data security

1.1.8.1  Securing Big Data


By applying data science analytics and a variety of machine learning tools, we can
carry out thorough security analysis of a collection of data to reveal trends and patterns
for actionable intel as to what security measures to deploy. An instance could be a
system security expert finding that all threats on corporate data takes place at night, or
at a certain stage in the day, over a network or offline. He can learn enough to restrict
the likelihood to a specific network terminal. In addition, the derived information
can be used to forecast future possible attacks. During reconnaissance queries and
movements, most likely traces and signals are left behind. Data contains these signals
which can be observed by means of data science in order to boost early warnings
[13]. All the data were presumably sent to a security data lake named SIEM before
development in data science (security information and event management). However,
recent data science developments will now lead to associations over many incidents
in real time. It is easier to connect dots and identify patterns using algorithms which,
due to the lack of security analysts, were previously difficult to find manually. The
opportunity to benefit from decision making by security researchers and overtime
taking proactive steps and actions as a physical security researcher is part of several
benefits of data-​based science programs.

1.1.8.2  Real-​Time Predictive and Active Intrusion Detection Systems


Hackers and destructive attackers are using a variety of techniques and types to obtain
access to massive data stores. These programs track users and devices by data science
pattern discovery on the network and flag risky behaviors. The essence of an attack
and its magnitude ratio and the future effect degree can be calculated using a data
science. Data science increases the application of these methods and simplifies these
methods. Such methods can identify possible problems and attack patterns specific-
ally by applying real-​time and historic data to a machine learning algorithm. With
time, a device like this becomes knowledgeable and precise; potential threats can be
forecast and different loopholes found.
12

12 Data Driven Decision Making Using Analytics

1.1.8.3  Securing Valuable Information Using Data Science


Most attacks on big data infrastructure and resources are to perpetuate the loss
of extremely valuable data and information, which of course becomes harm to a
business. Unauthorized users will avoid checking the data set using authentication
mechanisms such as extremely complex signatures or encryption. Impenetrable
protocols can be developed through data science. One can, for example, build
algorithms to identify the most common target chain of data by studying the back-
ground of the cyber-​attacks and offline intrusion attempts on organizational data,
thus extracting insight on why such data is targeted and the probable outcome. This
helps to define appropriate security measures required to be implemented for that
focus data [14].

1.1.8.4  Pattern Discovery


Big data security requires data science techniques to play important roles in the pre-
sent and future generation of defense strategies [15]. Within enterprise data stores and
networks, high data sources of enormous volumes are in existence which can facili-
tate the discovery and prevention of attacks and other malicious offline and network
activities. At the very core, the focus is on innovating the probability model-​based
and statistical approach by utilizing these data sources for identifying subtle intrusion
attempts. The post-​detection of irregularities is one of the most valuable applications
of data science, which has proved its usefulness as soon as the attackers enter the
corporate data center; they appear like registered users that use legitimate authoriza-
tion for accessing systems and data stores. Data science self-​learning models enable
anomaly detection by helping scientists to understand “normal” behavior inside the
enterprise and observe the slightest deviation. The advantage brought by using a data
science approach is the ability to pre-​learn; from historic data, complex patterns of
normal computer and network behavior could be deduced, thereby detecting anom-
alies which would not stand out otherwise; one such scenario is the unusual network
traversal using legitimate credentials. In consideration of data pre-​breach context,
data science is contributing to enabling big data enterprises identify malicious code
and mistaken transactions using supervised engine training technologies surrounded
by trees environment to remove behavioral functionality from suspect executables.
The businesses that employ this method will no longer use handwriting malware
sensing to discourage malicious code writers from identifying them, making it even
more complicated. Pre-​break using monitored learning and post-​break identification
of deviations or predictive approaches are fields of data science contributions in terms
of big data security [12].

1.1.8.5  Automated Detection and Response Using Data Science


Data science contributes to describing the magnitude of an attack using automated
methodologies. Detection and response work side by side; therefore, the more
accurate detailing the extent of an attack on any big data stockpile, the better an
accurate speedy response can be made. Data science is enhancing automated response
which is dependent on the capability of effective detection systems [16]. Training
13

Securing Big Data Using Big Data Mining 13

intelligent systems to be more certain by applying a combination of analytics and


machine learning on detection before deploying automated response which may be
based on false positives is where data science is strengthening the security response
suitable for big data protection within an enterprise [17].

1.1.9 Conclusions
As the data volume grows every day, big data will begin to expand and become one
of the fascinating prospects in the next few years. We have an incredible increase in
volume, speed, and variety of data today. Security and secrecy have also seen expo-
nential development in the vulnerability ecosystem and data protection threats. Now
we are in a new era where by using Cloud Computing we can manage all our data with
less money and effort, because we are taking the route of outsourcing for managing
big data [18]. Not only we can manage with less effort but we can manage big data in
a very effective manner using MapReduce technique. Through effective data science
analytics, machine learning, and statistical computations systems, the security of big
data can be greatly enhanced beyond the already existing control systems. This can
give rise to highly intelligent data security implementation with the ability to self
protect and automate actions when a suspicious pattern is detected. IT professionals
will be able to advance operative, defensive, and active measures to prevent malicious
attacks on big data by the vast options data science presents.

REFERENCES
[1]‌ Bhatia, S. A. (2020). Comparative Study of Opinion Summarization Techniques. IEEE
Transactions on Computational Social Systems, 8(1), 110–​117.
[2]‌ Kshatri, S. S., Singh, D., Narain, B., Bhatia, S., Quasim, M. T., & Sinha, G. R. (2021).
An empirical analysis of machine learning algorithms for crime prediction using
stacked generalization: An ensemble approach. IEEE Access, 9, 67488–​67500.
[3]‌ Vinodhan D, Vinnarasi A. (2016). IOT based smart home. International Journal of
Engineering and Innovative Technology (IJEIT), 10, 35–​38.
[4]‌ Kamalpreet Singh, Ravinder Kaur (2014). “Hadoop: Addressing Challenges Of Big
Data,” IEEE International Advanced Computing Conference (IACC).
[5]‌ Cloudera.com (2015). Introduction to Hadoop and MapReduce [online]. Available
at: www.cloudera.com/​content/​cloudera/​en/​trainin g/​courses/​udacity/​mapreduce.html
(accessed on 16/​1/​2015).
[6]‌ Charu C. Aggarwal, Naveen Ashish, Amit Sheth (2013). “The Internet of Things: A
Survey from the Data-​Centric Perspective, Managing and Mining Sensor Data, Springer
Science+Business Media, New York, pp. 384–​428.
[7]‌ Obaidat, M. S., & Nicopolitidis, P. (2016). Smart Cities and Homes: Key Enabling
Technologies. Cambridge, MA: Morgan Kaufmann, pp. 91–​108.
[8]‌ Eugene Silberstein, “Automatic Controls and Devices”Residential Construction
Academy HVAC, Chapter 7, pp. 158–​184.
[9]‌ Scuro, Carmelo & Sciammarella, Paolo & Lamonaca, Francesco & Olivito, R. &
Carnì, Domenico. (2018). IoT for structural health monitoring. IEEE Instrumentation
and Measurement Magazine. 21(6). 4–​9 and 14.
14

14 Data Driven Decision Making Using Analytics

[10] Alojail, M., & Bhatia, S. (2020). A novel technique for behavioral analytics using
ensemble learning algorithms in E-​commerce. IEEE Access, 8, 150072-​150080.
[11] Sheikh, R. A., Bhatia, S., Metre, S. G., & Faqihi, A. Y. A. (2021). Strategic value real-
ization framework from learning analytics: a practical approach. Journal of Applied
Research in Higher Education.
[12] Bisht B., & Gandhi P. (2019) “Review Study on Software Defect Prediction
Models premised upon Various Data Mining Approaches”, INDIACom-​2019, 10th
INDIACom 6th International Conference on “Computing For Sustainable Global
Development” at Bharti Vidyapeeth’s Institute of Computer Applications and
Management (BVICAM).
[13] Gandhi P., & Pruthi J. (2020) Data Visualization Techniques: Traditional Data to Big
Data. In: Data Visualization. Springer, Singapore. pp. 53–​74.
[14] Sethi, J. K., & Mittal, M. (2020). Monitoring the impact of air quality on the COVID-​
19 fatalities in Delhi, vol-​1 using machine learning techniques. Disaster Medicine
and Public Health Preparedness, 1–​8.
[15] Chhetri B et al., Estimating the prevalence of stress among Indian students during
the COVID-​19 pandemic: a cross-​sectional study from India, Journal of Taibah
University Medical Sciences, pp. 35–​50 https://​doi.org/​10.1016/​j.jtumed.2020.12.0
[16] Dagjit Singh Dhatterwal, Preety, Kuldeep Singh Kaswan, The Knowledge
Representation in COVID-​19 Springer, ISBN: 978-​981-​15-​7317-​0
[17] Chawla, S., Mittal, M., Chawla M., Goyal, L. M. (2020). Corona Virus -​SARS-​
CoV-​2: an insight to another way of natural disaster. EAI Endorsed Transactions on
Pervasive Health and Technology, 6, 22.
[18] Gandhi K., & Gandhi P. (2016) “Cloud Computing Security Issues: An Analysis”,
INDIACom-​2016 10th INDIACom 3rd International Conference on “Computing
for Sustainable Global Development” at Bharti Vidyapeeth’s Institute of Computer
Applications and Management (BVICAM), pp. 7670–​7673.
newgenprepdf

15

2 Analytical Theory
Frequent Pattern Mining
Ovais Bashir Gashroo1 and Monica Mehrotra2
1
Scholar, Department of Computer Science,
Jamia Millia Islamia, New Delhi, India
2
Professor, Department of Computer Science,
Jamia Millia Islamia, New Delhi, India

CONTENTS
2.1 Introduction......................................................................................................15
2.2 Frequent Pattern Mining Algorithms................................................................16
2.2.1 Apriori Algorithm.................................................................................17
2.2.2 DHP Algorithm....................................................................................18
2.2.3 FP-​Growth Algorithm...........................................................................18
2.2.4 EClaT Algorithm..................................................................................19
2.2.5 Tree Projection Algorithm....................................................................20
2.2.6 TM Algorithm......................................................................................21
2.2.7 P-​Mine Algorithm................................................................................22
2.2.8 Can-​Mining Algorithm.........................................................................22
2.3 Analysis of the Algorithms...............................................................................23
2.4 Privacy Issues...................................................................................................23
2.5 Applications of FPM........................................................................................24
2.5.1 For Customer Analysis.........................................................................24
2.5.2 Frequent Patterns for Classification.....................................................24
2.5.3 Frequent Patterns Aimed at Clustering.................................................25
2.5.4 Frequent Patterns for Outlier Analysis.................................................25
2.5.5 Frequent Patterns for Indexing.............................................................25
2.5.6 Frequent Patterns for Text Mining.......................................................25
2.5.7 Frequent Patterns for Spatial and Spatiotemporal Applications...........26
2.5.8 Applications in Chemical and Biological Fields..................................26
2.6 Resources Available for Practitioner................................................................26
2.7 Future Works and Conclusion..........................................................................27

2.1 
INTRODUCTION
Frequent pattern mining (FPM) is a process in which the association between distinct
items in a database needs to be found. Frequent patterns are actually itemsets, sub-​
structures, or sub-​sequences which appear within a set of data and have frequency not
below a user-​specified limit. For example, collection of items, like bread and eggs,

DOI: 10.1201/9781003199403-2 15
16

16 Data Driven Decision Making Using Analytics

that look often together in a transaction data set can be called as frequent itemset
[1]. A sub-​sequence like purchasing a laptop followed by a camera, afterwards, an
external hard drive, if it occurs in the history of shopping database often, it will be
termed as a frequent sequential pattern. Structural forms like subgraphs or subtrees
which can be joined with itemsets or sub-​sequences are referred to as sub-​structures.
And the frequent occurrence of sub-​structures within a graph database is known as
frequent structural pattern.
FPM has been a requisite task in data mining, and researchers have been a lot
more dedicated on this concept from the last couple of years. With its profuse use in
data mining problems such as classification and clustering, FPM has been broadly
researched. The advent of FPM into real-​world businesses has led to the promotion
of sales which resulted in increase in profits. FPM has been applied in domains like
recommender systems, bioinformatics, and decision making. The literature dedicated
to this field of research is abundant and has achieved tremendous progress such as
the development of efficient and effective algorithms aimed at mining of frequent
itemsets. FPM is of immense importance in many important data mining tasks like
association and correlation analysis, analyzing patterns in spatiotemporal data, clas-
sification, and cluster analysis.
The process of FPM can be specified as follows: If we have a database DTB
which contains transactions T1, T2 … Tn, all the patterns P need to be determined
which appear in no less than a fraction s of all the transactions [2]. Fraction s is usu-
ally referred to as the “minimum support”. This was put forward first by Aggarwal
et al. [3] in 1993 for analyzing market basket as a kind of association rule mining.
FPM was able to analyze the buying habits of customers by discovering the asso-
ciations among the items which are being placed by customers in their respective
baskets used for shopping. For example, customers who are buying bread, what
are their chances of buying eggs? Information like this can help increase the sales
because the owners will do marketing as per this information and shelf spaces will
be arranged accordingly.
This chapter will provide a comprehensive study in the field of FPM. The chapter
will also explore some of the algorithms for FPM, analysis of FPM algorithms,
privacy issues, various applications of FPM, and some of the resources that are avail-
able for those who want to practice the FPM methods. Finally, the chapter will be
concluded with future directions in this area.

2.2 
FREQUENT PATTERN MINING ALGORITHMS
Many researchers have come up with a lot of algorithms to enhance the FPM process.
This section analyzes several FPM algorithms to give us an understanding. Generally,
FPM algorithms can be categorized under three categories [2], namely Join-​Based,
Pattern Growth, and Tree-​based algorithms. Join-​Based algorithms use a bottom-​up
method to recognize frequent items in a data set and keep on enlarging them into
itemsets as long as those itemsets become more than a minimum threshold value
described by the user in the database. On the other hand, Tree-​Based algorithms use
a set-​enumeration technique for solving the frequent itemset generation problem. It
17

Analytical Theory 17

achieves this through the construction of a lexicographic tree enabling the mining
of items via a collection of ways as the breadth-​first or depth-​first order. Lastly,
Pattern Growth algorithms apply a divide-​and-​conquer method to partition and pro-
ject databases depending on the presently identified frequent patterns and expand
them into longer ones in the projected databases [4]. Algorithms such as Apriori,
DHP, AprioriTID, and AprioriHybrid are classified under Join-​Based Algorithms,
while algorithms such as AIS, TreeProjection, EClaT, VIPER, MAFIA, and TM are
classified under Tree-​Based Algorithms [4]. FP-​Growth, TFR, SSR, P-​Mine, LP-​
Growth, and Extract are classified under Pattern Growth algorithms [4]. Some of the
algorithms from all the three categories will be discussed here in this chapter.

2.2.1 Apriori Algorithm
Apriori [5] is the first algorithm which has been used for mining of frequent patterns.
This algorithm mines itemsets which are frequent so that Boolean association rules
can be generated. An iterative stepwise technique of searching is employed to
find (k+1)-​itemsets from k-​itemsets. An example of transactional data is shown in
Table 2.1; it contains items purchased in different transactions. Initially, the whole
database is searched so that all frequent 1-​itemsets are identified after calculating
them. Then only those among them which fulfill the minimum support threshold are
captured. The entire database has to be scanned for identifying each frequent itemset
and it has to be made sure that no frequent k-​itemset is possible to be identified.
Supposing minimum support count is 2, that time in our example, records having
support greater or equal to minimum support are going to be included into the next
phase for processing by the algorithm.
The size of candidate itemsets is lessened significantly through the Apriori algo-
rithm and a significant performance gain is also provided in many of the cases.
However, there are few limitations which are critical in nature that this algorithm
suffers from [6]. One of them being that if there is a rise in the total count of frequent
k-​itemsets, large amount of candidate itemsets have to be produced. In response, the
algorithm has to scan the entire database recurrently and the verification of a large set
of candidate items is needed using the pattern matching technique.

TABLE 2.1
Example of Transactional Data
Transaction ID Item ID’s
T100 a, b, e
T101 b, d
T102 b, c
T103 a, b, d
T104 a, c
T105 b, c
T106 a, c
T107 a, b, c, e
18

18 Data Driven Decision Making Using Analytics

The main benefit of the Apriori algorithm is that it uses an iterative step-
wise searching technique for discovering (k+1)-​ itemsets from k-​ itemsets. The
disadvantages of the Apriori algorithm are that it needs to generate a number of can-
didate sets when the itemsets are greater in number and also the database has to be
scanned repeatedly to determine the support count of the itemsets [4].

2.2.2 DHP Algorithm
DHP stands for the Direct Hashing and Pruning method [7] and it was put forward
after the Apriori algorithm. Mainly two optimizations are proposed in this algorithm
to speed up itself. First is pruning the candidate itemsets in each iteration, and second
is trimming the transactions so that the support counting procedure becomes more
effective [2].
For pruning the itemsets, the DHP algorithm keeps track of the incomplete infor-
mation regarding the candidate (k+1)-​itemsets, meanwhile counting the support
explicitly of candidate k-​itemsets. When the counting of candidate k-​itemsets is
being done, all of the (k+1) subsets are discovered and are hashed in a table that
preserves the counting of the number of subsets which are hashed into each entry
[2]. The retrieval of counts from the hash table for each itemset is done during the
stage of counting (k+1)-​itemsets. As there are collisions in the hash table, the counts
are overestimated. Itemsets which have counts under the user stated support level are
actually pruned for further attention.
Trimming of transactions is the second optimization which is proposed in the DHP
algorithm. An item can be pruned from a transaction if it doesn’t occur in no less
than k frequent itemsets in Fk because it can’t be used for support calculation so that
frequent patterns can be found any longer. This can be understood by an important
observation that if an item doesn’t occur in no less than k frequent itemsets in Fk,
then any other frequent itemset will be containing that item in it. The width of the
transaction is reduced, and the efficiency in terms of processing is increased through
this step [2].

2.2.3 FP-​Growth Algorithm
It is known as frequent pattern growth algorithm [6] and it mines frequent itemsets
without any costly process for the generation of candidates. The algorithm employs
a divide and conquer approach for compressing the frequent items into an FP-​Tree
which holds all the data related to the association of the frequent items. The division
of FP-​Tree into a number of conditional FP-​Trees for every frequent item is done so
that each frequent item can be mined individually [4]. The representation of frequent
items with FP-​Tree is displayed in Figure 2.1.
The problem of identification of long frequent patterns is being solved in the
FP-​Growth algorithm by repeatedly searching via conditional FP-​Trees which are
smaller in size. Examples of conditional FP-​ Trees and the detailed conditional
FP-​Trees which are present in Figure 2.1 can be accessed in [8]. As per [4], “The
Conditional Pattern Base is a ‘sub-​database’ which consists of every prefix path in
the FP-​Tree that co-​occurs with every frequent length-​1 item. It is used to construct the
19

Analytical Theory 19

FIGURE 2.1 FP-​Tree


Source: Reproduced from [8].

Conditional FP-​Tree and generate all the frequent patterns related to every frequent
length-​1 item”. There is a significant reduction for searching the frequent patterns in
terms of cost. According to [9], constructing an FP-​Tree is a time-​consuming process
if the set of data available is huge.
The first advantage of this algorithm is that the association related information
of every itemset is being preserved and second the volume of data which has to be
searched shrinks [3]. However, the disadvantage of this algorithm is that the time
required for constructing an FP-​Tree will be high if data upon which it is being built
is very large [4][9].

2.2.4 EClaT Algorithm
By using the data format, which is vertical in nature, the EClaT (Equivalence Class
Transformation) algorithm [10] efficiently mines the frequent itemsets as shown in
Table 2.2. Transactions containing particular itemsets are grouped into the same record
using this method of data representation. The transformation of data into the vertical
format from the horizontal format is done using the EClaT algorithm after looking
into the database once. The production of frequent (k+1)-​itemsets is accomplished
through the intersection of the transactions containing the frequent k-​itemsets. This
process is being repeated till all the frequent itemsets are intersected with each other
and there remains no frequent itemset that can be discovered as depicted in Tables 2.3
and 2.4.
The EClaT algorithm need not scan the database multiple times so that (k+1)-​
itemsets are identified. For transforming the data from the horizontal to the vertical
format, the database needs to be scanned only one time. The support count for each
itemset is just the total number of transactions that hold that particular itemset, so
the database does not need to be scanned more than once for identifying the support
20

20 Data Driven Decision Making Using Analytics

TABLE 2.2
Vertical Format of Transactional Data
Itemset Transaction ID
a {T100, T103, T104, T106, T107}
b {T100, T101, T102, T103, T105, T107}
c {T102, T104, T105, T106, T107}
d {T101, T103}
e {T100, T107}

TABLE 2.3
Vertical Data Format of 2-​Itemsets

Itemset Transaction ID
{a, b} {T100, T103, T107}
{a, c} {T104, T106, T107}
{a, d} {T103}
{a, e} {T100, T107}
{b, c} {T102, T105, T107}
{b, d} {T101, T103}
{b, e} {T100, T107}
{c, e} {T107}

TABLE 2.4
Vertical Data format of 3-​Itemsets

Itemset Transaction ID
{a, b, c} {T107}
{a, b, e} {T100, T107}

count. A lot of memory space will be needed and the processing time will be also
higher for the intersection of the itemsets if the transactions involved in an itemset
are higher [4]. The advantage of the EClaT algorithm is that the database needs to be
looked into only once [4]. However, the intersection of long transactions sets takes
both more space in memory and longer processing time, which is a disadvantage of
this algorithm [4].

2.2.5 Tree Projection Algorithm


This algorithm [11] mines frequent itemsets so that a lexicographic tree can be
constructed via different techniques of searching like depth-​first, breadth-​first,
21

Analytical Theory 21

FIGURE 2.2 Lexicographic tree


Source: Reproduced from [11].

or a mixture of both of them. The nodes of the lexicographic tree are used for
keeping the support of every frequent itemset in each transaction after it has been
calculated by this algorithm. The performance involved in the calculation of the
total number of transactions containing a particular itemset is improved [4]. The
lexicographic tree representing the frequent items is shown in Figure 2.2 as an
example here.
The algorithm within the hierarchical structure of the lexicographic tree will
search only those subsets within the transactions which have the chance to hold the
frequent itemsets and this is an advantage of the Tree Projection algorithm because
the identification of the frequent itemsets is done much faster. A top-​down technique
is used for traversing the lexicographic tree while searching it. The disadvantage of
this algorithm is that it is not efficient in consuming memory space when there are
different representations of the lexicographic tree [12].

2.2.6 TM Algorithm
Just like the EClaT algorithm, the TM (Transaction Mapping) [13] algorithm also
mines the frequent itemsets by making use of the vertical representation of data. For
every itemset, this algorithm transforms and maps the transaction IDs in a list of
transaction intervals which are present in other positions. After this, using the depth-​
first search method, intersection is performed among the transaction intervals all the
22

22 Data Driven Decision Making Using Analytics

way through the lexicographic tree to facilitate the counting of itemsets. The example
of this technique can be visualized in [13].
The compression of transaction IDs in a continuous transaction interval is not-
ably done using the TM technique if there is a high value of minimum support. The
advantage of this algorithm is that the time for intersection is saved when the itemsets
are being compressed in a list of transaction intervals [4]. For data sets containing
short frequent patterns, the TM algorithm achieved performance much better than
the FP-​Growth and EClaT algorithms [4]. While comparing it with the FP-​Growth
algorithm, the TM algorithm is slower when we consider the processing speed [4].

2.2.7 P-​Mine Algorithm
The P-​Mine [14] algorithm is capable of frequent itemset mining on a processor having
multiple cores by employing a parallel disk-​based approach. The production of dense
data set on the disk is done in a less time using a data structure named VLDBMine.
The VLDBMine data structure uses a Hybrid-​Tree for storing the whole data set and
the information that is required for the data retrieval process. The disk access is slow
which results in low performance, but here a pre-​fetching method is implemented
which can load a number of projections of the data set into separate cores of the
processor such that the frequent itemsets are mined; this technique enhances the effi-
ciency in terms of disk access. At last, every processor core gives results, and they
are finally combined for the construction of the entire frequent itemsets. The P-​Mine
algorithm architecture is illustrated in [14].
Improvements in performance and scalability are there for frequent itemset mining
because of the data set that is represented in the VLDBMine data structure. These
improvements are because of the Hybrid-​Tree that is used in the VLDBMine data
structure. The optimization of scalability and performance when the mining of fre-
quent itemsets is done parallelly with the multiple cores present in a processor is an
advantage of the P-​Mine algorithm. However, the disadvantage of this algorithm is
that the maximum level of optimization can be achieved only when there are multiple
cores present in a processor.

2.2.8 Can-​Mining Algorithm
The mining of frequent itemsets is done in an incremental way from a Can-​Tree
(Canonical-​Order Tree) using the Can-​Mining [15] algorithm. A header table is used
by this algorithm that consists of all the database items just like the FP-​Growth algo-
rithm. Every item and their corresponding pointers to the first and last nodes which
contain the item in the Can-​Tree are stored into the header table. A list containing the
frequent items is needed so that the algorithm can obtain the frequent patterns from
the Can-​Tree by performing the mining operation. The advantage of this algorithm is
that at times when there is a high threshold value of the minimum support, the Can-​
Mining algorithm outperforms the FP-​Growth algorithm. However, the disadvantage
is that the time taken to mine is longer when the minimum support holds a much
lower threshold value. The architecture of the Can-​Mining algorithm is illustrated
in [15].
23

Analytical Theory 23

TABLE 2.5
Timeline of the FPM Algorithms
S. No Algorithm Publication Year
1. Apriori [5] 1994
2. DHP [7] 1995
3. FR-​Growth [6] 2000
4. EClaT [10] 2000
5. Tree Projection [11] 2001
6. TM [13] 2006
7. P-​Mine [14] 2013
8. Can-​Mining [15] 2015

2.3 
ANALYSIS OF THE ALGORITHMS
Some of the recent and important algorithms for mining frequent patterns have
been discussed above. Both advantages and disadvantages of these algorithms were
discussed. The recent algorithms have to deal with the new kind of data and face
problems that the earlier algorithms didn’t have to. The overall performance of the
algorithms with respect to the execution time and the amount of memory space being
used is the key focus in this area. If the algorithm implemented in real-​world applica-
tion doesn’t efficiently produce results on time, then it will cause loss to the business
where it was used. The main purpose of developing these efficient algorithms is to
have less execution time so that results can be produced quickly, which, in turn,
will lead to better decision making and eventually increase the sales in markets
and help grow businesses. In [9], the author has illustrated the runtime of different
algorithms using tables and graphs and also the memory usage of several existing
FPM algorithms. Table 2.5 shows the algorithms discussed above along with their
year of publication.

2.4 
PRIVACY ISSUES
Privacy issue has become a very concerning thing in recent years because the
individual’s personal data is available widely [16]. The data shared is often reluctant,
is shared in a very constrained manner, or a low-​quality version of data is shared.
These issues pose a challenge in discovering frequent patterns from the data. Below
we have discussed the challenges faced in terms of frequent pattern and association
rule mining:

• Once methods of privacy preservation like randomization are imposed,


discovering association rules from the data becomes a challenge. This is
because of the addition of large amount of noise to the data and discovering
association rules when noise is present in the data is a difficult task. These
class association rule mining methods [17] put forward an efficient method
24

24 Data Driven Decision Making Using Analytics

for the meaningful discovery of patterns while maintaining the privacy of the
modified data.
• The problem of distributed privacy preservation [18] is that the data which
needs to be mined is being kept in a dispersed manner by the market contestants
who compete with each other. They want themselves data to find out the global
knowledge without revealing their local insights [2].

2.5 
APPLICATIONS OF FPM
This section will emphasize mainly on the application part of FPM because it serves
as the main motivation for frequent pattern algorithms. These applications span over
a variety of other fields incorporating many other domains of data. As we have limited
space here, we will be focusing mainly on some of the key areas where it is being
used or we can say the application part of FPM.
Some of the key applications of FPM have been discussed in the following.

2.5.1 For Customer Analysis


Supermarket and customer analysis [3] were the initial research put forth by
researchers. In this situation, the behavior of customers like what basket of items they
purchased at the same time or what sequences of items were purchased together is
recorded. What are the common patterns of buying behavior can be answered by using
the frequent patterns. Two-​tuple frequent pattern like {Milk, Bread} is an example
that suggests that the items Milk and Bread are bought together often. With this infor-
mation the shelves containing these two items can be placed together and also this
information can be used to promote items in a much better way. This information
helps making the decisions efficiently and producing good results for the business.
Because if there is information regarding the previous buying behavior of customers,
the decision of keeping things in a store can be made in a much better way. If the pre-
vious information regarding a person’s buying things is known and after analyzing
earlier period information, the marketing process would become easy. For example, if
there is already information available regarding a customer who has bought a laptop,
it is very much likely that he will buy a printer. So, in that case targeting a customer
for a certain item would become easy.

2.5.2 Frequent Patterns for Classification


The problem of data classification is linked to FPM, especially in the rule-​based
methods context. The condition of the form is known as a classification rule [3].

A1 = a1, A2 = a2 => C = c (2.1)

Attributes A1 and A2 will be taking the values a1 and a2, respectively, as implied
by the LHS of the rule and value c should be the class value as implied by the RHS
of the rule.
25

Analytical Theory 25

Classification rules have a similarity in form as association rules and suitable


patterns can be determined from the data with the help of mining techniques of asso-
ciation rules. The aim is to ensure that the patterns are enough discriminative for
the purpose of classification and the criteria of support are not having much domin-
ance in the process of rule selection. This [19] is the earliest work that highlights the
connection between classification and association rule mining. Classification based
on associations (CBA) [20] is one among the most popular methods that classify on
the basis of associations. Also, the CMAR [9] method is another technique which is
used for classifying based on the FP-​Growth method for mining association rules.

2.5.3 Frequent Patterns Aimed at Clustering


Problems in the field of data mining like clustering are related to FPM. In [21], the
relationship between clustering and FPM is talked much about, in which the item
having a large size is used to enable the clustering process.
The article in [22] and the chapter on high-​dimensional data in [23] contain a thor-
ough analysis about the associations among high-​dimensional clustering algorithms
and FPM problems.

2.5.4 Frequent Patterns for Outlier Analysis


For binary and transaction data, FPM is used often for outlier analysis. The inherent
nature of transaction data as high-​dimensional needs to identify the relevant outliers
present in it. For this subspace, methods are utilized for the identification of outliers.
The challenge faced by the subspace methods that they are not computationally prac-
tical or statistically feasible so that they can define the sets of items (or subspaces)
that are sparse for outlier detection has been addressed within [24].

2.5.5 Frequent Patterns for Indexing


Apart from the clustering methods, the FPM methods are effective in creating
representations which the clustering methods [25][26] do in the table of transactions.
Clustering methods partition the database containing transactions in groups based on
the broad patterns present in them. The use of an FPM approach can be inherent for
the market basket data context.
The gIndex [27] method is an indexing structure that uses frequent patterns which
are discriminative. There are other methods that have been developed keeping in
mind the context like Grafil [28] and PIS [29].

2.5.6 Frequent Patterns for Text Mining


In the field of text mining, frequent patterns have very important applications, for
both positional and non-​positional co-​occurrence [2]. When some words happen to
be together in terms of adjacency of occurrence it is referred to as positional co-​
occurrence. Sequence pattern mining methods can be adapted, or constrained FPM
26

26 Data Driven Decision Making Using Analytics

methods can be applied for discovering such patterns. Non-​positional co-​occurrence


is associated with the problem of discovering bigrams, trigrams, or phrases within the
data which appears frequently.
For text collections, there are many applications of FPM which are present
within [30].

2.5.7 Frequent Patterns for Spatial and Spatiotemporal Applications


There has been a lot of advancements in the field of mobile sensing technology.
Social sensing [31] is an important branch that has emerged out of it. A lot of data
is accumulated from cell phones uninterruptedly and its large portion is that of data
related to GPS, which is nothing but data about the position. The construction of
trajectories can be done through this GPS-​based location data. The determination of
clusters and frequent patterns can be done out of these constructed trajectories.
The clustering of spatiotemporal data has been achieved through the frequent use
of FPM methods. The swarm method which has been proposed in [32] is an example
of this technique.

2.5.8 Applications in Chemical and Biological Fields


Both biological and chemical data can usually be denoted by graphs. Chemical
compounds can be denoted by graphs in which the nodes will be representing the
atoms and the bonds in between atoms can be shown as the edges between the nodes
of a graph. In a similar fashion, biological data can be shown with graphs or sequences
in many ways. In case of biological data, there will be diversity in terms of the struc-
tural representations. For the identification of useful and important properties in these
chemical compounds or structures, FPM will play a significant role. For classifying
chemical compounds an approach is discussed in [33]. Biological data is available as
sequence or graph-​structured data [2]; useful frequent patterns can be found out from
them with the help of many algorithms developed [34]–​[38].

2.6 
RESOURCES AVAILABLE FOR PRACTITIONER
Because the FPM methods are utilized so often for a lot of applications, we can use
software that is available immediately and aimed at providing services to use FPM in
these applications.
KDD Nuggets [39] is a website that contains links to different resources on
FPM. The implementation of different data mining algorithms which contains
algorithms for FPM is present in the Weka repository [40]. Bart Goethals has
implemented some of the distinguished FPM algorithms like Apriori, EClaT,
and FP-​Growth [41]. FIMI is a repository [42] very well known for the efficient
implementation of many FPM algorithms. There is one free package from soft-
ware R called arules which is capable of doing FPM for many types, the details of
which can be accessed in [43]. Commercial software like Enterprise Miner which
is provided by SAS has the ability to perform both associations and sequential
pattern mining [44].
27

Analytical Theory 27

2.7 
FUTURE WORKS AND CONCLUSION
Among the main challenges of data mining like classification, clustering, outlier
analysis, and FPM, the challenge of FPM is considered to be the leading problem
in the domain of data mining. This chapter gives a summary of some of the key
areas within FPM. This chapter also reviewed the strong points and weak points of
different algorithms. In addition, we also highlighted the issues which arise because
of the growing privacy concerns. Applications like analyzing the customer’s buying
behavior, outlier analysis of data, mining of textual data, and chemical and biological
data were discussed along with other applications.
We will be focusing on how to deal with the privacy preservation and effect-
ively mine our data without letting anything happen to the privacy. Also, some more
advanced FPM algorithms will be analyzed with their different versions. The focus
will be on how the earlier developed algorithms can be modified so that they can work
effectively in today’s environment. In addition, the variants of FPM shall be dealt
with to know how they impact the industry.

REFERENCES
[1]‌ Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and
future directions. Data Min Knowl Discov., 15(1), 55–​86. https://​doi.org/​10.1007/​
s10618-​006-​0059-​1
[2]‌ Aggarwal CC (2014) An introduction to frequent pattern mining. Freq. Pattern
Min., 1–​17.
[3]‌ Agrawal R, Imieliński T, Swami A (1993) Mining Association Rules Between
Sets of Items in Large Databases. ACM SIGMOD Rec. https://​doi.org/​10.1145/​
170036.170072
[4]‌ Chee CH, Jaafar J, Aziz IA, et al (2019) Algorithms for frequent itemset mining: a
literature review. Artif. Intell. Rev., 52(4), 2603–​2621.
[5]‌ Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large
databases. In: Proc. of the 20th International Conference on Very Large Data Bases
(VLDB’94).
[6]‌ Han J, Pei J, Yin Y (2000) Mining Frequent Patterns Without Candidate Generation.
SIGMOD Rec (ACM Spec Interes Gr Manag Data). https://​doi.org/​10.1145/​
335191.335372
[7]‌ Park JS, Chen MS, Yu PS (1995) An Effective Hash-​Based Algorithm for Mining
Association Rules. ACM SIGMOD Rec. https://​doi.org/​10.1145/​568271.223813
[8]‌ Han J, Kamber M, Pei J (2012) Data Mining: Concepts and Techniques.
[9]‌ Meenakshi A (2015) Survey of frequent pattern mining algorithms in horizontal and
vertical data layouts. Int J Adv Comput Sci Technol, 4(4).
[10] Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data,
12(3), 372–​390. https://​doi.org/​10.1109/​69.846291
[11] Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for gen-
eration of frequent item sets. J Parallel Distrib Comput., 61(3), 350–​371. https://​
doi.org/​10.1006/​jpdc.2000.1693
[12] Aggarwal CC, Bhuiyan MA, Hasan M Al (2014) Frequent pattern mining
algorithms: a survey. In: Frequent Pattern Mining.
28

28 Data Driven Decision Making Using Analytics

[13] Song M, Rajasekaran S (2006) A transaction mapping algorithm for frequent itemsets
mining. IEEE Trans Knowl Data Eng., 18(4), 472–​481. https://​doi.org/​10.1109/​
TKDE.2006.1599386
[14] Baralis E, Cerquitelli T, Chiusano S, Grand A (2013) P-​Mine: Parallel itemset mining
on large datasets. In: Proceedings of International Conference on Data Engineering.
[15] Hoseini MS, Shahraki MN, Neysiani BS (2016) A new algorithm for mining fre-
quent patterns in can tree. In: Conference Proceedings of 2015 2nd International
Conference on Knowledge-​Based Engineering and Innovation, KBEI 2015
[16] Aggarwal CC, Yu PS (2008) A general survey of privacy-​preserving data mining
models and algorithms. In: Privacy-​preserving Data Mining (pp. 11–​52).
[17] Evfimievski A, Srikant R, Agrawal R, Gehrke J (2004) Privacy preserving mining of
association rules. In: Information Systems.
[18] Clifton C, Kantarcioglu M, Vaidya J, et al (2002) Tools for privacy preserving
distributed data mining. ACM SIGKDD Explor Newsl. https://​doi.org/​10.1145/​
772862.772867
[19] Ali K, Manganaris S, Srikant R (1997) Partial classification using association rules.
Knowl Discov Data Min., 97, 115–​118.
[20] Liu B, Hsu W, Ma Y, Ma B (1998) Integrating classification and association rule
mining. Knowl Discov Data Min., 98, 80–​86. https://​doi.org/​10.1.1.48.8380
[21] Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: International
Conference on Information and Knowledge Management, Proceedings
[22] L. P, E. H, H. L (2004) Subspace clustering for high dimensional data: a review.
SIGKDD Explor Newsl ACM Spec Interes Gr Knowl Discov Data Min
[23] Aggarwal CC, Reddy CK (2013) DATA Clustering Algorithms and Applications.
[24] He Z, Deng S, Xu X (2002) Outlier detection integrating semantic knowledge.
In: Lecture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics).
[25] Aggarwal CC, Wolf JL, Yu PS (1999) A new method for similarity indexing of
market basket data. SIGMOD Rec (ACM Spec Interes Gr Manag Data). https://​
doi.org/​10.1145/​304181.304218
[26] Nanopoulos A, Manolopoulos Y (2002) Efficient similarity search for market basket
data. VLDB J., 11(2), 138–​152. https://​doi.org/​10.1007/​s00778-​002-​0068-​7
[27] Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-​based approach.
In: Proceedings of the ACM SIGMOD International Conference on Management
of Data.
[28] Yan X, Yu PS, Han J (2005) Substructure similarity search in graph databases.
In: Proceedings of the ACM SIGMOD International Conference on Management
of Data.
[29] Yan X, Zhu F, Han J, Yu PS (2006) Searching substructures with superimposed dis-
tance. In: Proceedings of the International Conference on Data Engineering.
[30] Aggarwal CC, Zhai CX (Eds.) (2013) Mining text data. Springer Science &
Business Media
[31] Aggarwal CC, Abdelzaher T (2013) Social sensing. In: Managing and Mining Sensor
Data (pp. 237–​297).
[32] Li Z, Ding B, Han J, Kays R (2010) Swarm: mining relaxed temporal moving object
clusters. In: Proceedings of the VLDB Endowment, 3(1). https://​doi.org/​10.14778/​
1920841.1920934
[33] Deshpande M, Kuramochi M, Wale N, Karypis G (2005) Frequent substructure-​
based approaches for classifying chemical compounds. IEEE Trans Knowl Data
Eng., 17(8), 1036-​1050. https://​doi.org/​10.1109/​TKDE.2005.127
29

Analytical Theory 29

[34] Cong G, Tung AKH, Xu X, et al (2004) FARMER: finding interesting rule groups in
microarray datasets. In: Proceedings of the ACM SIGMOD International Conference
on Management of Data.
[35] Cong G, Tan KL, K.h.tung A, Xu X (2005) Mining top-​k covering rule groups for gene
expression data. In: Proceedings of the ACM SIGMOD International Conference on
Management of Data.
[36] Pan F, Cong G, Tung AKH, et al (2003) Carpenter: Finding closed patterns in long
biological datasets. In: Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining.
[37] Wang J, Shapiro B, Shasha D (1999) Pattern Discovery in Biomolecular Data: Tools,
Techniques and Applications. Oxford University Press.
[38] Rigoutsos I, Floratos A (1998) Combinatorial pattern discovery in biological
sequences: the TEIRESIAS algorithm. Bioinformatics, 14(1), 55–​67. https://​doi.org/​
10.1093/​bioinformatics/​14.1.55
[39] KDnuggets www.kdnuggets.com/​software/​associations.html
[40] www.cs.waikato.ac.nz/​ml/​weka/​
[41] http://​adrem.ua.ac.be/​ ∼goethals/​software/​
[42] http://​fimi.ua.ac.be/​
[43] https://​cran.r-​project.org/​web/​packages/​arules/​index.html
[44] www2.sas.com/​proceedings/​forum2007/​132-​2007.pdf
30
31

3 A Journey from Big Data


to Data Mining in Quality
Improvement
Sharad Goel1 and Prerna Bhatnagar2
1
Director and Professor, Indirapuram Institute of Higher
Studies (IIHS), Ghaziabad, Uttar Pradesh
2
Assistant Professor, Indirapuram Institute of Higher
Studies (IIHS), Ghaziabad, Uttar Pradesh

CONTENTS
3.1 Introduction......................................................................................................32
3.1.1 Comparing Conventional Data Technique and Big
Data Technique.....................................................................................32
3.2 Big Data Technique Types................................................................................33
3.2.1 Structured Big Data Type.....................................................................33
3.2.2 Unstructured Big Data Type.................................................................33
3.2.3 Semi-​Structured Big Data Type............................................................33
3.3 Essence of Big Data.........................................................................................34
3.3.1 Volume..................................................................................................34
3.3.2 Variety..................................................................................................34
3.3.3 Velocity.................................................................................................34
3.3.4 Variability.............................................................................................34
3.3.5 Value.....................................................................................................34
3.4 Categorization of Data Mining Systems..........................................................37
3.4.1 Classification on the Basis of the Type of Data Source
That Is Mined.......................................................................................37
3.4.2 Classification [3]‌on the Basis of King of Knowledge
Discovered............................................................................................37
3.4.3 Classification on the Basis of the Data Model on which
It Is Drawn............................................................................................37
3.4.4 Classification According to Different Mining Techniques That
Are Used...............................................................................................37
3.5 Data Mining Design.........................................................................................38
3.5.1 Data Source..........................................................................................38
3.5.2 Data Warehouse [2]‌Server...................................................................38
3.5.3 Data Mining Engine.............................................................................38
3.5.4 Pattern Assessment Module..................................................................38
3.5.5 Graphical User Interface (GUI)............................................................39

DOI: 10.1201/9781003199403-3 31
newgenprepdf

32

32 Data Driven Decision Making Using Analytics

3.5.6 Knowledge Base...................................................................................39


3.6 Data Mining Architecture.................................................................................39
3.6.1 Issues and Dilemma of Big Data Technique........................................39
3.6.1.1 Dilemma.................................................................................39
3.6.1.2 Issues......................................................................................40
3.6.1.3 Solution of Big Data Technique.............................................40
3.6.1.4 Hadoop Distributed File System............................................40
3.6.1.5 MapReduce............................................................................41
3.7 Various Data Mining Techniques to Improve Data Quality.............................41
3.7.1 Anomaly Detection..............................................................................42
3.7.2 Clustering.............................................................................................42
3.7.3 Classification........................................................................................42
3.7.4 Regression............................................................................................42
3.8 Conclusion........................................................................................................43

3.1 
INTRODUCTION
Big data consists of a group of data sets that are very large that it becomes so challen-
ging to process with the help of available database management tools and mechanisms.
Businesses, governmental institutions, HCPs (Health Care Providers), and financial
and academic institutions are all taking advantage of the competency of Big Data to
work on different business anticipations in addition to enhanced experience by many
customers. Almost about 90% of the global data has been created over the past two
years. This rate is still growing enormously. In addition, we have noticed that big data
has been introduced currently in almost every industry and in every field.
According to Gartner’s view, big data can be defined as “Big data” is highly vol-
umed, high velocity, and diversified knowledge assets which requires cost-​effective,
contemporary mode of knowledge processing for improved insight as well as for
good decision making”.
There are certain basic points about big data technique that are discussed below:

• This technique points out to large collection of data that is expanding exponen-
tially along with time.
• This technique is so comprehensive that it cannot be refined or evaluated with
the help of typical data processing mechanisms.
• This technique consists of data mining process, data storage, data examination,
data sharing, and data visualization.
• This technique is all-​inclusive one including data, data frameworks, inclusive
of various tools and methods that can be used to examine and evaluate the data.

3.1.1 Comparing Conventional Data Technique and Big Data Technique


Generally, data is a collection of letters, words, numbers, symbols, or images, but
with the progression of different multi-​tasking technology-​based tools and methods,
data has become so distinctive in the context of content as well as origin. With a view

You might also like