Full Download Big Data Analytics Systems Algorithms Applications C.S.R. Prabhu PDF
Full Download Big Data Analytics Systems Algorithms Applications C.S.R. Prabhu PDF
com
https://textbookfull.com/product/big-data-
analytics-systems-algorithms-applications-c-s-r-
prabhu/
https://textbookfull.com/product/probabilistic-data-structures-and-
algorithms-for-big-data-applications-gakhov/
textbookfull.com
https://textbookfull.com/product/leadership-strategies-in-the-age-of-
big-data-algorithms-and-analytics-first-edition-norton-paley/
textbookfull.com
https://textbookfull.com/product/criminology-in-canada-theories-
patterns-and-typologies-mccormick/
textbookfull.com
Security in Computing: 5th Edition Charles P. Pfleeger And
Shari Lawrence Pfleeger
https://textbookfull.com/product/security-in-computing-5th-edition-
charles-p-pfleeger-and-shari-lawrence-pfleeger/
textbookfull.com
https://textbookfull.com/product/backroads-byways-of-georgia-david-b-
jenkins/
textbookfull.com
https://textbookfull.com/product/polymeric-gene-delivery-systems-
yiyun-cheng/
textbookfull.com
https://textbookfull.com/product/geotechnics-for-sustainable-
infrastructure-development-phung-duc-long/
textbookfull.com
https://textbookfull.com/product/african-languages-and-literatures-in-
the-21st-century-esther-mukewa-lisanza/
textbookfull.com
C. S. R. Prabhu ·
Aneesh Sreevallabh Chivukula ·
Aditya Mogadala · Rohit Ghosh ·
L. M. Jenila Livingston
Big Data
Analytics:
Systems,
Algorithms,
Applications
Big Data Analytics: Systems, Algorithms,
Applications
C. S. R. Prabhu Aneesh Sreevallabh Chivukula
• •
L. M. Jenila Livingston
123
C. S. R. Prabhu Aneesh Sreevallabh Chivukula
National Informatics Centre Advanced Analytics Institute
New Delhi, Delhi, India University of Technology, Sydney
Ultimo, NSW, Australia
Aditya Mogadala
Saarland University Rohit Ghosh
Saarbrücken, Saarland, Germany Qure.ai
Goregaon East, Mumbai, Maharashtra, India
L. M. Jenila Livingston
School of Computing Science
and Engineering
Vellore Institute of Technology
Chennai, Tamil Nadu, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Foreword
Big Data phenomenon has emerged globally as the next wave of technology, which
will influence in a big way and contribute to better quality of life in all its aspects.
The advent of Internet of things (IoT) and its associated Fog Computing paradigm
is only accentuating and amplifying the Big Data phenomenon.
This book by C. S. R. Prabhu and his co-authors is coming up at the right time.
This book fills in the timely need for a comprehensive text covering all dimensions
of Big Data Analytics: systems, algorithms, applications and case studies along
with emerging research horizons. In each of these dimensions, this book presents a
comprehensive picture to the reader in a lucid and appealing manner. This book can
be used effectively for the benefit of students of undergraduate and post-graduate
levels in IT, computer science and management disciplines, as well as research
scholars in these areas. It also helps IT professionals and practitioners who need to
learn and understand the subject of Big Data Analytics.
I wish this book all the best in its success with the global student community as
well as the professionals.
v
Preface
vii
viii Preface
focus on open-source technologies. We also discuss the algorithms and models used
in data mining tasks such as search, filtering, association, clustering, classification,
regression, forecasting, optimization, validation and visualization. These techniques
are applicable to various categories of content generated in data streams, sequences,
graphs and multimedia in transactional, in-memory and analytic databases. Big
Data Analytics techniques comprising descriptive and predictive analytics with an
emphasis on feature engineering and model fitting are covered. For feature engi-
neering steps, we cover feature construction, selection and extraction along with
preprocessing and post-processing techniques. For model fitting, we discuss the
model evaluation techniques such as statistical significance tests, cross-validation
curves, learning curves, sufficient statistics and sensitivity analyses. Finally, we
present the latest developments and innovations in generative learning and dis-
criminative learning for large-scale pattern recognition. These techniques comprise
incremental, online learning for linear/nonlinear and convex/multi-objective opti-
mization models, feature learning or deep learning, evolutionary learning for
scalability and optimization meta-heuristics.
Machine learning algorithms for big data cover broad areas of learning such a
supervised, unsupervised and semi-supervised and reinforcement techniques. In
particular, supervised learning subsection details several classification and regres-
sion techniques to classify and forecast, while unsupervised learning techniques
cover clustering approaches that are based on linear algebra fundamentals.
Similarly, semi-supervised methods presented in the chapter cover approaches that
help to scale to big data by learning from largely un-annotated information. We also
present reinforcement learning approaches which are aimed to perform collective
learning and support distributed scenarios.
The additional unique features of this book are about 15 real-life experiences as
case studies which have been provided in the above-mentioned application
domains. The case studies provide, in brief, the experiences of the different contexts
of deployment and application of the techniques of Big Data Analytics in the
diverse contexts of private and public sector enterprises. These case studies span
product companies such as Google, Facebook, Microsoft, consultancy companies
such as Kaggle and also application domains at power utility companies such as
Opower, banking and finance companies such as Deutsche Bank. They help the
readers to understand the successful deployment of analytical techniques that
maximize a company's functional effectiveness, diversity in business and customer
relationship management, in addition to improving the financial benefits. All these
companies handle real-life Big Data ecosystems in their respective businesses to
achieve tangible results and benefits. For example, Google not only harnesses, for
profit, the big data ecosystem arising out of its huge number of users with billions of
web searches and emails by offering customized advertisement services, but also is
offering to other companies to store and analyze the big datasets in cloud platforms.
Google has also developed an IoT sensor-based autonomous Google car with
real-time analytics for driverless navigation. Facebook, the largest social network in
the world, deployed big data techniques for personalized search and advertisement.
So LinkedIn also deploys big data techniques for effective service delivery.
Preface ix
Microsoft also aspires to enter the big data business scenario by offering services of
Big Data Analytics to business enterprises on its Azure cloud services. Nokia
deploys its Big Data Analytics services on the huge buyer and subscriber base of its
mobile phones, including the mobility of its buyers and subscribers. Opower, a
power utility company, has deployed Big Data Analytics techniques on its customer
data to achieve substantial benefits on power savings. Deutsche Bank has deployed
big data techniques for achieving substantial savings and better customer rela-
tionship management (CRM). Delta Airlines improved its revenues and customer
relationship management (CRM) by deploying Big Data Analytics techniques.
A Chinese city traffic management was achieved successfully by adopting big data
methods.
Thus, this book provides a complete survey of techniques and technologies in
Big Data Analytics. This book will act as basic textbook introducing niche tech-
nologies to undergraduate and postgraduate computer science students. It can also
act as a reference book for professionals interested to pursue leadership-level career
opportunities in data and decision sciences by focusing on the concepts for problem
solving and solutions for competitive intelligence. To the best of our knowledge,
big data applications are discussed in a plethora of books. But, there is no textbook
covering a similar mix of technical topics. For further clarification, we provide
references to white papers and research papers on specific topics.
xi
About This Book
xiii
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Contents
xv
xvi Contents
1.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.12 Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References and Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Intelligent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ 25
Aneesh Sreevallabh Chivukula
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ 25
2.1.1 Open-Source Data Science . . . . . . . . . . . . ........ 26
2.1.2 Machine Intelligence and Computational
Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3 Data Engineering and Data Sciences . . . . . . . . . . . . . 34
2.2 Big Data Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.1 Distributed Systems and Database Systems . . . . . . . . 37
2.2.2 Data Stream Systems and Stream Mining . . . . . . . . . . 40
2.2.3 Ubiquitous Computing Infrastructures . . . . . . . . . . . . 43
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Analytics Models for Data Science . . . . . . . . . . . . . . . .......... 47
L. M. Jenila Livingston
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Data Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Data Munging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.3 Descriptive Analytics . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.4 Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.5 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.6 Network Science . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Computing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Data Structures for Big Data . . . . . . . . . . . . . . . . . . . 55
3.3.2 Feature Engineering for Structured Data . . . . . . . . . . . 73
3.3.3 Computational Algorithm . . . . . . . . . . . . . . . . . . . . . 78
3.3.4 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.5 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.6 Functional Programming . . . . . . . . . . . . . . . . . . . . . . 80
3.3.7 Distributed Programming . . . . . . . . . . . . . . . . . . . . . 80
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5 Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Contents xvii
Dr. C. S. R. Prabhu has held prestigious positions with Government of India and
various institutions. He retired as Director General of the National Informatics
Centre (NIC), Ministry of Electronics and Information Technology, Government of
India, New Delhi, and has worked with Tata Consultancy Services (TCS), CMC,
TES and TELCO (now Tata Motors). He was also faculty for the Programs of the
APO (Asian Productivity Organization). He has taught and researched at the
University of Central Florida, Orlando, USA, and also had a brief stint as a
Consultant to NASA. He was Chairman of the Computer Society of India (CSI),
Hyderabad Chapter. He is presently working as an Advisor (Honorary) at KL
University, Vijayawada, Andhra Pradesh, and as a Director of Research and
Innovation at Keshav Memorial Institute of Technology (KMIT), Hyderabad.
He received his Master’s degree in Electrical Engineering with specialization in
Computer Science from the Indian Institute of Technology, Bombay. He has guided
many Master’s and doctoral students in research areas such as Big Data.
xxv
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
xxvi About the Authors
Dr. L. M. Jenila Livingston is an Associate Professor with the CSE Dept at VIT,
Chennai. Her teaching foci and research interests include artificial intelligence, soft
computing, and analytics.
Chapter 1
Big Data Analytics
1.1 Introduction
The latest disruptive trends and developments in digital age comprise social network-
ing, mobility, analytics and cloud, popularly known as SMAC. The year 2016 saw
Big Data Technologies being leveraged to power business intelligence applications.
What holds in store for 2020 and beyond?
Big Data for governance and for competitive advantage is going to get the big
push in 2020 and beyond. The tug of war between governance and data value will
be there to balance in 2020 and beyond. Enterprises will put to use the enormous
data or Big Data they already have about their customers, employees, partners and
other stakeholders by deploying it for both regulatory use cases and non-regulatory
use cases of value to business management and business development. Regulatory
use cases require governance, data quality and lineage so that a regulatory body can
analyze and track the data to its source all through its various transformations. On
the other hand, the non-regulatory use of data can be like 360° customer monitoring
or offering customer services where high cardinality, real time and mix of structured,
semi-structured and unstructured data will produce more effective results.
It is expected that in 2020 businesses will shift to a data-driven approach. All
businesses today require analytical and operational capabilities to address customers,
process claims, use interfaces to IOT devices such as sensors in real time, at a per-
sonalized level, for each individual customer. For example, an e-commerce site can
provide individual recommendations after checking prices in real time. Similarly,
health monitoring for providing medical advice through telemedicine can be made
operational using IOT devices for monitoring all individual vital health parameters.
Health insurance companies can process valid claims and stop paying fraudulent
claims by combining analytics techniques with their operational systems. Media
companies can deliver personalized content through set-top boxes. The list of such
use cases is endless. For achieving the delivery of such use cases, an agile platform
is essentially required which can provide both analytical results and also operational
efficiency so as to make the office operations more relevant and accurate, backed
by analytical reasoning. In fact, in 2020 and beyond the business organizations will
go beyond just asking questions to taking great strides to achieve both initial and
long-term business values.
Agility, both in data and in software, will become the differentiator in business in
2020 and beyond. Instead of just maintaining large data lakes, repositories, databases
or data warehouses, enterprises will leverage on data agility or the ability to under-
stand data in contexts and take intelligent decisions on business actions based on
data analytics and forecasting.
The agile processing models will enable the same instance of data to support
batch analytics, interactive analytics, global messaging, database models and all
other manifestations of data, all in full synchronization. More agile data analytics
models will be required to be deployed when a single instance of data can support
a broader set of tools. The end outcome will be agile development and application
platform that supports a very broad spectrum of processing and analytical models.
Block chain is the big thrust area in 2020 in financial services, as it provides
a disruptive way to store and process transactions. Block chain runs on a global
network of distributive computer systems which any one can view and examine.
Transactions are stored in blocks such that each block refers to previous block, all
of them being time-stamped and stored in a form unchangeable by hackers, as the
world has a complete view of all transactions in a block chain. Block chain will
speed up financial transactions significantly, at the same time providing security
and transparency to individual customers. For enterprises, block chain will result in
savings and efficiency. Block chain can be implemented in Big Data environment.
In 2020, microservices will be offered in a big way, leveraging on Big Data
Analytics and machine learning by utilizing huge amount of historical data to better
understand the context of the newly arriving streaming data. Smart devices from
IOT will collaborate and analyze each other, using machine learning algorithms to
adjudicate peer-to-peer decisions in real time.
There will also be a shift from post-event and real-time analytics to pre-event and
action (based on real-time data from immediate past).
Ubiquity of connected data applications will be the order of the day. In 2020, mod-
ern data applications will be highly portable, containerized and connected quickly
replacing vertically integrated monolithic software technologies.
Productization of data will be the order of the day in 2020 and beyond. Data will
be a product, a commodity, to buy or to sell, resulting in new business models for
monetization of data.
as Google, Facebook, Yahoo and others, operates at Internet scale that needed to
process the ever-increasing numbers of users and their data which was of very large
volume, with large variety, high veracity and changing with high velocity which had a
great value. The traditional techniques of handling data and processing it were found
to be completely deficient to rise up to the occasion. Therefore, new approaches and a
new paradigm were required. Using the old technologies, the new framework of Big
Data Architecture was evolved by the very same companies who needed it. Thence
came the birth of Internet-scale commercial supercomputing paradigm or Big Data.
This paradigm shift brought disruptive changes to organizations and vendors across
the globe and also large social networks so as to encompass the whole planet, in all
walks of life, in light of Internet of things (IOT) contributing in a big way to Big
Data. Big Data is not the trendy new fashion of computing, but it is sure to transform
the way computing is performed and it is so disruptive that its impact will sustain
for many generations to come.
Big Data is the commercial equivalent of HPC or supercomputing (for scientific
computing) with a difference: Scientific supercomputing or HPC is computation
intensive with scientific calculations as the main focus of computing, whereas Big
Data is only processing very large data for mostly finding out the patterns of behavior
in data which were previously unknown.
Today, Internet-scale commercial companies such as Amazon, eBay and Filpkart
use commercial supercomputing to solve their Internet-scale business problems, even
though commercial supercomputing can be harnessed for many more tasks than sim-
ple commercial transactions as fraud detection, analyzing bounced checks or tracking
Facebook friends! While the scientific supercomputing activity came downward and
commercial supercomputing activity went upward, they both are reaching a state
of equilibrium. Big data will play an important role in ‘decarbonizing’ the global
economy and will also help work toward Sustainable Development Goals.
Industry 4.0, Agriculture or Farming 4.0, Services 4.0, Finance 4.0 and beyond
are the expected outcomes of the application IOT and Big Data Analytics techniques
together to the existing versions of the same sectors of industry, agriculture or farm-
ing, services, finance, by weaving together of many sectors of the economy to the
one new order of the World 4.0. Beyond this, the World 5.0 is aimed to be achieved
by the governments of China and Japan by deploying IOT and Big Data in a big way,
a situation which may become ‘big brothers,’ becoming too powerful in tracking
everything aiming to control everything! That is where we need to find a scenario
of Humans 8.0 who have human values or Dharma, so as to be independent and
yet have a sustainable way of life. We shall now see how the Big Data technologies
based on Hadoop and Spark can handle practically the massive amounts of data that
is pouring in modern times.
4 1 Big Data Analytics
1.4 Hadoop
Hadoop was the first commercial supercomputing software platform that works at
scale and also is affordable at scale. Hadoop is based on exploiting parallelism and
was originally developed in Yahoo to solve specific problems. Soon it was realized to
have large-scale applicability to problems faced across the Internet-scale companies
such as Facebook or Google. Originally, Yahoo utilized Hadoop for tracking all user
navigation clicks in web search process for harnessing it for advertisers. This meant
millions of clickstream data to be processed on tens of thousands of servers across the
globe on an Internet-scale database that was economical enough to build and operate.
No existing solutions were found capable to handle this problem. Hence, Yahoo built,
from scratch, the entire ecosystem for effectively handling this requirement. Thus
was born Hadoops [1]. Like Linux, Hadoop was also in open source. Just as Linux
spans over clusters of servers, clusters of HPC servers or Clouds, so also Hadoop
has created the Big Data Ecosystem of new products, vendors, new startups and
disruptive possibilities. Even though in open-source domain originally, today even
Microsoft Operating System supports Hadoop.
1.5 Silos