Unit-Iii CC&BD CS71
Unit-Iii CC&BD CS71
Textbook:
1. Cloud Computing Theory and Practice – DAN C. Marinescu – Morgan Kaufmann Elsevier.
2. Cloud Computing A hands - on approach – Arshdeep Bahga & Vijay madisetti Universities
press
3. Big Data Analytics, Seema Acharya and Subhashini Chellappan. 2nd edition, Wiley India Pvt.
Ltd. 2019
4. White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012. Third Edition.
5. Ryza, Sandy, Uri Laserson, Sean Owen, and Josh Wills. Advanced analytics with spark:
patterns for learning from data at scale. " O'Reilly Media, Inc.", 2017. 2nd Edition,
NOTE: I declare that the PPT content is picked up from the prescribed course text
books or reference material prescribed in the syllabus book and Online Portals.
Introduction to Big Data:
• What is Big Data and Why is it Important?
• Types of Digital Data;
• Big Data – Definition, Characteristics, Evolution of Big Data, Challenges;
• Comparison with BI ; Cloud Computing and Big Data,
• Cloud Services for Big Data,
• In-Memory Computing Technology for Big Data.
Introduction to Big Data
• The "Internet of Things" and its widely ultra-connected nature are leading to a increase
rapidly rise in big data. There is no scarcity of data for today's enterprise.
o How does it compare with the Traditional Business Intelligence (BI) environment?
Structured Data: Data Stored in the form of rows and columns(Databases, Excel)
Massive amount of data , which cannot be stored, processed and Analyzed using
traditional Database systems and tools
Big data is data that exceeds the processing and storing capacity of conventional
database systems.
The data is too big, moves too fast, or does not fit the structures of traditional
database architectures/Systems
Big data is a collection of data sets, that are complex in nature, exponential/fast
growing data and variety of data, both structured and unstructured.
• Part I of the definition:
Walmart
• handles 1 million customer transactions/hour
• 2.5 petabyte of data.
Facebook
• handles 40 billion photos from its user base!
• inserts 500 terabytes of new data every day
• stores, accesses, and analyzes 30 Petabytes of user generated data
More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide
Big data analytics , Process of extracting a meaningful insights from Big data ,
such as Hidden Patterns, Unknown Facts and Correlations and Other
Insight's.
Big Data analytics is the process of collecting, organizing and analyzing large
sets of data (called Big Data) to discover patterns and other useful
information.
Big data analytics is the use of advanced analytic techniques against very
large, diverse data sets that include
o structured, semi-structured and unstructured data,
o from different sources, and in different sizes
Characteristics of Big Data
Why It is Important?
• Computing perfect storm.
Big Data analytics are the natural result of four major Global Trends:
• Moore’s Law (which basically says that technology always gets cheaper),
• Mobile Computing (that smart phone or mobile tablet in your hand),
• Social Networking (Facebook, Foursquare, Pinterest, etc.), and
• Cloud Computing (you don’t even have to own hardware or software
anymore; you can rent or lease someone else’s).
Volumes of transactional data have been around for decades for most big firms,
but the flood gates have now opened with more volume, and the velocity and variety
• The three Vs—of data that has arrived in unprecedented ways.
“apart from the changes in the actual hardware and software technology, there has
also been a massive change in the actual evolution of data systems. I compare it to the
stages of learning: Dependent, Independent, and Interdependent.”
Data systems were fairly new and users didn’t know quite know what they wanted. IT
assumed that “Build it and they shall come.”
Users understood what an analytical platform was and worked together with IT to
define the business needs and approach for deriving insights for their firm.
During the Customer Relationship Management (CRM) era of the 1990s, many
companies made substantial investments in customer-facing technologies that
subsequently failed to deliver expected value.
• The reason for most of those failures was fairly straightforward: Management either
forgot (or just didn ’t know) that big projects require a synchronized transformation
of people, process, and technology. All three must be marching in step or the project
is doomed.
Evolution of Big Data:
• 1970s and before was the era of mainframes. The data was essentially primitive and
structured.
• Relational databases evolved in 1980s and 1990s. The era was of data intensive
applications.
• The World Wide Web (WWW) and the Internet of Things (IOT) have led to an
onslaught of structured, unstructured, and multimedia data
Why Big Data?:Applications
1. Understanding and Targeting Customers
• Here, big data is used to better understand customers and their behaviors and
preferences.
• Using big data, Telecom companies can now better predict customer churn;
• Wal-Mart can predict what products will sell, and
• car insurance companies understand how well their customers actually drive.
• Even government election campaigns can be optimized using big data analytics.
• Big data is not just for companies and governments but also for all of us
individually.
• We can now benefit from the data generated from wearable devices such as
smart watches or smart bracelets: collects data on our calorie consumption,
activity levels, and our sleep patterns.
• Most online dating sites apply big data tools and algorithms to find us the most
appropriate matches.
• The computing power of big data analytics enables us to decode entire DNA
strings in minutes and will allow us to understand and predict disease patterns.
• Big data techniques are already being used to monitor babies in a specialist
premature and sick baby unit.
• By recording and analyzing every heart beat and breathing pattern of every
baby, the unit was able to develop algorithms that can now predict infections 24
hours before any physical symptoms appear.
5. Improving Sports Performance
• Most elite sports have now embraced big data analytics. We have the IBM
SlamTracker tool for tennis tournaments;
• we use video analytics that track the performance of every player in a
football or baseball game, and sensor technology in sports equipment such as
basket balls or golf clubs allows us to get feedback (via smart phones and
cloud servers) on our game and how to improve it.
• The CERN data center has 65,000 processors to analyze its 30 petabytes of data.
thousands of computers distributed across 150 data centers worldwide to
analyze the data.
• For example, big data tools are used to operate Google’s self-driving car.
• The Toyota is fitted with cameras, GPS as well as powerful computers and
sensors to safely drive on the road without the intervention of human
beings.
• The National Security Agency (NSA) in the U.S. uses big data analytics to
prevent terrorist plots .
• Others use big data techniques to detect and prevent cyber attacks.
9. Improving and Optimizing Cities and Countries
• Big data is used to improve many aspects of our cities and countries.
• For example, it allows cities to optimize traffic flows based on real time traffic
information as well as social media and weather data.
• a bus would wait for a delayed train and where traffic signals predict traffic
volumes and operate to minimize jams.
• High-Frequency Trading (HFT) is an area where big data finds a lot of use
today. Here, big data algorithms are used to make trading decisions.
• Today, the majority of equity trading now takes place via data algorithms that
increasingly take into account signals from
• social media networks and news websites to make, buy and sell decisions
in split seconds.
A Wider Variety of Data
The variety of data sources continues to increase. Traditionally, internally focused
operational systems, such as ERP (enterprise resource planning) and CRM
applications, were the major source of data used in analytic processing.
The wide variety of data leads to complexities in ingesting the data into data
storage.
3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.
Big Data and the New School of Marketing
“Today ’s consumers have changed. They ’ve put down the newspaper, they fast
forward through TV commercials, and they junk unsolicited email. Why? They have
new options that better fit their digital lifestyle. They can choose which marketing
messages they receive, when, where, and from whom.
• New School marketers deliver what today ’s consumers want: relevant interactive
communication across the digital power channels: email, mobile, social, display and
the web.”
(2) They can automate and optimize their programs and processes throughout the
customer lifecycle. Once marketers have that, they need a practical framework
for planning marketing activities.
• Let ’s take a look at the various loops that guide marketing strategies and tactics
in the Cross-Channel Lifecycle Marketing approach: conversion, repurchase,
stickiness, win-back, and re-permission (see Figure 2.1 ).
Web Analytics
• Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.
• Hit, Page view, Visit/ Session, First Visit / First Session, Repeat Visitor, New
Visitor, Bounce Rate, Exit Rate, Page Time Viewed / Page Visibility Time / Page
View Duration, Session Duration / Visit Duration. Average Page View Duration,
and Click path etc.
• Web is that the primary way in which data gets collected, processed and stored, and
accessed is actually at a third party
• Big Data on the Web will completely transform a company’s ability to understand the
effectiveness of its marketing and hold its people accountable for the millions of
dollars that they spend. It will also transform a company’s ability to understand how
its competitors are behaving.
Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how that
varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and
lifetime value
• It tells you how customers and prospective customers engage with your
different marketing campaigns and how that drives subsequent behaviour
Web analytics tools are good at delivering the standard reports that are common across
different business types
Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
• Page views per session
• Page views per web page
• Conversion rate by traffic source
• Transaction value by traffic source
• Condition: The condition of data deals with the state of data, that is,
• "Can one use this data as is foranalysis?" or
• "Does it require cleansing for further enhancement and enrichment?"
Storage: Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further
complicates the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-
term decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level proficiency in
data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage,
search, analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage
capacity of traditional database software tools.
• There is no explicit definition of how big the data set should be for it to be
considered bigdata.
The ability to build massively scalable platforms—platforms where you have the
option to keep adding new products and services for zero additional cost—is giving
rise to business models that weren’t possible before
Mehta calls it “the next industrial revolution, where the raw material is data and data
factories replace manufacturing factories.”
He pointed out a few guiding principles that his firm stands by:
• It ’s not about the fact that it is virtual, but the true value lies in delivering
software, data, and/or analytics in an “as a service” model.
• Whether that is in a private hosted model or a publicly shared one does not matter.
The delivery, pricing, and consumption model matters.
• Algorithmic trading and supply chain optimization are just two typical
examples where predictive analytics have greatly reduced the friction in
business.
• Look for predictive analytics to proliferate in every facet of our lives, both
personal and business. Here are some leading trends that are making their
way to the forefront of businesses today:
Recommendation engines similar to those used in Netflix and Amazon that use past
purchases and buying behavior to recommend new purchases.
Risk engines for a wide variety of business areas, including market and credit risk,
catastrophic risk, and portfolio risk.
Innovation engines for new product innovation, drug discovery, and consumer and
fashion trends to predict potential new product formulations and discoveries.
Customer insight engines that integrate a wide variety of customer related info,
including sentiment, behavior, and even emotions.
• Customer insight engines will be the backbone in online and set-top box
advertisement targeting, customer loyalty programs to maximize customer
lifetime value, optimizing marketing campaigns for revenue lift, and targeting
individuals or companies at the right time to maximize their spend.
Optimization engines that optimize complex interrelated operations and decisions that
are too overwhelming for people to systematically handle at scales, such as when, where,
and how to seek natural resources to maximize output while reducing operational
costs— or what potential competitive strategies should be used in a global business
that takes into account the various political, economic, and competitive pressures along
with both internal and external operational capabilities.
In-Memory Computing Technology for Big Data
The Elephant in the Room: Hadoop ’s Parallel World
• Hadoop runs on clusters of commodity servers and each of those servers has local
CPUs and disk storage that can be leveraged by the system.
The Two critical components of Hadoop are:
MapReduce.
• Because Hadoop stores the entire dataset in small pieces across a collection of
servers, analytical jobs can be distributed, in parallel, to each of the servers
storing part of the data.
• Each server evaluates the question against its local fragment simultaneously and
reports its results back for collation into a comprehensive answer.
• MapReduce is the agent that distributes the work and collects the results
HDFS continually monitors the data stored on the cluster.
• If one of them is slow in returning an answer or fails before completing its work,
MapReduce automatically starts another instance of that task on another server that
has a copy of the data.
Because of the way that HDFS and MapReduce work, Hadoop provides scalable, reliable,
and fault-tolerant services for data storage and analysis at very low cost.
Basics of Hadoop:
This flood of data is coming from many sources. Consider the following:
• The New York Stock Exchange generates about one terabyte of new trade
data perday.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of
storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
• The Internet Archive stores around 2 petabytes of data, and is growing at a rate
of 20 terabytes per month.
• The Large Hadron Collider near Geneva, Switzerland, will produce about 15
petabytes of data per year.
Data Storage and Analysis
The problem is simple: while the storage capacities of hard drives have increased
massively over the years, access speeds—the rate at which data can be read from
drives— have not kept up.
• One typical drive from 1990 could store 1,370 MB of data and had a transfer
speed of 4.4 MB/s,4 so you could read all the data from a full drive in around
five minutes.
• Over 20 years later, one terabyte drives are the norm, but the transfer speed is
around 100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
This is a long time to read all data on a single drive—and writing is even slower. The
obvious way to reduce the time is to read from multiple disks at once. Imagine if we had
100 drives, each holding one hundredth of the data. Working in parallel, we could read
the data in under two minutes.
There’s more to being able to read and write data in parallel to or from multiple disks,
though.
The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high.
The second problem is that most analysis tasks need to be able to combine the data in
some way;
• data read from one disk may need to be combined with the data from any of the
other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging. MapReduce
provides programming model that abstracts the problem from disk reads and
writes transforming it into a computation over sets of keys and values.
Comparison with Other Systems
Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is
MapReduce needed?
• The answer to these questions comes from another trend in disk drives:
• seek time is improving more slowly than transfer rate.
• Seeking is the process of moving the disk’s head to a particular place on
the disk to read or write data.
• It characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disk’s bandwidth
What is MapReduce in Hadoop?
• Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
• Typically the compute nodes and the storage nodes are the same, i.e, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes.
• The MapReduce consists of a single master JobTracker and one slave TaskTracker per
cluster-node.
• The master is responsible for scheduling the jobs' component tasks on the
slaves, monitoring them and re-executing the failed tasks.
• Hadoop can run MapReduce programs written in various languages; in this chapter,
we look at the same program expressed in Java, Ruby, and Python.
A Weather Dataset
• For our example, we will write a program that mines weather data.
• Weather sensors collect data every hour at many locations across the globe and
gather a large volume of log data, which is a good candidate for analysis with
MapReduce because we want to process all the data, and the data is semi-
structured and record-oriented
Data Format
• The data we will use is from the National Climatic Data Center, or
NCDC(http://www.ncda.noaa.gov/).
• The data is stored using line-oriented ASCII format, in which each line is a
record.
• For simplicity, we focus on the basic elements, such as temperature, which are
always present and are of fixed width.
Example 2-1 shows a sample line with some of the salient fields highlighted.
• Data files are organized by date • The air temperature value is turned into an
and weather station. integer by adding 0.
• There is a directory for each year • Next, a test is applied to see if the temperature is
from 1901 to 2001, each containing valid (the value 9999 signifies a missing value in
a gzipped file for each weather the NCDC dataset) and
station with its readings for that
year. • if the quality code indicates that the reading is
not suspect or erroneous.
• For example, here are the first
entries for 1990: • The temperature values in the source file are
scaled by a factor of 10, so this works out as a
maximum temperature of 31.7°C for 1901
• Since there are tens of thousands of (there were very few readings at the beginning of
weather stations, the whole dataset the century, so this is plausible).
is made up of a large number of
relatively small files
What’s the highest recorded global temperature for each year in the dataset?
Analyzing the Data with Hadoop
• To take advantage of the parallel processing that Hadoop provides, we need to express our
query as a MapReduce job.
• After some local, small-scale testing, we will be able to run it on a cluster of machines.
• Each phase has key-value pairs as input and output, the types of which may be
chosen by the programmer.
• The programmer also specifies two functions: the map function and the reduce
function
• The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value.
• The key is the offset of the beginning of the line from the beginning of the file, but as
we have no need for this, we ignore it.
• Our map function is simple. We pull out the year and the air temperature, since these
are the only fields we are interested in.
• The map function is also a good place to drop bad records: here we filter out
temperatures that are missing, suspect, or erroneous.
Java MapReduce
• How the MapReduce program works, the next step is to express it in code. We need
three things: a map function, a reduce function, and some code to run the job.
• The map function is represented by the Mapper class, which declares an abstract
map() method. Example 2-3 shows the implementation of our map function.
• The Mapper class is a generic type, with four formal type parameters that specify
the input key, input value, output key, and output value types of the map function.
• Rather than using built-in Java types, Hadoop provides its own set of basic types that
are optimized for network serialization.
• The tasks are scheduled using YARN and run on nodes in the cluster.
• There are two types of nodes that control the job execution process: a jobtracker and a
number of tasktrackers.
• The jobtracker coordinates all the jobs run on the system by scheduling
tasks to run on tasktrackers.
• Having many splits means the time taken to process each split is small
compared to the time to process the whole input.
• So if we are processing the splits in parallel, the processing is better load balanced
when the splits are small, since a faster machine will be able to process
proportionally more splits over the course of the job than a slower machine.
• Hadoop does its best to run the map task on a node where the input data resides in
HDFS, because it doesn’t use valuable cluster bandwidth. This is called the data
locality optimization.
• Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s
input split are running other map tasks, so the job scheduler will look for a free map slot
on a node in the same rack as one of the blocks.
• Very occasionally even this is not possible, so an off-rack node is used, which results in
an inter-rack network transfer. The three possibilities are illustrated in Figure 2-2.
It should now be clear why the optimal split size is the same as the block size: it is the
largest size of input that can be guaranteed to be stored on a single node.
• If the split spanned two blocks, it would be unlikely that any HDFS node stored both
blocks,
• so some of the split would have to be transferred across the network to the node
running the map task, which is clearly less efficient than running the whole map task
using local data.
• The whole data flow with a single reduce task is illustrated in Figure 2-3.
• When there are multiple reducers, the map tasks partition their output, each
creating one partition for each reduce task.
• There can be many keys (and their associated values) in each partition,
but the records for any given key are all in a single partition.
• The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4.
• This diagram makes it clear why the data flow between map and reduce tasks is
colloquially known as “the shuffle,” as each reduce task is fed by many map tasks.
MapReduce data flow with Multiple Reduce Task
Cloud Security:
• Risks,
• Security of virtualization,
• The cloud gives you access to more applications, improves data accessibility, helps
your team collaborate more effectively, and provides easier content management.
• Some people may have reservations about switching to the cloud due to security
concerns, but a reliable cloud service provider (CSP) can put your mind at ease
and keep your data safe with highly secure cloud services.
Define: Cloud security, also known as cloud computing security, is a collection of security
measures designed to protect cloud-based infrastructure, applications, and data.
• These measures ensure user and device authentication, data and resource access
control, and data privacy protection.
Malware. About 90% of organizations moving to the cloud are more likely to
experience data breaches.
Cloud computing partners have tried to build in all the major security
protocols to keep your data safe. But cybercriminals have upped their game
too! They have familiarized themselves with these modern technologies.
Every individual has the right to control his or her own data, whether private,
public or professional. Without knowledge of the physical location of the server or of
how the processing of personal data is configured, end-users consume cloud services
without any information about the processes involved.
Data privacy is a discipline intended to keep data safe against improper access,
theft or loss. It's vital to keep data confidential and secure by exercising sound data
management and preventing unauthorized access that might result in data loss,
alteration or theft.
Types of privacy
• Information privacy.
• Communication privacy.
• Individual privacy.
Privacy and privacy Impact Assessment
Trust in cloud computing
In cloud computing, trust helps the consumer to choose the service of a cloud provider
for storing and processing their sensitive information.
Trust in the Cloud Computing is a critical issue and it is one of the most challenging
issues in the cloud.
a user trusts a cloud service with respect to performance, security, and privacy,
based on the identity of the provider
A trust model measures the security strength and computes a trust value. CSA (Cloud
Service Alliance) service challenges are used to assess security of a service and
validity of the model.
Trust
Operating System Security
Protection and Security in Operating System involve the process and management
of resources of the Operating system from Unauthorized access.
Any vulnerability in the operating system could compromise the security of the
application.
By securing the operating system, you make the environment stable, control
access to resources, and control external access to the environment.
Operating System Security
Virtual Machine Security in Cloud Computing
A VM is a virtualized instance of a computer that can perform almost all of the
same functions as a computer, including running applications and operating systems.
• Virtual machines run on a physical machine and access computing resources from
software called a hypervisor.
• This differs from traditional, hardware-based network security, which is static and
runs on devices such as traditional firewalls, routers, and switches.
2. Components used to boot the system. These components self-destruct before any
userVMis started. Two components discover the hardware configuration of the server,
including the PCI drivers, and then boot the system: