BDA_ppt1
BDA_ppt1
UNIT - 1
Big Data Analytics
Big Data
Collection of datasets that cannot be handled by using
the traditional data processing tools.
2
Sample Use Case of Big Data -
Banking System
Stakeholders Tools & Technologies
Generate Data Storing, Processing, Retrieving
and Analyzing
Access the analyzed
Data
ET
L
Distributed Storage Scala
Hadoop
NoSQL New Simpler
HDFS Spark
Mongo DB Programming
Map Reduce
Framework
Processed Faster Real-Time
Storage Pig Data Analysis
HIVE
Data 4
Analysis
Forms of Data
(Structured, Unstructured, Semi-Structured)
Structured Data
• Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
• Structured data conforms to a tabular format with relationship between
the different rows and columns.
• Structured data is is considered the most ‘traditional’ form of data
storage, since the earliest versions of database management systems
(DBMS) were able to store, process and access structured data.
• Common examples of structured data are Excel files or SQL databases.
Each of these have structured rows and columns that can be sorted.
Unstructured Data
• Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
• The ability to store and process unstructured data has greatly grown in
recent years, with many new technologies and tools coming to the
market that are able to store specialized types of unstructured data.
• Common examples of unstructured data include audio, video files or No-
SQL databases.
• MongoDB, for example, is optimized to store documents. Apache Giraph,
as an opposite example, is optimized for storing relationships between
nodes.
• The ability to analyse unstructured data is especially relevant in the
context of Big Data, since a large part of data in organisations is
unstructured. Think about pictures, videos or PDF documents. The ability
to extract value from unstructured data is one of main drivers behind the
quick growth of Big Data.
• Unstructured data is everywhere. In fact, most individuals and organizations
conduct their lives around unstructured data. Just as with structured data,
unstructured data is either machine generated or human generated.
Here are some examples of machine-generated unstructured data:
• Satellite images: This includes weather data or the data that the government
captures in its satellite surveillance imagery. Just think about Google Earth, and you
get the picture.
• Scientific data: This includes seismic imagery, atmospheric data, and high energy
physics.
• Photographs and video: This includes security, surveillance, and traffic video.
• Radar or sonar data: This includes vehicular, meteorological, and oceanographic
seismic profiles.
The following list shows a few examples of human-generated unstructured data:
• Text internal to your company: Think of all the text within documents, logs, survey
results, and e-mails. Enterprise information actually represents a large percent of
the text information in the world today.
• Social media data: This data is generated from the social media platforms such as
YouTube, Facebook, Twitter, LinkedIn, and Flickr.
• Mobile data: This includes data such as text messages and location information.
• website content: This comes from any site delivering unstructured content, like
YouTube, Flickr, or Instagram.
Semi-structured Data
• Semi-structured data is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or other forms of data tables,
but nonetheless contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data. Therefore, it is also known as self-
describing structure.
• Examples of semi-structured data include JSON and XML are forms of semi-structured data.
• The reason that this third category exists (between structured and unstructured data) is
because semi-structured data is considerably easier to analyse than unstructured data.
Many Big Data solutions and tools have the ability to ‘read’ and process either JSON or XML.
This reduces the complexity to analyse structured data, compared to unstructured data.
Metadata – Data about Data
• A last category of data type is metadata. From a technical point of view,
this is not a separate data structure, but it is one of the most important
elements for Big Data analysis and big data solutions. Metadata is data
about data. It provides additional information about a specific set of data.
• In a set of photographs, for example, metadata could describe when and
where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data. Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
The 5 V's of Big Data
11
Volume
8 bits Byte
10 15
Peta Byte (1PB)
10 18
Exa Byte (1 EB)
10 21
Zetta Byte
(1ZB)
10 24
Yotta Byte
(1YB)
10 27 Bronto Byte
(BB)
12
Velocity
The rate at which a system is generating the large amounts of data
Stores,
Processes 30 + PBs of Data
and
Analyzes
240 TB data for every
Flight generates
6-8 hours of flight 13
Variety
Variety of Sources
Ubiquitous Computing
People - Using Mobile
devices
Machines - Sensors / IOT The ability to
devices
Organizations -- Generate data compute/analyze data at
by capturing the customer any time from any where
transactions using any device
Variety of Data
Structured
Semi-Structured
Unstructured 14
Veracity
Correctness of Data being generated
15
Value
Whether the data being analyzed results in some meaningful information
16
Use cases
Log Analytics
Fraud Detection
Pattern
17
Log Analytics
Log Analytics is the assessment of recorded information about the events collected
from one or more computer , network and application OS.
A log analytics s/w collects and checks the logs such as error logs.
These logs help the organizations diagnose an issue such as he location, time of
event occurrence e.t.c.
18
Fraud Detection Pattern
Traditional approach
Rule based systems flags potentially fraudulent systems
BDA approach
19
Customer Sentiment Analysis
Companies monitor what people are saying about them in social media and
respond appropriately — and if they do not, they quickly lose customers.
Positive
Negative
Neutral
20
Introduction to Hadoop
Hadoop is a programming framework that provides distributed storage and
parallel processing of large data using commodity hardware.
History of
Hadoop
In 2003 google published the concept of google file system(GFS) which was
distributed in nature.
Actual Hadoop developers are Dough Cutting and Mike Cafarella. Dough Cutting
named hadoop after his son's toy elephant.
21
Hadoop Architecture
22
23
Name node in HDFS
24
HDFS - Hadoop Distributed File
System
HDFS stores the files/data in clusters of nodes. Nodes are basically computers
connected in LAN with a server maintaining the metadata about all these
nodes.
Advantages:
Inexpensive
Immutable
Disadvantages:
No suitable for smaller datasets
25
HDFS Architecture
26
27
Metadata in Disk & Metadata in RAM
28
Rack aware Architecture
29
Rules for Data node replication
1. Never place the replication on the same
data node where the original block
resides
30
HDFS Federation
31
HDFS Federation
Federation means organizing several units of same functionality under
one administration / set of rules.
The traditional hdfs architecture has been horizontally scaled to accommodate more
number of Name node and Data node clusters. Thus forms the hdfs federation.
The Namespace portion consists of Name nodes and the Block storage consists of
Data nodes
Within each Name node we'll have a Namespace which is a hierarchical structure of
directories and files. And a block pool comprising the set of blocks corresponding to
the Namespace files.
The blocks of each block pool can be stored in any of the data nodes. When a Name
node is deleted its Name space , block pool also will be removed by removing those
blocks from the Data nodes.
32
33
The HDFS High Availability
Architecture
The High Availability feature of HDFS ensures the data availability to its clients
inspite of Name node and Data node failure.
To provide the High Availability in case of Name node failure, the HDFS High
Availability Architecture has been developed since the hadoop 2.x . In this
architecture we'll have an alternative Name node called passive Name node.
34
Components of HA Architecture
Zookeeper
Data node
Holds the status of the active and passive
Stores the data in form of blocks
name nodes to enable the alternate name node
Sends it's heart beat(status) to the active Name during the acive nname node failure.
node frequently
Min no: of Zookeepers is 3
Name node
Journal node
Maintains the metadata of the cluster Holds the metadata of the File System which
Updates/Writes the metadata in to all the journal can be shared among he active(read) and passive
nodes Name nodes
Fileover controller Min no: of Journal nodes is 3
Monitors the health of Name node's OS and H/W Passive Name node
Sends the Name nodes status to the Reads and copies the metadata of the File
Zookeeper(s) System from journal nodes.
Cotrols all the Name nodes by using STONIITH Monitor's the active name node's status from
(Shoot The other Node In The Head the zookeeper to become the active name node in
35
case of the present active name node's failure
Hadoop File Systems
Hadoop is an abstract notion of File System of which HDFS
is just an instance or implementation
The java abstract class
org.apache.hadoop.fs.FileSystem is the base FS class from
which various implementations can be made
36
S.No File System URI Java Implementation Description
Scheme
Step1: Create an instance of the filesystem we want to access by using one of the following factory
methods
Public static FileSystem get( URI uri, Configuration conf) throws IOException
Public static FileSystem get( URI uri, Configuration conf, String users) throws IOException
Returns the File System specified by the uri for the specified user
Opens the file specified in the path with a default buffersize of 4KB
39
Anatomy of a File Read
1. The client opens the required file to be read
by calling open() method on the Distributed File
System object
Advantage:
HDFS can scale to a large number of concurrent clients
41
Anatomy of a File Read
Network Topology and Hadoop
In the context of high-volume data processing, the limiting
factor is the rate at which we can transfer data between
nodes—bandwidth is a scarce commodity.
In a Hadoop cluster network is represented as a tree and
the distance between two nodes is the sum of
their distances to their closest common ancestor.
Levels in the tree correspond to the data center, the rack,
and the node
that a process is running on.
the bandwidth available for each of the following
scenarios becomes progressively less:
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center 42
Anatomy of File Write
1. The client creates a new file by calling create()
method on the Distributed File System object
44
Replica Placement
Hadoop’s default strategy is to place