0% found this document useful (0 votes)
6 views45 pages

BDA_ppt1

The document provides an overview of Big Data Analytics, detailing its definition, types of data (structured, semi-structured, and unstructured), and the importance of metadata. It discusses the 5 V's of Big Data: Volume, Velocity, Variety, Veracity, and Value, along with practical use cases such as log analytics and fraud detection. Additionally, it introduces Hadoop as a framework for distributed storage and processing of large datasets, explaining its architecture and file systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views45 pages

BDA_ppt1

The document provides an overview of Big Data Analytics, detailing its definition, types of data (structured, semi-structured, and unstructured), and the importance of metadata. It discusses the 5 V's of Big Data: Volume, Velocity, Variety, Veracity, and Value, along with practical use cases such as log analytics and fraud detection. Additionally, it introduces Hadoop as a framework for distributed storage and processing of large datasets, explaining its architecture and file systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Big Data Analytics

UNIT - 1
Big Data Analytics

Big Data
Collection of datasets that cannot be handled by using
the traditional data processing tools.​

Big Data Analytics

Analyzing Big Data to extract informative patters from it.

2
Sample Use Case of Big Data -
Banking System
Stakeholders ​ Tools & Technologies
Generate Data Storing, Processing, Retrieving
and Analyzing​
Access the analyzed
Data​

Types of Data​ Information perceived​

Structured​ Reports Generated​


Graphs​
Semi-Structured​
Tables​
Unstructured​
3
Structured Semi-Structured Unstructured
Data Data Data Survilliance Security
FB page
CRM Customer EMS
MySQL JSON Care Log
IOT/ Customer
Sentiment Media
Partners Flat Sensory Analysis
File Data
Core
Oracle XML
Data

ET
L
Distributed Storage Scala

Hadoop
NoSQL New Simpler
HDFS Spark
Mongo DB Programming
Map Reduce
Framework
Processed Faster Real-Time
Storage Pig Data Analysis
HIVE
Data 4
Analysis
Forms of Data
(Structured, Unstructured, Semi-Structured)
Structured Data
• Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
• Structured data conforms to a tabular format with relationship between
the different rows and columns.
• Structured data is is considered the most ‘traditional’ form of data
storage, since the earliest versions of database management systems
(DBMS) were able to store, process and access structured data.
• Common examples of structured data are Excel files or SQL databases.
Each of these have structured rows and columns that can be sorted.
Unstructured Data
• Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
• The ability to store and process unstructured data has greatly grown in
recent years, with many new technologies and tools coming to the
market that are able to store specialized types of unstructured data.
• Common examples of unstructured data include audio, video files or No-
SQL databases.
• MongoDB, for example, is optimized to store documents. Apache Giraph,
as an opposite example, is optimized for storing relationships between
nodes.
• The ability to analyse unstructured data is especially relevant in the
context of Big Data, since a large part of data in organisations is
unstructured. Think about pictures, videos or PDF documents. The ability
to extract value from unstructured data is one of main drivers behind the
quick growth of Big Data.
• Unstructured data is everywhere. In fact, most individuals and organizations
conduct their lives around unstructured data. Just as with structured data,
unstructured data is either machine generated or human generated.
Here are some examples of machine-generated unstructured data:
• Satellite images: This includes weather data or the data that the government
captures in its satellite surveillance imagery. Just think about Google Earth, and you
get the picture.
• Scientific data: This includes seismic imagery, atmospheric data, and high energy
physics.
• Photographs and video: This includes security, surveillance, and traffic video.
• Radar or sonar data: This includes vehicular, meteorological, and oceanographic
seismic profiles.
The following list shows a few examples of human-generated unstructured data:
• Text internal to your company: Think of all the text within documents, logs, survey
results, and e-mails. Enterprise information actually represents a large percent of
the text information in the world today.
• Social media data: This data is generated from the social media platforms such as
YouTube, Facebook, Twitter, LinkedIn, and Flickr.
• Mobile data: This includes data such as text messages and location information.
• website content: This comes from any site delivering unstructured content, like
YouTube, Flickr, or Instagram.
Semi-structured Data
• Semi-structured data is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or other forms of data tables,
but nonetheless contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data. Therefore, it is also known as self-
describing structure.
• Examples of semi-structured data include JSON and XML are forms of semi-structured data.
• The reason that this third category exists (between structured and unstructured data) is
because semi-structured data is considerably easier to analyse than unstructured data.
Many Big Data solutions and tools have the ability to ‘read’ and process either JSON or XML.
This reduces the complexity to analyse structured data, compared to unstructured data.
Metadata – Data about Data
• A last category of data type is metadata. From a technical point of view,
this is not a separate data structure, but it is one of the most important
elements for Big Data analysis and big data solutions. Metadata is data
about data. It provides additional information about a specific set of data.
• In a set of photographs, for example, metadata could describe when and
where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data. Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
The 5 V's of Big Data

11
Volume
8 bits Byte

10 3 Kilo Byte (1KB) Data ranging from Zetta Byte is


usually referred to Big Data
10 6
Mega Byte
(1MB)
10 9 Giga Byte
(1GB)
10 12
Tera Byte (1TB)

10 15
Peta Byte (1PB)

10 18
Exa Byte (1 EB)

10 21
Zetta Byte
(1ZB)
10 24
Yotta Byte
(1YB)
10 27 Bronto Byte
(BB)
12
Velocity
The rate at which a system is generating the large amounts of data

Walmart handles 1 million customer transactions per


hour
Facebook handles 40 billion photos from its user base

inserts 500 TBs of data each day

Stores,
Processes 30 + PBs of Data
and
Analyzes
240 TB data for every
Flight generates
6-8 hours of flight 13
Variety
Variety of Sources​
Ubiquitous Computing​
People - Using Mobile
devices​
Machines - Sensors / IOT The ability to
devices​
Organizations -- Generate data compute/analyze data at
by capturing ​the customer any time from any where
transactions​ using any device

Variety of Data
Structured
Semi-Structured
Unstructured 14
Veracity
Correctness of Data being generated

Authenticity, Reputation, Authorization

15
Value
Whether the data being analyzed results in some meaningful information

Statistical ,Hypothetical, Correlational​

16
Use cases
Log Analytics

Fraud Detection
Pattern

Social Media/ Customer Sentiment Analysis Pattern​

17
Log Analytics
Log Analytics is the assessment of recorded information about the events collected
from one or more computer , network and application OS.​

An event is an identifiable or significant occurrence upon a h/w or s/w.​

Event can be generated by an user or by a computer​

A log analytics s/w collects and checks the logs such as error logs. ​

These logs help the organizations diagnose an issue such as he location, time of
event occurrence e.t.c.​

A state-of-the-art log analytics system is Amazon Open Search system Service(AOSS)

18
Fraud Detection Pattern​
Traditional approach
Rule based systems flags potentially fraudulent systems

Confirm the credit card transaction with customer

BDA approach

Incorporate domain experts knowledge in form of rules

Uses past Fraudulent transactions to apply diagnostic analysis

Performed on No Sampling basis​

19
Customer Sentiment Analysis
Companies monitor what people are saying about them in social media and
respond appropriately — and if they do not, they quickly lose customers.

Performed by collecting large set of social media data like


twitter, facebook, youtube​

Customer Sentiment can take either of the following forms​

Positive​

Negative​

Neutral​

20
Introduction to Hadoop
Hadoop is a programming framework that provides distributed storage and
parallel processing of large data using commodity hardware.​

History of
Hadoop
In 2003 google published the concept of google file system(GFS) which was
distributed in nature.​

In 2005 , the Apache foundation implemented GFS in terms of Hadoop Distributed


File System(HDFS) MapReduce Processing and released the first version
of Hadoop, Hadoop 0.1.0 in the year 2006.

Actual Hadoop developers are Dough Cutting and Mike Cafarella. Dough Cutting
named hadoop after his son's toy elephant. ​

21
Hadoop Architecture

22
23
Name node in HDFS

24
HDFS - Hadoop Distributed File
System
HDFS stores the files/data in clusters of nodes. Nodes are basically computers
connected in LAN with a server maintaining the metadata about all these
nodes.​
Advantages:​
Inexpensive

Immutable

Reliable data storage

Block structured and scalable​

Disadvantages:
No suitable for smaller datasets​

25
HDFS Architecture

26
27
Metadata in Disk & Metadata in RAM

28
Rack aware Architecture

Two design possibilities

 All Master processes/nodes


reside in one Rack while the
slaves will be placed in the other
racks

 Every rack would have


one master node

29
Rules for Data node replication
1. Never place the replication on the same
data node where the original block
resides​

2. In case of rack aware architecture do not


place a replication on the same rack where
he original block resides​

3. Never place more than two replicas of a


block in the same rack

30
HDFS Federation

31
HDFS Federation
Federation means organizing several units of same functionality under
one administration / set of rules.​

The traditional hdfs architecture has been horizontally scaled to accommodate more
number of Name node and Data node clusters. Thus forms the hdfs federation. ​

The Namespace portion consists of Name nodes and the Block storage consists of
Data nodes ​

Within each Name node we'll have a Namespace which is a hierarchical structure of
directories and files. And a block pool comprising the set of blocks corresponding to
the Namespace files. ​

The blocks of each block pool can be stored in any of the data nodes. When a Name
node is deleted its Name space , block pool also will be removed by removing those
blocks from the Data nodes.​

32
33
The HDFS High Availability
Architecture
The High Availability feature of HDFS ensures the data availability to its clients
inspite of Name node and Data node failure. ​

Data node High Availability​


The data replication concept of HDFS , makes the blocks available even when one of the
data nodes having a copy of the block is corrupted . When the client requests Name node
for data access, then the Name node searches for all the nodes in which that data is
available. After that, it provides access to that data to the user from the node in which
data was quickly available.​
​Namenode High Availability

To provide the High Availability in case of Name node failure, the HDFS High
Availability Architecture has been developed since the hadoop 2.x . In this
architecture we'll have an alternative Name node called passive Name node.​

34
Components of HA Architecture

Zookeeper
Data node
Holds the status of the active and passive
Stores the data in form of blocks
name nodes to enable the alternate name node
Sends it's heart beat(status) to the active Name during the acive nname node failure.
node frequently
Min no: of Zookeepers is 3
Name node
Journal node
Maintains the metadata of the cluster Holds the metadata of the File System which
Updates/Writes the metadata in to all the journal can be shared among he active(read) and passive
nodes Name nodes
Fileover controller Min no: of Journal nodes is 3
Monitors the health of Name node's OS and H/W Passive Name node

Sends the Name nodes status to the Reads and copies the metadata of the File
Zookeeper(s) System from journal nodes.
Cotrols all the Name nodes by using STONIITH Monitor's the active name node's status from
(Shoot The other Node In The Head the zookeeper to become the active name node in
35
case of the present active name node's failure
Hadoop File Systems
Hadoop is an abstract notion of File System of which HDFS
is just an instance or implementation
The java abstract class
org.apache.hadoop.fs.FileSystem is the base FS class from
which various implementations can be made

36
S.No File System URI Java Implementation Description
Scheme

1 Local file fs.LocalFileSystem

2 HDFS hdfs hdfs.DistribtedFileSystems Works with MapReduce Efficiently

3 HFTP hftp hdfs.HftpFileSystems HFTP File System provides read- only


access to HDFS over HTTP

4 HSFTP hsftp Hdfs.HsftpFileSystem Provides read-only access to HDFS over


HTTP in a secured fashion

5 WebHdfs webhdfs Hdfs.Web.WebHdfsFileSystem File System provides secure Read-Write


access to HDFS

6 HAR har Fs.HarFileSystem A FileSystem layered on another


FileSystem for archieving Files

7 KFS(Cloud- kfs Fs.kfs.KosmosFileSystem FileSystem that supports data intensive


Store) apps like GFS

8 FTP ftp fs.ftp.FTPFileSystem File System backed up by an FTP server

9 S3 S3a fs.s3a.S3AFileSystem A File System backed by Amazon S3.


Replaces the older s3n S3native
implementation
10 Azure wasb fs.azure.NativeAzureFileSyste A filesystem backed by Microsoft Azure 37
m
Accessing Hadoop File System using
Java API
The Hadoop's File System class is the general file system API for
Steps to access a Hadoop Filesystem

Step1: Create an instance of the filesystem we want to access by using one of the following factory
methods

Public static FileSystem get( Configuration conf) throws IOException

Returns the default File System specified in core-site.xml

Public static FileSystem get( URI uri, Configuration conf) throws IOException

Returns the File System as specified by the uri

Public static FileSystem get( URI uri, Configuration conf, String users) throws IOException

Returns the File System specified by the uri for the specified user

Can be used for secured access

Public static LocalFileSystem getLocal( Configuration conf) throws IOException


38
Returns the Local File System instance
Accessing Hadoop File System using
Java API
Step2: open an input stream for the file to be accessed
By using one of the following
• public FSDataInputStream open(Path f) throws IOException

Opens the file specified in the path with a default buffersize of


4KB
• public FSDataInputStream open(Path f, int bufsize) throws IOException

Opens the file specified in the path with a default buffersize of 4KB

39
Anatomy of a File Read
1. The client opens the required file to be read
by calling open() method on the Distributed File
System object

2. Distributed File System calls the name


node using RPC to determine the first few blocks

Upon determining the data node with the closest


proximity, the Distributed File system returns an
FSDataInputStream

3. The client then calls the read() method upon the


FSDataInputStream

4. Data is streamed from the Datanode back to


the client which calls read() repeatedly

5. When the end of the block is reached,


DFSInputStream will close the connection to the
datanode, then find the best datanode for the next
block
40
Anatomy of a File Read
if the DFSInputStream encounters an error while communicating with a data
node, then it will try the next closest one for that block. It will also
remember data nodes that have failed so that it doesn’t needlessly retry
them for later blocks

If a corrupted block is found, it is reported to the namenode before


the DFSInputStream attempts to read a replica of the block from another
datanode.

Advantage:
HDFS can scale to a large number of concurrent clients
41
Anatomy of a File Read
Network Topology and Hadoop
In the context of high-volume data processing, the limiting
factor is the rate at which we can transfer data between
nodes—bandwidth is a scarce commodity.
In a Hadoop cluster network is represented as a tree and
the distance between two nodes is the sum of
their distances to their closest common ancestor.
Levels in the tree correspond to the data center, the rack,
and the node
that a process is running on.
the bandwidth available for each of the following
scenarios becomes progressively less:
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center 42
Anatomy of File Write
1. The client creates a new file by calling create()
method on the Distributed File System object

2. Distributed File System calls the name node using


RPC to create a new file in it's namespace with no
blocks associated with it.

The namenode makes a record of the new file if the


file doesn’t already exist, and that the client has the
right permissions to create the file. otherwise, file
creation fails and the client is thrown
an IOException. The returns an
for the client Distributed File System, FS Data Output
Stream

3. client calls a write() method


and DFSOutputStream splits it into packets and it
writes to an internal queue, called the data queue.
The data queue is consumed by the DataStreamer,
which asks the name node to allocate new blocks by
picking a list of suitable data nodes to store the
43
replicas.
Anatomy of File Write
4. The list of data nodes forms a pipeline .
The DataStreamer streams the packets to the first data
node in the pipeline, which stores the packet and forwards it
to the second data node in the pipeline

5. DFSOutputStream also maintains an internal queue of


packets that are waiting to be acknowledged by data nodes,
called the ack queue. A packet is removed from the
ack queue only when it has been acknowledged by all the
data nodes in the pipeline

6. When the client has finished writing, it calls on the


close() FSDataoutputStream

7. This action flushes all the remaining packets to the data


node pipeline and waits for acknowledgments before
contacting the name node to signal that the file is complete

44
Replica Placement
Hadoop’s default strategy is to place

 The first replica on the same node as the client


(for
clients running outside the cluster, a node is
chosen at random, although the system tries not
to pick nodes that are too full or too busy).

 The second replica is placed on a different rack


from the first (off-rack), chosen at random.

 The third replica is placed on the same rack as


the second, but on a different node chosen at
random.

Further replicas are placed on random nodes on


the cluster, although the system tries to avoid
placing
too many replicas on the same rack. Once the 45

You might also like