0% found this document useful (0 votes)
8 views41 pages

1 BDA Unit1 ppt1

Bda

Uploaded by

ιηρ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views41 pages

1 BDA Unit1 ppt1

Bda

Uploaded by

ιηρ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Big Data Analytics

UNIT - 1
Big Data Analytics

Big Data
Collection of datasets that cannot be handled by using
the traditional data processing tools.​

Big Data Analytics

Analyzing Big Data to extract informative patters from it.


Sample Use Case of Big Data - Banking
System
Stakeholders ​ Tools & Technologies
Generate Data Storing, Processing, Retrieving
and Analyzing​
Access the analyzed
Data​

Types of Data​ Information perceived​

Structured​ Reports Generated​


Graphs​
Semi-Structured​
Tables​
Unstructured​
Structured Data Semi-Structured Unstructured
Data Data Survilliance Security
FB page
CRM Customer EMS
MySQL JSON Care Log
IOT/ Customer
Sentiment Media
Partners Flat Sensory Analysis
File Data
Core
Oracle XML
Data

ET
L Storage
Distributed Scala

Hadoop
NoSQL New Simpler
HDFS Spark
Mongo DB Programming
Map Reduce
Framework
Processed Storage Faster Real-Time
Pig Data Analysis
HIVE
Data Analysis
• K H Vijaya Kumari, Asst.Professor, Dept of IT, CBIT,Hyderabad
The 5 V's of Big Data
Volume
8 bits Byte

10 3 Kilo Byte (1KB) Data ranging from Zetta Byte is usually


referred to Big Data
10 6 Mega Byte (1MB)

10 9 Giga Byte (1GB)

10 12 Tera Byte (1TB)

10 15 Peta Byte (1PB)

10 18 Exa Byte (1 EB)

10 21 Zetta Byte (1ZB)

10 24 Yotta Byte (1YB)

10 27 Bronto Byte (BB)


Velocity
The rate at which a system is generating the large amounts of data

Walmart handles 1 million customer transactions per hour

Facebook handles 40 billion photos from its user base

inserts 500 TBs of data each day

Stores,
Processes 30 + PBs of Data
and
Analyzes
240 TB data for every
Flight generates
6-8 hours of flight
Variety
Variety of Sources​
Ubiquitous Computing​
People - Using Mobile devices​
Machines - Sensors / IOT The ability to
devices​
Organizations -- Generate data by compute/analyze data at any
capturing ​ the customer transactions​ time from any where using
any device

Structured
Semi-Structured
Unstructured
Value
Whether the data being analyzed results in some meaningful information

Statistical ,Hypothetical, Correlational​


Veracity

Authenticity, Reputation, Authorization


Big Data Life Cycle
Business case Visualization
Evaluation

Identification of
Data Analysis
Data

Data Filtering Data Aggregation

Data Extraction
Types of BDA
1. Descriptive Analytics What happened?​

Ex: A chemical company Dow found out the underutilized area and thus saved
4 million dollars annually​

2. Diagnostic Analytics How it happened?​


Ex: Airlines --- customer dropouts ​

3. Predictive Analytics What is likely to happen?


Ex: Paypal -- Fraudulent transaction detection before it happens​

4. Prescriptive Analytics What can be done?​


Ex: Airlines – Automatically adjust flight fares depending upon the season​
IT Use cases
Log Analytics

Fraud Detection Pattern

Social Media/ Customer Sentiment Analysis Pattern​


Log Analytics
Log Analytics is the assessment of recorded information about the events collected from one
or more computer , network and application OS.​

An event is an identifiable or significant occurrence upon a h/w or s/w.​

Event can be generated by an user or by a computer​

A log analytics s/w collects and checks the logs such as error logs. ​

These logs help the organizations diagnose an issue such as he location, time of
event occurrence e.t.c.​

A state-of-the-art log analytics system is Amazon Open Search system Service(AOSS)


Fraud Detection Pattern​
Traditional approach
Rule based systems flags potentially fraudulent systems

Confirm the credit card transaction with customer

BDA approach

Incorporate domain experts knowledge in form of rules

Uses past Fraudulent transactions to apply diagnostic analysis

Performed on No Sampling basis​


Customer Sentiment Analysis
Companies monitor what people are saying about them in social media and respond
appropriately — and if they do not, they quickly lose customers.

Performed by collecting large set of social media data like twitter, facebook, youtube​

Customer Sentiment can take either of the following forms​

Positive​

Negative​

Neutral​
Introduction to Hadoop
Hadoop is a programming framework that provides distributed storage and parallel processing
of large data using commodity hardware.​

History of Hadoop
In 2003 google published the concept of google file system(GFS) which was distributed in nature.​

In 2005 , the Apache foundation implemented GFS in terms of Hadoop Distributed File
System(HDFS) MapReduce Processing and released the first version of Hadoop, Hadoop 0.1.0
in the year 2006.

Actual Hadoop developers are Dough Cutting and Mike Cafarella. Dough Cutting
named hadoop after his son's toy elephant. ​

Hadoop Architecture
Name node in HDFS
HDFS - Hadoop Distributed File System
HDFS stores the files/data in clusters of nodes. Nodes are basically computers connected
in LAN with a server maintaining the metadata about all these nodes.​

Advantages:​
Inexpensive

Immutable

Reliable data storage

Block structured and scalable​

Disadvantages:
No suitable for smaller datasets​
HDFS Architecture
Metadata in Disk & Metadata in RAM
Rack aware Architecture

Two design possibilities

 All Master processes/nodes reside


in one Rack while the slaves will be
placed in the other racks

 Every rack would have one master


node
Rules for Data node replication
1. Never place the replication on the same data
node where the original block resides​

2. In case of rack aware architecture do not place


a replication on the same rack where he original
block resides​

3. Never place more than two replicas of a block


in the same rack
HDFS Federation
HDFS Federation
Federation means organizing several units of same functionality under one administration / set
of rules.​

The traditional hdfs architecture has been horizontally scaled to accommodate more number
of Name node and Data node clusters. Thus forms the hdfs federation. ​

The Namespace portion consists of Name nodes and the Block storage consists of Data nodes ​

Within each Name node we'll have a Namespace which is a hierarchical structure of directories
and files. And a block pool comprising the set of blocks corresponding to the Namespace files. ​

The blocks of each block pool can be stored in any of the data nodes. When a Name node is
deleted its Name space , block pool also will be removed by removing those blocks from the Data
nodes.​

The HDFS High Availability Architecture
The High Availability feature of HDFS ensures the data availability to its clients inspite of
Name node and Data node failure. ​

Data node High Availability​


The data replication concept of HDFS , makes the blocks available even when one of the
data nodes having a copy of the block is corrupted . When the client requests Name node for data
access, then the Name node searches for all the nodes in which that data is available. After that, it
provides access to that data to the user from the node in which data was quickly available.​

​Namenode High Availability

To provide the High Availability in case of Name node failure, the HDFS High Availability
Architecture has been developed since the hadoop 2.x . In this architecture we'll have an
alternative Name node called passive Name node.​
Components of HA Architecture

Min no: of Journal nodes is 3


Hadoop File Systems
S.No File System URI Java Implementation Description
Scheme

1 Local file fs.LocalFileSystem

2 HDFS hdfs hdfs.DistribtedFileSystems Works with MapReduce Efficiently

3 HFTP hftp hdfs.HftpFileSystems HFTP File System provides read- only access to
HDFS over HTTP

4 HSFTP hsftp Hdfs.HsftpFileSystem Provides read-only access to HDFS over HTTP


in a secured fashion

5 WebHdfs webhdfs Hdfs.Web.WebHdfsFileSystem File System provides secure Read-Write access


to HDFS

6 HAR har Fs.HarFileSystem A FileSystem layered on another FileSystem for


archieving Files

7 KFS(Cloud- kfs Fs.kfs.KosmosFileSystem FileSystem that supports data intensive apps like
Store) GFS

8 FTP ftp fs.ftp.FTPFileSystem File System backed up by an FTP server

9 S3 S3a fs.s3a.S3AFileSystem A File System backed by Amazon S3. Replaces


the older s3n S3native implementation
10 Azure wasb fs.azure.NativeAzureFileSystem A filesystem backed by Microsoft Azure
Accessing Hadoop File System using Java
API

Public static FileSystem get( URI uri, Configuration conf) throws IOException
Returns the File System as specified by the uri
FileSystem Configuration

FileSystem Configuration
Accessing Hadoop File System using Java
API

open
Anatomy of a File Read
1. The client opens the required file to be read by calling
open() method on the Distributed File System object

2. Distributed File System calls the name node using RPC


to determine the first few blocks

Upon determining the data node with the


closest proximity, the Distributed File system returns an
FSDataInputStream

3. The client then calls the read() method upon the


FSDataInputStream

4. Data is streamed from the Datanode back to the client


which calls read() repeatedly

5. When the end of the block is


reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block

6. When the client has finished reading, it calls on the


close() FSDataInputStream
Anatomy of a File Read
DFSInputStream
Anatomy of a File Read
Anatomy of File Write
1. The client creates a new file by calling create() method on
the Distributed File System object

2. Distributed File System calls the name node using RPC


to create a new file in it's namespace with no blocks
associated with it.

The namenode makes a record of the new file if the file


doesn’t already exist, and that the client has the right
permissions to create the file. otherwise, file creation fails and
the client is thrown an IOException. The returns an
for the client Distributed File System, FS Data Output Stream

3. client calls a write() method and DFSOutputStream


splits it into packets and it writes to an internal queue, called
the data queue.
The data queue is consumed by the DataStreamer,
which asks the name node to allocate new blocks by
picking a list of suitable data nodes to store the replicas.
Anatomy of File Write
4. The list of data nodes forms a pipeline .
The DataStreamer streams the packets to the first data
node in the pipeline, which stores the packet and forwards it
to the second data node in the pipeline

5. DFSOutputStream also maintains an internal queue of


packets that are waiting to be acknowledged by data nodes,
called the ack queue. A packet is removed from the
ack queue only when it has been acknowledged by all the
data nodes in the pipeline

6. When the client has finished writing, it calls on the


close() FSDataoutputStream

7. This action flushes all the remaining packets to the data


node pipeline and waits for acknowledgments before
contacting the name node to signal that the file is complete
Replica Placement
Hadoop’s default strategy is to place

 The first replica on the same node as the client (for


clients running outside the cluster, a node is chosen at
random, although the system tries not to pick nodes that
are too full or too busy).

 The second replica is placed on a different rack from


the first (off-rack), chosen at random.

 The third replica is placed on the same rack as the


second, but on a different node chosen at random.

Further replicas are placed on random nodes on the


cluster, although the system tries to avoid placing
too many replicas on the same rack. Once the replica
locations have been chosen, a pipeline is built, taking
network topology
into account.

You might also like