0% found this document useful (0 votes)

8 views41 pages

1 BDA Unit1 ppt1

Bda

Uploaded by

ιηρ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views41 pages

1 BDA Unit1 ppt1

Bda

Uploaded by

ιηρ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Big Data Analytics

UNIT - 1
Big Data Analytics

Big Data
Collection of datasets that cannot be handled by using
the traditional data processing tools.

Big Data Analytics

Analyzing Big Data to extract informative patters from it.

Sample Use Case of Big Data - Banking
System
Stakeholders Tools & Technologies
Generate Data Storing, Processing, Retrieving
and Analyzing
Access the analyzed
Data

Types of Data Information perceived

Structured Reports Generated

Graphs
Semi-Structured
Tables
Unstructured
Structured Data Semi-Structured Unstructured
Data Data Survilliance Security
FB page
CRM Customer EMS
MySQL JSON Care Log
IOT/ Customer
Sentiment Media
Partners Flat Sensory Analysis
File Data
Core
Oracle XML
Data

ET
L Storage
Distributed Scala

Hadoop
NoSQL New Simpler
HDFS Spark
Mongo DB Programming
Map Reduce
Framework
Processed Storage Faster Real-Time
Pig Data Analysis
HIVE
Data Analysis
• K H Vijaya Kumari, Asst.Professor, Dept of IT, CBIT,Hyderabad
The 5 V's of Big Data
Volume
8 bits Byte

10 3 Kilo Byte (1KB) Data ranging from Zetta Byte is usually

referred to Big Data
10 6 Mega Byte (1MB)

10 9 Giga Byte (1GB)

10 12 Tera Byte (1TB)

10 15 Peta Byte (1PB)

10 18 Exa Byte (1 EB)

10 21 Zetta Byte (1ZB)

10 24 Yotta Byte (1YB)

10 27 Bronto Byte (BB)

Velocity
The rate at which a system is generating the large amounts of data

Walmart handles 1 million customer transactions per hour

Facebook handles 40 billion photos from its user base

inserts 500 TBs of data each day

Stores,
Processes 30 + PBs of Data
and
Analyzes
240 TB data for every
Flight generates
6-8 hours of flight
Variety
Variety of Sources
Ubiquitous Computing
People - Using Mobile devices
Machines - Sensors / IOT The ability to
devices
Organizations -- Generate data by compute/analyze data at any
capturing the customer transactions time from any where using
any device

Structured
Semi-Structured
Unstructured
Value
Whether the data being analyzed results in some meaningful information

Statistical ,Hypothetical, Correlational

Veracity

Authenticity, Reputation, Authorization

Big Data Life Cycle
Business case Visualization
Evaluation

Identification of
Data Analysis
Data

Data Filtering Data Aggregation

Data Extraction
Types of BDA
1. Descriptive Analytics What happened?

Ex: A chemical company Dow found out the underutilized area and thus saved
4 million dollars annually

2. Diagnostic Analytics How it happened?

Ex: Airlines --- customer dropouts

3. Predictive Analytics What is likely to happen?

Ex: Paypal -- Fraudulent transaction detection before it happens

4. Prescriptive Analytics What can be done?

Ex: Airlines – Automatically adjust flight fares depending upon the season
IT Use cases
Log Analytics

Fraud Detection Pattern

Social Media/ Customer Sentiment Analysis Pattern

Log Analytics
Log Analytics is the assessment of recorded information about the events collected from one
or more computer , network and application OS.

An event is an identifiable or significant occurrence upon a h/w or s/w.

Event can be generated by an user or by a computer

A log analytics s/w collects and checks the logs such as error logs.

These logs help the organizations diagnose an issue such as he location, time of
event occurrence e.t.c.

A state-of-the-art log analytics system is Amazon Open Search system Service(AOSS)

Fraud Detection Pattern
Traditional approach
Rule based systems flags potentially fraudulent systems

Confirm the credit card transaction with customer

BDA approach

Incorporate domain experts knowledge in form of rules

Uses past Fraudulent transactions to apply diagnostic analysis

Performed on No Sampling basis

Customer Sentiment Analysis
Companies monitor what people are saying about them in social media and respond
appropriately — and if they do not, they quickly lose customers.

Performed by collecting large set of social media data like twitter, facebook, youtube

Customer Sentiment can take either of the following forms

Positive

Negative

Neutral
Introduction to Hadoop
Hadoop is a programming framework that provides distributed storage and parallel processing
of large data using commodity hardware.

History of Hadoop
In 2003 google published the concept of google file system(GFS) which was distributed in nature.

In 2005 , the Apache foundation implemented GFS in terms of Hadoop Distributed File
System(HDFS) MapReduce Processing and released the first version of Hadoop, Hadoop 0.1.0
in the year 2006.

Actual Hadoop developers are Dough Cutting and Mike Cafarella. Dough Cutting
named hadoop after his son's toy elephant.

Hadoop Architecture
Name node in HDFS
HDFS - Hadoop Distributed File System
HDFS stores the files/data in clusters of nodes. Nodes are basically computers connected
in LAN with a server maintaining the metadata about all these nodes.

Advantages:
Inexpensive

Immutable

Reliable data storage

Block structured and scalable

Disadvantages:
No suitable for smaller datasets
HDFS Architecture
Metadata in Disk & Metadata in RAM
Rack aware Architecture

Two design possibilities

 All Master processes/nodes reside

in one Rack while the slaves will be
placed in the other racks

 Every rack would have one master

node
Rules for Data node replication
1. Never place the replication on the same data
node where the original block resides

2. In case of rack aware architecture do not place

a replication on the same rack where he original
block resides

3. Never place more than two replicas of a block

in the same rack
HDFS Federation
HDFS Federation
Federation means organizing several units of same functionality under one administration / set
of rules.

The traditional hdfs architecture has been horizontally scaled to accommodate more number
of Name node and Data node clusters. Thus forms the hdfs federation.

The Namespace portion consists of Name nodes and the Block storage consists of Data nodes

Within each Name node we'll have a Namespace which is a hierarchical structure of directories
and files. And a block pool comprising the set of blocks corresponding to the Namespace files.

The blocks of each block pool can be stored in any of the data nodes. When a Name node is
deleted its Name space , block pool also will be removed by removing those blocks from the Data
nodes.

The HDFS High Availability Architecture
The High Availability feature of HDFS ensures the data availability to its clients inspite of
Name node and Data node failure.

Data node High Availability

The data replication concept of HDFS , makes the blocks available even when one of the
data nodes having a copy of the block is corrupted . When the client requests Name node for data
access, then the Name node searches for all the nodes in which that data is available. After that, it
provides access to that data to the user from the node in which data was quickly available.

Namenode High Availability

To provide the High Availability in case of Name node failure, the HDFS High Availability
Architecture has been developed since the hadoop 2.x . In this architecture we'll have an
alternative Name node called passive Name node.
Components of HA Architecture

Min no: of Journal nodes is 3

Hadoop File Systems
S.No File System URI Java Implementation Description
Scheme

1 Local file fs.LocalFileSystem

2 HDFS hdfs hdfs.DistribtedFileSystems Works with MapReduce Efficiently

3 HFTP hftp hdfs.HftpFileSystems HFTP File System provides read- only access to
HDFS over HTTP

4 HSFTP hsftp Hdfs.HsftpFileSystem Provides read-only access to HDFS over HTTP

in a secured fashion

5 WebHdfs webhdfs Hdfs.Web.WebHdfsFileSystem File System provides secure Read-Write access

to HDFS

6 HAR har Fs.HarFileSystem A FileSystem layered on another FileSystem for

archieving Files

7 KFS(Cloud- kfs Fs.kfs.KosmosFileSystem FileSystem that supports data intensive apps like
Store) GFS

8 FTP ftp fs.ftp.FTPFileSystem File System backed up by an FTP server

9 S3 S3a fs.s3a.S3AFileSystem A File System backed by Amazon S3. Replaces

the older s3n S3native implementation
10 Azure wasb fs.azure.NativeAzureFileSystem A filesystem backed by Microsoft Azure
Accessing Hadoop File System using Java
API

Public static FileSystem get( URI uri, Configuration conf) throws IOException
Returns the File System as specified by the uri
FileSystem Configuration

FileSystem Configuration
Accessing Hadoop File System using Java
API

open
Anatomy of a File Read
1. The client opens the required file to be read by calling
open() method on the Distributed File System object

2. Distributed File System calls the name node using RPC

to determine the first few blocks

Upon determining the data node with the

closest proximity, the Distributed File system returns an
FSDataInputStream

3. The client then calls the read() method upon the

FSDataInputStream

4. Data is streamed from the Datanode back to the client

which calls read() repeatedly

5. When the end of the block is

reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block

6. When the client has finished reading, it calls on the

close() FSDataInputStream
Anatomy of a File Read
DFSInputStream
Anatomy of a File Read
Anatomy of File Write
1. The client creates a new file by calling create() method on
the Distributed File System object

2. Distributed File System calls the name node using RPC

to create a new file in it's namespace with no blocks
associated with it.

The namenode makes a record of the new file if the file

doesn’t already exist, and that the client has the right
permissions to create the file. otherwise, file creation fails and
the client is thrown an IOException. The returns an
for the client Distributed File System, FS Data Output Stream

3. client calls a write() method and DFSOutputStream

splits it into packets and it writes to an internal queue, called
the data queue.
The data queue is consumed by the DataStreamer,
which asks the name node to allocate new blocks by
picking a list of suitable data nodes to store the replicas.
Anatomy of File Write
4. The list of data nodes forms a pipeline .
The DataStreamer streams the packets to the first data
node in the pipeline, which stores the packet and forwards it
to the second data node in the pipeline

5. DFSOutputStream also maintains an internal queue of

packets that are waiting to be acknowledged by data nodes,
called the ack queue. A packet is removed from the
ack queue only when it has been acknowledged by all the
data nodes in the pipeline

6. When the client has finished writing, it calls on the

close() FSDataoutputStream

7. This action flushes all the remaining packets to the data

node pipeline and waits for acknowledgments before
contacting the name node to signal that the file is complete
Replica Placement
Hadoop’s default strategy is to place

 The first replica on the same node as the client (for

clients running outside the cluster, a node is chosen at
random, although the system tries not to pick nodes that
are too full or too busy).

 The second replica is placed on a different rack from

the first (off-rack), chosen at random.

 The third replica is placed on the same rack as the

second, but on a different node chosen at random.

Further replicas are placed on random nodes on the

cluster, although the system tries to avoid placing
too many replicas on the same rack. Once the replica
locations have been chosen, a pipeline is built, taking
network topology
into account.

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Data Science
No ratings yet
Data Science
87 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
Big Data Notes
No ratings yet
Big Data Notes
68 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
BDA ppt1
No ratings yet
BDA ppt1
45 pages
Big-Data Computing: B. Ramamurthy
100% (1)
Big-Data Computing: B. Ramamurthy
55 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Bda 2 - Hadoop
No ratings yet
Bda 2 - Hadoop
112 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
47 pages
Unit I
No ratings yet
Unit I
38 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
HADOOP
No ratings yet
HADOOP
55 pages
BSBTEC601 Assessment (Word Version) BSBTEC601-Assessments-V1.0 - Student
No ratings yet
BSBTEC601 Assessment (Word Version) BSBTEC601-Assessments-V1.0 - Student
61 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Module 2
No ratings yet
Module 2
34 pages
Unit 1
No ratings yet
Unit 1
19 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Biggdata
No ratings yet
Biggdata
24 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
No ratings yet
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
45 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Lecture 02
No ratings yet
Lecture 02
60 pages
Big Data
No ratings yet
Big Data
51 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Intro
No ratings yet
Intro
47 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Lecture 4
No ratings yet
Lecture 4
32 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Unit 4
100% (1)
Unit 4
33 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
CH-05 CC
No ratings yet
CH-05 CC
21 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Da ANSWERS
No ratings yet
Da ANSWERS
13 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Unit 1 1
No ratings yet
Unit 1 1
10 pages
Big Assignment
No ratings yet
Big Assignment
8 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Continuous Availability S390 Technology Guide
No ratings yet
Continuous Availability S390 Technology Guide
326 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Trends Network and Critical Thinking in The 21st Century Culture Finals
100% (1)
Trends Network and Critical Thinking in The 21st Century Culture Finals
115 pages
Unit III Lecture 3 Address Binding 1
100% (1)
Unit III Lecture 3 Address Binding 1
3 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Windows Hardware Drivers
No ratings yet
Windows Hardware Drivers
164 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Unit 2 Information Technology and Business
No ratings yet
Unit 2 Information Technology and Business
46 pages
01 Introduction To STELLA
No ratings yet
01 Introduction To STELLA
6 pages
Creating Chiptunes With Sid Wizard Second Edition
No ratings yet
Creating Chiptunes With Sid Wizard Second Edition
75 pages
Autoranging System DC Power Supply AGILENT MODELS 6030A, 6031A, 6032A and 6035A
No ratings yet
Autoranging System DC Power Supply AGILENT MODELS 6030A, 6031A, 6032A and 6035A
140 pages
Keyboard Shortcut
100% (2)
Keyboard Shortcut
4 pages
ICOIN2022 Program Book v8 20220112
No ratings yet
ICOIN2022 Program Book v8 20220112
32 pages
DSD Unit 3 Sorting and Searching
No ratings yet
DSD Unit 3 Sorting and Searching
36 pages
IP Project
No ratings yet
IP Project
78 pages
Directorate of Air Traffic Management Rajiv Gandhi Bhawan, New Delhi 110003
No ratings yet
Directorate of Air Traffic Management Rajiv Gandhi Bhawan, New Delhi 110003
18 pages
Deep Dive Into SONiC Architecture & Design - Sonic Foundation
No ratings yet
Deep Dive Into SONiC Architecture & Design - Sonic Foundation
8 pages
United States IT Spending Market Size, Share & Growth 2025-2034
No ratings yet
United States IT Spending Market Size, Share & Growth 2025-2034
17 pages
Haivision Air Datasheet
No ratings yet
Haivision Air Datasheet
2 pages
Architecture Thesis Report Format
100% (3)
Architecture Thesis Report Format
8 pages
Manipal University Jaipur
No ratings yet
Manipal University Jaipur
8 pages
Summary Sit
No ratings yet
Summary Sit
8 pages
Network Configuration - PostQuiz - Attempt Review
No ratings yet
Network Configuration - PostQuiz - Attempt Review
4 pages
Skype Whatsapp y Viber Forensic
No ratings yet
Skype Whatsapp y Viber Forensic
13 pages
Question 1990767
No ratings yet
Question 1990767
6 pages
KNN Algorithm: Gnitc Mrs - Sumitra Mallick CSE Dept
No ratings yet
KNN Algorithm: Gnitc Mrs - Sumitra Mallick CSE Dept
12 pages
Application of Linked List-Polynomial Manipulation
No ratings yet
Application of Linked List-Polynomial Manipulation
5 pages
CS 101: Data Structures and Algorithms: Unit I Lecture 7: Introduction To DS
No ratings yet
CS 101: Data Structures and Algorithms: Unit I Lecture 7: Introduction To DS
13 pages
TC c34ks Spec I3 e y C SD 2.8mm v4.2
No ratings yet
TC c34ks Spec I3 e y C SD 2.8mm v4.2
4 pages
Azure Software - DRDO Projects Summary 2024JULY
No ratings yet
Azure Software - DRDO Projects Summary 2024JULY
5 pages
Difference Between Wire&Wireless
100% (1)
Difference Between Wire&Wireless
2 pages
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

1 BDA Unit1 ppt1

Uploaded by

1 BDA Unit1 ppt1

Uploaded by

Big Data Analytics

Big Data Analytics

Analyzing Big Data to extract informative patters from it.

Types of Data​ Information perceived​

Structured​ Reports Generated​

10 3 Kilo Byte (1KB) Data ranging from Zetta Byte is usually

10 9 Giga Byte (1GB)

10 12 Tera Byte (1TB)

10 15 Peta Byte (1PB)

10 18 Exa Byte (1 EB)

10 21 Zetta Byte (1ZB)

10 24 Yotta Byte (1YB)

10 27 Bronto Byte (BB)

Walmart handles 1 million customer transactions per hour

Facebook handles 40 billion photos from its user base

inserts 500 TBs of data each day

Statistical ,Hypothetical, Correlational​

Authenticity, Reputation, Authorization

Data Filtering Data Aggregation

2. Diagnostic Analytics How it happened?​

3. Predictive Analytics What is likely to happen?

4. Prescriptive Analytics What can be done?​

Fraud Detection Pattern

Social Media/ Customer Sentiment Analysis Pattern​

An event is an identifiable or significant occurrence upon a h/w or s/w.​

Event can be generated by an user or by a computer​

A state-of-the-art log analytics system is Amazon Open Search system Service(AOSS)

Confirm the credit card transaction with customer

Incorporate domain experts knowledge in form of rules

Uses past Fraudulent transactions to apply diagnostic analysis

Performed on No Sampling basis​

Reliable data storage

Block structured and scalable​

Two design possibilities

 All Master processes/nodes reside

 Every rack would have one master

2. In case of rack aware architecture do not place

3. Never place more than two replicas of a block

Data node High Availability​

​Namenode High Availability

Min no: of Journal nodes is 3

1 Local file fs.LocalFileSystem

2 HDFS hdfs hdfs.DistribtedFileSystems Works with MapReduce Efficiently

4 HSFTP hsftp Hdfs.HsftpFileSystem Provides read-only access to HDFS over HTTP

5 WebHdfs webhdfs Hdfs.Web.WebHdfsFileSystem File System provides secure Read-Write access

6 HAR har Fs.HarFileSystem A FileSystem layered on another FileSystem for

8 FTP ftp fs.ftp.FTPFileSystem File System backed up by an FTP server

9 S3 S3a fs.s3a.S3AFileSystem A File System backed by Amazon S3. Replaces

2. Distributed File System calls the name node using RPC

Upon determining the data node with the

3. The client then calls the read() method upon the

4. Data is streamed from the Datanode back to the client

5. When the end of the block is

6. When the client has finished reading, it calls on the

2. Distributed File System calls the name node using RPC

The namenode makes a record of the new file if the file

3. client calls a write() method and DFSOutputStream

5. DFSOutputStream also maintains an internal queue of

6. When the client has finished writing, it calls on the

7. This action flushes all the remaining packets to the data

 The first replica on the same node as the client (for

 The second replica is placed on a different rack from

 The third replica is placed on the same rack as the

Further replicas are placed on random nodes on the

You might also like

Types of Data Information perceived

Structured Reports Generated

Statistical ,Hypothetical, Correlational

2. Diagnostic Analytics How it happened?

4. Prescriptive Analytics What can be done?

Social Media/ Customer Sentiment Analysis Pattern

An event is an identifiable or significant occurrence upon a h/w or s/w.

Event can be generated by an user or by a computer

Performed on No Sampling basis

Block structured and scalable

Data node High Availability

Namenode High Availability