0% found this document useful (0 votes)

6 views45 pages

BDA_ppt1

The document provides an overview of Big Data Analytics, detailing its definition, types of data (structured, semi-structured, and unstructured), and the importance of metadata. It discusses the 5 V's of Big Data: Volume, Velocity, Variety, Veracity, and Value, along with practical use cases such as log analytics and fraud detection. Additionally, it introduces Hadoop as a framework for distributed storage and processing of large datasets, explaining its architecture and file systems.

Uploaded by

adwaithmadiraju886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views45 pages

BDA_ppt1

Uploaded by

adwaithmadiraju886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

Big Data Analytics

UNIT - 1
Big Data Analytics

Big Data
Collection of datasets that cannot be handled by using
the traditional data processing tools.

Big Data Analytics

Analyzing Big Data to extract informative patters from it.

2
Sample Use Case of Big Data -
Banking System
Stakeholders Tools & Technologies
Generate Data Storing, Processing, Retrieving
and Analyzing
Access the analyzed
Data

Types of Data Information perceived

Structured Reports Generated

Graphs
Semi-Structured
Tables
Unstructured
3
Structured Semi-Structured Unstructured
Data Data Data Survilliance Security
FB page
CRM Customer EMS
MySQL JSON Care Log
IOT/ Customer
Sentiment Media
Partners Flat Sensory Analysis
File Data
Core
Oracle XML
Data

ET
L
Distributed Storage Scala

Hadoop
NoSQL New Simpler
HDFS Spark
Mongo DB Programming
Map Reduce
Framework
Processed Faster Real-Time
Storage Pig Data Analysis
HIVE
Data 4
Analysis
Forms of Data
(Structured, Unstructured, Semi-Structured)
Structured Data
• Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
• Structured data conforms to a tabular format with relationship between
the different rows and columns.
• Structured data is is considered the most ‘traditional’ form of data
storage, since the earliest versions of database management systems
(DBMS) were able to store, process and access structured data.
• Common examples of structured data are Excel files or SQL databases.
Each of these have structured rows and columns that can be sorted.
Unstructured Data
• Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
• The ability to store and process unstructured data has greatly grown in
recent years, with many new technologies and tools coming to the
market that are able to store specialized types of unstructured data.
• Common examples of unstructured data include audio, video files or No-
SQL databases.
• MongoDB, for example, is optimized to store documents. Apache Giraph,
as an opposite example, is optimized for storing relationships between
nodes.
• The ability to analyse unstructured data is especially relevant in the
context of Big Data, since a large part of data in organisations is
unstructured. Think about pictures, videos or PDF documents. The ability
to extract value from unstructured data is one of main drivers behind the
quick growth of Big Data.
• Unstructured data is everywhere. In fact, most individuals and organizations
conduct their lives around unstructured data. Just as with structured data,
unstructured data is either machine generated or human generated.
Here are some examples of machine-generated unstructured data:
• Satellite images: This includes weather data or the data that the government
captures in its satellite surveillance imagery. Just think about Google Earth, and you
get the picture.
• Scientific data: This includes seismic imagery, atmospheric data, and high energy
physics.
• Photographs and video: This includes security, surveillance, and traffic video.
• Radar or sonar data: This includes vehicular, meteorological, and oceanographic
seismic profiles.
The following list shows a few examples of human-generated unstructured data:
• Text internal to your company: Think of all the text within documents, logs, survey
results, and e-mails. Enterprise information actually represents a large percent of
the text information in the world today.
• Social media data: This data is generated from the social media platforms such as
YouTube, Facebook, Twitter, LinkedIn, and Flickr.
• Mobile data: This includes data such as text messages and location information.
• website content: This comes from any site delivering unstructured content, like
YouTube, Flickr, or Instagram.
Semi-structured Data
• Semi-structured data is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or other forms of data tables,
but nonetheless contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data. Therefore, it is also known as self-
describing structure.
• Examples of semi-structured data include JSON and XML are forms of semi-structured data.
• The reason that this third category exists (between structured and unstructured data) is
because semi-structured data is considerably easier to analyse than unstructured data.
Many Big Data solutions and tools have the ability to ‘read’ and process either JSON or XML.
This reduces the complexity to analyse structured data, compared to unstructured data.
Metadata – Data about Data
• A last category of data type is metadata. From a technical point of view,
this is not a separate data structure, but it is one of the most important
elements for Big Data analysis and big data solutions. Metadata is data
about data. It provides additional information about a specific set of data.
• In a set of photographs, for example, metadata could describe when and
where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data. Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
The 5 V's of Big Data

11
Volume
8 bits Byte

10 3 Kilo Byte (1KB) Data ranging from Zetta Byte is

usually referred to Big Data
10 6
Mega Byte
(1MB)
10 9 Giga Byte
(1GB)
10 12
Tera Byte (1TB)

10 15
Peta Byte (1PB)

10 18
Exa Byte (1 EB)

10 21
Zetta Byte
(1ZB)
10 24
Yotta Byte
(1YB)
10 27 Bronto Byte
(BB)
12
Velocity
The rate at which a system is generating the large amounts of data

Walmart handles 1 million customer transactions per

hour
Facebook handles 40 billion photos from its user base

inserts 500 TBs of data each day

Stores,
Processes 30 + PBs of Data
and
Analyzes
240 TB data for every
Flight generates
6-8 hours of flight 13
Variety
Variety of Sources
Ubiquitous Computing
People - Using Mobile
devices
Machines - Sensors / IOT The ability to
devices
Organizations -- Generate data compute/analyze data at
by capturing the customer any time from any where
transactions using any device

Variety of Data
Structured
Semi-Structured
Unstructured 14
Veracity
Correctness of Data being generated

Authenticity, Reputation, Authorization

15
Value
Whether the data being analyzed results in some meaningful information

Statistical ,Hypothetical, Correlational

16
Use cases
Log Analytics

Fraud Detection
Pattern

Social Media/ Customer Sentiment Analysis Pattern

17
Log Analytics
Log Analytics is the assessment of recorded information about the events collected
from one or more computer , network and application OS.

An event is an identifiable or significant occurrence upon a h/w or s/w.

Event can be generated by an user or by a computer

A log analytics s/w collects and checks the logs such as error logs.

These logs help the organizations diagnose an issue such as he location, time of
event occurrence e.t.c.

A state-of-the-art log analytics system is Amazon Open Search system Service(AOSS)

18
Fraud Detection Pattern
Traditional approach
Rule based systems flags potentially fraudulent systems

Confirm the credit card transaction with customer

BDA approach

Incorporate domain experts knowledge in form of rules

Uses past Fraudulent transactions to apply diagnostic analysis

Performed on No Sampling basis

19
Customer Sentiment Analysis
Companies monitor what people are saying about them in social media and
respond appropriately — and if they do not, they quickly lose customers.

Performed by collecting large set of social media data like

twitter, facebook, youtube

Customer Sentiment can take either of the following forms

Positive

Negative

Neutral

20
Introduction to Hadoop
Hadoop is a programming framework that provides distributed storage and
parallel processing of large data using commodity hardware.

History of
Hadoop
In 2003 google published the concept of google file system(GFS) which was
distributed in nature.

In 2005 , the Apache foundation implemented GFS in terms of Hadoop Distributed

File System(HDFS) MapReduce Processing and released the first version
of Hadoop, Hadoop 0.1.0 in the year 2006.

Actual Hadoop developers are Dough Cutting and Mike Cafarella. Dough Cutting
named hadoop after his son's toy elephant.

21
Hadoop Architecture

22
23
Name node in HDFS

24
HDFS - Hadoop Distributed File
System
HDFS stores the files/data in clusters of nodes. Nodes are basically computers
connected in LAN with a server maintaining the metadata about all these
nodes.
Advantages:
Inexpensive

Immutable

Reliable data storage

Block structured and scalable

Disadvantages:
No suitable for smaller datasets

25
HDFS Architecture

26
27
Metadata in Disk & Metadata in RAM

28
Rack aware Architecture

Two design possibilities

 All Master processes/nodes

reside in one Rack while the
slaves will be placed in the other
racks

 Every rack would have

one master node

29
Rules for Data node replication
1. Never place the replication on the same
data node where the original block
resides

2. In case of rack aware architecture do not

place a replication on the same rack where
he original block resides

3. Never place more than two replicas of a

block in the same rack

30
HDFS Federation

31
HDFS Federation
Federation means organizing several units of same functionality under
one administration / set of rules.

The traditional hdfs architecture has been horizontally scaled to accommodate more
number of Name node and Data node clusters. Thus forms the hdfs federation.

The Namespace portion consists of Name nodes and the Block storage consists of
Data nodes

Within each Name node we'll have a Namespace which is a hierarchical structure of
directories and files. And a block pool comprising the set of blocks corresponding to
the Namespace files.

The blocks of each block pool can be stored in any of the data nodes. When a Name
node is deleted its Name space , block pool also will be removed by removing those
blocks from the Data nodes.

32
33
The HDFS High Availability
Architecture
The High Availability feature of HDFS ensures the data availability to its clients
inspite of Name node and Data node failure.

Data node High Availability

The data replication concept of HDFS , makes the blocks available even when one of the
data nodes having a copy of the block is corrupted . When the client requests Name node
for data access, then the Name node searches for all the nodes in which that data is
available. After that, it provides access to that data to the user from the node in which
data was quickly available.
Namenode High Availability

To provide the High Availability in case of Name node failure, the HDFS High
Availability Architecture has been developed since the hadoop 2.x . In this
architecture we'll have an alternative Name node called passive Name node.

34
Components of HA Architecture

Zookeeper
Data node
Holds the status of the active and passive
Stores the data in form of blocks
name nodes to enable the alternate name node
Sends it's heart beat(status) to the active Name during the acive nname node failure.
node frequently
Min no: of Zookeepers is 3
Name node
Journal node
Maintains the metadata of the cluster Holds the metadata of the File System which
Updates/Writes the metadata in to all the journal can be shared among he active(read) and passive
nodes Name nodes
Fileover controller Min no: of Journal nodes is 3
Monitors the health of Name node's OS and H/W Passive Name node

Sends the Name nodes status to the Reads and copies the metadata of the File
Zookeeper(s) System from journal nodes.
Cotrols all the Name nodes by using STONIITH Monitor's the active name node's status from
(Shoot The other Node In The Head the zookeeper to become the active name node in
35
case of the present active name node's failure
Hadoop File Systems
Hadoop is an abstract notion of File System of which HDFS
is just an instance or implementation
The java abstract class
org.apache.hadoop.fs.FileSystem is the base FS class from
which various implementations can be made

36
S.No File System URI Java Implementation Description
Scheme

1 Local file fs.LocalFileSystem

2 HDFS hdfs hdfs.DistribtedFileSystems Works with MapReduce Efficiently

3 HFTP hftp hdfs.HftpFileSystems HFTP File System provides read- only

access to HDFS over HTTP

4 HSFTP hsftp Hdfs.HsftpFileSystem Provides read-only access to HDFS over

HTTP in a secured fashion

5 WebHdfs webhdfs Hdfs.Web.WebHdfsFileSystem File System provides secure Read-Write

access to HDFS

6 HAR har Fs.HarFileSystem A FileSystem layered on another

FileSystem for archieving Files

7 KFS(Cloud- kfs Fs.kfs.KosmosFileSystem FileSystem that supports data intensive

Store) apps like GFS

8 FTP ftp fs.ftp.FTPFileSystem File System backed up by an FTP server

9 S3 S3a fs.s3a.S3AFileSystem A File System backed by Amazon S3.

Replaces the older s3n S3native
implementation
10 Azure wasb fs.azure.NativeAzureFileSyste A filesystem backed by Microsoft Azure 37
m
Accessing Hadoop File System using
Java API
The Hadoop's File System class is the general file system API for
Steps to access a Hadoop Filesystem

Step1: Create an instance of the filesystem we want to access by using one of the following factory
methods

Public static FileSystem get( Configuration conf) throws IOException

Returns the default File System specified in core-site.xml

Public static FileSystem get( URI uri, Configuration conf) throws IOException

Returns the File System as specified by the uri

Public static FileSystem get( URI uri, Configuration conf, String users) throws IOException

Returns the File System specified by the uri for the specified user

Can be used for secured access

Public static LocalFileSystem getLocal( Configuration conf) throws IOException

38
Returns the Local File System instance
Accessing Hadoop File System using
Java API
Step2: open an input stream for the file to be accessed
By using one of the following
• public FSDataInputStream open(Path f) throws IOException

Opens the file specified in the path with a default buffersize of

4KB
• public FSDataInputStream open(Path f, int bufsize) throws IOException

Opens the file specified in the path with a default buffersize of 4KB

39
Anatomy of a File Read
1. The client opens the required file to be read
by calling open() method on the Distributed File
System object

2. Distributed File System calls the name

node using RPC to determine the first few blocks

Upon determining the data node with the closest

proximity, the Distributed File system returns an
FSDataInputStream

3. The client then calls the read() method upon the

FSDataInputStream

4. Data is streamed from the Datanode back to

the client which calls read() repeatedly

5. When the end of the block is reached,

DFSInputStream will close the connection to the
datanode, then find the best datanode for the next
block
40
Anatomy of a File Read
if the DFSInputStream encounters an error while communicating with a data
node, then it will try the next closest one for that block. It will also
remember data nodes that have failed so that it doesn’t needlessly retry
them for later blocks

If a corrupted block is found, it is reported to the namenode before

the DFSInputStream attempts to read a replica of the block from another
datanode.

Advantage:
HDFS can scale to a large number of concurrent clients
41
Anatomy of a File Read
Network Topology and Hadoop
In the context of high-volume data processing, the limiting
factor is the rate at which we can transfer data between
nodes—bandwidth is a scarce commodity.
In a Hadoop cluster network is represented as a tree and
the distance between two nodes is the sum of
their distances to their closest common ancestor.
Levels in the tree correspond to the data center, the rack,
and the node
that a process is running on.
the bandwidth available for each of the following
scenarios becomes progressively less:
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center 42
Anatomy of File Write
1. The client creates a new file by calling create()
method on the Distributed File System object

2. Distributed File System calls the name node using

RPC to create a new file in it's namespace with no
blocks associated with it.

The namenode makes a record of the new file if the

file doesn’t already exist, and that the client has the
right permissions to create the file. otherwise, file
creation fails and the client is thrown
an IOException. The returns an
for the client Distributed File System, FS Data Output
Stream

3. client calls a write() method

and DFSOutputStream splits it into packets and it
writes to an internal queue, called the data queue.
The data queue is consumed by the DataStreamer,
which asks the name node to allocate new blocks by
picking a list of suitable data nodes to store the
43
replicas.
Anatomy of File Write
4. The list of data nodes forms a pipeline .
The DataStreamer streams the packets to the first data
node in the pipeline, which stores the packet and forwards it
to the second data node in the pipeline

5. DFSOutputStream also maintains an internal queue of

packets that are waiting to be acknowledged by data nodes,
called the ack queue. A packet is removed from the
ack queue only when it has been acknowledged by all the
data nodes in the pipeline

6. When the client has finished writing, it calls on the

close() FSDataoutputStream

7. This action flushes all the remaining packets to the data

node pipeline and waits for acknowledgments before
contacting the name node to signal that the file is complete

44
Replica Placement
Hadoop’s default strategy is to place

 The first replica on the same node as the client

(for
clients running outside the cluster, a node is
chosen at random, although the system tries not
to pick nodes that are too full or too busy).

 The second replica is placed on a different rack

from the first (off-rack), chosen at random.

 The third replica is placed on the same rack as

the second, but on a different node chosen at
random.

Further replicas are placed on random nodes on

the cluster, although the system tries to avoid
placing
too many replicas on the same rack. Once the 45

Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
big data
No ratings yet
big data
84 pages
Chapter 04_Small data, big data_02
No ratings yet
Chapter 04_Small data, big data_02
53 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
bda mod1
No ratings yet
bda mod1
100 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Big Data
No ratings yet
Big Data
110 pages
Ds unit 3 notes
No ratings yet
Ds unit 3 notes
29 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
Unit-1
No ratings yet
Unit-1
107 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Introduction To Big Data BS (CS) 6 Lecture # 2: Dr. Syed Attique Shah (PH.D.)
No ratings yet
Introduction To Big Data BS (CS) 6 Lecture # 2: Dr. Syed Attique Shah (PH.D.)
28 pages
OOAD All Chapter NOTES - 20211218132503
No ratings yet
OOAD All Chapter NOTES - 20211218132503
79 pages
UNIT 1-2
No ratings yet
UNIT 1-2
78 pages
BDA M1 (40pgs)
No ratings yet
BDA M1 (40pgs)
40 pages
bda unit 1 - mam
No ratings yet
bda unit 1 - mam
198 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Unit-1 BDA
No ratings yet
Unit-1 BDA
30 pages
Detailed Explanation of Big Data Architecture Components
No ratings yet
Detailed Explanation of Big Data Architecture Components
15 pages
big data unit - 1
No ratings yet
big data unit - 1
12 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
BigData_1
No ratings yet
BigData_1
14 pages
Bda M1
No ratings yet
Bda M1
111 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
BDA unit-1
No ratings yet
BDA unit-1
33 pages
Lec 1 - Introduction to Big Data
No ratings yet
Lec 1 - Introduction to Big Data
37 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Unit 1
No ratings yet
Unit 1
26 pages
unit 1 big data
No ratings yet
unit 1 big data
34 pages
Module 1 BDA
No ratings yet
Module 1 BDA
103 pages
3. AI primer
No ratings yet
3. AI primer
24 pages
CLIENT MANAGEMENT Final
100% (1)
CLIENT MANAGEMENT Final
66 pages
Cloud computing
No ratings yet
Cloud computing
86 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Hacking Capitalism Kris Nova
No ratings yet
Hacking Capitalism Kris Nova
241 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
BIG DATA (UNIT 1)
No ratings yet
BIG DATA (UNIT 1)
32 pages
01_Introduction to Big Data Analytics.pdf
No ratings yet
01_Introduction to Big Data Analytics.pdf
37 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Rodin
No ratings yet
Rodin
185 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Module 1
No ratings yet
Module 1
54 pages
ICTFI - 10.03.2023 - Module A - Full
No ratings yet
ICTFI - 10.03.2023 - Module A - Full
24 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
UNIT-1 Bda Kalyan
No ratings yet
UNIT-1 Bda Kalyan
25 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
BIGDATA ANALYTICS
No ratings yet
BIGDATA ANALYTICS
19 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Big Data NOTES and QB
No ratings yet
Big Data NOTES and QB
92 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
IAM Lab Guide - Tarea 1
No ratings yet
IAM Lab Guide - Tarea 1
41 pages
Open GL
No ratings yet
Open GL
22 pages
Unit Iii
No ratings yet
Unit Iii
23 pages
JAVA Internship Tasks
No ratings yet
JAVA Internship Tasks
12 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Curriculum 29032023114450
No ratings yet
Curriculum 29032023114450
22 pages
JSP Servlet
No ratings yet
JSP Servlet
16 pages
Unit-III
No ratings yet
Unit-III
43 pages
TYBSC - Computer Sci
No ratings yet
TYBSC - Computer Sci
36 pages
12th Practical
No ratings yet
12th Practical
5 pages
01 Hardware & Types of Hardware
No ratings yet
01 Hardware & Types of Hardware
2 pages
NSO 2010 Papers With Answers For Class 9
No ratings yet
NSO 2010 Papers With Answers For Class 9
5 pages
Upgrade Your $3 Bluetooth Module To Have HID Firmware - 11 Steps (With Pictures)
No ratings yet
Upgrade Your $3 Bluetooth Module To Have HID Firmware - 11 Steps (With Pictures)
10 pages
Unit 4 Lecture 4 w5hh Principle
No ratings yet
Unit 4 Lecture 4 w5hh Principle
23 pages
Construction Purchase Order Template Yk79oo
No ratings yet
Construction Purchase Order Template Yk79oo
1 page
Lab 2 Week 3 Answers
No ratings yet
Lab 2 Week 3 Answers
6 pages
Dhulipala Sri Krishna Priya: Education
No ratings yet
Dhulipala Sri Krishna Priya: Education
2 pages
CCNA-IP Services
No ratings yet
CCNA-IP Services
13 pages
Final QP - IP - XI - 01
No ratings yet
Final QP - IP - XI - 01
4 pages
Ergonomics Self Assessment
No ratings yet
Ergonomics Self Assessment
3 pages
FULL PreSonus Studio One 4 Professional 411 MULTILANG x64 PDF
No ratings yet
FULL PreSonus Studio One 4 Professional 411 MULTILANG x64 PDF
4 pages
Installation of Microwin Programming Software
No ratings yet
Installation of Microwin Programming Software
18 pages
PROJ 79398 DC Network L2 - Staging - JD
No ratings yet
PROJ 79398 DC Network L2 - Staging - JD
2 pages
Pattern Matching Algorithms
No ratings yet
Pattern Matching Algorithms
17 pages
RobView Single Sheet
No ratings yet
RobView Single Sheet
2 pages
16Ch 240fps MPEG4 Standalone DVR W/ DVD-RW: Specifications Dimensions (Unit: In)
No ratings yet
16Ch 240fps MPEG4 Standalone DVR W/ DVD-RW: Specifications Dimensions (Unit: In)
1 page
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

BDA_ppt1

Uploaded by

BDA_ppt1

Uploaded by

Big Data Analytics

Big Data Analytics

Analyzing Big Data to extract informative patters from it.

Types of Data​ Information perceived​

Structured​ Reports Generated​

10 3 Kilo Byte (1KB) Data ranging from Zetta Byte is

Walmart handles 1 million customer transactions per

inserts 500 TBs of data each day

Authenticity, Reputation, Authorization

Statistical ,Hypothetical, Correlational​

Social Media/ Customer Sentiment Analysis Pattern​

An event is an identifiable or significant occurrence upon a h/w or s/w.​

Event can be generated by an user or by a computer​

A state-of-the-art log analytics system is Amazon Open Search system Service(AOSS)

Confirm the credit card transaction with customer

Incorporate domain experts knowledge in form of rules

Uses past Fraudulent transactions to apply diagnostic analysis

Performed on No Sampling basis​

Performed by collecting large set of social media data like

In 2005 , the Apache foundation implemented GFS in terms of Hadoop Distributed

Reliable data storage

Block structured and scalable​

Two design possibilities

 All Master processes/nodes

 Every rack would have

2. In case of rack aware architecture do not

3. Never place more than two replicas of a

Data node High Availability​

1 Local file fs.LocalFileSystem

2 HDFS hdfs hdfs.DistribtedFileSystems Works with MapReduce Efficiently

3 HFTP hftp hdfs.HftpFileSystems HFTP File System provides read- only

4 HSFTP hsftp Hdfs.HsftpFileSystem Provides read-only access to HDFS over

5 WebHdfs webhdfs Hdfs.Web.WebHdfsFileSystem File System provides secure Read-Write

6 HAR har Fs.HarFileSystem A FileSystem layered on another

7 KFS(Cloud- kfs Fs.kfs.KosmosFileSystem FileSystem that supports data intensive

8 FTP ftp fs.ftp.FTPFileSystem File System backed up by an FTP server

9 S3 S3a fs.s3a.S3AFileSystem A File System backed by Amazon S3.

Public static FileSystem get( Configuration conf) throws IOException

Returns the default File System specified in core-site.xml

Returns the File System as specified by the uri

Can be used for secured access

Public static LocalFileSystem getLocal( Configuration conf) throws IOException

Opens the file specified in the path with a default buffersize of

2. Distributed File System calls the name

Upon determining the data node with the closest

3. The client then calls the read() method upon the

4. Data is streamed from the Datanode back to

5. When the end of the block is reached,

If a corrupted block is found, it is reported to the namenode before

2. Distributed File System calls the name node using

The namenode makes a record of the new file if the

3. client calls a write() method

5. DFSOutputStream also maintains an internal queue of

6. When the client has finished writing, it calls on the

7. This action flushes all the remaining packets to the data

 The first replica on the same node as the client

 The second replica is placed on a different rack

 The third replica is placed on the same rack as

Further replicas are placed on random nodes on

You might also like

Types of Data Information perceived

Structured Reports Generated

Statistical ,Hypothetical, Correlational

Social Media/ Customer Sentiment Analysis Pattern

An event is an identifiable or significant occurrence upon a h/w or s/w.

Event can be generated by an user or by a computer

Performed on No Sampling basis

Block structured and scalable

Data node High Availability