1 BDA Unit1 ppt1
1 BDA Unit1 ppt1
UNIT - 1
Big Data Analytics
Big Data
Collection of datasets that cannot be handled by using
the traditional data processing tools.
ET
L Storage
Distributed Scala
Hadoop
NoSQL New Simpler
HDFS Spark
Mongo DB Programming
Map Reduce
Framework
Processed Storage Faster Real-Time
Pig Data Analysis
HIVE
Data Analysis
• K H Vijaya Kumari, Asst.Professor, Dept of IT, CBIT,Hyderabad
The 5 V's of Big Data
Volume
8 bits Byte
Stores,
Processes 30 + PBs of Data
and
Analyzes
240 TB data for every
Flight generates
6-8 hours of flight
Variety
Variety of Sources
Ubiquitous Computing
People - Using Mobile devices
Machines - Sensors / IOT The ability to
devices
Organizations -- Generate data by compute/analyze data at any
capturing the customer transactions time from any where using
any device
Structured
Semi-Structured
Unstructured
Value
Whether the data being analyzed results in some meaningful information
Identification of
Data Analysis
Data
Data Extraction
Types of BDA
1. Descriptive Analytics What happened?
Ex: A chemical company Dow found out the underutilized area and thus saved
4 million dollars annually
A log analytics s/w collects and checks the logs such as error logs.
These logs help the organizations diagnose an issue such as he location, time of
event occurrence e.t.c.
BDA approach
Performed by collecting large set of social media data like twitter, facebook, youtube
Customer Sentiment can take either of the following forms
Positive
Negative
Neutral
Introduction to Hadoop
Hadoop is a programming framework that provides distributed storage and parallel processing
of large data using commodity hardware.
History of Hadoop
In 2003 google published the concept of google file system(GFS) which was distributed in nature.
In 2005 , the Apache foundation implemented GFS in terms of Hadoop Distributed File
System(HDFS) MapReduce Processing and released the first version of Hadoop, Hadoop 0.1.0
in the year 2006.
Actual Hadoop developers are Dough Cutting and Mike Cafarella. Dough Cutting
named hadoop after his son's toy elephant.
Hadoop Architecture
Name node in HDFS
HDFS - Hadoop Distributed File System
HDFS stores the files/data in clusters of nodes. Nodes are basically computers connected
in LAN with a server maintaining the metadata about all these nodes.
Advantages:
Inexpensive
Immutable
Disadvantages:
No suitable for smaller datasets
HDFS Architecture
Metadata in Disk & Metadata in RAM
Rack aware Architecture
The traditional hdfs architecture has been horizontally scaled to accommodate more number
of Name node and Data node clusters. Thus forms the hdfs federation.
The Namespace portion consists of Name nodes and the Block storage consists of Data nodes
Within each Name node we'll have a Namespace which is a hierarchical structure of directories
and files. And a block pool comprising the set of blocks corresponding to the Namespace files.
The blocks of each block pool can be stored in any of the data nodes. When a Name node is
deleted its Name space , block pool also will be removed by removing those blocks from the Data
nodes.
The HDFS High Availability Architecture
The High Availability feature of HDFS ensures the data availability to its clients inspite of
Name node and Data node failure.
To provide the High Availability in case of Name node failure, the HDFS High Availability
Architecture has been developed since the hadoop 2.x . In this architecture we'll have an
alternative Name node called passive Name node.
Components of HA Architecture
3 HFTP hftp hdfs.HftpFileSystems HFTP File System provides read- only access to
HDFS over HTTP
7 KFS(Cloud- kfs Fs.kfs.KosmosFileSystem FileSystem that supports data intensive apps like
Store) GFS
Public static FileSystem get( URI uri, Configuration conf) throws IOException
Returns the File System as specified by the uri
FileSystem Configuration
FileSystem Configuration
Accessing Hadoop File System using Java
API
open
Anatomy of a File Read
1. The client opens the required file to be read by calling
open() method on the Distributed File System object