0% found this document useful (0 votes)

18 views

Unit 3 Data-Analytics

BIG DATA

Uploaded by

azhagu sundari

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Unit 3 Data-Analytics

BIG DATA

Uploaded by

azhagu sundari

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Data

Analytics
Data Analysis

• Dat a analysis is defined as a process of

cleaning, transforming, and modeling
data to discover useful information for
business decision- making.
• The purpose of Data Analysis is to extract
useful information from data and taking
the decision based upon the data
analysis.
Benefits
Better communication with customers.
Mitigation of fraudulent activities.
New products & services.
Allocation of capital & human resources.
Reduced costs.
Data Analysis – Types

• There are several types of Data

Analysis techniques that exist based
on business and technology.
• However, the major Data Analysis
methods are:
– Text Analysis
– Statistical Analysis
– Diagnostic Analysis
– Predictive Analysis
– Prescriptive Analysis
Descriptive Analytics

• Descriptive analytics helps answer questions

about what happened. These techniques
summarize large datasets to describe outcomes
to stakeholders.
• Specialized metrics are developed to track
performance in specific industries. This process
requires the collection of relevant data,
processing of the data, data analysis and data
visualization. This process provides essential
insight into past performance.
Diagnostic analytics

• Diagnostic analytics helps answer questions about

why things happened. These techniques
supplement more basic descriptive analytics.
• They take the findings from descriptive analytics
and dig deeper to find the cause. The
performance indicators are further investigated
to discover why they got better or worse. This
generally occurs in three steps:
– Identify anomalies in the data. These may be
unexpected changes in a metric or a part icular
market .
– Data that is related to these anomalies is
collected.
– Statistical techniques are used to find
Predictive analytics

• Predictive analytics helps answer questions

about what will happen in the future.
These techniques use historical data to
identify trends and determine if they are
• likely to recur.
Predictive analytical tools provide valuable
insight into what may happen in the
future and its t echniques include a variet y
of stat istical and machine learning
techniques, such as: neural networks,
decision trees, and regression.
Prescriptive analytics

• Prescriptive analytics helps answer questions

about what should be done. By using
insights from predictive analytics, data-
driven decisions can be made.
• This allows businesses to make informed
decisions in the face of uncertainty.
Prescriptive analytics techniques rely on
machine learning strategies that can find
• patterns in large datasets.
By analyzing past decisions and events, the
likelihood of diff erent outcomes can be
estimated.
Methods

https://www.datapine.com
Cluster analysis

• The action of grouping a set of data

elements in a way that said elements are
more similar (in a particular sense) to
each other than to those in other groups
• – hence the term ‘cluster.’
Since there is no target variable when
clustering, the method is often used to find
hidden patterns in the data. The approach
is also used to provide additional context
to a trend or dataset.
Cohort analysis

• This type of data analysis method uses

historical data to examine and compare a
determined segment of users' behavior,
which can then be grouped with others with
• similar characteristics.
By using this data analysis methodology, it's
possible to gain a wealth of insight into
• consumer needs or a fi rm understanding of a
broader target group.
Cohort analysis can be really useful to
perform analysis in marketing as it will allow
you t o understand the impact of your
campaigns on specific groups of customers.
Regression analysis

• The regression analysis uses historical data

to underst and how a dependent variable's
value is affected when one (linear
regression) or more independent variables
(multiple regression) change or stay the
• same.
By understanding each variable's
relationship and how they developed in
the past, you can anticipate possible
outcomes and make better business
decisions in the future.
Neural networks

• The neural network forms the basis

for the intelligent algorithms of
• machine learning.
It is a form of data-driven analytics that
attempts, with minimal intervention, to
understand how the human brain would
• process insights and predict values.
Neural networks learn from each and every
data transaction, meaning that they evolve
and advance over time.
Factor analysis

• The factor analysis, also called “dimension

reduction,” is a type of data analysis used
to describe variability among observed,
correlated variables in t erms of a
pot ent ially lower number of unobserved
• variables called factors.
The aim here is to uncover independent
latent variables, an ideal analysis method
for streamlining specific data segments.
Data Mining

• A method of analysis that is the umbrella

term for engineering metrics and insights
for additional value, direction, and
context.
• By using exploratory statistical evaluation,
data mining aims to identify dependencies,
relations, data patterns, and trends to
generate and advanced knowledge.
• When considering how to analyze data,
adopting a data mining mindset is essent ial
t o success - as such, it’s an area that is
worth exploring in greater detail.
Text analysis

• Text analysis, also known in t he industry

as t ext mining, is the process of taking
large sets of textual data and arranging
it in a way that makes it easier to
manage.
• By working through this cleansing
process in stringent detail, you will be
able to extract the data that is truly
relevant to your business and use it to
develop actionable insights that will
propel you forward.
Data Analysis Techniques

https://www.datapine.com
Hadoop

High Availability Distributed Object Oriented Platform

Created by Doug Cutting in 2006 who named it after his

son’s stuffed yellow elephant, and based on Google’s
MapReduce paper in 2004, Hadoop is an open source
framework for fault tolerant, scalable, distributed
computing on commodity hardware.
Hadoop

Hadoop is an open-source framework that allows

to store and process big data in a distributed
environment across clusters of computers using
simple programming models.
latest stable version of Apache Hadoop is 3.3.1
The four main components of Hadoop are −
Hadoop Distributed File System (HDFS) − This is a storage system that
breaks large files into smaller pieces and distributes them across multiple
computers in a cluster. It ensures data reliability and enables parallel
processing of data across the cluster.
MapReduce − This is a programming model used for processing and
analyzing large datasets in parallel across the cluster. It consists of two main
tasks: Map, which processes and transforms input data into intermediate
key-value pairs, and Reduce, which aggregates and summarizes the
intermediate data to produce the final output.
YARN (Yet Another Resource Negotiator) − YARN is a resource
management and job scheduling component of Hadoop. It allocates
resources (CPU, memory) to various applications running on the cluster and
manages their execution efficiently.
Hadoop Common − This includes libraries and utilities used by other
Hadoop components. It provides tools and infrastructure for the entire
Hadoop ecosystem, such as authentication, configuration, and logging.
Hadoop is not a data warehouse

Hadoop is not a data warehouse because they

serve different purposes and have different
architectures. Hadoop is a framework for storing
and processing large volumes of unstructured
and semi-structured data across distributed
clusters of computers. It is designed for handling
big data and supports batch processing of large
datasets using technologies like HDFS and
MapReduce.
advantage of Hadoop

The biggest advantage of Hadoop is its ability to

handle and process large volumes of data efficiently.
Hadoop is designed to distribute data and
processing tasks across multiple computers in a
cluster, allowing it to scale easily to handle massive
datasets that traditional databases or processing
systems struggle to manage.
This enables organizations to store, process, and
analyze huge amounts of data, gaining valuable
insights and making informed decisions that would
not be possible with conventional technologies.
Which software is used in Hadoop?

Hadoop Distributed File System (HDFS)

tores large datasets across a cluster of
computers, breaking them into smaller pieces
for efficient storage and retrieval.
YARN manages computing resources across the
cluster, allocating resources to different
applications and ensuring efficient execution.
MapReduce is the processing engine that
divides data processing tasks into smaller parts
and executes them in parallel across the cluster.
Traditional Approach
In this approach, an enterprise will have a
computer to store and process big data. For
storage purpose, the programmers will take the
help of their choice of database vendors such as
Oracle, IBM, etc.
In this approach, the user interacts with the
application, which in turn handles the part of
data storage and analysis.
Limitation
This approach works fine with those applications
that process less voluminous data that can be
accommodated by standard database servers, or
up to the limit of the processor that is processing
the data.
But when it comes to dealing with huge amounts
of scalable data, it is a hectic task to process
such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an
algorithm called MapReduce. This algorithm
divides the task into small parts and assigns
them to many computers, and collects the
results from them which when integrated,
form the result dataset.
Hadoop
Using the solution provided by Google, Doug
Cutting and his team developed an Open
Source Project called HADOOP.
Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel
with others.
Hadoop is used to develop applications that
could perform complete statistical analysis on
huge amounts of data.
How does Hadoop solve the problem of Big
Data?
The proposed solution for the problem of big
data should:
Implement good recovery strategies
Be horizontally scalable as data grows
Be cost-effective
Minimize the learning curve
Be easy for programmers and data analysts, and
even for non-programmers, to work with
Hadoop Architecture
Hadoop has two major
layers namely −
Processing/
Computation layer
(MapReduce), and
Storage layer
(Hadoop Distributed
File System).
MapReduce

MapReduce is a parallel programming model

for writing distributed applications devised at
Google for efficient processing of large amounts
of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
The MapReduce program runs on Hadoop which
is an Apache open-source framework.
Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the

Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware.
It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems
are significant.
It is highly fault-tolerant and is designed to be deployed on
low-cost hardware.

Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Hadoop with big data- Applications
Is Hadoop a Database?
Typically, Hadoop is not a database. Rather, it is
a software ecosystem that allows for parallel
computing of extremely large data sets.

Nptel Big Data Full PPT Book With Assignment Solution Rajiv Mishra IIT Patna 2021
100% (1)
Nptel Big Data Full PPT Book With Assignment Solution Rajiv Mishra IIT Patna 2021
1,103 pages
Introduction To Emerging Technologies Lecture Slide
75% (4)
Introduction To Emerging Technologies Lecture Slide
80 pages
Data Science Training in Naresh I Technologies
100% (3)
Data Science Training in Naresh I Technologies
18 pages
Big Data
No ratings yet
Big Data
10 pages
Bda 4
No ratings yet
Bda 4
18 pages
IV Unit Big Data Analysis
No ratings yet
IV Unit Big Data Analysis
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
4 pages
Business Intelligence Notes
No ratings yet
Business Intelligence Notes
27 pages
Unit - I DA.pptx
No ratings yet
Unit - I DA.pptx
107 pages
Data Mining
No ratings yet
Data Mining
3 pages
Unit 3: by Dr. Anand Vyas
No ratings yet
Unit 3: by Dr. Anand Vyas
20 pages
Unit 3 PPT (BA)
No ratings yet
Unit 3 PPT (BA)
19 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
47 pages
Data Analytics
No ratings yet
Data Analytics
17 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
DM - Weka Reprot
No ratings yet
DM - Weka Reprot
18 pages
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
Architectures of Big Data
No ratings yet
Architectures of Big Data
27 pages
Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
16 pages
Data Science - Fundamentals and Components
No ratings yet
Data Science - Fundamentals and Components
21 pages
CC Unit - 4 Imp Questions
No ratings yet
CC Unit - 4 Imp Questions
4 pages
1 - Introduction To Data Science
No ratings yet
1 - Introduction To Data Science
6 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
Data Mining
No ratings yet
Data Mining
26 pages
Introduction to Data
No ratings yet
Introduction to Data
34 pages
Unit-I Part II Erp
No ratings yet
Unit-I Part II Erp
60 pages
Data Mining: What Is Data Mining?: Correlations or Patterns Among Fields in Large Relational Databases
No ratings yet
Data Mining: What Is Data Mining?: Correlations or Patterns Among Fields in Large Relational Databases
6 pages
unit-1 notes onl
No ratings yet
unit-1 notes onl
25 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
past ppr(1)
No ratings yet
past ppr(1)
31 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
Unit-2
No ratings yet
Unit-2
15 pages
Bi DW DM
No ratings yet
Bi DW DM
39 pages
2.Data analysis Vs analytics
No ratings yet
2.Data analysis Vs analytics
6 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
Big Data and Analytics Challenges and Issues
No ratings yet
Big Data and Analytics Challenges and Issues
12 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Internal 1
No ratings yet
Internal 1
19 pages
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
No ratings yet
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
15 pages
Data Mining Display
No ratings yet
Data Mining Display
20 pages
Data Modeling
No ratings yet
Data Modeling
12 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
Data Analytics mid sem notes
No ratings yet
Data Analytics mid sem notes
9 pages
Unit 1
No ratings yet
Unit 1
18 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Demystifying Big Data RGc1.0
100% (1)
Demystifying Big Data RGc1.0
10 pages
Data Mining Notes
No ratings yet
Data Mining Notes
9 pages
1708443470801
No ratings yet
1708443470801
71 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
BDA UNIT -1_pdf
No ratings yet
BDA UNIT -1_pdf
143 pages
Data Mining and Warehousing - L1 & L2
No ratings yet
Data Mining and Warehousing - L1 & L2
30 pages
Basics of Data Analytics
No ratings yet
Basics of Data Analytics
4 pages
Big Data Analytics Tools, BHARATH.S (Assignment-1)
No ratings yet
Big Data Analytics Tools, BHARATH.S (Assignment-1)
17 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
CCS334 Big Data Analytics
No ratings yet
CCS334 Big Data Analytics
20 pages
B1 - Install Hadoop Va Spark
No ratings yet
B1 - Install Hadoop Va Spark
5 pages
Advanced Certificate Programme DS
No ratings yet
Advanced Certificate Programme DS
34 pages
HDFS and YARN
No ratings yet
HDFS and YARN
91 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
Big Data Engineering PDF
No ratings yet
Big Data Engineering PDF
17 pages
Hadoop The Definitive Guide 3rd Edition
100% (1)
Hadoop The Definitive Guide 3rd Edition
647 pages
Big Data HW
No ratings yet
Big Data HW
6 pages
3-1 Syllabus (R20)
No ratings yet
3-1 Syllabus (R20)
36 pages
Hive Installation On Windows
No ratings yet
Hive Installation On Windows
21 pages
Donkal, Gita Verma, Gyanendra K. (2018)
No ratings yet
Donkal, Gita Verma, Gyanendra K. (2018)
12 pages
Module-2-Introduction To HDFS and Tools
No ratings yet
Module-2-Introduction To HDFS and Tools
38 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
Gridgain® In-Memory Computing Platform: Feature Comparison: Pivotal Gemfire®
No ratings yet
Gridgain® In-Memory Computing Platform: Feature Comparison: Pivotal Gemfire®
14 pages
Real-Time Vehicle Traffic Analysis Using Long Short Term Memory Networks in Apache Spark
No ratings yet
Real-Time Vehicle Traffic Analysis Using Long Short Term Memory Networks in Apache Spark
5 pages
Big Data-Driven Sustainable Urban Planning and Management - Slides
No ratings yet
Big Data-Driven Sustainable Urban Planning and Management - Slides
15 pages
BDH Unit 3
No ratings yet
BDH Unit 3
16 pages
SQOOP
No ratings yet
SQOOP
6 pages
DA Lab Manual Final.docx
No ratings yet
DA Lab Manual Final.docx
46 pages
Hadoop 2.x Installation Steps
No ratings yet
Hadoop 2.x Installation Steps
11 pages
Hive Case Study: E-Commerce Sales Review
No ratings yet
Hive Case Study: E-Commerce Sales Review
16 pages
Control-M/Enterprise Manager: Corrected Problems
No ratings yet
Control-M/Enterprise Manager: Corrected Problems
9 pages
004 - Hadoop Daemons (HDFS Only)
No ratings yet
004 - Hadoop Daemons (HDFS Only)
3 pages
Big Data Analysis and Data Visualization To Facilitate Decision-Making - Mega Start Case Study
No ratings yet
Big Data Analysis and Data Visualization To Facilitate Decision-Making - Mega Start Case Study
11 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages