0% found this document useful (0 votes)
18 views

Unit 3 Data-Analytics

BIG DATA

Uploaded by

azhagu sundari
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit 3 Data-Analytics

BIG DATA

Uploaded by

azhagu sundari
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Data

Analytics
Data Analysis

• Dat a analysis is defined as a process of


cleaning, transforming, and modeling
data to discover useful information for
business decision- making.
• The purpose of Data Analysis is to extract
useful information from data and taking
the decision based upon the data
analysis.
Benefits
Better communication with customers.
Mitigation of fraudulent activities.
New products & services.
Allocation of capital & human resources.
Reduced costs.
Data Analysis – Types

• There are several types of Data


Analysis techniques that exist based
on business and technology.
• However, the major Data Analysis
methods are:
– Text Analysis
– Statistical Analysis
– Diagnostic Analysis
– Predictive Analysis
– Prescriptive Analysis
Descriptive Analytics

• Descriptive analytics helps answer questions


about what happened. These techniques
summarize large datasets to describe outcomes
to stakeholders.
• Specialized metrics are developed to track
performance in specific industries. This process
requires the collection of relevant data,
processing of the data, data analysis and data
visualization. This process provides essential
insight into past performance.
Diagnostic analytics

• Diagnostic analytics helps answer questions about


why things happened. These techniques
supplement more basic descriptive analytics.
• They take the findings from descriptive analytics
and dig deeper to find the cause. The
performance indicators are further investigated
to discover why they got better or worse. This
generally occurs in three steps:
– Identify anomalies in the data. These may be
unexpected changes in a metric or a part icular
market .
– Data that is related to these anomalies is
collected.
– Statistical techniques are used to find
Predictive analytics

• Predictive analytics helps answer questions


about what will happen in the future.
These techniques use historical data to
identify trends and determine if they are
• likely to recur.
Predictive analytical tools provide valuable
insight into what may happen in the
future and its t echniques include a variet y
of stat istical and machine learning
techniques, such as: neural networks,
decision trees, and regression.
Prescriptive analytics

• Prescriptive analytics helps answer questions


about what should be done. By using
insights from predictive analytics, data-
driven decisions can be made.
• This allows businesses to make informed
decisions in the face of uncertainty.
Prescriptive analytics techniques rely on
machine learning strategies that can find
• patterns in large datasets.
By analyzing past decisions and events, the
likelihood of diff erent outcomes can be
estimated.
Methods

https://www.datapine.com
Cluster analysis

• The action of grouping a set of data


elements in a way that said elements are
more similar (in a particular sense) to
each other than to those in other groups
• – hence the term ‘cluster.’
Since there is no target variable when
clustering, the method is often used to find
hidden patterns in the data. The approach
is also used to provide additional context
to a trend or dataset.
Cohort analysis

• This type of data analysis method uses


historical data to examine and compare a
determined segment of users' behavior,
which can then be grouped with others with
• similar characteristics.
By using this data analysis methodology, it's
possible to gain a wealth of insight into
• consumer needs or a fi rm understanding of a
broader target group.
Cohort analysis can be really useful to
perform analysis in marketing as it will allow
you t o understand the impact of your
campaigns on specific groups of customers.
Regression analysis

• The regression analysis uses historical data


to underst and how a dependent variable's
value is affected when one (linear
regression) or more independent variables
(multiple regression) change or stay the
• same.
By understanding each variable's
relationship and how they developed in
the past, you can anticipate possible
outcomes and make better business
decisions in the future.
Neural networks

• The neural network forms the basis


for the intelligent algorithms of
• machine learning.
It is a form of data-driven analytics that
attempts, with minimal intervention, to
understand how the human brain would
• process insights and predict values.
Neural networks learn from each and every
data transaction, meaning that they evolve
and advance over time.
Factor analysis

• The factor analysis, also called “dimension


reduction,” is a type of data analysis used
to describe variability among observed,
correlated variables in t erms of a
pot ent ially lower number of unobserved
• variables called factors.
The aim here is to uncover independent
latent variables, an ideal analysis method
for streamlining specific data segments.
Data Mining

• A method of analysis that is the umbrella


term for engineering metrics and insights
for additional value, direction, and
context.
• By using exploratory statistical evaluation,
data mining aims to identify dependencies,
relations, data patterns, and trends to
generate and advanced knowledge.
• When considering how to analyze data,
adopting a data mining mindset is essent ial
t o success - as such, it’s an area that is
worth exploring in greater detail.
Text analysis

• Text analysis, also known in t he industry


as t ext mining, is the process of taking
large sets of textual data and arranging
it in a way that makes it easier to
manage.
• By working through this cleansing
process in stringent detail, you will be
able to extract the data that is truly
relevant to your business and use it to
develop actionable insights that will
propel you forward.
Data Analysis Techniques

https://www.datapine.com
Hadoop

High Availability Distributed Object Oriented Platform

Created by Doug Cutting in 2006 who named it after his


son’s stuffed yellow elephant, and based on Google’s
MapReduce paper in 2004, Hadoop is an open source
framework for fault tolerant, scalable, distributed
computing on commodity hardware.
Hadoop

Hadoop is an open-source framework that allows


to store and process big data in a distributed
environment across clusters of computers using
simple programming models.
latest stable version of Apache Hadoop is 3.3.1
The four main components of Hadoop are −
Hadoop Distributed File System (HDFS) − This is a storage system that
breaks large files into smaller pieces and distributes them across multiple
computers in a cluster. It ensures data reliability and enables parallel
processing of data across the cluster.
MapReduce − This is a programming model used for processing and
analyzing large datasets in parallel across the cluster. It consists of two main
tasks: Map, which processes and transforms input data into intermediate
key-value pairs, and Reduce, which aggregates and summarizes the
intermediate data to produce the final output.
YARN (Yet Another Resource Negotiator) − YARN is a resource
management and job scheduling component of Hadoop. It allocates
resources (CPU, memory) to various applications running on the cluster and
manages their execution efficiently.
Hadoop Common − This includes libraries and utilities used by other
Hadoop components. It provides tools and infrastructure for the entire
Hadoop ecosystem, such as authentication, configuration, and logging.
Hadoop is not a data warehouse

Hadoop is not a data warehouse because they


serve different purposes and have different
architectures. Hadoop is a framework for storing
and processing large volumes of unstructured
and semi-structured data across distributed
clusters of computers. It is designed for handling
big data and supports batch processing of large
datasets using technologies like HDFS and
MapReduce.
advantage of Hadoop

The biggest advantage of Hadoop is its ability to


handle and process large volumes of data efficiently.
Hadoop is designed to distribute data and
processing tasks across multiple computers in a
cluster, allowing it to scale easily to handle massive
datasets that traditional databases or processing
systems struggle to manage.
This enables organizations to store, process, and
analyze huge amounts of data, gaining valuable
insights and making informed decisions that would
not be possible with conventional technologies.
Which software is used in Hadoop?

Hadoop Distributed File System (HDFS)


tores large datasets across a cluster of
computers, breaking them into smaller pieces
for efficient storage and retrieval.
YARN manages computing resources across the
cluster, allocating resources to different
applications and ensuring efficient execution.
MapReduce is the processing engine that
divides data processing tasks into smaller parts
and executes them in parallel across the cluster.
Traditional Approach
In this approach, an enterprise will have a
computer to store and process big data. For
storage purpose, the programmers will take the
help of their choice of database vendors such as
Oracle, IBM, etc.
In this approach, the user interacts with the
application, which in turn handles the part of
data storage and analysis.
Limitation
This approach works fine with those applications
that process less voluminous data that can be
accommodated by standard database servers, or
up to the limit of the processor that is processing
the data.
But when it comes to dealing with huge amounts
of scalable data, it is a hectic task to process
such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an
algorithm called MapReduce. This algorithm
divides the task into small parts and assigns
them to many computers, and collects the
results from them which when integrated,
form the result dataset.
Hadoop
Using the solution provided by Google, Doug
Cutting and his team developed an Open
Source Project called HADOOP.
Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel
with others.
Hadoop is used to develop applications that
could perform complete statistical analysis on
huge amounts of data.
How does Hadoop solve the problem of Big
Data?
The proposed solution for the problem of big
data should:
Implement good recovery strategies
Be horizontally scalable as data grows
Be cost-effective
Minimize the learning curve
Be easy for programmers and data analysts, and
even for non-programmers, to work with
Hadoop Architecture
Hadoop has two major
layers namely −
Processing/
Computation layer
(MapReduce), and
Storage layer
(Hadoop Distributed
File System).
MapReduce

MapReduce is a parallel programming model


for writing distributed applications devised at
Google for efficient processing of large amounts
of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
The MapReduce program runs on Hadoop which
is an Apache open-source framework.
Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the


Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware.
It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems
are significant.
It is highly fault-tolerant and is designed to be deployed on
low-cost hardware.

Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Hadoop with big data- Applications
Is Hadoop a Database?
Typically, Hadoop is not a database. Rather, it is
a software ecosystem that allows for parallel
computing of extremely large data sets.

You might also like