0% found this document useful (0 votes)
26 views

Data Analytics-11

This document discusses various data analysis techniques including correlation analysis, maximum likelihood estimation, and regression analysis. Correlation analysis measures the strength of the linear relationship between two variables. Maximum likelihood estimation finds the parameter values that make observed data most probable under an assumed statistical model. Regression analysis models the relationship between a dependent variable and one or more independent variables. Common types of regression analysis described are linear regression, logistic regression, and polynomial regression.

Uploaded by

shrihari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Data Analytics-11

This document discusses various data analysis techniques including correlation analysis, maximum likelihood estimation, and regression analysis. Correlation analysis measures the strength of the linear relationship between two variables. Maximum likelihood estimation finds the parameter values that make observed data most probable under an assumed statistical model. Regression analysis models the relationship between a dependent variable and one or more independent variables. Common types of regression analysis described are linear regression, logistic regression, and polynomial regression.

Uploaded by

shrihari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Analytics

Correlation analysis is a statistical method used to measure the strength of the


linear relationship between two variables and compute their association.
Simply put - correlation analysis calculates the level of change in one variable
due to the change in the other.
The Correlation Coefficient

The correlation coefficient is the unit of measurement used to calculate the


intensity in the linear relationship between the variables involved in a
correlation analysis, this is easily identifiable since it is represented with the
symbol r and is usually a value without units which is located between 1 and -
1.

Coefficient of determination = r square

Types of correlation analysis

Correlation between two variables can be either a positive correlation, a


negative correlation, or no correlation. Let's look at examples of each of these
three types.
• Positive linear correlation: A positive correlation between two variables
means both the variables move in the same direction. An increase in one
variable leads to an increase in the other variable and vice versa.

For example, spending more time on a treadmill burns more calories.

• Negative linear correlation: A negative correlation between two


variables means that the variables move in opposite directions. An
increase in one variable leads to a decrease in the other variable and
vice versa.

For example, increasing the speed of a vehicle decreases the time you take to
reach your destination.

• Weak/Zero correlation: No correlation exists when one variable does


not affect the other.

For example, there is no correlation between the number of years of school a


person has attended and the letters in his/her name.

Non-linear Correlation (known as curvilinear correlation)

There is a non-linear correlation when there is a relationship between


variables but the relationship is not linear (straight).
Uses of correlation analysis

Correlation analysis is used to study practical cases. Here, the researcher can't
manipulate individual variables. For example, correlation analysis is used to
measure the correlation between the patient's blood pressure and the
medication used.

Marketers use it to measure the effectiveness of advertising. Researchers


measure the increase/decrease in sales due to a specific marketing campaign.

Advantages of correlation analysis

In statistics, correlation refers to the fact that there is a link between various
events. One of the tools to infer whether such a link exists is correlation
analysis. Practical simplicity is undoubtedly one of its main advantages.

To perform reliable correlation analysis, it is essential to make in-depth


observations of two variables, which gives us an advantage in obtaining
results. Some of the most notorious benefits of correlation analysis are:
• Awareness of the behavior between two variables: A correlation helps to
identify the absence or presence of a relationship between two
variables. It tends to be more relevant to everyday life.

• A good starting point for research: It proves to be a good starting point


when a researcher starts investigating relationships for the first time.

• Uses for further studies: Researchers can identify the direction and
strength of the relationship between two variables and later narrow the
findings down in later studies.

• Simple metrics: Research findings are simple to classify. The findings can
range from -1.00 to 1.00. There can be only three potential broad
outcomes of the analysis.

Maximum Likelihood Test

In data analytics, the Maximum Likelihood Test (MLT) is a statistical method


used for hypothesis testing, parameter estimation, and model selection. It is a
powerful technique for making inferences about population parameters based
on sample data. Maximum likelihood estimation (MLE) is closely related to MLT
and is often used in conjunction with it.

Here are the key concepts and steps involved in the Maximum Likelihood Test:

Likelihood Function:

The likelihood function is a probability distribution function that describes


how likely a given set of parameters is to produce the observed data.

It represents the probability of observing the data given a particular set of


parameter values.

Maximum Likelihood Estimation (MLE):

MLE is the process of finding the parameter values that maximize the
likelihood function.
In other words, it finds the parameter values that make the observed data
most probable under the assumed statistical model.

Hypothesis Testing:

MLT is often used for hypothesis testing. It helps determine whether a


proposed model or parameter value is a good fit for the data.

The null hypothesis (H0) typically assumes a specific parameter value, while
the alternative hypothesis (H1) allows for different parameter values.

MLT calculates the likelihood under the null hypothesis (L(H0)) and the
likelihood under the alternative hypothesis (L(H1)).

Likelihood Ratio Test (LRT):

The Likelihood Ratio Test is a statistical test used in MLT.

It compares the likelihood of the data under the null hypothesis to the
likelihood under the alternative hypothesis.

The test statistic is often denoted as the log-likelihood ratio or -2 times the
log-likelihood ratio, and it follows a chi-squared distribution under certain
conditions.

Critical Value or P-value:

The critical value is a threshold value from the chi-squared distribution that
helps determine whether to accept or reject the null hypothesis.

Alternatively, you can calculate a p-value, which represents the probability of


obtaining a test statistic as extreme as the one observed, assuming the null
hypothesis is true. A small p-value suggests rejecting the null hypothesis.

Decision:

If the test statistic exceeds the critical value or if the p-value is smaller than a
chosen significance level (alpha), you reject the null hypothesis in favor of the
alternative hypothesis.

If the test statistic does not exceed the critical value or if the p-value is larger
than alpha, you fail to reject the null hypothesis.
MLT can be applied in various data analytics scenarios, such as comparing
different models, testing the significance of parameters, and assessing
goodness-of-fit. It is a fundamental tool in statistical analysis and hypothesis
testing in data science and analytics

Difference between Likelihood and Probability:

S.
Likelihood Probability
No

Refers to the past events with Refers to the occurrence of future


1
known outcomes events

I flipped a coin 10 times and I flipped a coin 10 times. What is the


obtained 10 heads. What is the probability of it landing heads or tails
likelihood that the coin is fair? every time?
2
Given the fixed outcomes (data), Given the fixed parameter(p=0.5). What
what is the likelihood of different is the probability of different
parameter values? outcomes?

3 Likelihoods doesn’t add up to 1 Probabilities add up to 1

Probability is simply the likelihood of an event happening.

Simple Explanation – Maximum Likelihood Estimation


using MS Excel.
Problem: What is the Probability of Heads when a single coin is tossed 40 times.

Observation: When the probability of a single coin toss is low in the range of 0% to
10%, the probability of getting 19 heads in 40 tosses is also very low. However, when
we go for higher values in the range of 30% to 40%, I observed the likelihood of
getting 19 heads in 40 tosses is also rising higher and higher in this scenario.

In some cases, after an initial increase, the likelihood percentage gradually decreases
after some probability percentage which is the intermediate point (or) peak value.
The peak value is called maximum likelihood.
Five Major Steps in MLE:
1. Perform a certain experiment to collect the data.
2. Choose a parametric model of the data, with certain modifiable
parameters.
3. Formulate the likelihood as an objective function to be maximized.
4. Maximize the objective function and derive the parameters of the model.
Examples:

• Toss a Coin – To find the probabilities of head and tail


• Throw a Dart – To find your PDF of distance to the bull eye
• Sample a group of animals – To find the quantity of animals

Sec-2 U-1 Data Analysis Techniques


Regression analysis is a statistical technique used in data analysis to model and
explore the relationship between a dependent variable (also called the response or
target variable) and one or more independent variables (predictors or explanatory
variables). It is a valuable tool for understanding how changes in the independent
variables are associated with changes in the dependent variable. Regression analysis
comes in various types, each suited for different scenarios. Here are some common
types of regression analysis:
1. Linear Regression:
• Simple Linear Regression: Involves a single independent variable to
predict the dependent variable.
• Multiple Linear Regression: Incorporates multiple independent
variables to predict the dependent variable.
• Polynomial Regression: Uses polynomial functions (e.g., quadratic,
cubic) to model nonlinear relationships between variables.
2. Logistic Regression:
• Logistic regression is used when the dependent variable is binary (i.e., it
has only two possible outcomes, such as yes/no, 1/0).
• It is used to estimate the probability of certain events that are
mutually exclusive, for example, happy/sad, normal/abnormal, or
pass/fail. The value of probability strictly ranges between 0 and 1.
3.Nonlinear Regression:
Used when the relationship between the dependent and independent
variables is not linear.
Models include exponential, logarithmic, and power functions.
4. Quantile Regression

Quantile Regression is an econometric technique that is used when the necessary


conditions to use Linear Regression are not duly met. It is an extension of Linear
Regression analysis i.e., we can use it when outliers are present in data as its estimates
strong against outliers as compared to linear regression.

5. Ridge Regression

To understand Ridge Regression we first need to get through the concept of


Regularization.

Regularization: There are two types of Regularization, L1 regularization & L2


regularization. L1 regularization adds an L1 penalty equal to the value of coefficients
to restrict the size of coefficients, which leads to the removal of some coefficients. On
the other hand, L2 regularization adds a penalty L2 which is equal to the square of
coefficients.

Using the above method Regularization solves the problem of a scenario where the
model performs well on training data but underperforms on validation data.

6. Lasso Regression

LASSO (Least Absolute Shrinkage and Selection Operator) is a regression technique


that was introduced first in geophysics. The term “Lasso” was coined by Professor
Robert Tibshirani. Just like Ridge Regression, it uses regularization to estimate the
results. Plus it also uses variable selection to make the model more efficient.

12. Poisson Regression

Poisson Regression is used to foreshow the number of calls related to a particular


product on customer care. Poisson regression is used when the dependent variable
has a calculation. Poisson regression is also known as the log-linear model when it is
used to model contingency tablets. Its dependent variable y has Poisson distribution.
13. Negative Binomial Regression

Similar to Poisson regression, negative Binomial regression also accord with count
data, the only difference is that the Negative Binomial regression does not predict the
distribution of count that has variance equal to its mean.

14. Quasi Poisson Regression

Quasi Poisson Regression is a substitute for negative Binomial regression. The


technique can be used for overdispersed count data.

Applications of regression analysis:

• Forecasting
• Comparision with competition
• Problem identification
• Decision making

KNN

K-nearest neighbors (KNN) is a versatile machine learning algorithm that can be used
for both regression and classification tasks in data analytics. Here's an overview of how
KNN is applied to both types of problems:

KNN Classification:
In KNN classification, the goal is to predict a categorical class label for a given data
point based on the majority class among its K-nearest neighbors. Here's how KNN
classification works:

1. Training Phase:
• During the training phase, the KNN algorithm stores the entire training
dataset, which consists of feature vectors and their corresponding class
labels.
2. Prediction Phase:
• When making a prediction for a new, unseen data point, KNN identifies
the K-nearest neighbors of that point from the training data.
• The neighbors are determined based on a chosen distance metric
(commonly Euclidean distance) and are the training data points that are
closest to the new data point.
3. Classification:
• For KNN classification, the algorithm calculates the predicted class label
for the new data point by conducting a majority vote among its K-nearest
neighbors.
• The predicted class label is the class that occurs most frequently among
these neighbors.
4. Choosing K:
• The value of K (the number of neighbors to consider) is a hyperparameter
that you must specify.
• The choice of K can significantly impact the model's performance. Smaller
values of K may lead to more flexible and potentially noisy predictions,
while larger values of K may lead to smoother but potentially biased
predictions.
• The optimal K value is often determined through techniques like cross-
validation.

KNN Regression:

In KNN regression, the goal is to predict a continuous target variable for a given data
point based on the average (or weighted average) of the target values of its K-nearest
neighbors. Here's how KNN regression works:

1. Training Phase:
• Similar to KNN classification, the training phase involves storing the
entire training dataset, which includes feature vectors and their
corresponding target values.
2. Prediction Phase:
• When making a prediction for a new, unseen data point, KNN identifies
the K-nearest neighbors of that point from the training data based on a
chosen distance metric.
3. Regression:
• For KNN regression, the algorithm calculates the predicted target value
for the new data point by taking the average (or weighted average) of the
target values of its K-nearest neighbors.
• The predicted target value is essentially the mean (or weighted mean) of
the target values of these neighbors.
4. Choosing K:
• As in KNN classification, the choice of K in KNN regression is a critical
hyperparameter. Smaller K values may lead to more sensitive and
potentially noisy predictions, while larger K values may lead to smoother
but potentially biased predictions.
• The optimal K value is often determined through techniques like cross-
validation.

KNN is a non-parametric algorithm, meaning it doesn't make strong assumptions about


the underlying data distribution. It's relatively easy to understand and implement,
making it a useful tool for initial data analysis. However, it has limitations, such as
sensitivity to the choice of K and the distance metric, as well as potential
computational costs for large datasets. Experimentation and model tuning are
essential to get the best performance from KNN for both classification and regression
tasks.

Clustering

Clustering is a fundamental technique in data analytics that involves


grouping similar data points together based on their characteristics or
attributes. The primary goal of clustering is to discover patterns,
structures, or natural groupings in the data without any prior knowledge
of the group assignments. Clustering is an unsupervised learning
technique, meaning it doesn't require labeled data for training, and it's
widely used in various data analysis applications. Here are some key
concepts and methods related to clustering in data analytics

• A cluster of data objects can be treated as one group.


• While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
• The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.

Applications of Cluster Analysis


• Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
• Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.

• Clustering also helps in identification of areas of similar land use in


an earth observation database. It also helps in the identification of
groups of houses in a city according to house type, value, and
geographic location.

• Clustering also helps in classifying documents on the web for


information discovery.
• Clustering is also used in outlier detection applications such as
detection of credit card fraud.

Clustering Methods
Clustering methods can be classified into the following categories −

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k
≤ n. It means that it will classify the data into k groups, which satisfy the
following requirements −

• Each group contains at least one object.


• Each object must belong to exactly one group.
Points to remember −

• For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
• Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data


objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −

• Agglomerative Approach
• Divisive Approach
Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start


with each object forming a separate group. It keeps on merging the objects
or groups that are close to one another. It keep on doing so until all of the
groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start


with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a
merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of
hierarchical clustering −

• Perform careful analysis of object linkages at each hierarchical


partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and
then performing macro-clustering on the micro-clusters.
Density-based Method

This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood
exceeds some threshold, i.e., for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of
points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into
finite number of cells that form a grid structure.

Advantages

• The major advantage of this method is fast processing time.


• It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit
of data for a given model. This method locates the clusters by clustering the
density function. It reflects spatial distribution of the data points.

This method also provides a way to automatically determine the number of


clusters based on standard statistics, taking outlier or noise into account. It
therefore yields robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or


application-oriented constraints. A constraint refers to the user expectation
or the properties of desired clustering results. Constraints provide us with
an interactive way of communication with the clustering process.
Constraints can be specified by the user or the application requirement.

Association rules analysis:


Association rule analysis, often referred to as association rule mining or simply
association analysis, is a data mining technique used in data analytics to
discover interesting and meaningful patterns or relationships within large
datasets. It primarily identifies associations between items, products, or
attributes that frequently co-occur in transactions or records. Association rule
analysis is commonly used for market basket analysis, recommendation
systems, and understanding customer behavior. Here are the key concepts and
steps involved in association rule analysis.

The following terms are commonly used in association rule analysis:

• Item: An element or attribute of interest in the dataset


• Transaction: A collection of items that occur together
• Support: The frequency with which an item or itemset appears in the
dataset.
o (Item A + Item B) / (Entire dataset)
• Confidence: The likelihood that a rule is correct or true, given the
occurrence of the antecedent and consequent in the dataset.
o (Item A + Item B)/ (Item A)
• Lift: A measure of how often the antecedent and consequent occur together
than expected by chance.
o (Confidence) / (item B)/ (Entire dataset)

Data Preprocessing

Before performing association rule analysis, it is necessary to preprocess the data. This
involves data cleaning, transformation, and formatting to ensure that the data is in a
suitable format for analysis.

Data preprocessing steps may include:

• Removing duplicate or irrelevant data


• Handling missing or incomplete data
• Converting data to a suitable format (e.g., binary or numerical)
• Discretizing continuous variables into categorical variables
• Scaling or normalizing data

Association Rule Mining Algorithms


An association rule mining algorithm is a tool used to find patterns and
relationships in data. Several algorithms are used in association rule mining, each with
its own strengths and weaknesses.

Apriori Algorithm

One of the most popular association rule mining algorithms is the Apriori algorithm.
The Apriori algorithm is based on the concept of frequent itemsets, which are sets of
items that occur together frequently in a dataset.

The algorithm works by first identifying all the frequent itemsets in a dataset, and then
generating association rules from those itemsets.

These association rules can then be used to make predictions


or recommendations based on the patterns and relationships discovered.

FP-Growth Algorithm

In large datasets, FP-growth is a popular method for mining frequent item sets.

It generates frequent itemsets efficiently without generating candidate itemsets using


a tree-based data structure called the FP-tree. As a result, it is faster and more
memory efficient than the Apriori algorithm when dealing with large datasets.

First, the algorithm constructs an FP-tree from the input dataset,


then recursively generates frequent itemsets from it.

S-2 U-2

Creating Data for Analytics through Designed Experiments:

In prescriptive analytics, designed experiments are a powerful approach to


generating data for analysis. Designed experiments, also known as
experimental design or DOE (Design of Experiments), are systematic
procedures used to plan, conduct, and analyze experiments in a way that
maximizes the information gained from the data while minimizing the number
of experiments needed. Here's how you can create data for analytics through
designed experiments in the context of prescriptive analysis:

1. Define the Problem or Objective:


• Start by clearly defining the problem or objective you want to
address with prescriptive analytics. What decisions or actions are
you trying to optimize or improve? What factors or variables might
influence the outcomes?
2. Identify Factors and Levels:
• Determine the factors or variables that could impact the outcome
of interest. These factors can be both controllable (e.g., process
settings, marketing strategies) and uncontrollable (e.g.,
environmental conditions).
• Define the possible levels or values for each factor. For example, if
you are optimizing a manufacturing process, factors might include
temperature, pressure, and time, each with multiple levels.
3. Design the Experiment:
• Choose an appropriate experimental design method. Common
designs include full factorial design, fractional factorial design,
response surface methodology, and Taguchi methods, among
others.
• Determine the number of experimental runs (experiments)
required to investigate the factors and levels adequately. The
design should aim to cover a wide range of factor combinations
efficiently.
4. Conduct the Experiments:
• Conduct the experiments according to the design plan. Ensure that
the factors are set at their specified levels for each experiment,
and record the outcomes or responses.
5. Collect Data:
• Collect data from each experiment, including measurements,
observations, or any other relevant information. Ensure the data is
accurate and consistent.
6. Analyze the Data:
• Use statistical analysis techniques to analyze the data collected
from the experiments. This may include regression analysis,
analysis of variance (ANOVA), or other methods depending on the
experimental design and objectives.
• Determine how the factors influence the outcome of interest and
identify any significant interactions between factors.
7. Model Development:
• Develop mathematical models or predictive models that describe
the relationship between the factors and the outcomes. These
models can be used for optimization and decision-making.
8. Optimization and Prescriptive Analysis:
• Utilize the developed models to perform optimization and
prescriptive analysis. This may involve finding the optimal set of
factor levels that maximize or minimize the desired outcomes.
• Prescriptive analytics can help you make informed decisions and
recommendations based on the experimental results.
9. Validation:
• Validate the results of your prescriptive analysis to ensure they
align with real-world conditions. Sometimes, follow-up
experiments or testing in a production environment may be
necessary to confirm the findings.
10. Implementation:
• Implement the recommended actions or decisions based on the
prescriptive analysis to achieve the desired outcomes or
improvements.

Designed experiments are a systematic and efficient way to create data for
prescriptive analysis. They allow you to explore the relationships between
factors and outcomes, optimize processes, and make data-driven decisions
that lead to improved performance and efficiency.

You might also like