Data Analytics-11
Data Analytics-11
For example, increasing the speed of a vehicle decreases the time you take to
reach your destination.
Correlation analysis is used to study practical cases. Here, the researcher can't
manipulate individual variables. For example, correlation analysis is used to
measure the correlation between the patient's blood pressure and the
medication used.
In statistics, correlation refers to the fact that there is a link between various
events. One of the tools to infer whether such a link exists is correlation
analysis. Practical simplicity is undoubtedly one of its main advantages.
• Uses for further studies: Researchers can identify the direction and
strength of the relationship between two variables and later narrow the
findings down in later studies.
• Simple metrics: Research findings are simple to classify. The findings can
range from -1.00 to 1.00. There can be only three potential broad
outcomes of the analysis.
Here are the key concepts and steps involved in the Maximum Likelihood Test:
Likelihood Function:
MLE is the process of finding the parameter values that maximize the
likelihood function.
In other words, it finds the parameter values that make the observed data
most probable under the assumed statistical model.
Hypothesis Testing:
The null hypothesis (H0) typically assumes a specific parameter value, while
the alternative hypothesis (H1) allows for different parameter values.
MLT calculates the likelihood under the null hypothesis (L(H0)) and the
likelihood under the alternative hypothesis (L(H1)).
It compares the likelihood of the data under the null hypothesis to the
likelihood under the alternative hypothesis.
The test statistic is often denoted as the log-likelihood ratio or -2 times the
log-likelihood ratio, and it follows a chi-squared distribution under certain
conditions.
The critical value is a threshold value from the chi-squared distribution that
helps determine whether to accept or reject the null hypothesis.
Decision:
If the test statistic exceeds the critical value or if the p-value is smaller than a
chosen significance level (alpha), you reject the null hypothesis in favor of the
alternative hypothesis.
If the test statistic does not exceed the critical value or if the p-value is larger
than alpha, you fail to reject the null hypothesis.
MLT can be applied in various data analytics scenarios, such as comparing
different models, testing the significance of parameters, and assessing
goodness-of-fit. It is a fundamental tool in statistical analysis and hypothesis
testing in data science and analytics
S.
Likelihood Probability
No
Observation: When the probability of a single coin toss is low in the range of 0% to
10%, the probability of getting 19 heads in 40 tosses is also very low. However, when
we go for higher values in the range of 30% to 40%, I observed the likelihood of
getting 19 heads in 40 tosses is also rising higher and higher in this scenario.
In some cases, after an initial increase, the likelihood percentage gradually decreases
after some probability percentage which is the intermediate point (or) peak value.
The peak value is called maximum likelihood.
Five Major Steps in MLE:
1. Perform a certain experiment to collect the data.
2. Choose a parametric model of the data, with certain modifiable
parameters.
3. Formulate the likelihood as an objective function to be maximized.
4. Maximize the objective function and derive the parameters of the model.
Examples:
5. Ridge Regression
Using the above method Regularization solves the problem of a scenario where the
model performs well on training data but underperforms on validation data.
6. Lasso Regression
Similar to Poisson regression, negative Binomial regression also accord with count
data, the only difference is that the Negative Binomial regression does not predict the
distribution of count that has variance equal to its mean.
• Forecasting
• Comparision with competition
• Problem identification
• Decision making
KNN
K-nearest neighbors (KNN) is a versatile machine learning algorithm that can be used
for both regression and classification tasks in data analytics. Here's an overview of how
KNN is applied to both types of problems:
KNN Classification:
In KNN classification, the goal is to predict a categorical class label for a given data
point based on the majority class among its K-nearest neighbors. Here's how KNN
classification works:
1. Training Phase:
• During the training phase, the KNN algorithm stores the entire training
dataset, which consists of feature vectors and their corresponding class
labels.
2. Prediction Phase:
• When making a prediction for a new, unseen data point, KNN identifies
the K-nearest neighbors of that point from the training data.
• The neighbors are determined based on a chosen distance metric
(commonly Euclidean distance) and are the training data points that are
closest to the new data point.
3. Classification:
• For KNN classification, the algorithm calculates the predicted class label
for the new data point by conducting a majority vote among its K-nearest
neighbors.
• The predicted class label is the class that occurs most frequently among
these neighbors.
4. Choosing K:
• The value of K (the number of neighbors to consider) is a hyperparameter
that you must specify.
• The choice of K can significantly impact the model's performance. Smaller
values of K may lead to more flexible and potentially noisy predictions,
while larger values of K may lead to smoother but potentially biased
predictions.
• The optimal K value is often determined through techniques like cross-
validation.
KNN Regression:
In KNN regression, the goal is to predict a continuous target variable for a given data
point based on the average (or weighted average) of the target values of its K-nearest
neighbors. Here's how KNN regression works:
1. Training Phase:
• Similar to KNN classification, the training phase involves storing the
entire training dataset, which includes feature vectors and their
corresponding target values.
2. Prediction Phase:
• When making a prediction for a new, unseen data point, KNN identifies
the K-nearest neighbors of that point from the training data based on a
chosen distance metric.
3. Regression:
• For KNN regression, the algorithm calculates the predicted target value
for the new data point by taking the average (or weighted average) of the
target values of its K-nearest neighbors.
• The predicted target value is essentially the mean (or weighted mean) of
the target values of these neighbors.
4. Choosing K:
• As in KNN classification, the choice of K in KNN regression is a critical
hyperparameter. Smaller K values may lead to more sensitive and
potentially noisy predictions, while larger K values may lead to smoother
but potentially biased predictions.
• The optimal K value is often determined through techniques like cross-
validation.
Clustering
Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k
≤ n. It means that it will classify the data into k groups, which satisfy the
following requirements −
• For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
• Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach
Divisive Approach
Here are the two approaches that are used to improve the quality of
hierarchical clustering −
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood
exceeds some threshold, i.e., for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of
points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into
finite number of cells that form a grid structure.
Advantages
In this method, a model is hypothesized for each cluster to find the best fit
of data for a given model. This method locates the clusters by clustering the
density function. It reflects spatial distribution of the data points.
Constraint-based Method
Data Preprocessing
Before performing association rule analysis, it is necessary to preprocess the data. This
involves data cleaning, transformation, and formatting to ensure that the data is in a
suitable format for analysis.
Apriori Algorithm
One of the most popular association rule mining algorithms is the Apriori algorithm.
The Apriori algorithm is based on the concept of frequent itemsets, which are sets of
items that occur together frequently in a dataset.
The algorithm works by first identifying all the frequent itemsets in a dataset, and then
generating association rules from those itemsets.
FP-Growth Algorithm
In large datasets, FP-growth is a popular method for mining frequent item sets.
S-2 U-2
Designed experiments are a systematic and efficient way to create data for
prescriptive analysis. They allow you to explore the relationships between
factors and outcomes, optimize processes, and make data-driven decisions
that lead to improved performance and efficiency.