BUSINESS ANALYTICS NOTES
3. Standard Deviation, Mean, Skewness
Mean
The mean is a fundamental measure of central tendency that provides a single value
representing the average of a dataset. It is calculated by summing all values and dividing by
the number of observations. The formula for mean is:
Mean=∑xn\text{Mean} = \frac{\sum x}{n}Mean=n∑x
where ∑x\sum x∑x is the sum of all data points, and nnn is the number of data points. The
mean gives an idea of the "typical" value in a dataset and is widely used in business analytics
to understand key metrics like average sales, customer spending, or production output.
However, it is sensitive to extreme values (outliers), which can distort its accuracy. For
example, if a company’s average employee salary is calculated including a very high CEO
salary, the result may not represent the typical employee’s salary.
Standard Deviation (SD)
Standard Deviation is a measure of dispersion or spread in a dataset. It indicates how much
individual data points deviate from the mean. A low standard deviation suggests that the
values are close to the mean, whereas a high standard deviation indicates that the values are
more spread out. The formula for standard deviation (for a sample) is:
PAGE NO. = 53
Standard deviation is critical in business to assess risk, consistency and performance. For
example, in stock market analysis, a high standard deviation means the stock is volatile and
risky, whereas a low SD means it is more stable.
Skewness
Skewness measures the asymmetry of the distribution of data. If the data is perfectly
symmetrical, the distribution is said to have zero skewness and is considered normal. If the
tail is longer on the right side, the distribution is positively skewed, and if the tail is longer
on the left, it is negatively skewed. The formula for sample skewness is:
PAGE NO. = 61, 64 & 65
Skewness helps in identifying whether the mean is a reliable measure of central tendency. For
example, in customer income data, if a few customers earn significantly more than the rest,
the data will show positive skewness. This impacts the choice of statistical techniques and
summarization methods.
4. Normality and Distribution of Data
What is Normality?
Normality refers to a condition where the dataset follows a normal distribution, also known
as the Gaussian distribution. A normal distribution is symmetric, bell-shaped, and centered
around the mean. In this distribution, the mean, median, and mode are all equal. It is widely
used in statistics due to its natural occurrence in many real-life phenomena such as employee
performance, product weight, or exam scores.
Properties of Normal Distribution
The normal distribution is defined by two parameters: mean (μ) and standard deviation (σ).
The shape of the curve is determined by these two. It has key properties:
It is symmetric around the mean.
About 68.26% of the data lies within ±1σ, 95.44% within ±2σ, and 99.73% within
±3σ.
The total area under the curve is 1.
It extends infinitely in both directions, though practically most data lies within ±3σ.
Why Normality Matters in Analytics
Many parametric tests such as t-tests, regression analysis, and ANOVA assume normality
of data. If the assumption of normality is violated, the results from these tests may not be
valid. For example, if you want to evaluate employee performance based on a training
program using a t-test, the data should ideally be normally distributed for accurate
interpretation.
How to Check for Normality
Normality can be visually assessed using:
Histograms
Box plots
Q-Q (quantile-quantile) plots
Additionally, statistical tests like the Shapiro-Wilk Test and Kolmogorov-Smirnov Test are
used to test normality.
Non-Normal Distributions
If data is not normally distributed, it could be skewed, bimodal, or uniform. In such cases,
non-parametric tests such as the Mann-Whitney U test or Kruskal-Wallis test are more
appropriate. For example, income data often follows a positively skewed distribution, and
using a non-parametric approach would yield more reliable results.
6. T-Test
Definition and Purpose
The t-test is a parametric test used to compare the means of two groups and determine if
the differences are statistically significant. It is especially useful when the sample size is
small and population standard deviation is unknown.
Types of T-Tests
1. One-sample t-test: Compares the mean of a single group with a known or
hypothesized population mean.
Example: Is the average delivery time of a service different from 30 minutes?
2. Independent (two-sample) t-test: Compares the means of two independent groups.
Example: Compare average sales of two different branches.
3. Paired t-test: Used when the same group is measured twice (before and after a
treatment).
Example: Measure productivity of employees before and after training.
Formula for Independent t-test
Assumptions
Data should be normally distributed
Samples are independent
Variance between groups should be equal (homogeneity of variance)
The t-test is widely used in business to compare employee performance, marketing
campaign results, or sales before and after a price change.
7. Chi-Square Test
Overview
The Chi-Square Test (χ²) is a non-parametric test used to examine the association
between categorical variables. It is used when data is in the form of frequencies or counts,
not continuous variables.
Types of Chi-Square Tests
1. Chi-square test of independence:
o Tests whether two categorical variables are independent.
o Example: Is customer satisfaction independent of geographic location?
2. Chi-square goodness-of-fit test:
o Checks if a sample distribution matches an expected distribution.
o Example: Are product sales equally distributed across all weekdays?
Formula
Assumptions
Data must be in counts
Categories must be mutually exclusive
Expected frequency in each cell should be ≥ 5
The Chi-square test is often used in business to assess customer preferences, employee
satisfaction by department, or relationship between product category and return rate.
8. Cluster Analysis
Introduction
Cluster Analysis is a powerful unsupervised learning technique used to group similar data
points into clusters, where the data within each cluster is more similar to each other than to
those in other clusters. It is widely used in market research, customer segmentation, and
pattern recognition.
Purpose and Importance
The main purpose of clustering is to discover hidden structures or patterns in large
datasets. For instance, a company can use clustering to segment its customers into groups
such as price-sensitive, brand-loyal, and occasional buyers, allowing targeted marketing
strategies for each group.
Types of Clustering
1. Hierarchical Clustering:
o Creates a tree-like structure (dendrogram) to group data.
o Useful for small datasets.
2. K-Means Clustering:
o Divides data into K predefined clusters.
o Minimizes the within-cluster variance.
Steps in K-Means Clustering
Choose number of clusters (K)
Randomly assign data points to clusters
Calculate cluster centroids
Reassign points based on nearest centroid
Repeat until convergence
Application in Business
Customer segmentation for personalized offers
Fraud detection in banking
Inventory categorization based on turnover and value
Cluster analysis helps businesses maximize marketing ROI, reduce churn, and improve
operational efficiency.
9. Importance of Data Analytics in Business
Decision-Making
Data analytics enables fact-based decision-making by transforming raw data into actionable
insights. It replaces intuition with data-driven strategies. For example, analyzing past sales
can help forecast future demand, aiding in inventory planning.
Customer Insights
Businesses can use analytics to deeply understand customer behavior, preferences, and
feedback. By analyzing customer purchase history, a retailer can personalize product
recommendations and marketing messages, leading to improved customer retention.
Forecasting and Planning
Using predictive analytics, businesses can anticipate future trends, customer behavior, or
risks. For example, banks use historical loan data to predict default risks, allowing better
credit decisions.
Operational Efficiency
Analytics helps in identifying inefficiencies in processes. By tracking KPIs (Key
Performance Indicators), companies can reduce wastage, optimize resource utilization, and
improve service delivery. In manufacturing, analytics is used for quality control and
production optimization.
Competitive Advantage
Data-driven companies gain a competitive edge by reacting faster to market trends and
making smarter strategic decisions. For instance, companies like Amazon and Netflix thrive
because they leverage analytics to recommend products and content, enhancing customer
satisfaction.
OTHERS IN NOTE BOOK FOR REFERAL.