Business Analytics DGV
Business Analytics DGV
Analytics is broad term, which may refer to almost any type of data
analysis, especially statistical analysis, data mining and machine
learning.
Analytics 3
▪ Descriptive
▪ Diagnostic
▪ Predictive
▪ Prescriptive
Scales of measurement 6
– Continuous – Rank
Quantitative / Numerical data 9
All characteristics such as weight, length, height, thickness, temperature, Time of arrival,
Marks scored, etc., represent continuous variables
It may be noted that a continuous variable assumes the finest unit of measurement.
Finest in the sense that it enables measurements to the maximum degree of precision.
Quantitative / Numerical data 10
Discrete data are the values assumed by a discrete variable. A discrete variable is the
one whose outcomes are measured in fixed numbers. Such data are essentially count
data.
These are derived from a process of counting, such as the number of items possessing
or not possessing a certain characteristic.
Nominal data are the outcome of classification into two or more categories of items or
units comprising a sample or a population according to some quality characteristic.
Given any such basis of classification, it is always possible to assign each item to a
particular class and make a summation of items belonging to each class. The count
data so obtained are called nominal data.
Qualitative / Categorical 12
Rank data, on the other hand, are the result of assigning ranks to specify order in terms
of the integers 1,2,3, ..., n. Ranks may be assigned according to the level of
performance in a test. a contest, a competition, an interview, or a show.
• Univariate
• Bivariate
• Multivariate
Univariate data 14
Univariate Data is data that concerns only one variable. The data
concerning the Weights of a Finance class of 30 students presented in the
following table is an example of Univariate Data
30 78
Bivariate data 15
Bivariate Data is data concerning only two variables. Continuing with our
earlier example, if we add the Height of each student along with his/her
Weight, it will be a bivariate data.
1 75 172 80 Male
2 71 169 75 Male
3 82 174 82 Female
30 78 176 69 Male
Data Sources 17
Data sources could be seen as of two types, viz., secondary and primary.
1. Secondary data
Data sources could be seen as of two types, viz., secondary and primary…
2. Primary data
• Those data which do not already exist in any form, and thus have to be
collected for the first time from the primary source(s). By their very nature,
these data require fresh and first-time collection covering the whole
population or a sample drawn from it.
• Surveys
• Interviews
• Experiments
• Observations
Classification of Sampling techniques 20
Sampling Techniques
Nonprobability Probability
Sampling Techniques Sampling Techniques
Quota sampling is judgmental sampling with the constraint that the sample
includes a minimum number of specified sub-groups.
Non Probability sampling 24
Each possible sample of a given size (n) has a known and equal probability
of being the sample actually selected.
The sample is chosen by selecting a random starting point and then picking
every ith element in succession from the sampling frame.
For example, there are 100,000 elements in the population and a sample of
1,000 is desired. In this case the sampling interval, i, is 100. A random number
between 1 and 100 is selected. If, for example, this number is 64, the sample
consists of elements 64, 164, 264, 364, 464, 564, and so on.
Probability sampling – Stratified sampling 27
Rules Of Strata
The target population is first divided into mutually exclusive and collectively
exhaustive subpopulations, or clusters.
In stage 3 ,for each selected cluster, either all the elements are included in
the sample (one-stage) or a sample of elements is drawn probabilistically
(two-stage).
Probability sampling – Cluster Sampling 30
The target population is first divided into mutually exclusive and collectively
exhaustive subpopulations, or clusters.
Cluster Rules
Clusters
Ideally, each cluster should be a small-scale representation of
the population.
◼ Sampling errors:
❖ Faulty selection of sample – This may be due to defective sampling technique.
Purposive or Judgment sampling, in which reseacher deliberately selects non-
representative sample
❖ Error in compilation
Sample Size
Branches of Statistics 34
35
Statistics 36
Descriptive Statistics
Mean 490.8
Standard Error 6.542348114
Median 475
Mode 450
Standard Deviation 54.73721146
Sample Variance 2996.162319
Kurtosis -0.334093298
Skewness 0.924330473
Range 190
Minimum 425
Maximum 615
Sum 34356
Count 70
Types of measures 37
◼ Measures of Dispersion
◼ Measures of relationship
Measures of Central tendency 38
100 91 85 84 75 72 72 69 65
Mode
Median
Mean = 79.22
What measure to use: Mean, Median, Mode
40
* Mode may not be a good representation if the data set is not normal
Measures of Dispersion 41
◼ Measures of Dispersion
◼ Describes the spread of the data (how
scores are scattered or dispersed)
Range
◼ The range is calculated by taking the maximum value and
subtracting the minimum value.
2 4 6 8 10 12 14 Range = 14 - 2 = 12
◼ If we divide range or spread of scores into four equal parts, these are called
“quartiles”
◼ When we divide range into 10 equal parts, these are called “deciles”
◼ When we divide range into 100 equal parts, these are called “percentiles”
Measures of Dispersion 43
Percentiles
Arrange the data in ascending order.
i = (p/100)n
80th Percentile
◼ Example: Apartment Rents
i = (p/100)n = (80/100)70 = 56
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Quartiles
Third Quartile
◼ Example: Apartment Rents
Third quartile = 75th percentile
i = (p/100)n = (75/100)70 = 52.5 = 53
Third quartile = 525
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Note: Data is in ascending order.
Measures of Dispersion 47
IQR
◼ Example: Apartment Rents
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Standard Deviation
A measure of how widely the data points tend to diverge from the mean. A small
standard deviation indicates most values are close to the mean, and a large
standard deviation indicates they are much more or much less than the mean. The
basic idea is that you’d like to sum up how different the individual data points are
from the average. You could just sum up the individual differences, but what about
the fact that some are less than the mean and others are greater? That would tend
to make them cancel out. The way to get around that is to square the differences,
because any time you square a number, the result is positive. Later, after we have
added them together, we take a square root, to reduce the value down to
something more manageable and reasonable
Measures of Dispersion 53
Measures of Dispersion 54
n 2
(xi x)
S2 = i =1
n 1
The Standard deviation, S is the positive square root of the variance.
Measures of Dispersion 55
μ + 2σ
μ + 3σ
μ - 3σ
μ - 2σ
μ+σ
μ-σ
μ
1) Normal Distribution is completely designated by two
parameters (μ and σ)
2) μ used for location and σ for spread.
3) Normal curve is bell shaped.
Normal Distribution 59
◼ Measures of asymmetry
◼ When the distribution of item in a series happens to be perfectly
symmetrical, we have normal distribution. Such a curve is perfectly bell
shaped. But if the curve is distorted (whether on the right side or on the
left side) we have asymmetrical distribution which indicates that there is
skewness.
xi − x
3
n
Skewness =
(n − 1)(n − 2) s
.25
.20
.15
.10
.05
0
Parametric Test 66
For example, one assumption for the one way ANOVA is that the data comes from a
normal distribution. If your data isn’t normally distributed, you can’t run an ANOVA,
but you can run the nonparametric alternative—the Kruskal-Wallis test.
Non parametric Test 67
A non parametric test (sometimes called a distribution free test) does not assume
anything about the underlying distribution (for example, that the data comes from
a normal distribution). That’s compared to parametric test, which makes assumptions
about a population’s parameters (for example, the mean or standard deviation);
When the word “non parametric” is used in stats, it doesn’t quite mean that you
know nothing about the population. It usually means that you know the population
data does not have a normal distribution.
Non parametric vs Parametric Test 68
What is hypothesis:
Ordinarily, when one talks about hypothesis, one simply means mere
assumption or some supposition to be proved or disproved.
When your sample contains sufficient evidence, you can reject the null
and conclude that the effect is statistically significant. Statisticians often
denote the null hypothesis as H0 or HA.
are the variables. Why call them variables? Because they are not constants—the
data for each of these variables vary for each case (think of a case as a person).
Breaking the mass of data down into variables is a first step in getting a handle on
Variables come in different types. The simplest way to break them down is to
Continuous variables, on the other hand, are numbers or numerical. Anything you
Well, that’s the central question. If you can figure out how variables
way things in the system under examination work. There are two
◼ Measures of relationship
◼ Describes the relationship of the data
X Y x=X-X y =Y-Y x2 y2 xy
3250
0 20 -150 -5 22500 25 750 r=
60000 * 226
50 20 -100 -5 10000 25 500
100 21 -50 -4 2500 16 200 3250
150 22 0 -3 0 9 0 r=
13560000
150 23 0 -2 0 4 0
150 24 0 -1 0 1 0 3250
r=
200 26 50 1 2500 1 50 3682.39
200 28 50 3 2500 9 150
250 31 100 6 10000 36 600 r = 0.88
250 35 100 10 10000 100 1000
X bar =150 Y bar =25 x2 = 60000 y2 = 226 xy = 3250
Measures of Relationship 81
35
Productivity units / labour hour
30
25
20
15
10
0
0 50 100 150 200 250 300
Measures of Relationship 82
Correlation
Scatter diagram
Measures of Relationship 86
Scatter diagram
Correlation 87
▪ Correlating market data and business data is definitely a step in the right
direction. It shows the organization that we are pulling information together
and making important connections.
▪ Correlations are used to understand how data sets are related. In other
words of variable “X” changes does variable “Y” change
Correlation 88
89
Predictive Analytics
What can Predictive Analytics do in Business? 90
◼ What mixture of skills, experience, and competencies would most likely guarantee a
high performance.
◼ With this information, analysis can be applied to predict how successful different
courses of action will be
Simple linear Regression 91
y = 0 + 1x +
where:
E(y)
Regression line
Intercept Slope 1
0 is positive
x
Simple linear Regression Equation 94
E(y)
Intercept
0 Regression line
Slope 1
is negative
x
Simple linear Regression Equation 95
No Relationship
E(y)
x
Established Simple linear Regression Equation 96
ŷ = b0 + b1 x
b1 = ( x − x )( y − y )
i i
(x − x )
i
2
where:
xi = value of independent variable for ith observation
yi = value of dependent variable for ith observation
_
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares method 98
b0 = y − b1 x
What is Linear Regression 99
What is Linear Regression 100
What is Logistic Regression 101
What is Logistic Regression 102
Contact 103
Phone : 9600066166
Web : www.transbizconsulting.com
Email : [email protected] /
[email protected]
Twitter : Transbiz1
Linked in : linkedin.com/company/transbizconsulting
Facebook : facebook.com/transbizconsulting
Thank You