UNIT II_ Statistics for Data Science_new (1)
UNIT II_ Statistics for Data Science_new (1)
Data Science
By
Shilpa Sonawani
2
Population and sample
4
Population and sample
Population
sampling inference
Sample
5
Primary & Secondary
•Data
Raw or Primary data: when data
collected having
lot of unnecessar y, irrelevant & un wanted
information
• Treated or Secondary data: when we
treat &
remove this unnecessar y, irrelevant & un
wanted
• Cooked data: when data collected not
information
genuinely and
is false and
fictitious 9
Ungrouped & Grouped
example if we
Data
Ungrouped data: when data presented or observed individually. For
observed no. of
children in 6
families
2, 4, 6, 4, 6, 4
Grouped data: when we grouped the identical data by frequency. For
example above
No. of children
data of children in 6 families Families
can be grouped as:
2 1
4 3
6 2
or alternatively we can make
classes:
2 -children
No. of 4 4
Frequency
10
5-7 2
Variable
11
Types of
Variable
Independent variable: is typically the variable
representing the
value being manipulated
Dependent or changed.
variable: is the observedFor example
result of the
smoking
independent
variable being manipulated.
Confounding For example
variable: is associated caboth
with of lung
exposure and
disease. For example age is factor for many events
12
Types of measurement
1
0
Types of measurement
1
1
Types of measurement
• Nominal
–Categorical variables. Numbers that are
simply used as identifiers or names
represent a nominal scale of
measurement such as female vs. male.
12
Types of measurement
• Ordinal
– An ordinal scale of measurement
represents an ordered series of
relationships or rank order. Likert-type
scales (such as "On a scale of 1 to 10, with
one being no pain and ten being high pain,
how much pain are you in today?")
represent ordinal data.
13
Types of measurement
14
Types of measurement
15
Types of measurement
For each
dimension…
Numerical Categorical
• Two types of
statistics
–Descriptive statistics
–Inferential statistics
18
Basic statistics
• Descriptive statistics:
–“are procedures used to
summarize, organize, and make
sense of a set of scores or
observations.”
19
Basic statistics
• Inferential statistics:
–“are procedures used that allow
researchers to infer or generalize
observations made with samples to
the larger population from which
they were selected.”
20
Why Descriptive statistics?
• Who is a better ODI batsmen - Sachin or Muralidharan?
• Batting average?
• Who is the reliable- Dhoni or Afridi?
• Score variance
• A triangular series among Aus, Eng & Newziland ; Who will win?
• Most number of wins - Mode
• I am going to buy shoes. Which brand has verity- Power or Adidas?
• Price range - Range
• We used Average, Variance, Mode, Range to make some inferences.
These are nothing but descriptive statistics
• Descriptive statistics tell us what happened in the past.
• Descriptive statistics avoid inferences but, they help us to get a feel
of the data.
• Some times they are good enough to make an inference. 5
Descriptive Statistics
• A statistic or a measure that describes the data
• Average salary of employees
• Describing data with tables and graphs
(quantitative or categorical variables)
• Numerical descriptions
• Center –measures of center of the data
• Variability–measures of variability of the data
• Bivariate descriptions (In practice, most studies have several
variables)
• Dependency measures(Correlation)
6
Descriptive statistics
23
Descriptive statistics
24
Descriptive statistic
25
Descriptive statistics
• Mean
– 𝑋 (or M) for sample mean and μ
for population mean
– 𝑋 (x bar) =𝑛
∑𝑥
– ∑x means sum of all individual scores of x1-
xn
– n means number of scores
26
Descriptive statistics
27
Descriptive statistics
Score (X) Frequency (f) fX
60 1 60
65 2 130
70 3 210
75 4 300
80 5 400
85 4 340
90 3 270
95 2 190
100 1 100
Sum 25 2000
28
Descriptive statistics
29
Descriptive statistics
• Distribution of Example 1
Mean = 80
30
Descriptive statistics
• Median
– Data: 2, 3, 4, 5, 7, 10, 80. Mean of those
scores is 15.86.
– 80 is an outlier.
– Mean fails to reflect most of the data. We
use median instead of mean to remove the
influence of an outlier.
– Median is the middle value in a distribution
of data listed in a numeric order.
31
Descriptive statistics
• Median
–Position of median = 𝑛+1
2
–For odd –numbered sample size:
3,6,5,3,8,6,7. First place each score
in numeric order: 3,3,5,6,6,7,8.
Position 4. median = 6
32
Descriptive statistics
• Median
• For even-numbered sample size:
3,6,5,3,8,6. First place each score in
numeric order: 3,3,5,6,6,8.
5+
3.5. Median = 6 = 5.5
2
Position
• Example 2: we want to know average
salary of 36 cases.
33
Descriptive statistics
Salary Frequency
$20k 1
$25k 2
$30k 3
$35k 4
$40k 5
$45k 6
$50k 5
$55k 4
$200k 3
$205k 2
$210k 1
Total 36
34
Descriptive statistics
• Median = ?
• Position 18.5
• Which number is at position 18.5?
• Median = $45k
35
Descriptive statistics
• Mode
–The value in a data set that occurs
most often or most frequently.
–Example: 2,3,3,3,4,4,4,4,7,7,8,8,8.
Mode = 4
36
Range
unit 1 unit 2
• Max –Min 9.7 9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
R: range(x) 12.6
13.1
12.5
13.2
13.5 13.8
13.6 14.0
14.8 15.5
16.3 15.6
26.9 16.2
16.4
37
Excercise
40
Descriptive statistics
41
Descriptive statistics
42
Descriptive statistics
43
Descriptive statistics
• Dispersion
–Range
–Variance
–Standard deviation
44
Descriptive statistics
• Range
–It is the difference between the
largest value and smallest
value.
–It is informative for data
without outliers.
45
Descriptive statistics
• Variance
–It measures the average squared
distance that scores deviate from
their mean.
–Sample variance: s2 (population
variance σ2 sigma)
46
Descriptive statistics
47
Descriptive statistics
48
Descriptive statistics
49
Descriptive statistics
• Summary
–When individual scores are close to mean,
the standard deviation (SD) is smaller.
When individual scores are spread out far
from the mean, the standard deviation is
larger.
–SD is always positive
–It is typically reported with mean.
51
Descriptive statistics
52
Descriptive statistics
53
Standard Deviation
A hen lays eight eggs. Each egg was weighed and recorded as follows:
• 60 g, 56 g, 61 g, 68 g, 51 g, 53 g, 69 g, 54 g.
• Calculate Mean and Standard Deviation.
Example with Solution
Coefficient of variation
• p = 50: median
• p = 25: lower quartile (LQ)
• p = 75: upper quartile (UQ)
• Interquartile range IQR = UQ - LQ 21
Box Plots
• Quartiles portrayed graphically by box plots
22
Boxplots
• For numerical data
66
Box Plot calculation and interpretation
The following data are the heights of 40 students in a statistics class.
59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 68 69 70 70 70 70 70 71 71
72 72 73 74 74 75 77
Construct a box plot with the following properties; the calculator instructions for the minimum and
maximum values as well as the quartiles follow the example.
•Minimum value = 59
•Maximum value = 77
•Q1, First quartile = 64.5
•Q2, Second quartile or median= 66
•Q3, Third quartile = 70
1.Each quarter has approximately 25% of the data.
2.The spreads of the four quarters are 64.5 – 59 = 5.5 (first
quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (third
quarter), and 77 – 70 = 7 (fourth quarter). So, the second
quarter has the smallest spread and the fourth quarter has
the largest spread.
3.Range = maximum value – the minimum value = 77 – 59 =
18
4.Interquartile Range: IQR = Q3 – Q1 = 70 – 64.5 = 5.5.
5.The interval 59–65 has more than 25% of the data so it has
Summaries for numerical data
• Normal distribution
–Probability: the frequency of times an
outcome is likely to occur divided by
the total number of possible
outcomes.
• It varies between 0 and 1.
• Example (next slide)
70
Descriptive statistics
• Probability
Fail Pass Total
Male 3 2 5
Female 1 4 5
Total 4 6 10
72
Descriptive statistics
• Normal curve
73
Descriptive statistics
• Normal curve
74
Descriptive statistics
77
Histograms
• Calculate Mean,
Median using
formula.
• Plot Histogram of
scores and find
mean and median
Histogram
mean sd
Normal Distribution
• A normal distribution
with mean = 0, and
standard deviation = 1.
• A Z score is a value on
the x-axis of a standard
normal distribution.
• We can take any Normal
Distribution and convert
it to The Standard
Normal Distribution.
Z Score
• z = (x – μ) / σ
– x observation
– μ mean
– σ standard deviation
What is the probability that any observed value is less than 105? Greater
than 105?
x=105, μ=100, σ =5, Find Z score & Probability under the curve
Z = 105-100/5=1
https://www.mathsisfun.com/data/standard-normal-distribution-table.html
Z-score Example
A person is having two sons. He wants to know who scored better on their
standardized test with respect to the other test takers. Ram who earned an 1800 on
his SAT or Sham who scored a 24 on his ACT Exam ?
Here we cannot simply compare and tell who has done better as they are measured in
different scale.
So, his father will be interested to observe how many standard deviation of their
respective mean of their distribution Ram and Sham score.
Ram = (1800- 1500) / 300 =1 standard deviation above the mean
Sham = (24 – 21 ) / 5= 0.6 standard deviation above the mean
Now his father can conclude Ram indeed did a better score than Sham.
Example
100
Excercise
Contingency Tables
• Cross classifications of categorical variables in which rows (typically)
represent categories of explanatory variable and columns
represent categories of response variable.
• Counts in “cells” of the table give the numbers of individuals at the
corresponding combination of levels of the two variables
Example: Happiness and Family Income of 1993 families (GSS 2008 data:
“happy,” “finrela”)
Happiness
Income Very Pretty Nottoo Total
-------------------------------
Above Aver. 164 233 26 423
Average 293 473 117 883
Below Aver. 132 383 172 687
------------------------------
Total 589 1089 1993
315
102
Contingency tables
• Example: Percentage “very happy” is
• 39% for above average income (164/423 = 0.39)
• 33% for average income (293/883 = 0.33)
• What percent for below average income?
Happiness
Income Very Pretty Not Total
--------------------------------------------
oo
Above 164 (39%) 233 (55%) 26 (6%) 423
Average 293 (33%) 473 (54%) 117 (13%) 883
Below 132 (19%) 383 (56%) 172 (25%) 687
----------------------------------------------
• What can we conclude? Is happiness depending on Income? Or
Happiness is independent of Income? 27
• Inference questions for later chapters?
Correlation
• Correlation describes strength of association between
two variables
• Falls between -1 and +1, with sign indicating direction of
association (formula & other details later )
• The larger the correlation in absolute value, the stronger
the association (in terms of a straight line trend)
• Examples: (positive or negative, how strong?)
• Mental impairment and life events, correlation =
• GDP and fertility, correlation =
• GDP and percent using Internet, correlation =
104
Calculating Correlation
106
Regression
• Regression analysis gives line predicting y using
x(algorithm & other details later )
107
Calculating Covariance
•
If your data is not normal…
•
Scatter Plot
121
Scatter Plot
122
Box Plot
123
Box Plot
124
Box Plot
• Box plot of Q6 by Q2
125
Lab
• Run proc univariate on a variable from sample data in sas
default library(prd sale / cars)
• Run proc means on actual & predicted variables from product
sales data
• What are the values of Range, Variance, SD
• What are 1,2,3 & 4 quartile values
• What is 95th percentile?
• Use “all” option to display the box plots
126
Lab
• Create a contingency table for product sales data
• Find contingency tables for
• Region by product type
• Division by Product type
• Find the correlation between actual sales and predicted sales.
• Find the correlation between weight & msrp in cars data
127
Binomial distribution
• Binomial distribution is a type of discrete probability distribution representing probabilities
of different values of the binomial random variable (X) in repeated independent N trials in
an experiment.
• Thus, in an experiment comprising of tossing a coin 10 times (N), the binomial random
variable (number of heads represented as successes) could take the value of 0-10 and the
binomial probability distribution is probability distribution representing the probabilities of
a random variable taking the value of 0-10.
• The probability that a random variable X with binomial distribution B(n,p) is equal to the
value k, where k = 0, 1,….,n, is given by the following formula:
• P(X = k) = n!k!(n−k)!pk(1−p)(n−k)n!k!(n−k)!pk(1−p)(n−k)
• The mean and the variance of the binomial distribution of an experiment with n number of
trials and the probability of success in each trial is p is following:
• Mean = np
• Variance = np(1-p)
Binomial distribution
•Rolling a die: Probability of getting the number of six (6) (0, 1, 2, 3…50) while rolling a die
50 times;
•Here, the random variable X is the number of “successes” that is the number of times six
occurs.
•The probability of getting a six is 1/6. The binomial distribution could be represented as
B(50,1/6).
• The diagram below represents the binomial distribution for 100 experiments.
Binomial distribution
Where:
•P is the probability of success on any trail.
•q = 1- P – the probability of failure
•n – the number of trails/experiments
•x – the number of successes, it can take the values 0, 1, 2,
3, . . . n.
•nCx = n!/x!(n-x) and denotes the number of combinations
of n elements taken x at a time.
Examples of binomial distribution problems:
Here,
λ is the average number
x is a Poisson random variable.
e is the base of logarithm and e = 2.71828 (approx).
Poisson distribution
Question: As only 3 students came to attend the class today,
find the probability for exactly 4 students to attend the classes tomorrow.
Solution:
Given,
Average rate of value(λ) = 3
Poisson random variable(x) = 4
Poisson distribution = P(X = x) =
P(X=4)=e−3⋅3**4/4!
P(X=4)=0.16803135574154
Poisson probability distribution
• Poisson probability distribution is used in situations where events occur
randomly and independently a number of times on average during an
interval of time or space.
• The random variable XX associated with a Poisson process is discrete and
therefore the Poisson distribution is discrete.
• These are examples of events that may be described as Poisson
processes:
• My computer crashes on average once every 4 months.
• Hospital emergencies receive on average 5 very serious cases every 24 hours.
• The number of cars passing through a point, on a small road, is on average 4 cars every 30
minutes.
• I receive on average 10 e-mails every 2 hours.
• Customers make on average 10 calls every hour to the customer help center
Poisson distribution
Where,
μ = Population mean
σ = Population standard deviation
μx¯¯¯ = Sample mean
σx = Sample standard deviation
¯¯¯
n = Sample size
Example
Question: The record of weights of the male population follows the normal
distribution.
Its mean and standard deviations are 70 kg and 15 kg respectively. If a researcher
considers
the records of 50 males, then what would be the mean and
standard deviation of the chosen sample?
Solution:
Mean of the population μ = 70 kg
Standard deviation of the population = 15 kg
sample size n = 50
Mean of the sample is given by:
μx¯¯¯ = 70 kg
Standard deviation of the sample is given by:
σx¯¯¯ = 15/√50
σx¯¯¯ = 2.122 = 2.1 kg (approx)
Confidence Interval
In the survey of Americans’ and Brits’ television watching habits, we can use the sample mean, sample standard deviation,
and sample size in place of the population mean, population standard deviation, and population size.
To calculate the 95% confidence interval, we can simply plug the values into the formula.
For the USA:
In the TV-watching example, the point estimate is the mean number of hours watched: 35.
You survey 100 Brits and 100 Americans about their television-watching habits, and find that both groups watch
an average of 35 hours of television per week.
So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98.
For GB:
So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.
Lab Practice
• Compute minimum, 25th percentile, median,
75th, and max of a numeric series
• # Input
• state = np.random.RandomState(100)
# Input