100% found this document useful (1 vote)
60 views

Univariate and Bivariate Data Analysis + Probability

This document introduces key concepts in probability and data analysis including: 1. Types of data (categorical, numerical), univariate and bivariate data analysis techniques (graphs, measures of central tendency, dispersion), and the normal distribution. 2. Probability concepts such as sample space, events, operations on events, and axioms of probability including the addition and multiplication rules. 3. Descriptive statistics for both categorical and numerical data including graphs, measures of central tendency, and measures of dispersion.

Uploaded by

Basoko_Leaks
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
60 views

Univariate and Bivariate Data Analysis + Probability

This document introduces key concepts in probability and data analysis including: 1. Types of data (categorical, numerical), univariate and bivariate data analysis techniques (graphs, measures of central tendency, dispersion), and the normal distribution. 2. Probability concepts such as sample space, events, operations on events, and axioms of probability including the addition and multiplication rules. 3. Descriptive statistics for both categorical and numerical data including graphs, measures of central tendency, and measures of dispersion.

Uploaded by

Basoko_Leaks
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING

INTRODUCTION

DATA (variable)

CATEGORICAL NUMERICAL
(QUALITATIVE) (QUANTITATIVE)

ORDINAL: DISCRETE:
Clothes size # of children
(XL, L, M, S) (0,1,2...) TYPES OF FREQUENCY
• Absolute: Number of times the
NOMINAL: CONTINUOUS: value appeared in the sample.
Blood type Height • Relative: Proportion of times the
(A, B, AB, 0) (1.55, 1.71...) value appeared in the sample

UNIVARIATE DATA ANALYSIS


GRAPHICAL PRESENTATION OF DATA
CATEGORICAL: Piechart and barchart
NUMERICAL: Histogram, polygon, boxplot

DESCRIBING DATA NUMERICALLY


CENTRAL TENDENCY
∑𝑛𝑛
𝑖𝑖=1 𝑥𝑥𝑖𝑖
1. MEAN: 𝑥𝑥̅ =
𝑛𝑛
𝑥𝑥[(𝑛𝑛+1)/2] 𝑖𝑖𝑖𝑖 𝑛𝑛 𝑜𝑜𝑜𝑜𝑜𝑜
2. MEDIAN: 𝑀𝑀 = �𝑥𝑥(𝑛𝑛+2)+𝑥𝑥(𝑛𝑛/2+1) (ordered list)
𝑖𝑖𝑖𝑖 𝑛𝑛 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
2
3. MODE: The value that occurs most often. Not affected by outliers.
PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING

LOCATION
a) QUARTILES: Split the data into four segments with the same value.
1
 The first quartile 𝑸𝑸𝟏𝟏 has position (𝑛𝑛 + 1)
4
1
 The second quartile 𝑸𝑸𝟐𝟐 has position (𝑛𝑛 + 1)
2
3
 The third quartile 𝑸𝑸𝟑𝟑 has position (𝑛𝑛 + 1)
4
b) PERCENTILES: The pth percentile is the value in the pth position of the data set.

VARIATION
1. RANGE: 𝑅𝑅 = 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚
2. INTERQUARTILE RANGE (IQR): 𝐼𝐼𝐼𝐼𝐼𝐼 = 3𝑟𝑟𝑟𝑟 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 − 1𝑠𝑠𝑠𝑠 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 = 𝑄𝑄3 − 𝑄𝑄1
3. OUTLIERS are observations that fall:
 Below the value of 𝑄𝑄1 − 1.5 ⋅ 𝐼𝐼𝐼𝐼𝐼𝐼
 Above the value of 𝑄𝑄3 + 1.5 ⋅ 𝐼𝐼𝐼𝐼𝐼𝐼
4. For EXTREME OUTLIERS, replace 1.5 by 3 in the above definition.
5. VARIANCE
∑𝑛𝑛 𝑥𝑥 2 −𝑛𝑛(𝑥𝑥̅ )2
5.1. SAMPLE VARIANCE: 𝜎𝜎 � 2 = 𝑖𝑖=1 𝑖𝑖
𝑛𝑛
2 ∑𝑛𝑛 2
𝑖𝑖=1 𝑥𝑥𝑖𝑖 −𝑛𝑛(𝑥𝑥̅ )
2
5.2. SAMPLE QUASI-VARIANCE: 𝑠𝑠 =
𝑛𝑛−1
6. STANDARD DEVIATION (SD)
6.1. SAMPLE ST. DEV. (𝝈𝝈): 𝜎𝜎 = √𝜎𝜎� 2
6.2. SAMPLE QUASI-ST. DEV. (𝒔𝒔): 𝑠𝑠 = √𝑠𝑠 2
𝑠𝑠
7. COEFFICIENT OF VARIATON (CV): 𝐶𝐶𝐶𝐶 =
�𝑥𝑥
��

STANDARIZATION
𝑥𝑥−𝑥𝑥̅
To standardize a variable means to calculate:
𝑠𝑠
If you apply this formula to al observations 𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 and call the transformed ones 𝑧𝑧1 , … , 𝑧𝑧𝑛𝑛 ,
then the mean of the z’s is 0 with a standard deviation of 1.

BIVARIATE DATA ANALYSIS


JOINT ABSOLUTE/RELATIVE FREQUENCY DISTRIBUTION
PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING

CONDITIONAL FREQUENCY DISTRIBUTION

MEASURES OF LINEAR ASSOCIATION


𝑛𝑛
1
 SAMPLE COVARIANCE: 𝑠𝑠𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑛𝑛𝑥𝑥̅ 𝑦𝑦�
𝑛𝑛 − 1
𝑖𝑖=1

If the covariance is “much larger/smaller than 0”, is because there exists a


positive/negative linear relationship between the variables.
If the covariance is “small”, is because:
I. The linear relationship does not exist.
II. The relationship is nonlinear.

𝑠𝑠𝑥𝑥𝑥𝑥
 CORRELATION: 𝑟𝑟(𝑥𝑥,𝑦𝑦) =
𝑠𝑠𝑥𝑥 𝑠𝑠𝑦𝑦

The correlation is bounded: −1 ≤ 𝑟𝑟(𝑥𝑥,𝑦𝑦) ≤ 1

PROBABILITY
BASIC CONCEPTS
o SAMPLE SPACE (𝛀𝛀): The set of all possible outcomes of a random experiment.
o INTERSECTION (∩): 𝐴𝐴 ∩ 𝐵𝐵 is the set of all events Ω that jointly belong to A and B.
o UNION (∪): 𝐴𝐴 ∪ 𝐵𝐵 is the set of all events Ω that belong either to A or B.
o MUTUALLY EXCLUSIVE: If A and B have no common elementary events (𝐴𝐴 ∩ 𝐵𝐵 = ∅).
o COMPLEMENTARY (𝑨𝑨 � ) of an event A is the set of all events Ω that do not belong to A.

OPERATIONS’ PROPERTIES
𝐴𝐴 ∪ 𝐵𝐵 = 𝐵𝐵 ∪ 𝐴𝐴
Commutative:
𝐴𝐴 ∩ 𝐵𝐵 = 𝐵𝐵 ∩ 𝐴𝐴
𝐴𝐴 ∪ (𝐵𝐵 ∪ 𝐶𝐶) = (𝐴𝐴 ∪ 𝐵𝐵) ∪ 𝐶𝐶
Associative:
𝐴𝐴 ∩ (𝐵𝐵 ∩ 𝐶𝐶) = (𝐴𝐴 ∩ 𝐵𝐵) ∩ 𝐶𝐶
𝐴𝐴 ∪ ∅ = 𝐴𝐴
Neutral elements:
𝐴𝐴 ∩ Ω = 𝐴𝐴
𝐴𝐴 ∪ (𝐵𝐵 ∩ 𝐶𝐶) = (𝐴𝐴 ∪ 𝐵𝐵) ∩ (𝐴𝐴 ∪ 𝐶𝐶)
Distributive:
𝐴𝐴 ∩ (𝐵𝐵 ∪ 𝐶𝐶) = (𝐴𝐴 ∩ 𝐵𝐵) ∪ (𝐴𝐴 ∩ 𝐶𝐶)
PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING

Complementation: 𝐴𝐴 ∪ 𝐴𝐴̅ 𝐴𝐴 ∩ 𝐴𝐴̅ = ∅

Idempotence: 𝐴𝐴 ∪ 𝐴𝐴 = 𝐴𝐴 𝐴𝐴 ∩ 𝐴𝐴 = 𝐴𝐴

Absortion: 𝐴𝐴 ∪ Ω = Ω 𝐴𝐴 ∩ ∅ = ∅

Simplification: 𝐴𝐴 ∪ (𝐴𝐴 ∩ 𝐵𝐵) = 𝐴𝐴 = 𝐴𝐴 ∩ (𝐴𝐴 ∪ 𝐵𝐵)

Complementary event: �����


(𝐴𝐴̅) = 𝐴𝐴 �=∅
Ω �=Ω

DeMorgan’s Laws: �������


𝐴𝐴 ∪ 𝐵𝐵 = 𝐴𝐴̅ ∩ 𝐵𝐵� �������
𝐴𝐴 ∩ 𝐵𝐵 = 𝐴𝐴̅ ∪ 𝐵𝐵�

LAPLACE RULE
Consider an experiment with k elementary events all 1
equiprobable, then the probability of a set A is defined as: 𝑃𝑃(𝐴𝐴) = × 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝐴𝐴
𝑘𝑘
POSTULATES
 0 ≤ 𝑃𝑃(𝐴𝐴) ≤ 1
 𝐼𝐼𝐼𝐼 𝐴𝐴 = {𝐵𝐵1 , 𝐵𝐵2 , … , 𝐵𝐵𝑛𝑛 }, 𝑎𝑎𝑎𝑎𝑎𝑎 𝐵𝐵𝑖𝑖 ∩ 𝐵𝐵𝑗𝑗 = ∅ 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 ≠ 𝑗𝑗 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑃𝑃(𝐴𝐴) = ∑𝑛𝑛
𝑖𝑖=1 𝑃𝑃(𝐵𝐵𝑖𝑖 )
 𝑃𝑃(Ω) = 1

CONSEQUENCES
 Complementary: 𝑃𝑃(𝐴𝐴̅) = 1 − 𝑃𝑃(𝐴𝐴)
 𝑃𝑃(∅) = 0
 𝐼𝐼𝐼𝐼 𝐴𝐴 ⊂ 𝐵𝐵, 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑃𝑃(𝐴𝐴) ≤ 𝑃𝑃(𝐵𝐵)
 Difference: 𝑃𝑃(𝐴𝐴\𝐵𝐵 ) = 𝑃𝑃(𝐴𝐴) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
 Union: 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵 ) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 )
 𝐼𝐼𝐼𝐼 𝐴𝐴 𝑎𝑎𝑎𝑎𝑎𝑎 𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (𝐴𝐴 ∩ 𝐵𝐵 = ∅), 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵)

NOTION OF CONDITIONAL PROBABILITY


Let A and B be two events such that 𝑃𝑃(𝐵𝐵) > 0, the conditional 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
probability of A given B is defined: 𝑃𝑃(𝐵𝐵)

THE PRODUCT’S LAW: 𝐼𝐼𝐼𝐼 𝑃𝑃(𝐵𝐵) > 0 ⟹ 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵)

NOTION OF INDEPENDENCE
Event A is independent from B if conditioning on B its probability does not change:
𝑃𝑃(𝐴𝐴|𝐵𝐵) = 𝑃𝑃(𝐴𝐴)
Moreover, if 𝑃𝑃(𝐴𝐴) > 0 and 𝑃𝑃(𝐵𝐵) > 0 the above is equivalent to the following:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵)

MULTIPLICATION RULE
Given n events 𝐴𝐴1 , 𝐴𝐴2 , … , 𝐴𝐴𝑛𝑛 with 𝑃𝑃(𝐴𝐴𝑖𝑖 ) > 0 (∀𝑖𝑖 ≥ 1). It holds:
𝑃𝑃(𝐴𝐴1 ∩ 𝐴𝐴2 ∩ ⋯ ∩ 𝐴𝐴𝑛𝑛 ) = 𝑃𝑃(𝐴𝐴1 )𝑃𝑃(𝐴𝐴2 |𝐴𝐴1 ) … 𝑃𝑃(𝐴𝐴𝑛𝑛 |𝐴𝐴1 ∩ 𝐴𝐴2 ∩ ⋯ ∩ 𝐴𝐴𝑛𝑛−1
If events are independent: 𝑃𝑃(𝐴𝐴1 ∩ 𝐴𝐴2 ∩ ⋯ ∩ 𝐴𝐴𝑛𝑛 ) = 𝑃𝑃(𝐴𝐴1 )𝑃𝑃(𝐴𝐴2 ) ⋯ 𝑃𝑃(𝐴𝐴𝑛𝑛 )
PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING

TOTAL PROBABILITY RULE


Given a partition of the sample space, 𝐵𝐵1 , 𝐵𝐵2 , … , 𝐵𝐵𝑘𝑘 . An event A must be:
𝑃𝑃(𝐴𝐴) = 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵1 ) + 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵2 ) + ⋯ + 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵𝑘𝑘 ) ⟹
𝑃𝑃(𝐴𝐴) = 𝑃𝑃(𝐴𝐴|𝐵𝐵1 )𝑃𝑃(𝐵𝐵1 ) + 𝑃𝑃(𝐴𝐴|𝐵𝐵2 )𝑃𝑃(𝐵𝐵2 ) + ⋯ 𝑃𝑃( )𝑃𝑃(𝐵𝐵𝑘𝑘 )

BAYES THEOREM
For two events A and B it must be:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐴𝐴|𝐵𝐵) = =
𝑃𝑃(𝐵𝐵) 𝑃𝑃(𝐵𝐵)
Such theorem is applied if we know 𝑃𝑃(𝐵𝐵|𝐴𝐴)

You might also like