Univariate and Bivariate Data Analysis + Probability
Univariate and Bivariate Data Analysis + Probability
INTRODUCTION
DATA (variable)
CATEGORICAL NUMERICAL
(QUALITATIVE) (QUANTITATIVE)
ORDINAL: DISCRETE:
Clothes size # of children
(XL, L, M, S) (0,1,2...) TYPES OF FREQUENCY
• Absolute: Number of times the
NOMINAL: CONTINUOUS: value appeared in the sample.
Blood type Height • Relative: Proportion of times the
(A, B, AB, 0) (1.55, 1.71...) value appeared in the sample
LOCATION
a) QUARTILES: Split the data into four segments with the same value.
1
The first quartile 𝑸𝑸𝟏𝟏 has position (𝑛𝑛 + 1)
4
1
The second quartile 𝑸𝑸𝟐𝟐 has position (𝑛𝑛 + 1)
2
3
The third quartile 𝑸𝑸𝟑𝟑 has position (𝑛𝑛 + 1)
4
b) PERCENTILES: The pth percentile is the value in the pth position of the data set.
VARIATION
1. RANGE: 𝑅𝑅 = 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚
2. INTERQUARTILE RANGE (IQR): 𝐼𝐼𝐼𝐼𝐼𝐼 = 3𝑟𝑟𝑟𝑟 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 − 1𝑠𝑠𝑠𝑠 𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 = 𝑄𝑄3 − 𝑄𝑄1
3. OUTLIERS are observations that fall:
Below the value of 𝑄𝑄1 − 1.5 ⋅ 𝐼𝐼𝐼𝐼𝐼𝐼
Above the value of 𝑄𝑄3 + 1.5 ⋅ 𝐼𝐼𝐼𝐼𝐼𝐼
4. For EXTREME OUTLIERS, replace 1.5 by 3 in the above definition.
5. VARIANCE
∑𝑛𝑛 𝑥𝑥 2 −𝑛𝑛(𝑥𝑥̅ )2
5.1. SAMPLE VARIANCE: 𝜎𝜎 � 2 = 𝑖𝑖=1 𝑖𝑖
𝑛𝑛
2 ∑𝑛𝑛 2
𝑖𝑖=1 𝑥𝑥𝑖𝑖 −𝑛𝑛(𝑥𝑥̅ )
2
5.2. SAMPLE QUASI-VARIANCE: 𝑠𝑠 =
𝑛𝑛−1
6. STANDARD DEVIATION (SD)
6.1. SAMPLE ST. DEV. (𝝈𝝈): 𝜎𝜎 = √𝜎𝜎� 2
6.2. SAMPLE QUASI-ST. DEV. (𝒔𝒔): 𝑠𝑠 = √𝑠𝑠 2
𝑠𝑠
7. COEFFICIENT OF VARIATON (CV): 𝐶𝐶𝐶𝐶 =
�𝑥𝑥
��
STANDARIZATION
𝑥𝑥−𝑥𝑥̅
To standardize a variable means to calculate:
𝑠𝑠
If you apply this formula to al observations 𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 and call the transformed ones 𝑧𝑧1 , … , 𝑧𝑧𝑛𝑛 ,
then the mean of the z’s is 0 with a standard deviation of 1.
𝑠𝑠𝑥𝑥𝑥𝑥
CORRELATION: 𝑟𝑟(𝑥𝑥,𝑦𝑦) =
𝑠𝑠𝑥𝑥 𝑠𝑠𝑦𝑦
PROBABILITY
BASIC CONCEPTS
o SAMPLE SPACE (𝛀𝛀): The set of all possible outcomes of a random experiment.
o INTERSECTION (∩): 𝐴𝐴 ∩ 𝐵𝐵 is the set of all events Ω that jointly belong to A and B.
o UNION (∪): 𝐴𝐴 ∪ 𝐵𝐵 is the set of all events Ω that belong either to A or B.
o MUTUALLY EXCLUSIVE: If A and B have no common elementary events (𝐴𝐴 ∩ 𝐵𝐵 = ∅).
o COMPLEMENTARY (𝑨𝑨 � ) of an event A is the set of all events Ω that do not belong to A.
OPERATIONS’ PROPERTIES
𝐴𝐴 ∪ 𝐵𝐵 = 𝐵𝐵 ∪ 𝐴𝐴
Commutative:
𝐴𝐴 ∩ 𝐵𝐵 = 𝐵𝐵 ∩ 𝐴𝐴
𝐴𝐴 ∪ (𝐵𝐵 ∪ 𝐶𝐶) = (𝐴𝐴 ∪ 𝐵𝐵) ∪ 𝐶𝐶
Associative:
𝐴𝐴 ∩ (𝐵𝐵 ∩ 𝐶𝐶) = (𝐴𝐴 ∩ 𝐵𝐵) ∩ 𝐶𝐶
𝐴𝐴 ∪ ∅ = 𝐴𝐴
Neutral elements:
𝐴𝐴 ∩ Ω = 𝐴𝐴
𝐴𝐴 ∪ (𝐵𝐵 ∩ 𝐶𝐶) = (𝐴𝐴 ∪ 𝐵𝐵) ∩ (𝐴𝐴 ∪ 𝐶𝐶)
Distributive:
𝐴𝐴 ∩ (𝐵𝐵 ∪ 𝐶𝐶) = (𝐴𝐴 ∩ 𝐵𝐵) ∪ (𝐴𝐴 ∩ 𝐶𝐶)
PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING
Idempotence: 𝐴𝐴 ∪ 𝐴𝐴 = 𝐴𝐴 𝐴𝐴 ∩ 𝐴𝐴 = 𝐴𝐴
Absortion: 𝐴𝐴 ∪ Ω = Ω 𝐴𝐴 ∩ ∅ = ∅
LAPLACE RULE
Consider an experiment with k elementary events all 1
equiprobable, then the probability of a set A is defined as: 𝑃𝑃(𝐴𝐴) = × 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝐴𝐴
𝑘𝑘
POSTULATES
0 ≤ 𝑃𝑃(𝐴𝐴) ≤ 1
𝐼𝐼𝐼𝐼 𝐴𝐴 = {𝐵𝐵1 , 𝐵𝐵2 , … , 𝐵𝐵𝑛𝑛 }, 𝑎𝑎𝑎𝑎𝑎𝑎 𝐵𝐵𝑖𝑖 ∩ 𝐵𝐵𝑗𝑗 = ∅ 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 ≠ 𝑗𝑗 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑃𝑃(𝐴𝐴) = ∑𝑛𝑛
𝑖𝑖=1 𝑃𝑃(𝐵𝐵𝑖𝑖 )
𝑃𝑃(Ω) = 1
CONSEQUENCES
Complementary: 𝑃𝑃(𝐴𝐴̅) = 1 − 𝑃𝑃(𝐴𝐴)
𝑃𝑃(∅) = 0
𝐼𝐼𝐼𝐼 𝐴𝐴 ⊂ 𝐵𝐵, 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑃𝑃(𝐴𝐴) ≤ 𝑃𝑃(𝐵𝐵)
Difference: 𝑃𝑃(𝐴𝐴\𝐵𝐵 ) = 𝑃𝑃(𝐴𝐴) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
Union: 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵 ) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 )
𝐼𝐼𝐼𝐼 𝐴𝐴 𝑎𝑎𝑎𝑎𝑎𝑎 𝐵𝐵 𝑎𝑎𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (𝐴𝐴 ∩ 𝐵𝐵 = ∅), 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵)
NOTION OF INDEPENDENCE
Event A is independent from B if conditioning on B its probability does not change:
𝑃𝑃(𝐴𝐴|𝐵𝐵) = 𝑃𝑃(𝐴𝐴)
Moreover, if 𝑃𝑃(𝐴𝐴) > 0 and 𝑃𝑃(𝐵𝐵) > 0 the above is equivalent to the following:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵)
MULTIPLICATION RULE
Given n events 𝐴𝐴1 , 𝐴𝐴2 , … , 𝐴𝐴𝑛𝑛 with 𝑃𝑃(𝐴𝐴𝑖𝑖 ) > 0 (∀𝑖𝑖 ≥ 1). It holds:
𝑃𝑃(𝐴𝐴1 ∩ 𝐴𝐴2 ∩ ⋯ ∩ 𝐴𝐴𝑛𝑛 ) = 𝑃𝑃(𝐴𝐴1 )𝑃𝑃(𝐴𝐴2 |𝐴𝐴1 ) … 𝑃𝑃(𝐴𝐴𝑛𝑛 |𝐴𝐴1 ∩ 𝐴𝐴2 ∩ ⋯ ∩ 𝐴𝐴𝑛𝑛−1
If events are independent: 𝑃𝑃(𝐴𝐴1 ∩ 𝐴𝐴2 ∩ ⋯ ∩ 𝐴𝐴𝑛𝑛 ) = 𝑃𝑃(𝐴𝐴1 )𝑃𝑃(𝐴𝐴2 ) ⋯ 𝑃𝑃(𝐴𝐴𝑛𝑛 )
PROBABILITY AND DATA ANALYSIS DATA SCIENCE AND ENGINEERING
BAYES THEOREM
For two events A and B it must be:
𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐴𝐴|𝐵𝐵) = =
𝑃𝑃(𝐵𝐵) 𝑃𝑃(𝐵𝐵)
Such theorem is applied if we know 𝑃𝑃(𝐵𝐵|𝐴𝐴)