0% found this document useful (0 votes)
5 views

Formulas Data Analytics

The document outlines various statistical formulas and concepts related to data analytics, including combinatorics, probability calculus, and distributions such as binomial and normal distributions. It also covers distance metrics, classification techniques like decision trees and Bayesian classification, and performance metrics such as precision and recall. Key formulas are provided for each topic, serving as a reference for statistical decision-making and data analysis.

Uploaded by

Jeroen Janssens
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Formulas Data Analytics

The document outlines various statistical formulas and concepts related to data analytics, including combinatorics, probability calculus, and distributions such as binomial and normal distributions. It also covers distance metrics, classification techniques like decision trees and Bayesian classification, and performance metrics such as precision and recall. Key formulas are provided for each topic, serving as a reference for statistical decision-making and data analysis.

Uploaded by

Jeroen Janssens
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Formulas Statistics and decision making

Formulas Data Analytics


Combinatorics
Principle of inclusion-exclusion
|𝐴 ∪ 𝐴 | = |𝐴 | + |𝐴 | − |𝐴 ∩ 𝐴 |
r-permutation r-combination
𝑛! 𝑛!
𝑃(𝑛, 𝑟) = 𝐶(𝑛, 𝑟) =
(𝑛 − 𝑟)! 𝑟! (𝑛 − 𝑟)!
r-combination with repetition
r-permutation with repetition (𝑛 + 𝑟 − 1)!
𝑛
𝑟! (𝑛 − 1)!
Permutation with repetitive elements
𝑛!
𝑛 ! .𝑛 ! .….𝑛 !
Probability Calculus
Formula of Laplace
|𝐸|
𝑃(𝐸) =
|Ω|
Probability of an event’s complement Probablity of a union of events (or)
𝑃(𝐸 ) = 1 − 𝑃(𝐸) 𝑃(𝐸 ∪ 𝐸 ) = 𝑃(𝐸 ) + 𝑃(𝐸 ) − 𝑃(𝐸 ∩ 𝐸 )
Condition probability Bayes’ Rule
𝑃(𝐸 ∩ 𝐹) 𝑃(𝐸|𝐹)𝑃(𝐹)
𝑃(𝐸|𝐹) = 𝑃(𝐹|𝐸) =
𝑃(𝐹) 𝑃(𝐸|𝐹)𝑃(𝐹) + 𝑃(𝐸|𝐹 )𝑃(𝐹 )

Bayes spam filter (twee words)


𝑃(𝑤 )𝑃(𝑤 )
𝑅(𝑤 , 𝑤 ) =
𝑃(𝑤 )𝑃(𝑤 ) + 𝑄(𝑤 )𝑄(𝑤 )
Discrete en continuous stochastic variables
Binomial distribution B(n,p)
𝑃(𝑋 = 𝑘) = 𝐶(𝑛, 𝑘)𝑝 𝑞 𝐸(𝑋) = 𝑛. 𝑝 = 𝜇 𝜎² = 𝑛. 𝑝. 𝑞
Poisson distribution P(𝝀)

𝑃(𝑋 = 𝑥) = !
𝐸(𝑋) = 𝜇 = 𝜆 𝜎² = 𝜆

Normal distribution N(µ, 𝝈) Z-score (standardisation) 𝑷(𝝁 − 𝝈 ≤ 𝑿 ≤ 𝝁 + 𝝈) = 𝑷(−𝟏 ≤ 𝒁 ≤ 𝟏) ≈ 𝟔𝟖, 𝟐𝟕%


𝑷(𝝁 − 𝟐𝝈 ≤ 𝑿 ≤ 𝝁 + 𝟐𝝈) = 𝑷(−𝟐 ≤ 𝒁 ≤ 𝟐) ≈ 𝟗𝟓, 𝟒𝟓%
1 ( ) 𝑋−𝜇
𝑓(𝑥) = 𝑒 𝑍= 𝑷(𝝁 − 𝟑𝝈 ≤ 𝑿 ≤ 𝝁 + 𝟑𝝈) = 𝑷(−𝟑 ≤ 𝒁 ≤ 𝟑) ≈ 𝟗𝟗, 𝟕𝟑%
𝜎√2𝜋 𝜎

Descriptive statistics
Standard deviation
∑ 𝑛 (𝑥 − 𝑥̅ )
𝑠 =
𝑛−1

© Brian Baert en Dirk Vandycke 1


Formulas Statistics and decision making
Data and distance
Euclidian distance Minkowski distance

𝑑(𝑋, 𝑌) = (𝑥 − 𝑦 )² 𝑑 = 𝑥 −𝑥

Simple matching, Jaccard and cosine


𝑓 +𝑓 𝑓 (𝑥⃗ ∙ 𝑦⃗)
𝑆𝑀𝐶 = 𝐽= cos(𝑥⃗, 𝑦⃗) =
𝑓 +𝑓 +𝑓 +𝑓 𝑓 +𝑓 +𝑓 ‖𝑥⃗‖‖𝑦⃗‖

Edit distance
EditDistance(string1, string2) = length(string1) + length(string2) – 2 ∗ LCS

With LCS = longest common subsequence

Classification – Decision Trees


Entropy, Gini and Information gain
Entropy(𝑡) Gini(𝑡) Information gain
=1 Classification error (𝑡)
= 1 − max[𝑝(𝑖|𝑡)] 𝑁 𝑣
=− 𝑝(𝑖|𝑡) log 𝑝(𝑖|𝑡) ∆= 𝐼(parent) − 𝐼(𝑣 )
− [𝑝(𝑖|𝑡)]² 𝑁

With 𝑝(𝑖|𝑡) the fraction of records belonging to class 𝑖 in node 𝑡, 𝐼 𝑣 the impurity of a given node
𝑣 ; N the total number of nodes; k the number of attribute values and 𝑁 𝑣 the number of records
of the child node 𝑣 .
Bayesian Classification
Bayes posterior probability
𝑃(𝐴 𝐴 … 𝐴 |𝐶)𝑃(𝐶)
𝑃(𝐶|𝐴 𝐴 … 𝐴 ) =
𝑃(𝐴 𝐴 … 𝐴 )
Naive Bayes Classification
Assuming independence between attributes 𝐴 with given class
𝑃(𝐴 𝐴 … 𝐴 |𝐶) = 𝑃 𝐴 𝐶 . 𝑃 𝐴 𝐶 … 𝑃(𝐴 |𝐶 )
New data point is classified as 𝐶 when 𝑃 𝐶 ∏ 𝑃 𝐴 𝐶 is maximal.
Confusion matrix
Recall or TPR FPR
𝑇𝑃 𝐹𝑃
𝑟 = 𝑇𝑃𝑅 = 𝐹𝑃𝑅 =
𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃
Precision F1
𝑇𝑃 2 ∗ 𝑇𝑃
𝑝= 𝐹 =
𝑇𝑃 + 𝐹𝑃 2 ∗ 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁

© Brian Baert en Dirk Vandycke 2

You might also like