Data Analysis and Modelling
Data Analysis and Modelling
Goal
Turn data into data products
n n
(ai − A)(bi − B) (ai bi ) − n AB
rA, B = i =1
= i =1
(n − 1) A B (n − 1) A B
where n is the number of tuples, A and B are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum
of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
• The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected values of
A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to
be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
03 July 2021 Data Analysis and Modelling 37
Covariance – An Example
It can be simplified in computation as
Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10),
(4, 11), (6, 14).
If the stocks are affected by the same industry trends, will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
A
Ex.
The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity)
is chosen to split the node (need to enumerate all the possible splitting points for
each attribute)
•
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and
{high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values
• Can be modified for categorical attributes
g ( x, , ) = e 2 2
2
and P(xk|Ci) is P( X | C i ) = g ( xk , Ci , Ci )
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and
income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected” counterparts
03 July 2021 Data Analysis and Modelling 69
Bayesian Classification … Contd.
Advantages
• Easy to implement
• Good results obtained in most of the cases
Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
• Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
• An n-dimensional input vector x is mapped into variable y by means of the scalar product
and a nonlinear function mapping
• The inputs to unit are outputs from the previous layer
• They are multiplied by their corresponding weights to form a weighted sum, which is added
to the bias associated with unit
• Then a nonlinear activation function is applied to it
• Classifier accuracy or recognition rate: percentage of test set tuples that are correctly
classified Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy Error rate = (FP + FN)/All
• Class imbalance problem:
– One class may be rare, e.g. fraud or COVID-positive
– Significant majority of the negative class and minority of the positive class
– Sensitivity: True Positive recognition rate Sensitivity = TP/P
– Specificity: True Negative recognition rate Specificity = TN/N
3. Compute the “cluster centers” of each cluster, these become the new cluster centroids
5. Stop
• Let us consider the Euclidean distance measure (L2 Norm) as the distance measurement
• Let d1, d2, and d3 denote the distance from an object to c1, c2, and c3 respectively
• Assignment of each object to the respective centroid is shown in the right-most column and the
clustering so obtained