Data Mining Lecture 1 - Summary
Data Mining Lecture 1 - Summary
0. Introduction
From data to information
Data mining relates to the area of application
Passive: Observing relations
Active: Finding the underlying model
Passive: Clustering, Classification
Active: Passive + Model construction + parameter estimation
3. Statistics
Uncertainty and Probability distributions: normal, uniform, Poisson, Bernouli, something
Normal distribution
Kolmogorov-Smirnov test
Multivariate sets
Mean
Mean-centered data
Correlation and correlation coefficient
Covariance matrix
Estimation
Outliers and the linear correlation coefficient
Transforming
Examples of Distances
Euclidean : distE(p,q)2 = (p1 – q1)2 + (p2 – q2)2 + … + (pn – qn)2
Manhattan: distM(p,q) = |p1 – q1| + |p2 – q2| + … + |pn – qn|
Max-norm: distmax(p,q) = max {|p1 – q1|, |p2 – q2|, …, |pn – qn| }
Notice the graphs of dist@(p,0) in IR2 for @ = Euclidean, Manhattan, max
Generalized p-norm
1/ d
n d
p d pi
i 1
notice that for: d =2: ||p - q||d = Euclidean distance
for: d =1: ||p - q||d = Manhattan distance
for: d = ∞: ||p - q||d = Max-norm distance
normally d >= 1
Riemannian Metric
Let g be a definite non-negative matrix on IRd (i.e. all eigen values >= 0)
then g induces a Riemannian metric on the space:
2
p g
p T gp
Correlation
6. Dimension Reduction
Principal Components Analysis (PCA)
Find the principal directions in the data, and use them to reduce the number of dimensions of
the set by representing the data in linear combinations of the principal components.
Works best for multivariate data. Finds the m < d eigen-vectors of the covariance matrix with
the largest eigen-values. These eigen-vectors are the principal components. Decompose the
data in these principal components and thus obtain as more concise data set.
Caution1: Depends on the normalization of the data! Ideal for data with equal units.
Caution2: works only in linear relations between the parameters
Caution3: valuable information can be lost in pca
Factor Analysis
Represent data with fewer variables
Not invariant for transformations: multiple equivalent solutions
Widely used esp. in alpha-world and medicine
Multidimensional scaling
Equivalent to PCA, and also in case there are non-linear relations between the parameters.
Input: similarity- or distance-map between the data-points. Output: a 2D- (or even higher D)
map of the data points.