Data Mining Memahami Data
Data Mining Memahami Data
Mengenal dan
memahami data
1
Mengenal dan memahami data
Visualisasi data
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document data: text documents: term-
frequency vector
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0
3
Data Objects
4
Atribut
Attribut ( dimensi, fitur, variabel): menyatakan
karakteristik atau fitur dari data objek
Misal., ID_pelanggan, nama, alama
Tipe-tipe:
Nominal
Ordina
Biner
Numerik:
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: kategori, keadaan, atau nama suatu hal
Warna rambut
Status , kode pos, dll, NRP dll
Binary :Atribut Nominal dengan hanya 2 keadaan (0 dan 1)
Symmetric binary: keduanya sama penting
Misal: jenis kelamin,
Asymmetric binary: keduanya tidak sama penting.
Misal : medical test (positive atau negative)
Dinyatakan dengan 1 untuk menyatakan hal yang lebih penting (
positif HIV)
Ordinal
Memiliki arti secara berurutan, (ranking) tetapi tidak dinyatakan
dengan besaran angka atau nilai.
Size = {small, medium, large}, kelas, pangkat
6
Atribut Numerik
Kuantitas (integer atau nilai real)
Interval
Diukur pada skala dengan unit satuan yang sama
Nilai memiliki urutan
tanggal kalender
No true zero-point
Ratio
Inherent zero-point
Contoh:Panjang, berat badan, dll
Bisa mengatakan perkalian dari nilai objek data yang
lain
Misal : panjang jalan A adalah 2 kali dari panjang jalan B
7
Atribut Discrete dan kontinu
Atribut Diskrit
Terhingga, dapat dihitung walaupun itu tak terhingga
Atribut Kontinu
Memilki nilai real
8
Basic Statistical Descriptions of Data
Tujuan
Untuk memahami data: central tendency, variasi dan
sebaran
Karakteristik Sebaran data
median, max, min, quantiles, outliers, variance, dll.
9
Mengukur Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi
x
Note: n jumlah sample dan N nilai populasi. n i 1 N
n
w x
Mean/rata-rata:
i i
Trimmed mean: x i 1
n
Median: w
i 1
i
12
Boxplot Analysis
13
Sifat-sifat kurva Distribusi Normal
Kurva norma
dari to +: berisi 68% pengukukuran (: mean, :
standar deviasi)
Dari 2 to +2: berisi 95% pengukuran
14
Histogram Analysis
Histogram: grafik menampilkan
40
tabulasi dari frekwensi data
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
15
Histograms Often Tell More than Boxplots
Dua histogram
menunjukkan boxplot
yang sama
Nilai yang sama: min,
Q1, median, Q3, max
Tetapi distribusi datanya
berbeda
16
Scatter plot
Melihat data bivariate data untuk melihat cluster dan outlier
data, etc
Setiap data menunjukkan pasangan koordinat dari suatu data
17
Positively and Negatively Correlated Data
18
Uncorrelated Data
19
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical
primitives
Provide qualitative overview of large data sets
data
Help find interesting regions and suitable parameters for further
quantitative analysis
20
Geometric Projection Visualization Techniques
21
Scatterplot Matrices
22
Similarity and Dissimilarity
Similarity
Mengukur secara Numerik bagaimana kesamaan dua objek
data
Tinggi nilainya bila benda yang lebih mirip
Range [0,1]
Dissimilarity (e.g., distance/jarak)
Ukuran numerik dari perbedaan dua objek
Sangat rendah bila benda yang lebih mirip
Minimum dissimilarity i0
23
Data Matrix and Dissimilarity Matrix
Data matrix
n titik data dengan p x11 ... x1f ... x1p
dimensi ... ... ... ... ...
Two modes
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix
0
n titik data yang didata
d(2,1) 0
adalah distance/jarak
d(3,1) d ( 3,2) 0
Matrik segitiga
: : :
Single mode
d ( n,1) d ( n,2) ... ... 0
24
Proximity Measure for Nominal Attributes
25
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
26
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Z-score:
z x
X: raw score to be standardized, : mean of the population, : standard
deviation
the distance between the raw score and the population mean in units
of the standard deviation
negative when the raw score is below the mean, + when above
An alternative way: Calculate the mean absolute deviation
sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
mf 1
n (x1 f x2 f ... xnf ) x m
.
28
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
29
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
Properties
d(i, j) > 0 if i j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
30
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
of the vectors
31
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
32
Ordinal Variables
33
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and r 1
zif
if
34
Cosine Similarity
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
35
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
36
KL Divergence: Comparing Two
Probability Distributions
The Kullback-Leibler (KL) divergence: Measure the difference between two
probability distributions over the same variable x
From information theory, closely related to relative entropy, information
divergence, and information for discrimination
DKL(p(x) || q(x)): divergence of q(x) from p(x), measuring the information lost
when q(x) is used to approximate p(x)
Discrete form:
possible (i.e., p(e) > 0), and the other predicts it is absolutely impossible
(i.e., q(e) = 0), then the two distributions are absolutely different
However, in practice, P and Q are derived from frequency distributions, not
counting the possibility of unseen events. Thus smoothing is needed
Example: P : (a : 3/5, b : 1/5, c : 1/5). Q : (a : 5/9, b : 3/9, d : 1/9)
need to introduce a small constant , e.g., = 10
3
38