Clustering Fraud Detection
Clustering Fraud Detection
4
3
X
y2
2
1
0
0 1 2 3 4
x1
• Cluster j:
40
20
20
10
Cluster 1
x2
x2
0
0
−20
−10
−40
x1 x1
• Outliers can be seen as “clusters by themselves”
But:
· Due to (physical, economical,...) reasons we could have an initial
idea of k without being aware of the existence of outliers
· “Radial/background” noise requires large k’s
Trade data
600
500
400
Value (1000 euros)
300
200
100
0
Quantity (tons)
2.- TRIMMED k-MEANS
• Trimming is the oldest and most widely used way to achieve robustness.
Trimmed Trimmed
0.5
0.0
−0.5
−1.0
−2 0 2 4 6
x
• But,... how to trim in clustering?
1.0
Non−trimmed ’bridge’ points
0.5
0.0
−0.5
−1.0
0 5 10 15
Symmetric trimming?
0.5
0.0
−0.5
−1.0
0 10 20 30
minimizing
k X
X
kxi − mj k2.
j=1 i∈Rj
(a) (b)
10
40
5
20
x2
x2
0
0
−20
−5
−20 −10 0 10 20 30 −5 0 5
x1 x1
• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previous
eruption length” and n = 271
Classification
k = 3, α = 0.03
5.0
4.5
4.0
Previous eruption length
3.5
3.0
2.5
2.0
1.5
Eruption length
20
6
4
10
2
0
0
−2
−10
−4
−20
−6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
(a) (b)
6
6
4
4
y2
y2
2
2
0
0
−2
−2
−2 0 2 4 6 −2 0 2 4 6
x1 x1
> library(tclust)
• tkmeans(data, k, alpha)
k = “number of groups”
20
15
10
x2
5
0
0 5 10 15 20
x1
• Old Faithful Geyser data again:
Classification Classification
k = 4, α = 0 k = 3, α = 0.03
5.0
5.0
4.5
4.5
4.0
4.0
Previous eruption length
3.5
3.0
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.5 2.5 3.5 4.5 1.5 2.5 3.5 4.5
52
4
553
2
52
444
5
3 5
4
55
4
5
2
34
3
24 111
212
5333 4
5 5
4 3
4
34
3
2 1 5
543 3
32
4
55 1 4 3
0
42 43
5
553
4
32 1 5
4 5
4
33
−400
42
53 5 3
42
53 44
5
4
3
522 1 44
5 533
54
43 5
5433
5
4
32
1
1 44
5
33 2
532 5
5
4
3 1 55443 22
4
3 2 1 3
44 22
−200
4
5 1 5
5
43 1 5 3 2 2
4 2
−500
3 2
Objective Function Value
5
4 2 1 2
3
−600
5
4 2 1 2
3 2 2
5 1 2
4 2
1
3 1 1
2
1 11
11
−600
−700
1 1
2 11
1 11
1 11
1
1 11
11
1
11
−800
−800
1 1
0.0 0.2 0.4 0.6 0.8 0.00 0.02 0.04 0.06 0.08 0.10
α α
4.- ROBUST CLUSTERING AROUND LINEAR
SUBSPACES
• Robust linear grouping: Higher p dimensions, but assuming that our
data “live” in k low-dimensional (affine) subspaces...
We search for
· k linear subspaces h1, ..., hk in Rp
· a partition {R0, R1, ..., Rk } of {1, 2, ..., n} with #R0 = [nα]
minimizing
k X
X
kxi − Prhj (xi)k2.
j=1 i∈Rj
6
6
4
4
2
2
Y
0
Y
−2
−2
−4
−4
−6
−6 1 1.5 2 2.5 3 3.5 4 4.5 5
1 1.5 2 2.5 3 3.5 4 4.5 5
X X
x
b i = Prh(xi) = x
b i(B q , Aq , m) = m + B q ai
−a1−
···
· Aq = −ai− is the scores matrix (n × q)
···
−an−
−b1−
···
· Bq = −bj − is a matrix (p × q) whose columns generate a q-
···
dimensional approximating subspace h
−bp−
• Principal Components Analysis is highly non-robust!!!
xij is the “quantity-value ratio” for country i in the j-th month (or
the j-th year; the j-th product;...) for j = 1, ..., p
• Casewise trimming: Trim xi cases with (at least one) outlying xij
n = 100 × p = 4 data matrix with 2% outlying cells:
• Cellwise trimming:
bij = mj + aTi bj .
x
for g = 1, ..., G
• Minimize
X p X
n X G
g
g
min
g g
wij bgij )2.
(xij − x
wij ,Bq ,Aq ,mg
i=1 j=1 g=1
g
wij = 1 if cell xij is assigned to cluster g and non-trimmed and 0
otherwise
g
Appropriate constraints on the wij
40
30
30
20
20
10
10
0
0
−10
−10
−20
−20
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
• k = 2, q = 2 and α = 0.05:
40
20
20
0
0
−20
−20
−40
−40
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
• Results:
• Real data example: Average daily temperatures in 83 Spanish
meteorologic stations between 2007-2009 (n = 83 and p = 1096).
• Artificial outliers:
Two periods of 50 consecutive days in Oviedo replaced by 0oC.
HUESCA AIR
LOGROÑO
CACERES
10
JAÉN
ZARAGOZA
BARCELONA FABRA
CORDOBA
MORÓN FRONTE
SORIA
ÁVILA JEREZ FRONT
TERUEL
5
COLMENAR V. SEGOVIA BURGOS
LEÓN HUELVA R. ESTE
VALLADOLID AIR SEVILLA
ROTA BARCELONA AIR
CUENCA
5
0
PALMA IBIZA
PORT
VALENCIA
0
ALICANTE
MALAGA
OURENSE MELILLA
−10 −5 0 5 10 15 −5 0 5 10 15
First two scores of cluster 3 First two scores of cluster 4
SANTIAGO DE C. VIGO GRAN CANARIA
2
2
PONTEVEDRA
1
TENERIFE NORTE
1
CORUÑA AIR CORUÑA
0
LUGO
STA CRUZ TENER LANZAROTE
−1
LA PALMA
0
OVIEDO VITORIA
GIJÓN PORT
−2
HIERRO
SANTANDER
−3
−1
SANTANDER AIR
−4
SAN SEBASTIÁN
BILBAO
HONDARRIBIA FUERTEVENTURA
−5 0 5 −1 0 1 2 3
HIERRO
GRAN CANARIA
STA CRUZ TENER
TENERIFE NORTE
FUERTEVENTURA
LA PALMA
LANZAROTE
IBIZA
MENORCA
PALMA AIR
PALMA PORT
TORTOSA
HUESCA AIR
LLEIDA
ZARAGOZA
DAROCA
PAMPLONA AIR
PAMPLONA
LOGROÑO
FORONDA
CASTELLÓN ALMAZ
VALENCIA
VALENCIA AIR
TERUEL
ALBACETE
ALBACETE BA
CUENCA
ALICANTE
ALICANTE AIR
ALCANTARILLA
MURCIA
SAN JAVIER
ALMERIA AIR
MALAGA
MELILLA Clusters
JEREZ FRONT
ROTA
MORÓN FRONTE 0
SEVILLA
GRANADA AIR
GRANADA BA 1
CORDOBA
JAÉN 2
HUELVA R. ESTE
BADAJOZ AIR
CIUDAD REAL 3
CACERES
TOLEDO 4
GETAFE
MADRID CUAT VI
MADRID
COLMENAR V.
TORREJÓN ARDOZ
MADRID AIR
SALAMANCA
SALAMANCA AIR
LEÓN
ZAMORA
VALLADOLID AIR
SEGOVIA
PUERTO DE NAVAC
ÁVILA
VALLADOLID
BURGOS
SORIA
OURENSE
PONFERRADA
LUGO
VIGO
PONTEVEDRA
SANTIAGO DE C.
CORUÑA AIR
CORUÑA
OVIEDO
VITORIA
GIJÓN PORT
SANTANDER
SANTANDER AIR
BILBAO
SAN SEBASTIÁN
HONDARRIBIA
BARCELONA FABRA
BARCELONA AIR
0 300 600 900
Dias
• Reconstructed curves “ ” and true real data “ ” in Oviedo:
• Conclusions: