0% found this document useful (0 votes)
34 views

Clustering Fraud Detection

This document discusses a robust clustering approach to fraud detection using trimmed k-means clustering. It begins by introducing k-means clustering and its lack of robustness to outliers. It then describes trimmed k-means clustering, which aims to make k-means more robust by allowing a fraction of observations to be trimmed or excluded from the clustering process. The document also discusses modeling clusters using multivariate normal distributions and extending the trimmed approach to model-based clustering. Software tools for performing trimmed k-means and trimmed model-based clustering in R and Matlab are also mentioned.

Uploaded by

superluigi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Clustering Fraud Detection

This document discusses a robust clustering approach to fraud detection using trimmed k-means clustering. It begins by introducing k-means clustering and its lack of robustness to outliers. It then describes trimmed k-means clustering, which aims to make k-means more robust by allowing a fraction of observations to be trimmed or excluded from the clustering process. The document also discusses modeling clusters using multivariate normal distributions and extending the trimmed approach to model-based clustering. Software tools for performing trimmed k-means and trimmed model-based clustering in R and Matlab are also mentioned.

Uploaded by

superluigi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

A robust clustering approach to fraud detection

Luis Angel Garcı́a-Escudero

Dpto. de Estadı́stica e I.O. and IMUVA - Universidad de Valladolid

joint (and on-going...) work with A. Mayo-Iscar, A. Gordaliza, C. Matrán


(U. Valladolid) and colleagues from M. Riani, A. Cerioli (U. Parma) and D.
Perrotta, F. Torti (JRC-Ispra)

Conference on Benford’s Law for fraud detection. Stresa, July 2019 1


1. CLUSTERING AND ROBUSTNESS

• Clustering is the task of grouping a set of objects in such a way that


objects in the same cluster are more similar to each other than to those
in other clusters:
• Sample mean:
1
Pn Pn 2
 m = n i=1 xi minimizes i=1 kxi − mk
 m may be seen as the “center” of a data-cloud:

4
3

X
y2

2
1
0

0 1 2 3 4

x1

• k clusters ⇒ k “data-clouds” ⇒ k-means


• k-means: Search for
 k centers m1, ..., mk
 a partition {R1, ..., Rk } of {1, 2, ..., n}
minimizing
k X
X
kxi − mj k2.
j=1 i∈Rj

• Cluster j:

Rj = {i : kxi − mj k ≤ kxi − mlk for every l = 1, ..., k}

(...assignment to the closest center...)


• Robustness: Many statistical procedures are strongly affected by even
few outlying observations:

 The mean is not robust:


1.72 + 1.67 + 1.80 + 1.70 + 1.82 + 1.73 + 1.78
x= = 1.745
7

1.72 + 1.67 + 1.80 + 1.70 + 182 + 1.73 + 1.78


x= = 27.485
7

 k-means inherits that lack of robustness from the mean


• Lack of robustness of k-means:

(a) 3−means (b) 2−means


30

40
20

20
10

Cluster 1
x2

x2

0
0

−20
−10

2 groups artific. joined Cluster 2


−20

−40

−20 −10 0 10 20 30 0 50 100 150

x1 x1
• Outliers can be seen as “clusters by themselves”

• So, why not increasing the number of clusters...?

 But:
· Due to (physical, economical,...) reasons we could have an initial
idea of k without being aware of the existence of outliers
· “Radial/background” noise requires large k’s

• Moreover, the detection of outliers may be the goal itself!!!


• Outliers in trade data can be associated to “frauds”:

 Heterogeneous sources of data (clustering) + Few outliers (frauds??)

Trade data
600
500
400
Value (1000 euros)

300
200
100
0

0 50 100 150 200

Quantity (tons)
2.- TRIMMED k-MEANS

• Trimming is the oldest and most widely used way to achieve robustness.

• Trimmed mean: The proportion α/2 smallest and α/2 largest


observations are discarded before computing the mean:
1.0

Trimmed Trimmed
0.5
0.0
−0.5
−1.0

−2 0 2 4 6

x
• But,... how to trim in clustering?

 Why not trimming outlying “bridge” points?

1.0
Non−trimmed ’bridge’ points
0.5
0.0
−0.5
−1.0

0 5 10 15

 Why a symmetric trimming?


1.0

Symmetric trimming?
0.5
0.0
−0.5
−1.0

0 10 20 30

 How to trim in multivariate clustering problems?


• Idea: Data itself tell us which are the most outlying observations!!
 Data-driven, adaptive, impartial,... trimming!

• Trimmed k-means: we search for


 k centers m1, ..., mk and
 a partition {R0, R1, ..., Rk } of {1, 2, ..., n} with #R0 = [nα]

minimizing
k X
X
kxi − mj k2.
j=1 i∈Rj

[A fraction α of data is not taken into account Trimmed]


• Black circles: trimmed points (k = 3 and α = 0.05):

(a) (b)

10
40

5
20
x2

x2

0
0
−20

−5

−20 −10 0 10 20 30 −5 0 5

x1 x1
• Old Faithful Geyser data: x1 = “Eruption length”, x2 = “Previous
eruption length” and n = 271
Classification
k = 3, α = 0.03

5.0
4.5
4.0
Previous eruption length

3.5
3.0
2.5
2.0
1.5

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Eruption length

 k = 3 and α = 0.03 (0.03 · 271 ' 9 trimmed obs.): 6 rare “short-


followed-by-short” eruptions trimmed, 3 bridge points...
3.- ROBUST MODEL-BASED CLUSTERING

• k-means and trimmed k-means prefer spherical clusters:

(a) 2−means (spherical groups) (b) 2−means (elliptical groups)

20
6
4

10
2
0

0
−2

−10
−4

−20
−6

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

• Elliptically contoured clusters?


• Multivariate normal distributions with densities φ(·; µ, Σ):
   
2 1 0
 µ= and Σ = [spherical] in (a)
2
   0 1 
2 2 1
 µ= and Σ = [non-spherical] in (b)
2 1 1

(a) (b)
6

6
4

4
y2

y2
2

2
0

0
−2

−2

−2 0 2 4 6 −2 0 2 4 6

x1 x1

φ(x; µ, Σ) = (2π)−p/2|Σ|−1/2 exp − (x − µ)0Σ−1(x − µ)/2



• Trimmed likelihoods: Search for
 k centers m1, ..., mk ,
 k scatter matrices S1, ..., Sk , and,
 a partition {R0, R1, ..., Rk } of {1, 2, ..., n} with #R0 = [nα]
maximizing
k
X X
log φ(xi; mj , Sj ) (obs. in R0 not taken into account)
j=1 xi ∈Rj

Garcı́a-Escudero et al 2008, Neykov et al 2007, Gallegos and Ritter 2005,...


• Constraints on the Sj scatter matrices needed:
 Unbounded target likelihood functions
 Avoid detecting (non-interesting) “spurious” clusters

• Control relative axes’ lengths (eigenvalues constraints):

c=1 Large c value


• The FSDA Matlab toolbox:

• The R package tclust at CRAN repository:


• The R package tclust :

> library(tclust)
• tkmeans(data, k, alpha)
 k = “number of groups”

 alpha = “trimming proportion”


• tclust(data, k, alpha, restr.fact, ...)
 restr.fact = “Strength of the constraints”
• tclust(X,k=3,alpha=0.03,restr.fact=50)
Large restr.fact
k = 3, α = 0.03

20
15
10
x2

5
0

0 5 10 15 20

x1
• Old Faithful Geyser data again:
Classification Classification
k = 4, α = 0 k = 3, α = 0.03
5.0

5.0
4.5

4.5
4.0

4.0
Previous eruption length

Previous eruption length


3.5

3.5
3.0

3.0
2.5

2.5
2.0

2.0
1.5

1.5
1.5 2.5 3.5 4.5 1.5 2.5 3.5 4.5

Eruption length Eruption length

• Why k = 3 and α = 0.03 was a sensible solution?


• Applying ctlcurves to the Old Faithful Geyser data:
CTL−Curves CTL−Curves
Restriction Factor = 50 Restriction Factor = 50

52
4
553
2
52
444
5
3 5
4
55
4
5
2
34
3
24 111
212
5333 4
5 5
4 3
4
34
3
2 1 5
543 3
32
4
55 1 4 3
0

42 43
5
553
4
32 1 5
4 5
4
33

−400
42
53 5 3
42
53 44
5
4
3
522 1 44
5 533
54
43 5
5433
5
4
32
1
1 44
5
33 2
532 5
5
4
3 1 55443 22
4
3 2 1 3
44 22
−200

4
5 1 5
5
43 1 5 3 2 2
4 2

−500
3 2
Objective Function Value

Objective Function Value


5
4
3 1 22
3
5
4
3 2 1 3 22
5
4 2 3 2
3 2 1 2
5
4
3 2 2
5
4 1 2
3 2
−400

5
4 2 1 2
3

−600
5
4 2 1 2
3 2 2
5 1 2
4 2
1
3 1 1
2
1 11
11
−600

−700
1 1
2 11
1 11
1 11
1
1 11
11
1
11
−800

−800
1 1

0.0 0.2 0.4 0.6 0.8 0.00 0.02 0.04 0.06 0.08 0.10

α α
4.- ROBUST CLUSTERING AROUND LINEAR
SUBSPACES
• Robust linear grouping: Higher p dimensions, but assuming that our
data “live” in k low-dimensional (affine) subspaces...

 We search for
· k linear subspaces h1, ..., hk in Rp
· a partition {R0, R1, ..., Rk } of {1, 2, ..., n} with #R0 = [nα]
minimizing
k X
X
kxi − Prhj (xi)k2.
j=1 i∈Rj

 Prh(·) denotes the “orthogonal” projection onto the linear subspace h


• Example: Three linear structures in presence of noise:

6
6

4
4

2
2

Y
0
Y

−2
−2

−4
−4

−6
−6 1 1.5 2 2.5 3 3.5 4 4.5 5
1 1.5 2 2.5 3 3.5 4 4.5 5
X X

(a) α = 0 (b) α = 0.1 (◦ = “Trimmed”)

Trimmed “mixtures of regressions” can also be applied...


• k = 1 case ⇒ Robust “Principal Components Analysis (PCA)”:
 PCA provides a q-dimensional (q << p) representation of data by
n
X
min b i||2 for
||xi − x
B q ,Aq ,m
i=1

x
b i = Prh(xi) = x
b i(B q , Aq , m) = m + B q ai

 
−a1−
 ··· 
· Aq =  −ai−  is the scores matrix (n × q)
 
 ··· 
−an−
 
−b1−
 ··· 
· Bq =  −bj −  is a matrix (p × q) whose columns generate a q-
 
 ··· 
dimensional approximating subspace h
−bp−
• Principal Components Analysis is highly non-robust!!!

• Least Trimmed Squares PCA (Maronna 2005): Minimize


n
X n
X
b ik2 =
wikxi − x b i(B q , Aq , m)k2,
wikxi − x
i=1 i=1

with {wi}ni=1 being “0-1 weights” such that


n
X
wi = [n(1 − α)]
i=1
(
1 If xi is not trimmed
 Weights: wi = .
0 If xi is trimmed
• Cases → xi = (xi1, ..., xip)0 ∈ Rp and Cells → xij ∈ R
 i denotes a country (or a trader; company;...) for i = 1, ..., n

 xij is the “quantity-value ratio” for country i in the j-th month (or
the j-th year; the j-th product;...) for j = 1, ..., p

• Casewise trimming: Trim xi cases with (at least one) outlying xij
n = 100 × p = 4 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)


• But when the dimension p increases... we do not expect many xi
completely free of outlying xij cells:
n = 100 × p = 80 data matrix with 2% outlying cells:

Outlying xij cells Trimmed xi cases (black lines)

• Cellwise trimming:

 Only trimming outlying cells... (⇒ “Particular” frauds identified...??)


• PCA approximation x
b i = m + B q ai = (b bip)T re-written as
xi1, ..., x

bij = mj + aTi bj .
x

• Cellwise LTS (Cevallos-Valdiviezo 2016): Minimize


n
X
wij (xij − mj − aTi bj )2
i=1

 wij = 0 if cell xij is trimmed and wij = 1 if not with


n
X
wij = [n(1 − α)], for j = 1, ..., p.
i=1
• Different patterns/structures in data ⇒ G subspace approximations:
g
b i B qg , Aqg , m = mg + B gqg agi or x
g g g
bgij = mgj + (agi )T bgj ,

x

for g = 1, ..., G
• Minimize
X p X
n X G
g
g
min
g g
wij bgij )2.
(xij − x
wij ,Bq ,Aq ,mg
i=1 j=1 g=1

g
 wij = 1 if cell xij is assigned to cluster g and non-trimmed and 0
otherwise

g
 Appropriate constraints on the wij

q1, ..., qG are intrinsic dimensions...


• Example 1: n = 400 in dimension p = 100 with 2 groups and 2%
“scattered” outliers:
40

40
30

30
20

20
10

10
0

0
−10

−10
−20

−20

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
• k = 2, q = 2 and α = 0.05:

“-” are the trimmed cells


• Cluster means and trimmed cells (◦):
• Example 2: n = 400 in dimension p = 100 with 2 groups and few
curves with 20% consecutive cells corrupted:
40

40
20

20
0

0
−20

−20
−40

−40

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
• Results:
• Real data example: Average daily temperatures in 83 Spanish
meteorologic stations between 2007-2009 (n = 83 and p = 1096).
• Artificial outliers:
 Two periods of 50 consecutive days in Oviedo replaced by 0oC.

 150 consecutive days in Huelva temperature replaced by 0oC.


• Cluster means:
 “Meseta” (Central plateau-Castile): Mediterranean:

 Cantabrian Coast: Canary Islands:


• Clustered stations:
• Clusters found and trimmed cells:
First two scores of cluster 1 First two scores of cluster 2
PUERTO DE NAVAC PAMPLONA
PAMPLONA AIR
15

HUESCA AIR

LOGROÑO
CACERES

10
JAÉN
ZARAGOZA

BADAJOZ AIR LLEIDA


10

BARCELONA FABRA
CORDOBA
MORÓN FRONTE
SORIA
ÁVILA JEREZ FRONT
TERUEL

5
COLMENAR V. SEGOVIA BURGOS
LEÓN HUELVA R. ESTE
VALLADOLID AIR SEVILLA
ROTA BARCELONA AIR
CUENCA
5

DAROCA SALAMANCA AIR


VALLADOLID TORTOSA
TORREJÓN VALENCIA AIR
ALBACETE
MADRID VI ARDOZ
BA
CUATAIR
MADRID
SALAMANCA FORONDA
MADRID ZAMORA PALMA AIR
GETAFE
ALBACETE GRANADA AIR MENORCA
CIUDAD REAL CASTELLÓN ALMAZ
ALCANTARILLA
GRANADA BA PONFERRADA
TOLEDO ALICANTE AIR
MURCIA
ALMERIA AIR SAN JAVIER

0
PALMA IBIZA
PORT
VALENCIA
0

ALICANTE

MALAGA
OURENSE MELILLA

−10 −5 0 5 10 15 −5 0 5 10 15
First two scores of cluster 3 First two scores of cluster 4
SANTIAGO DE C. VIGO GRAN CANARIA
2

2
PONTEVEDRA
1

TENERIFE NORTE

1
CORUÑA AIR CORUÑA
0

LUGO
STA CRUZ TENER LANZAROTE
−1

LA PALMA
0

OVIEDO VITORIA
GIJÓN PORT
−2

HIERRO

SANTANDER
−3

−1

SANTANDER AIR
−4

SAN SEBASTIÁN

BILBAO
HONDARRIBIA FUERTEVENTURA

−5 0 5 −1 0 1 2 3
HIERRO
GRAN CANARIA
STA CRUZ TENER
TENERIFE NORTE
FUERTEVENTURA
LA PALMA
LANZAROTE
IBIZA
MENORCA
PALMA AIR
PALMA PORT
TORTOSA
HUESCA AIR
LLEIDA
ZARAGOZA
DAROCA
PAMPLONA AIR
PAMPLONA
LOGROÑO
FORONDA
CASTELLÓN ALMAZ
VALENCIA
VALENCIA AIR
TERUEL
ALBACETE
ALBACETE BA
CUENCA
ALICANTE
ALICANTE AIR
ALCANTARILLA
MURCIA
SAN JAVIER
ALMERIA AIR
MALAGA
MELILLA Clusters
JEREZ FRONT
ROTA
MORÓN FRONTE 0
SEVILLA
GRANADA AIR
GRANADA BA 1
CORDOBA
JAÉN 2
HUELVA R. ESTE
BADAJOZ AIR
CIUDAD REAL 3
CACERES
TOLEDO 4
GETAFE
MADRID CUAT VI
MADRID
COLMENAR V.
TORREJÓN ARDOZ
MADRID AIR
SALAMANCA
SALAMANCA AIR
LEÓN
ZAMORA
VALLADOLID AIR
SEGOVIA
PUERTO DE NAVAC
ÁVILA
VALLADOLID
BURGOS
SORIA
OURENSE
PONFERRADA
LUGO
VIGO
PONTEVEDRA
SANTIAGO DE C.
CORUÑA AIR
CORUÑA
OVIEDO
VITORIA
GIJÓN PORT
SANTANDER
SANTANDER AIR
BILBAO
SAN SEBASTIÁN
HONDARRIBIA
BARCELONA FABRA
BARCELONA AIR
0 300 600 900
Dias
• Reconstructed curves “ ” and true real data “ ” in Oviedo:
• Conclusions:

 Different patterns/structures in data ⇒ Cluster Analysis

 Robust clustering aimed at (jointly) detecting main clusters


(bulk of data) and outliers ⇒ Potential “frauds”...

 Higher dimensional problems: Assume clusters “living” in low-


dimensional subspaces

 “Casewise” and “cellwise” trimming


Some References:
· Cuesta-Albertos, J.A., Gordaliza, A. and Matrán, C. (1997), “Trimmed
k-means: An attempt to robustify quantizers,” Ann. Statist., 25, 553-576.
· Garcı́a-Escudero, L.A. and Gordaliza, A. (1999), “Robustness properties of
k-means and trimmed k-means,” J. Amer. Statist. Assoc., 94, 956-969.
· Garcı́a-Escudero, L.A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A.
(2008), “A General Trimming Approach To Robust Cluster Analysis,” Ann. Statist.,
36, 1324-1345.
· Garcı́a-Escudero, L.A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A.
(2010), “A review of robust clustering methods,” Advances in Data Analysis and
Classification, 4, 89-109.
· Fritz, H.,Garcı́a-Escudero, L.A. and Mayo-Iscar; A (2012), “tclust: An R
package for a trimming approach to Cluster Analysis,” Journal of Statistical Software,
47, Issue 12.

Thanks for your attention!!!

You might also like