Online Techniques For Dealing With Concept Drift in Process Mining
Online Techniques For Dealing With Concept Drift in Process Mining
Carmona
R. Gavaldà
UPC (Barcelona, Spain)
1
Outline
The Advent of Process Mining (PM)
The challenge of Concept Drift (CD)
Key ingredients
Online strategy for CD in PM
Experiments
Work in progress
2
The Advent of Process Mining
Process mining:
BIG DATA in Information Systems
Focus: formal analysis of the processes
Software Engineering challenges:
Process model alignment with reality
Automation!
Formal methods
3
[source: www.processmining.org] 4
Example: control flow discovery
Information System
Petri Net (PN)
5
Control Flow Discovery
1: r,s,sb,p,ac,ap,c
Event Log (EL) 2: r,sb,em,p,ac,ap,c
3: r,sb,p,em,ac,rj,rs,c
...
sb rj rs
r p ac
em ap c
s Petri Net (PN)
6
The Challenge of Concept Drift
1: r,s,sb,p,ac,ap,c
sb rj rs
2: r,sb,em,p,ac,ap,c
r p
3: r,sb,p,em,ac,rj,rs,c
ac
Time
4: r, em
em, sb,p,ac,ap,c MODEL
ap time ≤ t c
5: r,sb,s,p,ac,rj,rs, c MODEL time ≤ t
6: r,sb,p,s,ac,ap,c
s MODEL time ≤ t
7:r,sb,p,em,ac,ap,c
Drift !
8: r,em,s,sb,p,ac,ap,c
9: r,sb,em,s,p,ac,ap,c
sb
10: r,sb,em,s,p,ac,rj,rs,c
rj rs
r p
11: r,em,sb,p,s,ac,ap,c ac MODEL time ≥ t+1
12: r,em,sb,s,p,ac,rj,rs,c
13: r,em,sb,p,s,ac,ap,c
MODEL
ap time ≥ t+1 c
em s
14: r,sb,p,em,s,ac,ap,c
... MODEL time ≥ t + 1
7
The Challenge of Concept Drift [Bose-Aalst 11]
9
From log traces to points in Rn
σ = a,a,b,c,b
a
Pref(σ):
λ = (0,0,0)
a = (1,0,0)
a,a = (2,0,0)
b
a,a,b = (2,1,0)
a,a,b,c = (2,1,1) c
a,a,b,c,b = (2,2,1)
10
From points to convex polyhedra (Points2CP)
a
Q = Convex Hull of
the set of points
b
c
mass(Q) = Probability of points in the log inside Q
11
Outline
The Advent of Process Mining (PM)
Key ingredients:
Numerical Abstract Domains
Concept Drift estimation and change
detection
Online strategy for CD in PM
Experiments
Work in progress
12
Setting
stream x1,x2 ,…,xt ,…
xt drawn from distribution Dt, independently
we model change by changes in the Dt’s
13
Windows & change detection
14
Windows & change detection
Problem: What size windows?
Large windows: Slow reaction to fast changes
Small windows: Inaccurate estimates, noise sensitive,
can’t detect small changes
15
ADWIN: Adaptive Window
• Time-scale independent, data-adaptive
• User does not need to guess window size
• Behaves as if “best fixed-window size” known
• Keeps largest window consistent with statistical
hypothesis “no change”
• Keeps window of size N in memory O(log N)
• O(1) amortized time per item, O(log N) worst case
• C++/JAVA implementation by A. Bifet available
[Bifet-G 07]
16
Outline
The Advent of Process Mining (PM)
Key ingredients
Online strategy for CD in PM
Strategy for change detection
Experiments
Work in progress
17
Online Strategy for CD in PM
Sequential
Sampling
18
Learning Stage
Points2CP
Convex Polyhedron Q
19
Estimation Stage
Yes
Q
Estimate:
mass(Q)
20
Monitoring Stage
Yes
DRIFT!
21
Algorithm
Input: P1,P2, ... sequence of log points
Learning
4. W = InitADWIN
5. i = m + 1
Estimating
6. repeat update(Pi,Q,W)
7. if “Pi included in Q” then W = W U {1}
8. else W = W U {0}
9. i=i+1
10. until “Convergence criteria on W estimation”
12. update(Pi,Q,W)
13. i= i+1
14. if “Drift detected on W” then “Emit Drift” and Jump to line 2
15. endwhile
22
Experiments: setting
Various models have been used to
generate logs
L = {L1,L2}, with L2 being the drifting part
Drift have been created by perturbating the
models:
Flip: ordering between events is reversed
Rem: one event is removed
Conc: two ordered events become concurrent
Conf: two ordered/concurrent events become in
conflict
23
Experiments
bench events |L1| FLIP REM CONC CONF
25
Problem #2: Change Localization
b
c In general:
[Carmona-Cortadella 10] 26
Problem #2: Change Localization
b a
27
Producer-Consumer example
1: a,c,e,b,d,x,e,a,c,...
2: a,c,e,a,x,c,y,...
3: a,x,c,y,e,b,...
... EL
(a,b,c,d,e,x,y,z)
(1,0,0,0,0,0,0,0)
(1,0,1,0,0,0,0,0)
(1,0,0,0,0,1,0,0)
(1,0,1,0,1,0,0,0)
(2,0,1,0,1,0,0,0)
... points in R 8
28
Producer-Consumer example
c≤a e≤c+d y≤x
x≤z+1
a+b≤e+1
29
Problem #2: Change Localization
Monitoring
c≤a ADWIN 1
e≤c+d ADWIN 2
y≤x ADWIN 3
Estimation
a+b≤e+1 ADWIN 4
d≤b ADWIN 5
y≤c+d ADWIN 6
Learning
z≤y ADWIN 7
x≤z+1 ADWIN 8
30
Problem #3: Unravel process evolution
c≤a
e≤c+d DRIFT!
y≤x
a+b≤e+1
.....
31
Problem #3: Unravel process evolution
c≤a
e≤c+d
y≤x
new model
a+b≤e+1
y≤z
x+b≤y+1
.....
32
Conclusions & Future Work
First online algorithm for CD in PM
Several uses: segmenting the log for later
process discovery, drift detection, …
Able to find the majority of drifts in practice
Ideas to tackle gradual drift
Promising results: fast detection of
concept drifts, even with simple abstract
numerical domains (octagons)
33
Thanks!
34
Backup slides
35
The Advent of Process Mining
Disciplines involved:
Formal Methods and Models
Algorithmics
AI (e.g., Data Mining/Machine Learning)
Information Systems
Software Engineering
Databases
Bussiness
...
36
Online Strategy for CD in PM
Change Detection:
Visual description of the algorithm (1-2 slides)
Example (1-2 slides, with animation)
Formal Description of the Algorithm (1 slide)
Theorem enumeration on guarantees. (1 slide)
Experiments (3-4 slides)
More elaborated strategies (1 slide)
Tackling the two other problems:
Change localization (1-2 slides)
Unraveling process evolution (1-2 slides)
37
Outline
The Advent of Process Mining (PM)
The challenge of Concept Drift (CD)
Key ingredients:
Process Discovery via Numerical Abstract Domains
Concept Drift estimation and change detection
Online strategy for CD in PM
Strategy for change detection
Experiments
Work in progress
More elaborated strategies
Tackling other problems
38
Process Discovery via Numerical
Abstract Domains
From log traces to points in Rn
From points in Rn to convex polyhedra
(Parikh2CP, used in this work)
From convex polyhedra to inequalities
From inequalities to Petri nets
39
From points to convex polyhedra
a
Q = Convex Hull of
the set of points
b
c
mass(Q) = Probability of points in the log inside Q
40