0% found this document useful (0 votes)
40 views40 pages

Online Techniques For Dealing With Concept Drift in Process Mining

The document describes techniques for dealing with concept drift in process mining. It outlines the advent of process mining and the challenge of concept drift. It then discusses key ingredients like numerical abstract domains and concept drift estimation. Finally, it presents an online strategy for detecting concept drift in process mining along with experiments on applying the strategy.

Uploaded by

Kevin Mondragon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views40 pages

Online Techniques For Dealing With Concept Drift in Process Mining

The document describes techniques for dealing with concept drift in process mining. It outlines the advent of process mining and the challenge of concept drift. It then discusses key ingredients like numerical abstract domains and concept drift estimation. Finally, it presents an online strategy for detecting concept drift in process mining along with experiments on applying the strategy.

Uploaded by

Kevin Mondragon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

J.

Carmona
R. Gavaldà
UPC (Barcelona, Spain)

ONLINE TECHNIQUES FOR


DEALING WITH CONCEPT
DRIFT IN PROCESS MINING

1
Outline
 The Advent of Process Mining (PM)
The challenge of Concept Drift (CD)
 Key ingredients
 Online strategy for CD in PM
 Experiments
 Work in progress

2
The Advent of Process Mining
 Process mining:
BIG DATA in Information Systems
 Focus: formal analysis of the processes
 Software Engineering challenges:
Process model alignment with reality
Automation!
Formal methods

3
[source: www.processmining.org] 4
Example: control flow discovery

Information System
Petri Net (PN)

Case Event Timestamp


1 reservation 21-02-2009 12:20h
1 arrival 22-02-2009 21:05h
2 reservation 23-02-2009 14:00h
1 payment 23-02-2009 14:50h
Event Log
2 cancellation 23-02-2009 16:00h

5
Control Flow Discovery
1: r,s,sb,p,ac,ap,c
Event Log (EL) 2: r,sb,em,p,ac,ap,c
3: r,sb,p,em,ac,rj,rs,c
...

sb rj rs
r p ac
em ap c
s Petri Net (PN)

6
The Challenge of Concept Drift
1: r,s,sb,p,ac,ap,c
sb rj rs
2: r,sb,em,p,ac,ap,c
r p
3: r,sb,p,em,ac,rj,rs,c
ac
Time

4: r, em
em, sb,p,ac,ap,c MODEL
ap time ≤ t c
5: r,sb,s,p,ac,rj,rs, c MODEL time ≤ t
6: r,sb,p,s,ac,ap,c
s MODEL time ≤ t
7:r,sb,p,em,ac,ap,c
Drift !
8: r,em,s,sb,p,ac,ap,c
9: r,sb,em,s,p,ac,ap,c
sb
10: r,sb,em,s,p,ac,rj,rs,c
rj rs
r p
11: r,em,sb,p,s,ac,ap,c ac MODEL time ≥ t+1
12: r,em,sb,s,p,ac,rj,rs,c
13: r,em,sb,p,s,ac,ap,c
MODEL
ap time ≥ t+1 c
em s
14: r,sb,p,em,s,ac,ap,c
... MODEL time ≥ t + 1

7
The Challenge of Concept Drift [Bose-Aalst 11]

 Problem #1: Change Detection!


 “There is a drift in the previous log between
traces 7 and 8”

 Problem #2: Change Localization and


Characterization
 “The activities involved in the drift are em and s,
for which the causality has changed”

 Problem #3: Unravel Process Evolution


 “In the new process, everything is the same but
em and s, with em now preceding s”

DISCLAIMER: We focus on ABRUPT changes. 8


Outline
 The Advent of Process Mining (PM)
 Key ingredients:
Numerical Abstract Domains
Concept Drift estimation and change
detection
 Online strategy for CD in PM
 Experiments
 Work in progress

9
From log traces to points in Rn
σ = a,a,b,c,b
a
Pref(σ):
λ = (0,0,0)
a = (1,0,0)
a,a = (2,0,0)
b
a,a,b = (2,1,0)
a,a,b,c = (2,1,1) c
a,a,b,c,b = (2,2,1)
10
From points to convex polyhedra (Points2CP)

a
Q = Convex Hull of
the set of points

b
c
mass(Q) = Probability of points in the log inside Q
11
Outline
 The Advent of Process Mining (PM)
 Key ingredients:
Numerical Abstract Domains
Concept Drift estimation and change
detection
 Online strategy for CD in PM
 Experiments
 Work in progress

12
Setting
 stream x1,x2 ,…,xt ,…
 xt drawn from distribution Dt, independently
 we model change by changes in the Dt’s

Two basic problems


 Detect change (in the Dt)
 Estimate some statistic (on the Dt)
 E.g., if xt is a real numer, estimate E[xt]

Only possible if Dt do not vary too wildly

13
Windows & change detection

Sliding window: keep consistent, no explicit change detection

Reference window + Sliding window

Min-error window + growing windows

14
Windows & change detection
Problem: What size windows?
 Large windows: Slow reaction to fast changes
 Small windows: Inaccurate estimates, noise sensitive,
can’t detect small changes

 Optimal size depends on unknown rate of change


 User needs to guess
 Or else: detect rate from the stream?

15
ADWIN: Adaptive Window
• Time-scale independent, data-adaptive
• User does not need to guess window size
• Behaves as if “best fixed-window size” known
• Keeps largest window consistent with statistical
hypothesis “no change”
• Keeps window of size N in memory O(log N)
• O(1) amortized time per item, O(log N) worst case
• C++/JAVA implementation by A. Bifet available

[Bifet-G 07]

16
Outline
 The Advent of Process Mining (PM)
 Key ingredients
 Online strategy for CD in PM
Strategy for change detection
 Experiments
 Work in progress

17
Online Strategy for CD in PM

LOG P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 ...

Sequential
Sampling

Learning Estimation Monitoring


ONLINE CONCEPT DRIFT DETECTION

18
Learning Stage

LOG P1 ... PN Log Parikh vectors

Points2CP

Convex Polyhedron Q

19
Estimation Stage

LOG Log Parikh


P(N+1) vectors
... P(N+K)

Yes

P(N+1) ... inside ? 1


0 ADWIN
No

Q
Estimate:
mass(Q)

20
Monitoring Stage

LOG Log Parikh vectors P(N+K+1) ...

Yes

P(N+K+1) ... ADWIN


inside ? No

DRIFT!

21
Algorithm
Input: P1,P2, ... sequence of log points
Learning

1. Select appropriate training size n


2. S = “Collect a random sample of m points out of the first n”
3. Q = Points2CP(S)

4. W = InitADWIN
5. i = m + 1
Estimating

6. repeat update(Pi,Q,W)
7. if “Pi included in Q” then W = W U {1}
8. else W = W U {0}
9. i=i+1
10. until “Convergence criteria on W estimation”

11. while true do


Monitoring

12. update(Pi,Q,W)
13. i= i+1
14. if “Drift detected on W” then “Emit Drift” and Jump to line 2
15. endwhile
22
Experiments: setting
 Various models have been used to
generate logs
 L = {L1,L2}, with L2 being the drifting part
 Drift have been created by perturbating the
models:
Flip: ordering between events is reversed
Rem: one event is removed
Conc: two ordered events become concurrent
Conf: two ordered/concurrent events become in
conflict

23
Experiments
bench events |L1| FLIP REM CONC CONF

ShRes(6) 24 4000 115 54 183 37


ShRes(8) 32 4000 165 73 381 83
PC(8) 41 4000 337 550 262 266
PC(9) 46 4000 256 136 323 489
WMG(9) 9 4000 101 16 75 16
WMG(10) 10 4000 147 28 53 18
Cycles(4,2) 14 4000 563 23 664 22
Cycles(5,2) 20 4000 554 22 845 21
A12F0N00 12 620 83 76 117 15
A22F0N00 22 2132 340 56 99 198
A32F0N00 32 2483 67 79 258 162
A42F0N00 42 3308 178 41 185 37
T32F0N00 33 3766 143 28 394 36
24
Outline
 The Advent of Process Mining (PM)
 Key ingredients:
 Online strategy for CD in PM
 Experiments
 Work in progress
Tackling other problems

25
Problem #2: Change Localization

b
c In general:

[Carmona-Cortadella 10] 26
Problem #2: Change Localization

b a

27
Producer-Consumer example
1: a,c,e,b,d,x,e,a,c,...
2: a,c,e,a,x,c,y,...
3: a,x,c,y,e,b,...
... EL

(a,b,c,d,e,x,y,z)
(1,0,0,0,0,0,0,0)
(1,0,1,0,0,0,0,0)
(1,0,0,0,0,1,0,0)
(1,0,1,0,1,0,0,0)
(2,0,1,0,1,0,0,0)
... points in R 8

28
Producer-Consumer example
c≤a e≤c+d y≤x

x≤z+1
a+b≤e+1

d≤b y≤c+d z≤y

29
Problem #2: Change Localization

Monitoring
c≤a ADWIN 1

e≤c+d ADWIN 2

y≤x ADWIN 3

Estimation
a+b≤e+1 ADWIN 4

d≤b ADWIN 5

y≤c+d ADWIN 6

Learning
z≤y ADWIN 7

x≤z+1 ADWIN 8

30
Problem #3: Unravel process evolution

Learning Estimation Monitoring

c≤a

e≤c+d DRIFT!
y≤x

a+b≤e+1

.....

31
Problem #3: Unravel process evolution

Learning Estimation Monitoring

c≤a

e≤c+d

y≤x
new model
a+b≤e+1

y≤z

x+b≤y+1

.....
32
Conclusions & Future Work
 First online algorithm for CD in PM
 Several uses: segmenting the log for later
process discovery, drift detection, …
 Able to find the majority of drifts in practice
 Ideas to tackle gradual drift
 Promising results: fast detection of
concept drifts, even with simple abstract
numerical domains (octagons)

33
Thanks!

34
Backup slides

35
The Advent of Process Mining
 Disciplines involved:
Formal Methods and Models
Algorithmics
AI (e.g., Data Mining/Machine Learning)
Information Systems
Software Engineering
Databases
Bussiness
...

36
Online Strategy for CD in PM
 Change Detection:
Visual description of the algorithm (1-2 slides)
Example (1-2 slides, with animation)
Formal Description of the Algorithm (1 slide)
Theorem enumeration on guarantees. (1 slide)
Experiments (3-4 slides)
More elaborated strategies (1 slide)
 Tackling the two other problems:
Change localization (1-2 slides)
Unraveling process evolution (1-2 slides)

37
Outline
 The Advent of Process Mining (PM)
The challenge of Concept Drift (CD)
 Key ingredients:
Process Discovery via Numerical Abstract Domains
Concept Drift estimation and change detection
 Online strategy for CD in PM
Strategy for change detection
Experiments
 Work in progress
More elaborated strategies
Tackling other problems

38
Process Discovery via Numerical
Abstract Domains
 From log traces to points in Rn
 From points in Rn to convex polyhedra
(Parikh2CP, used in this work)
 From convex polyhedra to inequalities
 From inequalities to Petri nets

[Carmona & Cortadella, ECML/PKDD’2010]

39
From points to convex polyhedra

a
Q = Convex Hull of
the set of points

b
c
mass(Q) = Probability of points in the log inside Q
40

You might also like