Tutorial5
Tutorial5
April 2017
1 Before you start
The focus of this tutorial is to show you how to easily simulate abrupt, grad-
ual, incremental and mixed drifts on synthetic data streams using the MOA
framework.
Disclaimer:
• All examples are presented using both the MOA GUI and the command
line as well. To run the examples you must copy and paste the command
line in the MOA GUI (see instructions in the “Running the examples”
section).
• This tutorial does not include the RecurrentConceptDriftStream class which
allows easily simulating Recurrent Concept Drifts.
• We do not cover into details the evaluation method used. We simply
use the default EvaluationPrequential configuration. Some comments are
made about how it impacts the evaluation, but there is no thorough dis-
cussion on this topic. We suggest this paper [2] for those interested in
state-of-the-art evaluation procedures for data streams.
• To generate plots as those available in this tutorial use the script available
here: https://github.com/hmgomes/data-stream-visualization
Synthetic data stream generators are often used to evaluate machine learn-
ing algorithms for the data stream setting. One motivation for using synthetic
data streams is that one can easily simulate different types of drifts on them,
and use it to evaluate how a specific algorithm performs given different types
of drifts. In this tutorial we are going to use 3 different generators: SEA [3],
AGRAWAL [3] and HYPERPLANE [4].
This tutorial is divided into 6 sections. The first section briefly introduces
how to run the examples in the MOA GUI and in the command line. In the
second section we show how to simulate gradual and abrupt drifts. In the
third section we show how to simulate gradual and abrupt drifts. The fourth
section shows how to simulate incremental drifts. Section 5 shows how to save
the synthetic data streams generated in the previous sections to ARFF (weka
format) files which can then be used even outside MOA. Finally, the last section
present exercises to practice your skills
1
2 Running the examples
To run the examples you can either use the MOA GUI or the command line. To
run them on the GUI the simplest approach is to copy and paste the command
lines presented in each example. The whole process is as follows:
Right mouse click on the text input right next the Configure button as in
Figure 1.
2
Paste the command line in the text dialog as in Figure 3.
To run the commands on the command line you will need to change direc-
tory (cd or chdir depending on your operating system) to the location of the
MOA jar and execute a command like the one shown below (replace the text
within double quotes by the command given in the examples):
java -cp moa.jar moa.DoTask “EvaluatePrequential -l bayes.NaiveBayes -
s (ConceptDriftStream -s (generators.SEAGenerator -f 3) -d (generators.SEAGenerator
-f 2) -p 50000 -w 20000) -i 100000 -f 1000”
3
2) Select “Edit” on the Stream option. Make sure that you have EvaluatePre-
quential or another evaluation method in the configuration drop down at the
top. Notice that it is in the Task (in this case EvaluatePrequential) window
that the total number of instances to be generated is selected using parameter
instanceLimit.
3) Change the data stream (the drop down at the top in Figure 6) to moa.streams.ConceptDriftStream
as shown in Figure 7.
4
• driftstream: The next concept will be generated using this stream config-
uration.
• position: The location of center of the drift in terms of the number of in-
stances.
• width: The length of the window of drift, i.e., the number of instances
where the drift will take place.
Notice that the parameter alpha is also important to configure the underlying
sigmoid function that models the transition between one concept to the other,
however tweaking alpha is beyond the scope of this tutorial.
5
3 Simulating Abrupt and Gradual Drifts
Basically, the experimental framework for inducing artificial concept drifts in
MOA simulates a window of change where the probability of a given instance
belonging to the current concept of the new concept is governed by a sigmoid
function. Let’s name the current concept as A and the new concept as B, basi-
cally at the start of the window the probability that one instance is draw from
A is higher, while at the end of the window the probability from it being draw
from B increases, after the window is over all instances will be draw from B
(the ”new concept” B is now stable).
6
Sometimes this might not happen because the classifier selected does include
a built-in drift adaptation technique.
95 Example1 - accuracy
90
85
80
accuracy
75
70
65
NaiveBayes
60
200 400 600 800 1000
#instances (hundreds)
Figure 10: Example 1 results. Solid and dashed vertical red lines indicates drifts
and drift window start/end, respectively.
95 Example2 - accuracy
90
85
80
accuracy
75
70
65
NaiveBayes
60
200 400 600 800 1000
#instances (hundreds)
Figure 11: Example 2 results. The solid red line indicate the abrupt drift loca-
tion.
7
4 Simulating Multiple and Mixed Drifts
The MOA drift simulation framework can easily be extended to simulate var-
ious drifts on the same data stream with drifts on different locations of the
stream. In the following examples we show how it can be used to simulate
streams with multiple abrupt drifts (which could also be gradual) and mixed
abrupt and gradual drifts.
Example3 - accuracy
90
80
70
60
accuracy
50
40
30
20
NaiveBayes LevBag
10
200 400 600 800 1000
#instances (hundreds)
Figure 12: Example 3 results. The solid red line indicate the drifts locations.
8
-f 2) -d (ConceptDriftStream -s generators.AgrawalGenerator -d (generators.AgrawalGenerator
-f 4) -p 25000 -w 10000) -p 25000 -w 1) -p 25000 -w 10000) -i 100000 -f 1000
The command for Example 4 using Leveraging Bagging is shown below:
EvaluatePrequential -l meta.LeveragingBag -s (ConceptDriftStream -s
generators.AgrawalGenerator -d (ConceptDriftStream -s (generators.AgrawalGenerator
-f 2) -d (ConceptDriftStream -s generators.AgrawalGenerator -d (generators.AgrawalGenerator
-f 4) -p 25000 -w 10000) -p 25000 -w 1) -p 25000 -w 10000) -i 100000 -f 1000
Figure 13 presents the results for Leveraging Bagging and Naive Bayes us-
ing an AGRAWAL data stream with 2 gradual drifts and 1 abrupt drift.
Example4 - accuracy
90
80
70
accuracy
60
50
40
30
NaiveBayes LevBag
20
200 400 600 800 1000
#instances (hundreds)
Figure 13: Example 4 results. Solid and dashed vertical red lines indicates drifts
and drift window start/end, respectively. The drift in the middle as an abrupt
drift, the dashed lines to its right and left are respectively the end and start of
the other gradual drifts in this experiment.
9
95 Example5 - accuracy
90
85
80
accuracy
75
70
65
NaiveBayes
60
200 400 600 800 1000
#instances (hundreds)
95 Example6 - accuracy
90
85
80
accuracy
75
70
65
NaiveBayes
60
200 400 600 800 1000
#instances (hundreds)
Figure 15: Example 6 results. There is a subtle incremental drift in this stream.
uatePrequential -f parameter which indicates the sample frequency for evaluations. In fact, there
is no reason to have a sample frequency in WriteToStreamARFFFile as we are generating the files
and not evaluating the data stream.)
2 Sometimes people are confused about -m parameter as in the EvaluatePrequential the param-
10
Below you can find all the WriteToStreamARFFFile commands to export all
the data streams previously generated in this tutorial.
Example 1: WriteStreamToARFFFile -s (ConceptDriftStream -s (generators.SEAGenerator
-f 3) -d (generators.SEAGenerator -f 2) -p 50000 -w 20000) -f example1.arff -m
100000
Example 2: WriteStreamToARFFFile -s (ConceptDriftStream -s (generators.SEAGenerator
-f 3) -d (generators.SEAGenerator -f 2) -p 50000 -w 1) -f example2.arff -m 100000
Example 3: WriteStreamToARFFFile -s (ConceptDriftStream -s generators.AgrawalGenerator
-d (ConceptDriftStream -s (generators.AgrawalGenerator -f 2) -d (ConceptDrift-
Stream -s generators.AgrawalGenerator -d (generators.AgrawalGenerator -f 4)
-p 25000 -w 1) -p 25000 -w 1) -p 25000 -w 1) -f example3.arff -m 100000
Example 4: WriteStreamToARFFFile -s (ConceptDriftStream -s generators.AgrawalGenerator
-d (ConceptDriftStream -s (generators.AgrawalGenerator -f 2) -d (ConceptDrift-
Stream -s generators.AgrawalGenerator -d (generators.AgrawalGenerator -f 4)
-p 25000 -w 10000) -p 25000 -w 1) -p 25000 -w 10000) -f example4.arff -m 100000
Example 5: WriteStreamToARFFFile -s (generators.HyperplaneGenerator -
k 10 -t 0.01 -s 10) -f example5.arff -m 100000
Example 6: WriteStreamToARFFFile -s (generators.HyperplaneGenerator -
k 10 -t 0.001 -s 10) -f example6.arff -m 100000
11
7 Exercises
7.1 Simulate 1 abrupt drift
Use generators.SEAGenerator to simulate 1 abrupt drift.
• The data should amount to 100 thousand instances;
If you do not change parameters, except for those required in the exercise,
then your final results would look like Figure 16.
80
70
accuracy
60
50
40
HoeffdingTree
30
0 20 40 60 80 100
#instances (thousands)
Figure 16: Simulating 1 drift using SEA generator. The solid vertical red line
indicates drift location.
• Generate the first concept using SEA function 1, the second concept using
SEA function 2, and the final concept using function 3;
• Evaluate the stream using EvaluatePrequential every 10,000 instances
and a HoeffdingTree.
If you do not change parameters, except for those required in the exercise,
then your final results would look like Figure 17.
12
AGRAWAL with 3 abrupt drifts - accuracy
90
80
70
accuracy
60
50
40
HoeffdingTree
30
200 400 600 800 1000
#instances (thousands)
Figure 17: Simulating 2 drifts using SEA generator. Solid and dashed vertical
red lines indicates drifts and drift window start/end, respectively.
If you do not change parameters, except for those required in the exercise,
then your final results would look like Figure 18. Notice the sudden drops in
accuracy after every drift.
80
70
accuracy
60
50
40
HoeffdingTree
30
200 400 600 800 1000
#instances (thousands)
Figure 18: Simulating 3 drifts using Agrawal generator. Solid vertical red lines
indicates drifts.
13
References
[1] João Gama, Indre Zliobaite, Albert Bifet, Mykole Pechenizkiy, and Abderl-
hamid Bouchachia. A survey on concept drift adaptation. ACM Computing
Surveys, 46(4):44:1–44:37, March 2014.
[2] Albert Bifet, Gianmarco de Francisci Morales, Jesse Read, Geoff Holmes,
and Bernhard Pfahringer. Efficient online evaluation of big data stream
classifiers. In Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 59–68. ACM, 2015.
[3] Rakesh Agrawal, Tomasz Imilielinski, and Arun Swani. Database mining:
A performance perspective. IEEE Trans. on Knowledge and Data Engineering,
5(6):914–925, Dec. 1993.
[4] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing
data streams. In Proceedings of the seventh ACM SIGKDD international con-
ference on Knowledge discovery and data mining, pages 97–106. ACM, 2001.
[5] Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Leveraging bagging
for evolving data streams. In PKDD, pages 135–150, 2010.
14