Stream and Pool Based Active Learning
Stream and Pool Based Active Learning
Active Learning
Group No -27
1. Uncertainty Sampling:
2. Query by Committee
3. Clustering
Data
1.Balanced
2.Left
3.Right
We have assumed 150 data points to be labelled and 450 points to be unlabeled.
1.Uncertainty Sampling:
1.Least Confidence
We stored the probability of the most probable class for each data point. Then we sorted the
probability in increasing order.
2.Margin Sampling
We sorted the probabilities for each data point and then stored the difference of probabilities
of two most probable classes. Then we sorted the probabilities scores remembering the
indexes.
3.Entropy method
We calculated the entropy score using probabilities for each data point.
1.Stream-based approach
For stream based approach, we set a threshold . Then we selected all the points to be labelled
which have scores lesser than the threshold.
For pool -based approach, we set a parameter that is the number of points to be labelled. Since
the list is already sorted, we selected the top k points to be labelled.
2.Query-by-Committee
We built a committee of models (7) including
- Logistic Regression
- Polynomial Regression(2)
- Naïve-Bayes Classifier
- Gaussian Classifier
- Linear Discriminant Analysis classifier
- Decision Tree classifier
1. Vote-Entropy method:
We calculated the probability of belonging to each class for each data point. Then we calculated
the entropy scores for each data point.
2. KL-Divergence method
We calculated the KL divergence score of each data point using the formula
1. Stream-based approach
For stream based approach, we set a threshold. Then we selected all the points to be labelled
which have scores lesser than the threshold.
2.Pool based approach
For pool -based approach, we set a parameter that is the number of points to be labelled. Since
the list is already sorted, we selected the top k points to be labelled.
Results:
.We can clearly see that the model wise average accuracy is increasing with each iteration. This
is because with each iteration ,model has more correctly labelled point(by Oracle).
We also checked the accuracy values for Uncertainty sampling on Stream-based approach.
We can observe that entropy method has the best accuracy ,it reaches accuracy of 1 in 3rd
iteration as it takes into account all the probabilities.
The second active learning regime exploits cluster structure in data. A key advantage of cluster-
based heuristics is that the framework can naturally utilise the unlabelled data Xu, as well as
optimising the selection of the training data Xl .
Ideal clusters (separable and pure): (a) clustering of query points [ + ∕ −] and unlabelled
instances [○]; (b) query points and propagated labels (XL).
Thus, data clustering must be adaptive — actively changing as more information becomes
available. Provided that there is some relationship between clustered groups of data and
diagnostic labels, at whatever resolution, cluster-based active learning can exploit these
patterns.
4: repeat
10: end if
11: for i ∈ S do
13: end for