0% found this document useful (0 votes)
53 views

Stream and Pool Based Active Learning

A Thorough application and simulation of query strategies on stream and pool based active machine learning.

Uploaded by

Lakshya Kwatra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Stream and Pool Based Active Learning

A Thorough application and simulation of query strategies on stream and pool based active machine learning.

Uploaded by

Lakshya Kwatra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ML Assignment 2

Active Learning
Group No -27

Manasvi Agarwal- 2017A7PS0542P


Ashish Prabhune- 2017A7PS0231P
Lakshya Kwatra – 2017A7PS0365P

Date – 26th November,2019


Active Learning:

1. Uncertainty Sampling:
2. Query by Committee
3. Clustering

Data

We used a UCI ML repository data- Balance Data

It has three classes:

1.Balanced

2.Left

3.Right

It has 600 instances.

We have assumed 150 data points to be labelled and 450 points to be unlabeled.
1.Uncertainty Sampling:

The data was divided into labelled and unlabeled data.


We used a Logistic Regression model for building a classifier on labelled data.
Then we used three methods for selecting the points to label.

1.Least Confidence

We stored the probability of the most probable class for each data point. Then we sorted the
probability in increasing order.

2.Margin Sampling

We sorted the probabilities for each data point and then stored the difference of probabilities
of two most probable classes. Then we sorted the probabilities scores remembering the
indexes.

3.Entropy method

We calculated the entropy score using probabilities for each data point.

Entropy = S = -sum(pk * log(pk), axis=0).


Then we implemented two approaches for selection of the points which are to be labelled.

1.Stream-based approach

For stream based approach, we set a threshold . Then we selected all the points to be labelled
which have scores lesser than the threshold.

2.Pool based approach

For pool -based approach, we set a parameter that is the number of points to be labelled. Since
the list is already sorted, we selected the top k points to be labelled.

2.Query-by-Committee
We built a committee of models (7) including
- Logistic Regression
- Polynomial Regression(2)
- Naïve-Bayes Classifier
- Gaussian Classifier
- Linear Discriminant Analysis classifier
- Decision Tree classifier

Then we calculated the most probable class for each model.

1. Vote-Entropy method:

We calculated the probability of belonging to each class for each data point. Then we calculated
the entropy scores for each data point.

Entropy = S = -sum(pk * log(pk), axis=0).

2. KL-Divergence method

We calculated the KL divergence score of each data point using the formula

S = sum(pk * log(pk / qk), axis=0).

Then we used two approaches for selecting the points to be labelled.

1. Stream-based approach

For stream based approach, we set a threshold. Then we selected all the points to be labelled
which have scores lesser than the threshold.
2.Pool based approach

For pool -based approach, we set a parameter that is the number of points to be labelled. Since
the list is already sorted, we selected the top k points to be labelled.

Results:

We run 3 iterations for active learning QBC

These are the results for 3 iterations for the models.

Iteration1 Iteration 2 Iteration 3

.We can clearly see that the model wise average accuracy is increasing with each iteration. This
is because with each iteration ,model has more correctly labelled point(by Oracle).

We also checked the accuracy values for Uncertainty sampling on Stream-based approach.

Least confidence Margin Sampling Entropy method

We can observe that entropy method has the best accuracy ,it reaches accuracy of 1 in 3rd
iteration as it takes into account all the probabilities.

3. Cluster-based strategy for data labelling, given a limited budget.


Clustering is similar to classification, but the basis is different. In Clustering you don’t know
what you are looking for, and you are trying to identify some segments or clusters in your data.
When you use clustering algorithms on your dataset, unexpected things can suddenly pop
up like structures, clusters and groupings you would have never thought of otherwise.

The second active learning regime exploits cluster structure in data. A key advantage of cluster-
based heuristics is that the framework can naturally utilise the unlabelled data Xu, as well as
optimising the selection of the training data Xl .

Roughly speaking, various cluster-based methods follow a similar framework, introduced by


Dasgupta and Hsu. In an ideal scenario, defined, separable clusters will exist that are pure in
terms of labels. Following definition by unsupervised learning, a few informative points Xl can
be selected from each cluster; any remaining unlabelled points Xu can then be assumed to have
their most confident (majority) label — as in the figure below. A supervised classifier can then
be trained on the labelled dataset XL, including queried and propagated labels YL, such that XL =
(Xl ∪ Xu, YL).

Ideal clusters (separable and pure): (a) clustering of query points [ + ∕ −] and unlabelled
instances [○]; (b) query points and propagated labels (XL).

The active/guided sampling element of cluster-based techniques is defined by the sampling


procedure. Various methods have been proposed. Dasgupta and Hsu suggest a heuristic that
favours instances from clusters that appear mixed as querying progresses. Alternatively, the
density clustering algorithm, by Wang et al. , favours queries in regions populated by (relatively)
dense groups of data.
The relationship between labels and clusters could be insignificant, or there might be viable
(near pure) clusters but at many different resolutions. For this reason, the performance of
cluster-based methods heavily depends on the quality of the clustering results.

Thus, data clustering must be adaptive — actively changing as more information becomes
available. Provided that there is some relationship between clustered groups of data and
diagnostic labels, at whatever resolution, cluster-based active learning can exploit these
patterns.

Algorithm 1. Agglomerative clustering

1: compute dissimilarity matrix d between all observations in X

2: initialise clusters as singletons: for i ← 1 to N do Ci ←{i}

3: initialise set of clusters available for merging: S ←{1, …, N}

4: repeat

5: Pick the two most similar clusters to merge: (j, k) ←argminj,k ∈ S(dj,k)

6: Create new cluster Cl ← Cj ∪ Ck

7: Mark j and k as unavailable: S ← S {j, k}


8: if Cl≠{1, …, N} then

9: Mark l as available, S ← S ∪{l}

10: end if

11: for i ∈ S do

12: Update dissimilarity matrix d(i, l)

13: end for

14: until no more clusters are available for merging

Algorithm 2. Cluster-adaptive active learning.


Misclassification error e for an increasing query budget n. Plots are provided for
classifiers trained using guided sampling (the DH learner without label propagation)
vs. random sample training.

You might also like