Bernd Klein Python and Machine Learning Letter
Bernd Klein Python and Machine Learning Letter
Learning with
Python
Tutorial
by
Bernd Klein
bodenseo
© 2021 Bernd Klein
All rights reserved. No portion of this book may be reproduced or used in any
manner without written permission from the copyright owner.
www.python-course.eu
Python Course
Machine Learning
With Python by
Bernd Klein
Machine Learning Terminology .................................................................................................3
Representation and Visualization of Data ................................................................................15
Loading the Iris Data with Scikit-learn ....................................................................................18
Visualising the Features of the Iris Data Set.............................................................................23
Scatterplot 'Matrices .................................................................................................................27
Datasets in sklearn ....................................................................................................................29
Loading Digits Data..................................................................................................................31
Reading the data and conversion back into 'data' and 'labels'...................................................51
Other Interesting Distributions .................................................................................................54
k-Nearest-Neighbor Classifier ..................................................................................................72
From Dividing Lines to Neural Networks................................................................................96
Neural Networks, Structure, Weights and Matrices ...............................................................141
Running a Neural Network with Python ................................................................................153
Backpropagation in Neural Networks ....................................................................................162
Training a Neural Network with Python ................................................................................169
Softmax as Activation Function .............................................................................................182
Confusion Matrix........................................................................................................................3
Neural Network ......................................................................................................................198
Multiple Runs .........................................................................................................................210
With Bias Nodes .....................................................................................................................216
Networks with multiple hidden layers....................................................................................227
Networks with multiple hidden layers and Epochs ................................................................231
A Neural Network for the Digits Dataset ...............................................................................269
Naive Bayes Classifier with Scikit .........................................................................................316
Regression Trees.....................................................................................................................413
The maths behind regression trees..........................................................................................418
Regression Decision Trees from scratch in Python ................................................................423
Regression Trees in sklearn ....................................................................................................434
TensorFlow .............................................................................................................................437
2
MACHINE LEARNING
TERMINOLOGY
CLASSIFIER
A program or a function which maps from unlabeled instances to classes is called a classifier.
CONFUSION MATRIX
A confusion matrix, also called a contingeny table or error matrix, is used to visualize the performance of a
classifier.
The columns of the matrix represent the instances of the predicted classes and the rows represent the instances
of the actual class. (Note: It can be the other way around as well.)
In the case of binary classification the table has 2 rows and 2 columns.
male 42 8
female 18 32
This means that the classifier correctly predicted a male person in 42 cases and it wrongly predicted 8 male
instances as female. It correctly predicted 32 instances as female. 18 cases had been wrongly predicted as male
instead of female.
Accuracy is a statistical measure which is defined as the quotient of correct predictions made by a classifier
divided by the sum of predictions made by the classifier.
The classifier in our previous example predicted correctly predicted 42 male instances and 32 female instance.
which is 0.72
Let's assume we have a classifier, which always predicts "female". We have an accuracy of 50 % in this case.
male 0 50
female 0 50
spam 4 1
ham 4 91
The following classifier predicts solely "ham" and has the same accuracy.
spam 0 5
ham 0 95
The accuracy of this classifier is 95%, even though it is not capable of recognizing any spam at all.
negative TN FP
positive FN TP
SUPERVISED LEARNING
The machine learning program is both given the input data and the corresponding labelling. This means that
the learn data has to be labelled by a human being beforehand.
UNSUPERVISED LEARNING
No labels are provided to the learning algorithm. The algorithm has to figure out the a clustering of the input
data.
REINFORCEMENT LEARNING
A computer program dynamically interacts with its environment. This means that the program receives
positive and/or negative feedback to improve it performance.
INTRODUCTION
Not only in machine learning but also in
general life, especially business life, you
will hear questiones like "How accurate is
your product?" or "How precise is your
machine?". When people get replies like
"This is the most accurate product in its
field!" or "This machine has the highest
imaginable precision!", they feel
fomforted by both answers. Shouldn't
they? Indeed, the terms accurate and
precise are very often used
interchangeably. We will give exact
definitions later in the text, but in a
nutshell, we can say: Accuracy is a
measure for the closeness of some
measurements to a specific value, while
precision is the closeness of the measurements to each other.
These terms are also of extreme importance in Machine Learning. We need them for evaluating ML
algorithms or better their results.
We will present in this chapter of our Python Machine Learning Tutorial four important metrics. These metrics
are used to evaluate the results of classifications. The metrics are:
• Accuracy
• Precision
• Recall
• F1-Score
We will introduce each of these metrics and we will discuss the pro and cons of each of them. Each metric
measures something different about a classifiers performance. The metrics will be of outmost importance for
all the chapters of our machine learning tutorial.
ACCURACY
Accuracy is a measure for the closeness of the measurements to a specific value, while precision is the
closeness of the measurements to each other, i.e. not necessarily to a specific value. To put it in other words: If
we have a set of data points from repeated measurements of the same quantity, the set is said to be accurate if
their average is close to the true value of the quantity being measured. On the other hand, we call the set to be
precise, if the values are close to each other. The two concepts are independent of each other, which means
that the set of data can be accurate, or precise, or both, or neither. We show this in the following diagram:
EVALUATION METRICS 7
CONFUSION MATRIX
Before we continue with the term accuracy , we want to make sure that you understand what a confusion
matrix is about.
A confusion matrix, also called a contingeny table or error matrix, is used to visualize the performance of a
classifier.
The columns of the matrix represent the instances of the predicted classes and the rows represent the instances
of the actual class. (Note: It can be the other way around as well.)
In the case of binary classification the table has 2 rows and 2 columns.
EVALUATION METRICS 8
We want to demonstrate the concept with an example.
Example:
cat 42 8
dog 18 32
This means that the classifier correctly predicted a cat in 42 cases and it wrongly predicted 8 cat instances as
dog. It correctly predicted 32 instances as dog. 18 cases had been wrongly predicted as cat instead of dog.
ACCURACY IN CLASSIFICATION
We are interested in Machine Learning and accuracy is also used as a statistical measure. Accuracy is a
statistical measure which is defined as the quotient of correct predictions (both True positives (TP) and True
negatives (TN)) made by a classifier divided by the sum of all predictions made by the classifier, including
False positves (FP) and False negatives (FN). Therefore, the formula for quantifying binary accuracy is:
TP + TN
accuracy =
TP + TN + FP + FN
negative TN FP
positive FN TP
We will now calculate the accuracy for the cat-and-dog classification results. Instead of "True" and "False",
we see here "cat" and "dog". We can calculate the accuracy like this:
EVALUATION METRICS 9
TP = 42
TN = 32
FP = 8
FN = 18
cat 0 50
dog 0 50
ACCURACY PARADOX
We will demonstrate the so-called accuracy paradox.
spam 4 1
ham 4 91
EVALUATION METRICS 10
TP, TN, FP, FN = 4, 91, 1, 4
accuracy = (TP + TN)/(TP + TN + FP + FN)
print(accuracy)
0.95
The following classifier predicts solely "ham" and has the same accuracy.
spam 0 5
ham 0 95
The accuracy of this classifier is 95%, even though it is not capable of recognizing any spam at all.
PRECISION
Precision is the ratio of the correctly identified positive cases to all the predicted positive cases, i.e. the
correctly and the incorrectly cases predicted as positive . Precision is the fraction of retrieved documents
that are relevant to the query. The formula:
TP
precision =
TP + FP
spam 12 14
EVALUATION METRICS 11
ham 0 114
TP = 114
FP = 14
# FN (0) and TN (12) are not needed in the formuala!
precision = TP / (TP + FP)
print(f"precision: {precision:4.2f}")
precision: 0.89
Exercise: Before you go on with the text think about what the value precision means. If you look at the
precision measure of our spam filter example, what does it tell you about the quality of the spam filter? What
do the results of the confusion matrix of an ideal spam filter look like? What is worse, high FP or FN values?
Incidentally, the ideal spam filter would have 0 values for both FP and FN.
The previous result means that 11 mailpieces out of a hundred will be classified as ham, even though they are
spam. 89 are correctly classified as ham. This is a point where we should talk about the costs of
misclassification. It is troublesome when a spam mail is not recognized as "spam" and is instead presented to
us as "ham". If the percentage is not too high, it is annoying but not a disaster. In contrast, when a non-spam
message is wrongly labeled as spam, the email will not be shown in many cases or even automatically deleted.
For example, this carries a high risk of losing customers and friends. The measure precision makes no
statement about this last-mentioned problem class. What about other measures?
RECALL
Recall, also known as sensitivity, is the ratio of the correctly identified positive cases to all the actual positive
cases, which is the sum of the "False Negatives" and "True Positives".
TP
recall =
TP + FN
TP = 114
FN = 0
# FT (14) and TN (12) are not needed in the formuala!
recall = TP / (TP + FN)
print(f"recall: {recall:4.2f}")
EVALUATION METRICS 12
recall: 1.00
The value 1 means that no non-spam message is wrongly labeled as spam. It is important for a good spam
filter that this value should be 1. We have previously discussed this already.
F1-SCORE
The last measure, we will examine, is the F1-score.
2 precision ⋅ recall
F1 = 1 1
=2⋅
precision + recall
recall
+ precision
EVALUATION METRICS 13
FN FP TP pre acc rec f1
0.00 0.00 93.00 1.00 1.00 1.00 1.00
1.00 0.00 92.00 1.00 0.99 0.99 0.99
1.00 1.00 91.00 0.99 0.99 0.99 0.99
2.00 0.00 91.00 1.00 0.99 0.98 0.99
2.00 1.00 90.00 0.99 0.98 0.98 0.98
2.00 2.00 89.00 0.98 0.98 0.98 0.98
3.00 0.00 90.00 1.00 0.98 0.97 0.98
3.00 1.00 89.00 0.99 0.98 0.97 0.98
3.00 2.00 88.00 0.98 0.97 0.97 0.97
3.00 3.00 87.00 0.97 0.97 0.97 0.97
4.00 0.00 89.00 1.00 0.98 0.96 0.98
4.00 1.00 88.00 0.99 0.97 0.96 0.97
4.00 2.00 87.00 0.98 0.97 0.96 0.97
4.00 3.00 86.00 0.97 0.96 0.96 0.96
4.00 4.00 85.00 0.96 0.96 0.96 0.96
5.00 0.00 88.00 1.00 0.97 0.95 0.97
5.00 1.00 87.00 0.99 0.97 0.95 0.97
5.00 2.00 86.00 0.98 0.96 0.95 0.96
5.00 3.00 85.00 0.97 0.96 0.94 0.96
5.00 4.00 84.00 0.95 0.95 0.94 0.95
5.00 5.00 83.00 0.94 0.95 0.94 0.94
6.00 0.00 87.00 1.00 0.97 0.94 0.97
6.00 1.00 86.00 0.99 0.96 0.93 0.96
6.00 2.00 85.00 0.98 0.96 0.93 0.96
6.00 3.00 84.00 0.97 0.95 0.93 0.95
6.00 4.00 83.00 0.95 0.95 0.93 0.94
6.00 5.00 82.00 0.94 0.94 0.93 0.94
6.00 6.00 81.00 0.93 0.94 0.93 0.93
We can see that f1-score best reflects the worse case scenario that the FN value is rising, i.e. ham is
getting classified as spam!
EVALUATION METRICS 14
REPRESENTATION AND
VISUALIZATION OF DATA
In the following, we want to show how to do this using the data in the sklearn module.
The likelihood that the first dataset you will see in an introductory tutorial on machine learning will be the
"Iris dataset" is similarly high. The Iris dataset contains the measurements of 150 iris flowers from 3 different
species:
• Iris-Setosa,
• Iris-Versicolor, and
Iris Setosa
Iris Versicolor
Iris Virginica
For example, scikit-learn has a very straightforward set of data on these iris species. The data consist of the
following:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica
scikit-learn embeds a copy of the iris CSV file along with a helper function to load it into numpy
type(iris)
Output: sklearn.utils.Bunch
You can see what's available for this data type by using the method keys() :
iris.keys()
Output: dict_keys(['data', 'target', 'target_names', 'DESCR', 'featur
e_names', 'filename'])
A Bunch object is similar to a dicitionary, but it additionally allows accessing the keys in an attribute style:
print(iris["target_names"])
print(iris.target_names)
['setosa' 'versicolor' 'virginica']
['setosa' 'versicolor' 'virginica']
The features of each sample flower are stored in the data attribute of the dataset:
The feautures of each flower are stored in the data attribute of the data set. Let's take a look at some of the
samples:
The information about the class of each sample, i.e. the labels, is stored in the "target" attribute of the data set:
print(iris.data.shape)
print(iris.target.shape)
(150, 4)
(150,)
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2
2 2]
import numpy as np
np.bincount(iris.target)
Output: array([50, 50, 50])
Using NumPy's bincount function (above) we can see that the classes in this dataset are evenly distributed -
there are 50 flowers of each species, with
These class names are stored in the last attribute, namely target_names :
print(iris.target_names)
['setosa' 'versicolor' 'virginica']
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2
2 2]
Beside of the shape of the data, we can also check the shape of the labels, i.e. the target.shape :
Each flower sample is one row in the data array, and the columns (features) represent the flower measurements
in centimeters. For instance, we can represent this Iris dataset, consisting of 150 samples and 4 features, a
2-dimensional array or matrix R 150 × 4 in the following format:
[ ]
x (1) x (1) x (1) x (1)
1 2 3 4
The superscript denotes the ith row, and the subscript denotes the jth feature, respectively.
[ ]
x (1) x (1) x (1) … x (1)
1 2 3 k
print(iris.data.shape)
bincount of NumPy counts the number of occurrences of each value in an array of non-negative integers.
We can use this to check the distribution of the classes in the dataset:
import numpy as np
np.bincount(iris.target)
Output: array([50, 50, 50])
We can see that the classes are distributed uniformly - there are 50 flowers from each species, i.e.
• class 0: Iris-Setosa
• class 1: Iris-Versicolor
• class 2: Iris-Virginica
These class names are stored in the last attribute, namely target_names :
print(iris.target_names)
['setosa' 'versicolor' 'virginica']
The feauture data is four dimensional, but we can visualize one or two of the dimensions at a time using a
simple histogram or scatter-plot.
print(iris.data[iris.target==1, 0][:5])
[[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]]
[7. 6.4 6.9 5.5 6.5]
fig, ax = plt.subplots()
x_index = 3
colors = ['blue', 'red', 'green']
ax.set_xlabel(iris.feature_names[x_index])
ax.legend(loc='upper right')
fig.show()
x_index = 3
y_index = 0
ax.set_xlabel(iris.feature_names[x_index])
ax.set_ylabel(iris.feature_names[y_index])
ax.legend(loc='upper left')
plt.show()
Change x_index and y_index in the above script and find a combination of two parameters which maximally
separate the three classes.
GENERALIZATION
We will now look at all feature combinations in one combined diagram:
n = len(iris.feature_names)
fig, ax = plt.subplots(n, n, figsize=(16, 16))
for x in range(n):
for y in range(n):
xname = iris.feature_names[x]
yname = iris.feature_names[y]
for color_ind in range(len(iris.target_names)):
ax[x, y].scatter(iris.data[iris.target==color_ind,
x],
iris.data[iris.target==color_ind, y],
label=iris.target_names[color_ind],
c=colors[color_ind])
plt.show()
Instead of doing it manually we can also use the scatterplot matrix provided by the pandas module.
Scatterplot matrices show scatter plots between all features in the data set, as well as histograms to show the
distribution of each feature.
import pandas as pd
SCATTERPLOT 'MATRICES 27
3-DIMENSIONAL VISUALIZATION
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from mpl_toolkits.mplot3d import Axes3D
iris = load_iris()
X = []
for iclass in range(3):
X.append([[], [], []])
for i in range(len(iris.data)):
if iris.target[i] == iclass:
X[iclass][0].append(iris.data[i][0])
X[iclass][1].append(iris.data[i][1])
X[iclass][2].append(sum(iris.data[i][2:]))
SCATTERPLOT 'MATRICES 28
DATASETS IN SKLEARN
sklearn.datasets.load_*
• Downloadable Data: these larger datasets are available for download, and scikit-learn includes
tools which streamline this process. These tools can be found in
sklearn.datasets.fetch_*
• Generated Data: there are several datasets which are generated from models based on a random
seed. These are available in the sklearn.datasets.make_*
You can explore the available dataset loaders, fetchers, and generators using IPython's tab-completion
functionality. After importing the datasets submodule from sklearn , type
datasets.load_<TAB>
or
datasets.fetch_<TAB>
or
datasets.make_<TAB>
DATASETS IN SKLEARN 29
• n: (n_samples) The number of samples: each sample is an item to process (e.g. classify). A
sample can be a document, a picture, a sound, a video, an astronomical object, a row in database
or CSV file, or whatever you can describe with a fixed set of quantitative traits.
• m: (n_features) The number of features or distinct traits that can be used to describe each item in
a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued
in some cases.
Be warned: many of these datasets are quite large, and can take a long time to download!
DATASETS IN SKLEARN 30
LOADING DIGITS DATA
We will have a closer look at one of these datasets. We look at the digits data set. We will load it first:
Again, we can get an overview of the available attributes by looking at the "keys":
digits.keys()
Output: dict_keys(['data', 'target', 'target_names', 'images', 'DESC
R'])
print(digits.data[0])
print(digits.target)
[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0.
0. 3.
15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5.
8. 0.
0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 1
0. 12.
0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
[0 1 2 ... 8 9 8]
The data is also available at digits.images. This is the raw data of the images in the form of 8 lines and 8
columns.
With "data" an image corresponds to a one-dimensional Numpy array with the length 64, and "images"
representation contains 2-dimensional numpy arrays with the shape (8, 8)
Let's visualize the data. It's little bit more involved than the simple scatter-plot we used above, but we can do it
rather quickly.
EXERCISE 1
EXERCISE 2:
Create a scatter plot of the features ash and color_intensity of the wine data set.
EXERCISE 4:
SOLUTIONS
SOLUTION TO EXERCISE 1
wine = datasets.load_wine()
In [ ]:
print(wine.DESCR)
The names of the classes and the features can be retrieved like this:
print(wine.target_names)
print(wine.feature_names)
['class_0' 'class_1' 'class_2']
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesiu
m', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proant
hocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wine
s', 'proline']
data = wine.data
labelled_data = wine.target
SOLUTION TO EXERCISE 2:
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.legend(loc='upper left')
plt.show()
SOLUTION TO EXERCISE 3:
import pandas as pd
from sklearn import datasets
wine = datasets.load_wine()
def rotate_labels(df, axes):
""" changing the rotation of the label output,
rotate_labels(wine_df, axs)
faces.keys()
Output: dict_keys(['data', 'images', 'target', 'DESCR'])
np.sqrt(4096)
Output: 64.0
faces.images.shape
Output: (400, 64, 64)
faces.data.shape
Output: (400, 4096)
The following Python code is a simple example in which we create artificial weather data for some German
cities. We use Pandas and Numpy to create the data:
import numpy as np
DATA GENERATION 40
import pandas as pd
n= len(cities)
data = {'Temperature': np.random.normal(24, 3, n),
'Humidity': np.random.normal(78, 2.5, n),
'Wind': np.random.normal(15, 4, n)
}
df = pd.DataFrame(data=data, index=cities)
df
Output:
Temperature Humidity Wind
DATA GENERATION 41
ANOTHER EXAMPLE
We will create artificial data for four nonexistent types of flowers. If the names remind you of programming
languages and pizza, it will be no coincidence:
• Flos Pythonem
• Flos Java
• Flos Margarita
• Flos artificialis
• (255, 0, 0)
• (245, 107, 0)
• (206, 99, 1)
• (255, 254, 101)
• 3.8
• 3.3
• 4.1
• 2.9
DATA GENERATION 42
res = truncated_normal(mean=mean, sd=sd, low=low, upp=upp)
return res.rvs(num).astype(np.uint8)
# flos Java:
number_of_items = number_of_items_per_class[1]
reds = truncated_normal_ints(mean=245, sd=17, low=226, upp=256,
num=number_of_items)
greens = truncated_normal_ints(mean=107, sd=11, low=88, upp=127,
num=number_of_items)
blues = truncated_normal_ints(mean=0, sd=10, low=0, upp=20,
num=number_of_items)
calyx_dia = truncated_normal_floats(3.3, 0.3, 3.0, 3.5,
num=number_of_items)
data = np.column_stack((reds, greens, blues, calyx_dia))
flowers["flos_java"] = data
# flos Java:
number_of_items = number_of_items_per_class[2]
reds = truncated_normal_ints(mean=206, sd=17, low=175, upp=238,
num=number_of_items)
greens = truncated_normal_ints(mean=99, sd=14, low=80, upp=120,
num=number_of_items)
blues = truncated_normal_ints(mean=1, sd=5, low=0, upp=12,
num=number_of_items)
calyx_dia = truncated_normal_floats(4.1, 0.3, 3.8, 4.4,
num=number_of_items)
data = np.column_stack((reds, greens, blues, calyx_dia))
flowers["flos_margarita"] = data
DATA GENERATION 43
# flos artificialis:
number_of_items = number_of_items_per_class[3]
reds = truncated_normal_ints(mean=255, sd=8, low=2245, upp=2255,
num=number_of_items)
greens = truncated_normal_ints(mean=254, sd=10, low=240, upp=255,
num=number_of_items)
blues = truncated_normal_ints(mean=101, sd=5, low=90, upp=112,
num=number_of_items)
calyx_dia = truncated_normal_floats(2.9, 0.4, 2.4, 3.5,
num=number_of_items)
data = np.column_stack((reds, greens, blues, calyx_dia))
flowers["flos_artificialis"] = data
data = np.concatenate((flowers["flos_pythonem"],
flowers["flos_java"],
flowers["flos_margarita"],
flowers["flos_artificialis"]
), axis=0)
target_names = list(flowers.keys())
feature_names = ['red', 'green', 'blue', 'calyx']
n = 4
fig, ax = plt.subplots(n, n, figsize=(16, 16))
for x in range(n):
DATA GENERATION 44
for y in range(n):
xname = feature_names[x]
yname = feature_names[y]
for color_ind in range(len(target_names)):
ax[x, y].scatter(data[target==color_ind, x],
data[target==color_ind, y],
label=target_names[color_ind],
c=colors[color_ind])
ax[x, y].set_xlabel(xname)
ax[x, y].set_ylabel(yname)
ax[x, y].legend(loc='upper left')
plt.show()
DATA GENERATION 45
GENERATE SYNTHETIC DATA WITH SCIKIT-LEARN
It is a lot easier to use the possibilities of Scikit-Learn to create synthetic data.
DATA GENERATION 46
GENERATORS FOR CLASSIFICATION AND CLUSTERING
We start with the the function make_blobs of sklearn.datasets to create 'blob' like data
distributions. By setting the value of centers to n_classes , we determine the number of blobs, i.e.
the clusters. n_samples corresponds to the total number of points equally divided among clusters. If
random_state is not set, we will have random results every time we call the function. We pass an int to
this parameter for reproducible output across multiple function calls.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
n_classes = 4
data, labels = make_blobs(n_samples=1000,
centers=n_classes,
random_state=100)
labels[:7]
Output: array([1, 3, 1, 3, 1, 3, 2])
fig, ax = plt.subplots()
ax.set(xlabel='X',
ylabel='Y',
title='Blobs Examples')
ax.legend(loc='upper right')
DATA GENERATION 47
Output: <matplotlib.legend.Legend at 0x7f50f92a4640>
The centers of the blobs were randomly chosen in the previous example. In the following example we set the
centers of the blobs explicitly. We create a list with the center points and pass it to the parameter centers :
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
labels[:7]
Output: array([0, 1, 1, 0, 2, 2, 2])
fig, ax = plt.subplots()
DATA GENERATION 48
label=label)
ax.set(xlabel='X',
ylabel='Y',
title='Blobs Examples')
ax.legend(loc='upper right')
Output: <matplotlib.legend.Legend at 0x7f50f91eaca0>
Usually, you want to save your artificially created datasets in a file. For this purpose, we can use the function
savetxt from numpy. Before we can do this we have to reaarange our data. Each row should contain both
the data and the label:
import numpy as np
labels = labels.reshape((labels.shape[0],1))
all_data = np.concatenate((data, labels), axis=1)
all_data[:7]
Output: array([[ 1.72415394, 4.22895559, 0. ],
[ 4.16466507, 5.77817418, 1. ],
[ 4.51441156, 4.98274913, 1. ],
[ 1.49102772, 2.83351405, 0. ],
[ 6.0386362 , 7.57298437, 2. ],
[ 5.61044976, 9.83428321, 2. ],
[ 5.69202866, 10.47239631, 2. ]])
DATA GENERATION 49
For some people it might be complicated to understand the combination of reshape and concatenate.
Therefore, you can see an extremely simple example in the following code:
import numpy as np
We use the numpy function savetxt to save the data. Don't worry about the strange name, it is just for fun
and for reasons which will be clear soon:
np.savetxt("squirrels.txt",
all_data,
fmt=['%.3f', '%.3f', '%1d'])
all_data[:10]
Output: array([[ 1.72415394, 4.22895559, 0. ],
[ 4.16466507, 5.77817418, 1. ],
[ 4.51441156, 4.98274913, 1. ],
[ 1.49102772, 2.83351405, 0. ],
[ 6.0386362 , 7.57298437, 2. ],
[ 5.61044976, 9.83428321, 2. ],
[ 5.69202866, 10.47239631, 2. ],
[ 6.14017298, 8.56209179, 2. ],
[ 2.97620068, 5.56776474, 1. ],
[ 8.27980017, 8.54824406, 2. ]])
DATA GENERATION 50
READING THE DATA AND
CONVERSION BACK INTO 'DATA'
AND 'LABELS'
We will demonstrate now, how to read in the data again and how to split it into data and labels again:
file_data = np.loadtxt("squirrels.txt")
data = file_data[:,:-1]
labels = file_data[:,2:]
labels = labels.reshape((labels.shape[0]))
We had called the data file squirrels.txt , because we imagined a strange kind of animal living in the
Sahara desert. The x-values stand for the night vision capabilities of the animals and the y-values correspond
to the colour of the fur, going from sandish to black. We have three kinds of squirrels, 0, 1, and 2. (Be aware
that our squirrals are imaginary squirrels and have nothing to do with the real squirrels of the Sahara!)
fig, ax = plt.subplots()
for n_class in range(0, n_classes):
ax.scatter(data[labels==n_class, 0], data[labels==n_class,
1],
c=colours[n_class], s=10, label=str(n_class))
ax.set(xlabel='Night Vision',
ylabel='Fur color from sandish to black, 0 to 10 ',
title='Sahara Virtual Squirrel')
ax.legend(loc='upper right')
READING THE DATA AND CONVERSION BACK INTO 'DATA' AND 'LABELS' 51
Output: <matplotlib.legend.Legend at 0x7f545b4d6340>
data_sets = train_test_split(data,
labels,
train_size=0.8,
test_size=0.2,
random_state=42 # garantees same output fo
r every run
)
# import model
from sklearn.neighbors import KNeighborsClassifier
# create classifier
knn = KNeighborsClassifier(n_neighbors=8)
# train
knn.fit(train_data, train_labels)
READING THE DATA AND CONVERSION BACK INTO 'DATA' AND 'LABELS' 52
Output: array([2., 0., 1., 1., 0., 1., 2., 2., 2., 2., 0., 1., 0.,
0., 1., 0., 1.,
2., 0., 0., 1., 2., 1., 2., 2., 1., 2., 0., 0., 2.,
0., 2., 2., 0.,
0., 2., 0., 0., 0., 1., 0., 1., 1., 2., 0., 2., 1.,
2., 1., 0., 2.,
1., 1., 0., 1., 2., 1., 0., 0., 2., 1., 0., 1., 1.,
0., 0., 0., 0.,
0., 0., 0., 1., 1., 0., 1., 1., 1., 0., 1., 2., 1.,
2., 0., 2., 1.,
1., 0., 2., 2., 2., 0., 1., 1., 1., 2., 2., 0., 2.,
2., 2., 2., 0.,
0., 1., 1., 1., 2., 1., 1., 1., 0., 2., 1., 2., 0.,
0., 1., 0., 1.,
0., 2., 2., 2., 1., 1., 1., 0., 2., 1., 2., 2., 1.,
2., 0., 2., 0.,
0., 1., 0., 2., 2., 0., 0., 1., 2., 1., 2., 0., 0.,
2., 2., 0., 0.,
1., 2., 1., 2., 0., 0., 1., 2., 1., 0., 2., 2., 0.,
2., 0., 0., 2.,
1., 0., 0., 0., 0., 2., 2., 1., 0., 2., 2., 1., 2.,
0., 1., 1., 1.,
0., 1., 0., 1., 1., 2., 0., 2., 2., 1., 1., 1., 2.])
READING THE DATA AND CONVERSION BACK INTO 'DATA' AND 'LABELS' 53
OTHER INTERESTING
DISTRIBUTIONS
import numpy as np
import sklearn.datasets as ds
data, labels = ds.make_moons(n_samples=150,
shuffle=True,
noise=0.19,
random_state=None)
data += np.array(-np.ndarray.min(data[:,0]),
-np.ndarray.min(data[:,1]))
np.ndarray.min(data[:,0]), np.ndarray.min(data[:,1])
Output: (0.0, 0.34649342272719386)
ax.set(xlabel='X',
ylabel='Y',
title='Moons')
#ax.legend(loc='upper right');
We want to scale values that are in a range [min, max] in a range [a, b] .
(b − a) ⋅ (x − min)
f(x) = +a
max − min
We now use this formula to transform both the X and Y coordinates of data into other ranges:
#np.ndarray.min(data[:,0]), np.ndarray.max(data[:,0])
data[:6]
Output: array([[71.14479608, 12.28919998],
[62.16584307, 18.75442981],
[61.02613211, 12.80794358],
[64.30752046, 12.32563839],
[81.41469127, 13.64613406],
[82.03929032, 13.63156545]])
fig, ax = plt.subplots()
ax.set(xlabel='X',
ylabel='Y',
title='moons')
ax.legend(loc='upper right');
import sklearn.datasets as ds
data, labels = ds.make_circles(n_samples=100,
shuffle=True,
fig, ax = plt.subplots()
ax.set(xlabel='X',
ylabel='Y',
title='circles')
ax.legend(loc='upper right')
Output: <matplotlib.legend.Legend at 0x7f54588c2e20>
print(__doc__)
plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)
plt.subplot(322)
plt.title("Two informative features, one cluster per class", fonts
ize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_inform
ative=2,
n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
plt.subplot(323)
plt.title("Two informative features, two clusters per class",
fontsize='small')
X2, Y2 = make_classification(n_features=2,
n_redundant=0,
n_informative=2)
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2,
s=25, edgecolor='k')
plt.subplot(324)
plt.title("Multi-class, two informative features, one cluster",
fontsize='small')
X1, Y1 = make_classification(n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
plt.subplot(325)
plt.title("Gaussian divided into three quantiles", fontsize='smal
l')
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
EXERCISES
EXERCISE 1
Create two testsets which are separable with a perceptron without a bias node.
Create two testsets which are not separable with a dividing line going through the origin.
EXERCISE 3
Create a dataset with five classes "Tiger", "Lion", "Penguin", "Dolphin", and "Python". The sets should look
similar to the following diagram:
SOLUTIONS
SOLUTION TO EXERCISE 1
fig, ax = plt.subplots()
ax.legend(loc='upper right')
Output: <matplotlib.legend.Legend at 0x7f788afb2c40>
SOLUTION TO EXERCISE 2
fig, ax = plt.subplots()
ax.set(xlabel='X',
ylabel='Y',
ax.legend(loc='upper right')
Output: <matplotlib.legend.Legend at 0x7f788af8eac0>
SOLUTION TO EXERCISE 3
import sklearn.datasets as ds
data, labels = ds.make_circles(n_samples=100,
shuffle=True,
noise=0.05,
random_state=42)
print(labels2)
labels = np.concatenate([labels, labels2])
data = data * [1.2, 1.8] + [3, 4]
fig, ax = plt.subplots()
ax.set(xlabel='X',
ylabel='Y',
title='dataset')
ax.legend(loc='upper right')
Output: <matplotlib.legend.Legend at 0x7f788b1d42b0>
For this purpose, we need to split our data into two parts:
When you consider how machine learning normally works, the idea of a split between learning and test data
makes sense. Really existing systems train on existing data and if other new data (from customers, sensors or
other sources) comes in, the trained classifier has to predict or classify this new data. We can simulate this
during training with a training and test data set - the test data is a simulation of "future data" that will go into
the system during production.
In this chapter of our Python Machine Learning Tutorial, we will learn how to do the splitting with plain
Python.
We will see also that doing it manually is not necessary, because the train_test_split function from
the model_selection module can do it for us.
DATA PREPARATION 65
We separated the dataset into a learn (a.k.a. training) dataset and a test dataset. Best practice is to split it into a
learn, test and an evaluation dataset.
We will train our model (classifier) step by step and each time the result needs to be tested. If we just have a
test dataset. The results of the testing might get into the model. So we will use an evaluation dataset for the
complete learning phase. When our classifier is finished, we will check it with the test dataset, which it has not
"seen" before!
Yet, during our tutorial, we will only use splitings into learn and test datasets.
DATA PREPARATION 66
SPLITTING EXAMPLE: IRIS DATA SET
We will demonstrate the previously discussed topics with the Iris Dataset.
The 150 data sets of the Iris data set are sorted, i.e. the first 50 data correspond to the first flower class (0 =
Setosa), the next 50 to the second flower class (1 = Versicolor) and the remaining data correspond to the last
class (2 = Virginica).
If we were to split our data in the ratio 2/3 (learning set) and 1/3 (test set), the learning set would contain all
the flowers of the first two classes and the test set all the flowers of the third flower class. The classifier could
only learn two classes and the third class would be completely unknown. So we urgently need to mix the data.
Assuming all samples are independent of each other, we want to shuffle the data set randomly before we split
the data set as shown above.
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
iris.target
Output: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
The first thing we have to do is rearrange the data so that it is not sorted anymore. For this purpose, we will
use the permutation function of the random submodul of Numpy:
indices = np.random.permutation(len(iris.data))
indices
DATA PREPARATION 67
Output: array([ 98, 56, 37, 60, 94, 142, 117, 121, 10, 15, 8
9, 85, 66,
29, 44, 102, 24, 140, 58, 25, 19, 100, 83, 12
6, 28, 118,
50, 127, 72, 99, 74, 0, 128, 11, 45, 143, 5
4, 79, 34,
32, 95, 92, 46, 146, 3, 9, 73, 101, 23, 7
7, 39, 87,
111, 129, 148, 67, 75, 147, 48, 76, 43, 30, 14
4, 27, 104,
35, 93, 125, 2, 69, 63, 40, 141, 7, 133, 1
8, 4, 12,
109, 33, 88, 71, 22, 110, 42, 8, 134, 5, 9
7, 114, 135,
108, 91, 14, 6, 137, 124, 130, 145, 55, 17, 8
0, 36, 61,
49, 62, 90, 84, 64, 139, 107, 112, 1, 70, 12
3, 38, 132,
31, 16, 13, 21, 113, 120, 41, 106, 65, 20, 11
6, 86, 68,
96, 78, 53, 47, 105, 136, 51, 57, 131, 149, 11
9, 26, 59,
138, 122, 81, 103, 52, 115, 82])
n_test_samples = 12
learnset_data = iris.data[indices[:-n_test_samples]]
learnset_labels = iris.target[indices[:-n_test_samples]]
testset_data = iris.data[indices[-n_test_samples:]]
testset_labels = iris.target[indices[-n_test_samples:]]
print(learnset_data[:4], learnset_labels[:4])
print(testset_data[:4], testset_labels[:4])
[[5.1 2.5 3. 1.1]
[6.3 3.3 4.7 1.6]
[4.9 3.6 1.4 0.1]
[5. 2. 3.5 1. ]] [1 1 0 1]
[[7.9 3.8 6.4 2. ]
[5.9 3. 5.1 1.8]
[6. 2.2 5. 1.5]
[5. 3.4 1.6 0.4]] [2 2 2 0]
DATA PREPARATION 68
We will demonstrate this below. We will use 80% of the data as training and 20% as test data. We could just as
well have taken 70% and 30%, because there are no hard and fast rules. The most important thing is that you
rate your system fairly based on data it did not see during exercise! In addition, there must be enough data in
both data sets.
n = 7
print(f"The first {n} data sets:")
print(test_data[:7])
print(f"The corresponding {n} labels:")
print(test_labels[:7])
The first 7 data sets:
[[6.1 2.8 4.7 1.2]
[5.7 3.8 1.7 0.3]
[7.7 2.6 6.9 2.3]
[6. 2.9 4.5 1.5]
[6.8 2.8 4.8 1.4]
[5.4 3.4 1.5 0.4]
[5.6 2.9 3.6 1.3]]
The corresponding 7 labels:
[1 0 2 1 1 0 1]
import numpy as np
print('All:', np.bincount(labels) / float(len(labels)) * 100.0)
print('Training:', np.bincount(train_labels) / float(len(train_lab
els)) * 100.0)
DATA PREPARATION 69
print('Test:', np.bincount(test_labels) / float(len(test_labels))
* 100.0)
All: [33.33333333 33.33333333 33.33333333]
Training: [33.33333333 34.16666667 32.5 ]
Test: [33.33333333 30. 36.66666667]
To stratify the division, we can pass the label array as an additional argument to the train_test_split function:
This was a stupid example to test the stratified random sample, because the Iris data set has the same
proportions, i.e. each class 50 elements.
We will work now with the file strange_flowers.txt of the directory data . This data set is created
in the chapter Generate Datasets in Python The classes in this dataset have different numbers of items. First
we load the data:
DATA PREPARATION 70
res = train_test_split(data, labels,
train_size=0.8,
test_size=0.2,
random_state=42,
stratify=labels)
train_data, test_data, train_labels, test_labels = res
DATA PREPARATION 71
K-NEAREST-NEIGHBOR CLASSIFIER
If you learn that Ben lives in a neighborhood where people vote conservative and that the average income is
above 200000 dollars a year? Both his neighbors make even more than 300,000 dollars per year? What do you
think of Ben? Most probably, you do not consider him to be an underdog and you may suspect him to be a
conservative as well?
The principle behind nearest neighbor classification consists in finding a predefined number, i.e. the 'k' - of
training samples closest in distance to a new sample, which has to be classified. The label of the new sample
will be defined from these neighbors. k-nearest neighbor classifiers have a fixed user defined constant for the
number of neighbors which have to be determined. There are also radius-based neighbor learning algorithms,
which have a varying number of neighbors based on the local density of points, all the samples inside of a
fixed radius. The distance can, in general, be any metric measure: standard Euclidean distance is the most
common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since
they simply "remember" all of its training data. Classification can be computed by a majority vote of the
nearest neighbors of the unknown sample.
The k-NN algorithm is among the simplest of all machine learning algorithms, but despite its simplicity, it has
been quite successful in a large number of classification and regression problems, for example character
recognition or image analysis.
As explained in the chapter Data Preparation, we need labeled learning and test data. In contrast to other
classifiers, however, the pure nearest-neighbor classifiers do not do any learning, but the so-called learning set
LS is a basic component of the classifier. The k-Nearest-Neighbor Classifier (kNN) works directly on the
K-NEAREST-NEIGHBOR CLASSIFIER 72
learned samples, instead of creating rules compared to other classification methods.
Given a set of categories C = {c 1, c 2, . . . c m}, also called classes, e.g. {"male", "female"}. There is also a
learnset LS consisting of labelled instances:
LS = {(o 1, c o ), (o 2, c o ), ⋯(o n, c o )}
1 2 n
As it makes no sense to have less lebelled items than categories, we can postulate that
• Case 1:
The instance o is an element of LS, i.e. there is a tupel (o, c) ∈ LS
In this case, we will use the class c as the classification result.
• Case 2:
We assume now that o is not in LS, or to be precise:
∀c ∈ C, (o, c) ∉ LS
o is compared with all the instances of LS. A distance metric d is used for the comparisons.
We determine the k closest neighbors of o, i.e. the items with the smallest distances.
k is a user defined constant and a positive integer, which is usually small.
The number k is typically chosen as the square root of LS, the total number of points in the training data set.
There is no general way to define an optimal value for 'k'. This value depends on the data. As a general rule
we can say that increasing 'k' reduces the noise but on the other hand makes the boundaries less distinct.
The algorithm for the k-nearest neighbor classifier is among the simplest of all machine learning algorithms.
k-NN is a type of instance-based learning, or lazy learning. In machine learning, lazy learning is understood
to be a learning method in which generalization of the training data is delayed until a query is made to the
system. On the other hand, we have eager learning, where the system usually generalizes the training data
before receiving queries. In other words: The function is only approximated locally and all the computations
K-NEAREST-NEIGHBOR CLASSIFIER 73
are performed, when the actual classification is being performed.
The following picture shows in a simple way how the nearest neighbor classifier works. The puzzle piece is
unknown. To find out which animal it might be we have to find the neighbors. If k=1 , the only neighbor is a
cat and we assume in this case that the puzzle piece should be a cat as well. If k=4 , the nearest neighbors
contain one chicken and three cats. In this case again, it will be save to assume that our object in question
should be a cat.
Before we actually start with writing a nearest neighbor classifier, we need to think about the data, i.e. the
learnset and the testset. We will use the "iris" dataset provided by the datasets of the sklearn module.
The data set consists of 50 samples from each of three species of Iris
• Iris setosa,
• Iris virginica and
• Iris versicolor.
Four features were measured from each sample: the length and the width of the sepals and petals, in
centimetres.
import numpy as np
from sklearn import datasets
K-NEAREST-NEIGHBOR CLASSIFIER 74
iris = datasets.load_iris()
data = iris.data
labels = iris.target
We create a learnset from the sets above. We use permutation from np.random to split the data
randomly.
n_training_samples = 12
learn_data = data[indices[:-n_training_samples]]
learn_labels = labels[indices[:-n_training_samples]]
test_data = data[indices[-n_training_samples:]]
test_labels = labels[indices[-n_training_samples:]]
K-NEAREST-NEIGHBOR CLASSIFIER 75
The first samples of our learn set:
index data label
0 [6.1 2.8 4.7 1.2] 1
1 [5.7 3.8 1.7 0.3] 0
2 [7.7 2.6 6.9 2.3] 2
3 [6. 2.9 4.5 1.5] 1
4 [6.8 2.8 4.8 1.4] 1
The first samples of our test set:
index data label
0 [6.1 2.8 4.7 1.2] 1
1 [5.7 3.8 1.7 0.3] 0
2 [7.7 2.6 6.9 2.3] 2
3 [6. 2.9 4.5 1.5] 1
4 [6.8 2.8 4.8 1.4] 1
The following code is only necessary to visualize the data of our learnset. Our data consists of four values per
iris item, so we will reduce the data to three values by summing up the third and fourth value. This way, we
are capable of depicting the data in 3-dimensional space:
#%matplotlib widget
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
K-NEAREST-NEIGHBOR CLASSIFIER 76
DISTANCE METRICS
We have already mentioned in detail, we calculate the distances between the points of the sample and the
object to be classified. To calculate these distances we need a distance function.
In n-dimensional vector rooms, one usually uses one of the following three distance metrics:
• Euclidean Distance
The Euclidean distance between two points x and y in either the plane or 3-dimensional
space measures the length of a line segment connecting these two points. It can be calculated
from the Cartesian coordinates of the points using the Pythagorean theorem, therefore it is also
occasionally being called the Pythagorean distance. The general formula is
d(x, y) =
√ ∑ (x i − y i) 2
i=1
• Manhattan Distance
It is defined as the sum of the absolute values of the differences between the coordinates of x
and y:
n
d(x, y) = ∑ | xi − yi |
i=1
• Minkowski Distance
The Minkowski distance generalizes the Euclidean and the Manhatten distance in one distance
metric. If we set the parameter p in the following formula to 1 we get the manhattan distance
an using the value 2 gives us the euclidean distance:
K-NEAREST-NEIGHBOR CLASSIFIER 77
( )
1
n
p
d(x, y) = ∑ | xi − yi | p
i=1
The following diagram visualises the Euclidean and the Manhattan distance:
The blue line illustrates the Eucliden distance between the green and red dot. Otherwise you can also move
over the orange, green or yellow line from the green point to the red point. The lines correspond to the
manhatten distance. The length is equal.
To determine the similarity between two instances, we will use the Euclidean distance.
We can calculate the Euclidean distance with the function norm of the module np.linalg :
K-NEAREST-NEIGHBOR CLASSIFIER 78
4.47213595499958
3.4190641994557516
The function get_neighbors returns a list with k neighbors, which are closest to the instance
test_instance :
def get_neighbors(training_set,
labels,
test_instance,
k,
distance):
"""
get_neighors calculates a list of the k nearest neighbors
of an instance 'test_instance'.
The function returns a list of k 3-tuples.
Each 3-tuples consists of (index, dist, label)
where
index is the index from the training_set,
dist is the distance between the test_instance and the
instance training_set[index]
distance is a reference to a function used to calculate the
distances
"""
distances = []
for index in range(len(training_set)):
dist = distance(test_instance, training_set[index])
distances.append((training_set[index], dist, labels[inde
x]))
distances.sort(key=lambda x: x[1])
neighbors = distances[:k]
return neighbors
for i in range(5):
neighbors = get_neighbors(learn_data,
learn_labels,
test_data[i],
3,
distance=distance)
print("Index: ",i,'\n',
"Testset Data: ",test_data[i],'\n',
"Testset Label: ",test_labels[i],'\n',
"Neighbors: ",neighbors,'\n')
K-NEAREST-NEIGHBOR CLASSIFIER 79
Index: 0
Testset Data: [5.7 2.8 4.1 1.3]
Testset Label: 1
Neighbors: [(array([5.7, 2.9, 4.2, 1.3]), 0.141421356237309
95, 1), (array([5.6, 2.7, 4.2, 1.3]), 0.17320508075688815, 1), (ar
ray([5.6, 3. , 4.1, 1.3]), 0.22360679774997935, 1)]
Index: 1
Testset Data: [6.5 3. 5.5 1.8]
Testset Label: 2
Neighbors: [(array([6.4, 3.1, 5.5, 1.8]), 0.141421356237309
3, 2), (array([6.3, 2.9, 5.6, 1.8]), 0.24494897427831783, 2), (arr
ay([6.5, 3. , 5.2, 2. ]), 0.3605551275463988, 2)]
Index: 2
Testset Data: [6.3 2.3 4.4 1.3]
Testset Label: 1
Neighbors: [(array([6.2, 2.2, 4.5, 1.5]), 0.264575131106458
6, 1), (array([6.3, 2.5, 4.9, 1.5]), 0.574456264653803, 1), (arra
y([6. , 2.2, 4. , 1. ]), 0.5916079783099617, 1)]
Index: 3
Testset Data: [6.4 2.9 4.3 1.3]
Testset Label: 1
Neighbors: [(array([6.2, 2.9, 4.3, 1.3]), 0.200000000000000
18, 1), (array([6.6, 3. , 4.4, 1.4]), 0.2645751311064587, 1), (arr
ay([6.6, 2.9, 4.6, 1.3]), 0.3605551275463984, 1)]
Index: 4
Testset Data: [5.6 2.8 4.9 2. ]
Testset Label: 2
Neighbors: [(array([5.8, 2.7, 5.1, 1.9]), 0.316227766016837
5, 2), (array([5.8, 2.7, 5.1, 1.9]), 0.3162277660168375, 2), (arra
y([5.7, 2.5, 5. , 2. ]), 0.33166247903553986, 2)]
We will write a vote function now. This functions uses the class Counter from collections to count
the quantity of the classes inside of an instance list. This instance list will be the neighbors of course. The
function vote returns the most common class:
def vote(neighbors):
K-NEAREST-NEIGHBOR CLASSIFIER 80
class_counter = Counter()
for neighbor in neighbors:
class_counter[neighbor[2]] += 1
return class_counter.most_common(1)[0][0]
for i in range(n_training_samples):
neighbors = get_neighbors(learn_data,
learn_labels,
test_data[i],
3,
distance=distance)
print("index: ", i,
", result of vote: ", vote(neighbors),
", label: ", test_labels[i],
", data: ", test_data[i])
index: 0 , result of vote: 1 , label: 1 , data: [5.7 2.8 4.1
1.3]
index: 1 , result of vote: 2 , label: 2 , data: [6.5 3. 5.5
1.8]
index: 2 , result of vote: 1 , label: 1 , data: [6.3 2.3 4.4
1.3]
index: 3 , result of vote: 1 , label: 1 , data: [6.4 2.9 4.3
1.3]
index: 4 , result of vote: 2 , label: 2 , data: [5.6 2.8 4.9
2. ]
index: 5 , result of vote: 2 , label: 2 , data: [5.9 3. 5.1
1.8]
index: 6 , result of vote: 0 , label: 0 , data: [5.4 3.4 1.7
0.2]
index: 7 , result of vote: 1 , label: 1 , data: [6.1 2.8 4.
1.3]
index: 8 , result of vote: 1 , label: 2 , data: [4.9 2.5 4.5
1.7]
index: 9 , result of vote: 0 , label: 0 , data: [5.8 4. 1.2
0.2]
index: 10 , result of vote: 1 , label: 1 , data: [5.8 2.6 4.
1.2]
index: 11 , result of vote: 2 , label: 2 , data: [7.1 3. 5.9
2.1]
We can see that the predictions correspond to the labelled results, except in case of the item with the index 8.
K-NEAREST-NEIGHBOR CLASSIFIER 81
'vote_prob' is a function like 'vote' but returns the class name and the probability for this class:
def vote_prob(neighbors):
class_counter = Counter()
for neighbor in neighbors:
class_counter[neighbor[2]] += 1
labels, votes = zip(*class_counter.most_common())
winner = class_counter.most_common(1)[0][0]
votes4winner = class_counter.most_common(1)[0][1]
return winner, votes4winner/sum(votes)
for i in range(n_training_samples):
neighbors = get_neighbors(learn_data,
learn_labels,
test_data[i],
5,
distance=distance)
print("index: ", i,
", vote_prob: ", vote_prob(neighbors),
", label: ", test_labels[i],
", data: ", test_data[i])
K-NEAREST-NEIGHBOR CLASSIFIER 82
index: 0 , vote_prob: (1, 1.0) , label: 1 , data: [5.7 2.8
4.1 1.3]
index: 1 , vote_prob: (2, 1.0) , label: 2 , data: [6.5 3.
5.5 1.8]
index: 2 , vote_prob: (1, 1.0) , label: 1 , data: [6.3 2.3
4.4 1.3]
index: 3 , vote_prob: (1, 1.0) , label: 1 , data: [6.4 2.9
4.3 1.3]
index: 4 , vote_prob: (2, 1.0) , label: 2 , data: [5.6 2.8
4.9 2. ]
index: 5 , vote_prob: (2, 0.8) , label: 2 , data: [5.9 3.
5.1 1.8]
index: 6 , vote_prob: (0, 1.0) , label: 0 , data: [5.4 3.4
1.7 0.2]
index: 7 , vote_prob: (1, 1.0) , label: 1 , data: [6.1 2.8
4. 1.3]
index: 8 , vote_prob: (1, 1.0) , label: 2 , data: [4.9 2.5
4.5 1.7]
index: 9 , vote_prob: (0, 1.0) , label: 0 , data: [5.8 4.
1.2 0.2]
index: 10 , vote_prob: (1, 1.0) , label: 1 , data: [5.8 2.6
4. 1.2]
index: 11 , vote_prob: (2, 1.0) , label: 2 , data: [7.1 3.
5.9 2.1]
We looked only at k items in the vicinity of an unknown object „UO", and had a majority vote. Using the
majority vote has shown quite efficient in our previous example, but this didn't take into account the following
reasoning: The farther a neighbor is, the more it "deviates" from the "real" result. Or in other words, we can
trust the closest neighbors more than the farther ones. Let's assume, we have 11 neighbors of an unknown item
UO. The closest five neighbors belong to a class A and all the other six, which are farther away belong to a
class B. What class should be assigned to UO? The previous approach says B, because we have a 6 to 5 vote
in favor of B. On the other hand the closest 5 are all A and this should count more.
To pursue this strategy, we can assign weights to the neighbors in the following way: The nearest neighbor of
an instance gets a weight 1 / 1, the second closest gets a weight of 1 / 2 and then going on up to 1 / k for the
farthest away neighbor.
K-NEAREST-NEIGHBOR CLASSIFIER 83
def vote_harmonic_weights(neighbors, all_results=True):
class_counter = Counter()
number_of_neighbors = len(neighbors)
for index in range(number_of_neighbors):
class_counter[neighbors[index][2]] += 1/(index+1)
labels, votes = zip(*class_counter.most_common())
#print(labels, votes)
winner = class_counter.most_common(1)[0][0]
votes4winner = class_counter.most_common(1)[0][1]
if all_results:
total = sum(class_counter.values(), 0.0)
for key in class_counter:
class_counter[key] /= total
return winner, class_counter.most_common()
else:
return winner, votes4winner / sum(votes)
for i in range(n_training_samples):
neighbors = get_neighbors(learn_data,
learn_labels,
test_data[i],
6,
distance=distance)
print("index: ", i,
", result of vote: ",
vote_harmonic_weights(neighbors,
all_results=True))
index: 0 , result of vote: (1, [(1, 1.0)])
index: 1 , result of vote: (2, [(2, 1.0)])
index: 2 , result of vote: (1, [(1, 1.0)])
index: 3 , result of vote: (1, [(1, 1.0)])
index: 4 , result of vote: (2, [(2, 0.9319727891156463), (1, 0.0
6802721088435375)])
index: 5 , result of vote: (2, [(2, 0.8503401360544217), (1, 0.1
4965986394557826)])
index: 6 , result of vote: (0, [(0, 1.0)])
index: 7 , result of vote: (1, [(1, 1.0)])
index: 8 , result of vote: (1, [(1, 1.0)])
index: 9 , result of vote: (0, [(0, 1.0)])
index: 10 , result of vote: (1, [(1, 1.0)])
index: 11 , result of vote: (2, [(2, 1.0)])
The previous approach took only the ranking of the neighbors according to their distance in account. We can
K-NEAREST-NEIGHBOR CLASSIFIER 84
improve the voting by using the actual distance. To this purpos we will write a new voting function:
for i in range(n_training_samples):
neighbors = get_neighbors(learn_data,
learn_labels,
test_data[i],
6,
distance=distance)
print("index: ", i,
", result of vote: ",
vote_distance_weights(neighbors,
all_results=True))
K-NEAREST-NEIGHBOR CLASSIFIER 85
index: 0 , result of vote: (1, [(1, 1.0)])
index: 1 , result of vote: (2, [(2, 1.0)])
index: 2 , result of vote: (1, [(1, 1.0)])
index: 3 , result of vote: (1, [(1, 1.0)])
index: 4 , result of vote: (2, [(2, 0.8490154592118361), (1, 0.1
5098454078816387)])
index: 5 , result of vote: (2, [(2, 0.6736137462184478), (1, 0.3
263862537815521)])
index: 6 , result of vote: (0, [(0, 1.0)])
index: 7 , result of vote: (1, [(1, 1.0)])
index: 8 , result of vote: (1, [(1, 1.0)])
index: 9 , result of vote: (0, [(0, 1.0)])
index: 10 , result of vote: (1, [(1, 1.0)])
index: 11 , result of vote: (2, [(2, 1.0)])
k = 2
for test_instance in [(0, 0, 0), (2, 2, 2),
(-3, -1, 0), (0, 1, 0.9),
(1, 1.5, 1.8), (0.9, 0.8, 1.6)]:
neighbors = get_neighbors(train_set,
labels,
test_instance,
k,
distance=distance)
K-NEAREST-NEIGHBOR CLASSIFIER 86
vote distance weights: ('orange', [('orange', 1.0)])
vote distance weights: ('apple', [('apple', 1.0)])
vote distance weights: ('banana', [('banana', 0.529411764705882
4), ('apple', 0.47058823529411764)])
vote distance weights: ('orange', [('orange', 1.0)])
vote distance weights: ('apple', [('apple', 1.0)])
vote distance weights: ('apple', [('apple', 0.5084745762711865),
('orange', 0.4915254237288135)])
KNN IN LINGUISTICS
The next example comes from computer linguistics. We show how we can use a k-nearest neighbor classifier
to recognize misspelled words.
We use a module called levenshtein, which we have implemented in our tutorial on Levenshtein Distance.
cities = open("data/city_names.txt").readlines()
cities = [city.strip() for city in cities]
K-NEAREST-NEIGHBOR CLASSIFIER 87
Can you help Marvin and James?
K-NEAREST-NEIGHBOR CLASSIFIER 88
You will need an English dictionary and a k-nearest Neighbor classifier to solve this problem. If you work
under Linux (especially Ubuntu), you can find a file with a British-English dictionary under /usr/share/dict/
british-english. Windows users and others can download the file as
british-english.txt
We use extremely misspelled words in the following example. We see that our simple vote_prob function is
doing well only in two cases: In correcting "holpposs" to "helpless" and "blagrufoo" to "barefoot". Whereas
our distance voting is doing well in all cases. Okay, we have to admit that we had "liberty" in mind, when we
wrote "liberdi", but suggesting "liberal" is a good choice.
words = []
with open("british-english.txt") as fh:
for line in fh:
word = line.strip()
words.append(word)
K-NEAREST-NEIGHBOR CLASSIFIER 89
for word in ["holpful", "kundnoss", "holpposs", "thoes", "innersta
nd",
"blagrufoo", "liberdi"]:
neighbors = get_neighbors(words,
words,
word,
3,
distance=levenshtein)
K-NEAREST-NEIGHBOR CLASSIFIER 90
vote_distance_weights: ('helpful', 0.5555555555555556)
vote_prob: ('helpful', 0.3333333333333333)
vote_distance_weights: ('helpful', [('helpful', 0.555555555555555
6), ('doleful', 0.22222222222222227), ('hopeful', 0.22222222222222
227)])
vote_distance_weights: ('kindness', 0.5)
vote_prob: ('kindness', 0.3333333333333333)
vote_distance_weights: ('kindness', [('kindness', 0.5), ('fondnes
s', 0.25), ('kudos', 0.25)])
vote_distance_weights: ('helpless', 0.3333333333333333)
vote_prob: ('helpless', 0.3333333333333333)
vote_distance_weights: ('helpless', [('helpless', 0.3333333333333
333), ("hippo's", 0.3333333333333333), ('hippos', 0.33333333333333
33)])
vote_distance_weights: ('hoes', 0.3333333333333333)
vote_prob: ('hoes', 0.3333333333333333)
vote_distance_weights: ('hoes', [('hoes', 0.3333333333333333),
('shoes', 0.3333333333333333), ('thees', 0.3333333333333333)])
vote_distance_weights: ('understand', 0.5)
vote_prob: ('understand', 0.3333333333333333)
vote_distance_weights: ('understand', [('understand', 0.5), ('int
erstate', 0.25), ('understands', 0.25)])
vote_distance_weights: ('barefoot', 0.4333333333333333)
vote_prob: ('barefoot', 0.3333333333333333)
vote_distance_weights: ('barefoot', [('barefoot', 0.4333333333333
333), ('Baguio', 0.2833333333333333), ('Blackfoot', 0.283333333333
3333)])
vote_distance_weights: ('liberal', 0.4)
vote_prob: ('liberal', 0.3333333333333333)
vote_distance_weights: ('liberal', [('liberal', 0.4), ('libert
y', 0.4), ('Hibernia', 0.2)])
K-NEAREST-NEIGHBOR CLASSIFIER 91
NEURAL NETWORKS
INTRODUCTION
When we say "Neural Networks", we
mean artificial Neural Networks (ANN).
The idea of ANN is based on biological
neural networks like the brain of living
being.
BIOLOGICAL NEURON
The following image by Quasar Jarosz, courtesy of Wikipedia, illustrates this:
NEURAL NETWORKS 92
ABSTRACTION OF A BIOLOGICAL NEURON AND ARTIFICIAL NEURON
Even though the above image is already an abstraction for a biologist, we can further abstract it:
It is amazingly simple, what is going on inside the body of a perceptron or neuron. The input signals get
multiplied by weight values, i.e. each input has its corresponding weight. This way the input can be adjusted
individually for every x i. We can see all the inputs as an input vector and the corresponding weights as the
weights vector.
When a signal comes in, it gets multiplied by a weight value that is assigned to this particular input. That is, if
a neuron has three inputs, then it has three weights that can be adjusted individually. The weights usually get
adjusted during the learn phase.
After this the modified input signals are summed up. It is also possible to add additionally a so-called bias 'b'
to this sum. The bias is a value which can also be adjusted during the learn phase.
Finally, the actual output has to be determined. For this purpose an activation or step function Φ is applied to
the weighted sum of the input values.
NEURAL NETWORKS 93
The simplest form of an activation function is a binary function. If the result of the summation is greater than
some threshold s, the result of Φ will be 1, otherwise 0.
Φ(x) =
{ 1
0
wx + b > s
otherwise
• Roundworm: 302
• Jellyfish
NEURAL NETWORKS 94
In [ ]:
NEURAL NETWORKS 95
FROM DIVIDING LINES TO NEURAL
NETWORKS
We will develop a simple neural network in this chapter of our tutorial. A network capable of separating two
classes, which are separable by a straight line in a 2-dimensional feature space.
LINE SEPARATION
Before we start programming a simple neural
network, we are going to develop a different concept.
We want to search for straight lines that separate two
points or two classes in a plane. We will only look at
straight lines going through the origin. We will look
at general straight lines later in the tutorial.
We could define dividing lines to define the points which are more lemon-like and which are more orange-
like.
In the following diagram, we depict one lemon and one orange. The green line is separating both points. We
assume that all other lemons are above this line and all oranges will be below this line.
y = mx
where:
m is the slope or gradient of the line and x is the independent variable of the function.
p2
m= x
p1
This means that a point P ′ = (p ′ , p ′ ) is on this line, if the following condition is fulfilled:
1 2
mp ′ − p ′ = 0
1 2
The following Python program plots a graph depicting the previously described situation:
It is clear that a point A = (a 1, a 2) is not on the line, if m ⋅ a 1 − a 2 is not equal to 0. We want to know more.
We want to know, if a point is above or below a straight line.
m ⋅ b 1 − (b 2 + δ B) = 0
m ⋅ b1 − b2 = δB
Finally, we have a criteria for a point to be below the line. m ⋅ b 1 − b 2 is positve, because δ B is positive.
The reasoning for "a point is above the line" is analogue: If a point A = (a 1, a 2) is above the line, there must
be a δ A > 0 so that the point (a 1, a 2 − δ A) will be on the line.
m ⋅ a 1 − (a 2 − δ A) = 0
m ⋅ a1 − a2 = − δA
We can now verify this on our fruits. The lemon has the coordinates (1.1, 3.9) and the orange the coordinates
3.5, 1.8. The point on the line, which we used to define our separation straight line has the values (4, 4.5). So
m is 4.5 divides by 4.
We did not calculate the green line using mathematical formulas or methods, but arbitrarily determined it by
visual judgement. We could have chosen other lines as well.
The following Python program calculates and renders a bunch of lines. All going through the origin, i.e. the
point (0, 0). The red ones are completely unusable for the purpose of separating the two fruits, because in
these cases both the lemon and the orange are on the same side of the straight line. However, it is obvious that
even the green ones might not be too useful if we have more than these two fruits. Some lemons might be
sweeter and some oranges can be quite sour.
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_xlabel("sweetness")
ax.set_ylabel("sourness")
x_min, x_max = -1, 7
y_min, y_max = -1, 8
ax.set_xlim([x_min, x_max])
ax.set_ylim([y_min, y_max])
X = np.arange(x_min, x_max, 0.1)
step = 0.05
for x in np.arange(0, 1+step, step):
slope = np.tan(np.arccos(x))
dist4line1 = create_distance_function(slope, -1, 0)
Y = slope * X
results = []
for point in fruits_coords:
results.append(dist4line1(*point))
if (results[0][1] != results[1][1]):
ax.plot(X, Y, "g-", linewidth=0.8, alpha=0.9)
else:
ax.plot(X, Y, "r-", linewidth=0.8, alpha=0.9)
size = 10
for (index, (x, y)) in enumerate(fruits_coords):
if index== 0:
ax.plot(x, y, "o",
color="darkorange",
markersize=size)
plt.show()
Basically, we have carried out a classification based on our dividing line. Even if hardly anyone would
describe this as such.
It is easy to imagine that we have more lemons and oranges with slightly different sourness and sweetness
values. This means we have a class of lemons ( class1 ) and a class of oranges class2 . This is depicted
in the following diagram.
import numpy as np
import matplotlib.pyplot as plt
def points_within_circle(radius,
center=(0, 0),
number_of_points=100):
center_x, center_y = center
r = radius * np.sqrt(np.random.random((number_of_points,)))
theta = np.random.random((number_of_points,)) * 2 * np.pi
x = center_x + r * np.cos(theta)
y = center_y + r * np.sin(theta)
return x, y
X = np.arange(0, 8)
fig, ax = plt.subplots()
oranges_x, oranges_y = points_within_circle(1.6, (5, 2), 100)
lemons_x, lemons_y = points_within_circle(1.9, (2, 5), 100)
ax.scatter(oranges_x,
oranges_y,
c="orange",
label="oranges")
ax.scatter(lemons_x,
ax.legend()
ax.grid()
plt.show()
The dividing line was again arbitrarily set by eye. The question arises how to do this systematically? We are
still only looking at straight lines going through the origin, which are uniquely defined by its slope. the
following Python program calculates a dividing line by going through all the fruits and dynamically adjusts
the slope of the dividing line we want to calculate. If a point is above the line but should be below the line, the
slope will be increment by the value of learning_rate . If the point is below the line but should be above
the line, the slope will be decremented by the value of learning_rate .
import numpy as np
import matplotlib.pyplot as plt
from itertools import repeat
from random import shuffle
X = np.arange(0, 8)
fig, ax = plt.subplots()
ax.scatter(oranges_x,
oranges_y,
c="orange",
label="oranges")
ax.scatter(lemons_x,
fruits = list(zip(oranges_x,
oranges_y,
repeat(0, len(oranges_x))))
fruits += list(zip(lemons_x,
lemons_y,
repeat(1, len(oranges_x))))
shuffle(fruits)
slope = adjust()
ax.plot(X,
slope * X,
linewidth=2)
ax.legend()
ax.grid()
plt.show()
X = np.arange(0, 8)
fig, ax = plt.subplots()
ax.scatter(oranges_x,
oranges_y,
c="orange",
label="oranges")
ax.scatter(lemons_x,
lemons_y,
c="y",
label="lemons")
print(slope)
We are going to define a neural network to classify the previous data sets. Our neural network will only
consist of one neuron. A neuron with two input values, one for 'sourness' and one for 'sweetness'.
The two input values - called in_data in our Python program below - have to be weighted by weight
values. So solve our problem, we define a Perceptron class. An instance of the class is a Perceptron (or
Neuron). It can be initialized with the input_length, i.e. the number of input values, and the weights, which can
be given as a list, tuple or an array. If there are no values for the weights given or the parameter is set to None,
we will initialize the weights to 1 / input_length.
In the following example choose -0.45 and 0.5 as the values for the weights. This is not the normal way to do
it. A Neural Network calculates the weights automatically during its training phase, as we will learn later.
import numpy as np
p = Perceptron(weights=[-0.45, 0.5])
We can see that we get a negative value, if we input an orange and a posive value, if we input a lemon. With
this knowledge, we can calculate the accuracy of our neural network on this data set:
print(evaluation)
Counter({'corrects': 200})
How does the calculation work? We multiply the input values with the weights and get negative and positive
values. Let us examine what we get, if the calculation results in 0:
w1 ⋅ x1 + w2 ⋅ x2 = 0
w1
x2 = − ⋅ x1
w2
y=m⋅x+c
where:
We can easily see that our equation corresponds to the definition of a line and the slope (aka gradient) m is
w1
− w and c is equal to 0.
2
This is a straight line separating the oranges and lemons, which is called the decision boundary.
import time
import matplotlib.pyplot as plt
slope = 0.1
X = np.arange(0, 8)
ax.grid()
plt.show()
print(slope)
0.9
Before we start with this task, we will separate our data into training and test data in the following Python
program. By setting the random_state to the value 42 we will have the same output for every run, which can
be benifial for debugging purposes.
As we start with two arbitrary weights, we cannot expect the result to be correct. For some points (fruits) it
may return the proper value, i.e. 1 for a lemon and 0 for an orange. In case we get the wrong result, we have to
correct our weight values. First we have to calculate the error. The error is the difference between the target or
expected value ( target_result ) and the calculated value ( calculated_result ). With this error
we have to adjust the weight values with an incremental value, i.e. w 1 = w 1 + Δw 1 and w 2 = w 2 + Δw 2
We are ready now to write the code for adapting the weights, which means training the network. For this
purpose, we add a method 'adjust' to our Perceptron class. The task of this method is to crrect the error.
import numpy as np
from collections import Counter
class Perceptron:
def __init__(self,
weights,
learning_rate=0.1):
"""
'weights' can be a numpy array, list or a tuple with the
actual values of the weights. The number of input values
is indirectly defined by the length of 'weights'
"""
self.weights = np.array(weights)
self.learning_rate = learning_rate
@staticmethod
def unit_step_function(x):
if x < 0:
return 0
else:
return 1
def adjust(self,
target_result,
calculated_result,
in_data):
if type(in_data) != np.ndarray:
in_data = np.array(in_data) #
error = target_result - calculated_result
if error != 0:
correction = error * in_data * self.learning_rate
self.weights += correction
#print(target_result, calculated_result, error, in_dat
a, correction, self.weights)
p = Perceptron(weights=[0.1, 0.1],
learning_rate=0.3)
print(p.weights)
[('correct', 160)]
[('correct', 40)]
[-1.68135341 2.07512397]
X = np.arange(0, 7)
fig, ax = plt.subplots()
w1 = p.weights[0]
w2 = p.weights[1]
m = -w1 / w2
ax.plot(X, m * X, label="decision boundary")
ax.legend()
plt.show()
print(p.weights)
[-1.68135341 2.07512397]
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
p = Perceptron(weights=[0.1, 0.1],
learning_rate=0.3)
number_of_colors = 7
colors = cm.rainbow(np.linspace(0, 1, number_of_colors))
fig, ax = plt.subplots()
ax.set_xticks(range(8))
ax.set_ylim([-2, 8])
counter = 0
for index in range(len(train_data)):
old_weights = p.weights.copy()
p.adjust(train_labels[index],
p(train_data[index]),
train_data[index])
if not np.array_equal(old_weights, p.weights):
color = "orange" if train_labels[index] == 0 else
"y"
ax.scatter(train_data[index][0],
train_data[index][1],
color=color)
ax.annotate(str(counter),
(train_data[index][0], train_data[index][1]))
m = -p.weights[0] / p.weights[1]
print(index, m, p.weights, train_data[index])
ax.plot(X, m * X, label=str(counter), color=colors[counte
r])
counter += 1
ax.legend()
plt.show()
Each of the points in the diagram above cause a change in the weights. We see them numbered in the order of
their appearance and the corresponding straight line. This way we can see how the networks "learns".
Our classes have been linearly separable. Linear separability make sense
in Euclidean geometry. Two sets of points (or classes) are called linearly
separable, if at least one straight line in the plane exists so that all the
points of one class are on one side of the line and all the points of the other
class are on the other side.
More formally:
∑ xi ⋅ wi = 0
i=1
Otherwise, i.e. if such a decision boundary does not exist, the two classes are called linearly inseparable. In
this case, we cannot use a simple neural network.
0 0 0
0 1 0
1 0 0
1 1 1
We learned in the previous chapter that a neural network with one perceptron and two input values can be
interpreted as a decision boundary, i.e. straight line dividing two classes. The two classes we want to classify
in our example look like this:
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.4
X = np.arange(xmin, xmax, 0.1)
ax.scatter(0, 0, color="r")
ax.scatter(0, 1, color="r")
ax.scatter(1, 0, color="r")
ax.scatter(1, 1, color="g")
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
m = -1
#ax.plot(X, m * X + 1.2, label="decision boundary")
plt.plot()
Output: []
We also found out that such a primitive neural network is only capable of creating straight lines going through
the origin. So dividing lines like this:
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.4
X = np.arange(xmin, xmax, 0.1)
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
m = -1
for m in np.arange(0, 6, 0.1):
ax.plot(X, m * X )
ax.scatter(0, 0, color="r")
ax.scatter(0, 1, color="r")
ax.scatter(1, 0, color="r")
ax.scatter(1, 1, color="g")
plt.plot()
Output: []
We can see that none of these straight lines can be used as decision boundary nor any other lines going
through the origin.
We need a line
y=m⋅x+c
y = − x + 1.2
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.4
X = np.arange(xmin, xmax, 0.1)
ax.scatter(0, 0, color="r")
ax.scatter(0, 1, color="r")
ax.scatter(1, 0, color="r")
ax.scatter(1, 1, color="g")
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
m, c = -1, 1.2
ax.plot(X, m * X + c )
plt.plot()
Output: []
The question now is whether we can find a solution with minor modifications of our network model? Or in
other words: Can we create a perceptron capable of defining arbitrary decision boundaries?
A perceptron with two input values and a bias corresponds to a general straight line. With the aid of the bias
value b we can train the perceptron to determine a decision boundary with a non zero intercept c .
∑ wi ⋅ xi + wn + 1 ⋅ b = 0
i=1
w1 ⋅ x1 + w2 ⋅ x2 + w3 ⋅ b = 0
w1 w3
x2 = − ⋅ x1 − ⋅b
w2 w2
This means:
w1
m= −
w2
and
w3
c= − ⋅b
w2
import numpy as np
from collections import Counter
class Perceptron:
def __init__(self,
@staticmethod
def unit_step_function(x):
if x <= 0:
return 0
else:
return 1
def adjust(self,
target_result,
in_data):
if type(in_data) != np.ndarray:
in_data = np.array(in_data) #
calculated_result = self(in_data)
error = target_result - calculated_result
if error != 0:
in_data = np.concatenate( (in_data, [self.bias]) )
correction = error * in_data * self.learning_rate
self.weights += correction
import numpy as np
from perceptrons import Perceptron
def labelled_samples(n):
for _ in range(n):
s = np.random.randint(0, 2, (2,))
yield (s, 1) if s[0] == 1 and s[1] == 1 else (s, 0)
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.4
X = np.arange(xmin, xmax, 0.1)
ax.scatter(0, 0, color="r")
ax.scatter(0, 1, color="r")
ax.scatter(1, 0, color="r")
ax.scatter(1, 1, color="g")
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
m = -p.weights[0] / p.weights[1]
c = -p.weights[2] / p.weights[1]
print(m, c)
ax.plot(X, m * X + c )
plt.plot()
We will create another example with linearly separable data sets, which need a bias node to be separable. We
will use the make_blobs function from sklearn.datasets :
n_samples = 250
samples, labels = make_blobs(n_samples=n_samples,
centers=([2.5, 3], [6.7, 7.9]),
random_state=0)
fig, ax = plt.subplots()
X = np.arange(np.max(samples[:,0]))
m = -p.weights[0] / p.weights[1]
c = -p.weights[2] / p.weights[1]
print(m, c)
ax.plot(X, m * X + c )
plt.plot()
plt.show()
-1.5513529034664024 11.736643489707035
In the following section, we will introduce the XOR problem for neural networks. It is the simplest example of
a non linearly separable neural network. It can be solved with an additional layer of neurons, which is called a
hidden layer.
0 0 0
0 1 1
1 0 1
1 1 0
This problem can't be solved with a simple neural network, as we can see in the following diagram:
No matter which straight line you choose, you will never succeed in having the blue points on one side and the
orange points on the other side. This is shown in the following figure. The orange points are on the orange
line. This means that this cannot be a dividing line. If we move this line parallel - no matter which direction,
there will be always two orange and one blue point on one side and only one blue point on the other side. If we
move the orange line in a non parallel way, there will be one blue and one orange point on either side, except
if the line goes through an orange point. So there is no way for a single straight line separating those points.
We will need only one hidden layer with two neurons. One works like an AND gate and the other one like an
OR gate. The output will "fire", when the OR gate fires and the AND gate doesn't.
As we had already mentioned, we cannot find a line which separates the orange points from the blue points.
But they can be separated by two lines, e.g. L1 and L2 in the following diagram:
The neuron N1 will determine one line, e.g. L1 and the neuron N2 will determine the other line L2. N3 will
finally solve our problem:
EXERCISE 1
We could extend the logical AND to float values between 0 and 1 in the following way:
Try to train a neural network with only one perceptron. Why doesn't it work?
EXERCISE 2
A point belongs to a class 0, if x 1 < 0.5 and belongs to class 1, if x 1 >= 0.5. Train a network with one
perceptron to classify arbitrary points. What can you say about the dicision boundary? What about the input
values x 2
def labelled_samples(n):
for _ in range(n):
s = np.random.random((2,))
yield (s, 1) if s[0] >= 0.5 and s[1] >= 0.5 else (s, 0)
The easiest way to see, why it doesn't work, is to visualize the data.
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.2
X, Y = list(zip(*ones))
ax.scatter(X, Y, color="g")
X, Y = list(zip(*zeroes))
ax.scatter(X, Y, color="r")
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
c = -p.weights[2] / p.weights[1]
m = -p.weights[0] / p.weights[1]
X = np.arange(xmin, xmax, 0.1)
ax.plot(X, m * X + c, label="decision boundary")
We can see that the green points and the red points are not separable by one straight line.
import numpy as np
from collections import Counter
def labelled_samples(n):
for _ in range(n):
s = np.random.random((2,))
yield (s, 0) if s[0] < 0.5 else (s, 1)
print(p.weights)
p.evaluate(test_data, test_labels)
fig, ax = plt.subplots()
xmin, xmax = -0.2, 1.2
X, Y = list(zip(*ones))
ax.scatter(X, Y, color="g")
X, Y = list(zip(*zeroes))
ax.scatter(X, Y, color="r")
ax.set_xlim([xmin, xmax])
ax.set_ylim([-0.1, 1.1])
c = -p.weights[2] / p.weights[1]
m = -p.weights[0] / p.weights[1]
X = np.arange(xmin, xmax, 0.1)
ax.plot(X, m * X + c, label="decision boundary")
Output: [<matplotlib.lines.Line2D at 0x7fabe8bc89d0>]
p.weights, m
Output: (array([ 2.03831116, -0.1785671 , -0.9 ]), 11.414819026
425487)
INTRODUCTION
In the previous chapter, we had
implemented a simple Perceptron class
using pure Python. The module
sklearn contains a Perceptron
class. We saw that a perceptron is an
algorithm to solve binary classifier
problems. This means that a Perceptron is
abinary classifier, which can decide
whether or not an input belongs to one or
the other class. E.g. "spam" or "ham". We
accomplished this by linearly combining
weights with the feature vector, i.e. the
input.
It is amazing that the perceptron algorithm was already invented in the year 1958 by Frank Rosenblatt. The
algorithm was implemented in custom-built hardware, called "Mark 1 perceptron". This hardware was
designed for image recognition.
The invention has been extremely overestimated: In 1958 the New York Times wrote after a press conference
with Rosenblatt: "New Navy Device Learns By Doing; Psychologist Shows Embryo of Computer Designed to
Read and Grow Wiser"
What initially seemed very promising was quickly proved incapable of keeping its promises. Thes perceptrons
could not be trained to recognise many classes of patterns.
n_samples = 50
data, labels = make_blobs(n_samples=n_samples,
centers=([1.1, 3], [4.5, 6.9]),
random_state=0)
We can calculate predictions on the learnset and testset and can evaluate the score:
p.score(train_data, train_labels)
Output: 1.0
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
We have one problem: The Perceptron classifiert can only be used on binary classification problems, but
the Iris dataset consists fo three different classes, i.e. 'setosa', 'versicolor', 'virginica', corresponding to the
labels 0, 1, and 2:
iris.target_names
Output: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
We will merge the classes 'versicolor' and 'virginica' into one class. This means that only two classes are left.
So we can differentiate with the classifier between
• Iris setose
• not Iris setosa, or in other words either 'viriginica' od 'versicolor'
targets = (iris.target==0).astype(np.int8)
Now, we are ready for predictions and we will look at some randomly chosen random X values:
import random
print(classification_report(p.predict(train_data), train_labels))
precision recall f1-score support
print(classification_report(p.predict(test_data), test_labels))
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
INTRODUCTION
We have to see how to initialize the weights and how to efficiently multiply the weights with the input values.
In the following chapters we will design a neural network in Python, which consists of three layers, i.e. the
input layer, a hidden layer and an output layer. You can see this neural network structure in the following
diagram. We have an input layer with three nodes i 1, i 2, i 3 These nodes get the corresponding input values
x 1, x 2, x 3. The middle or hidden layer has four nodes h 1, h 2, h 3, h 4. The input of this layer stems from the
input layer. We will discuss the mechanism soon. Finally, our output layer consists of the two nodes o 1, o 2
The input layer is different from the other layers. The nodes of the input layer are passive. This means that the
input neurons do not change the data, i.e. there are no weights used in this case. They receive a single value
and duplicate this value to their many outputs.
import numpy as np
In the algorithm, which we will write later, we will have to transpose it into a column vector, i.e. a two-
dimensional array with just one column:
import numpy as np
The value x 1 going into the node i 1 will be distributed according to the values of the weights. In the following
diagram we have added some example values. Using these values, the input values (Ih 1, Ih 2, Ih 3, Ih 4 into the
nodes (h 1, h 2, h 3, h 4) of the hidden layer can be calculated like this:
Those familiar with matrices and matrix multiplication will see where it is boiling down to. We will redraw
our network and denote the weights with w ij:
Now that we have defined our weight matrices, we have to take the next step. We have to multiply the matrix
wih the input vector. Btw. this is exactly what we have manually done in our previous example.
()( )( ) ( )
y1 w 11 w 12 w 13 w 11 ⋅ x 1 + w 12 ⋅ x 2 + w 13 ⋅ x 3
x1
y2 w 21 w 22 w 23 w 21 ⋅ x 1 + w 22 ⋅ x 2 + w 23 ⋅ x 3
= x2 =
y3 w 31 w 32 w 33 w 31 ⋅ x 1 + w 32 ⋅ x 2 + w 33 ⋅ x 3
x3
y4 w 41 w 42 w 43 w 41 ⋅ x 1 + w 42 ⋅ x 2 + w 43 ⋅ x 3
We have a similar situation for the 'who' matrix between hidden and output layer. So the output z 1 and z 2 from
the nodes o 1 and o 2 can also be calculated with matrix multiplications:
()
y1
()(
z1
z2
=
wh 11
wh 21
wh 12
wh 22
wh 13 wh 14
wh 23 wh 24 ) ( y2
y3
y4
=
wh 11 ⋅ y 1 + wh 12 ⋅ y 2 + wh 13 ⋅ y 3 + wh 14 ⋅ y 4
wh 21 ⋅ y 1 + wh 22 ⋅ y 2 + wh 23 ⋅ y 3 + wh 24 ⋅ y 4 )
You might have noticed that something is missing in our previous calculations. We showed in our introductory
The following picture depicts the whole flow of calculation, i.e. the matrix multiplication and the succeeding
application of the activation function.
The matrix multiplication between the matrix wih and the matrix of the values of the input nodes x 1, x 2, x 3
calculates the output which will be passed to the activation function.
Even though treatment is completely analogue, we will also have a detailled look at what is going on between
our hidden layer and the output layer:
As we have seen the input to all the nodes except the input nodes is calculated by applying the activation
function to the following sum:
n
yj = ∑ w ji ⋅ x i
i=1
(with n being the number of nodes in the previous layer and y j is the input to a node of the next layer)
We can easily see that it would not be a good idea to set all the weight values to 0, because in this case the
result of this summation will always be zero. This means that our network will be incapable of learning. This
is the worst choice, but initializing a weight matrix to ones is also a bad choice.
The values for the weight matrices should be chosen randomly and not arbitrarily. By choosing a random
normal distribution we have broken possible symmetric situations, which can and often are bad for the
learning process.
There are various ways to initialize the weight matrices randomly. The first one we will introduce is the unity
function from numpy.random. It creates samples which are uniformly distributed over the half-open interval
[low, high), which means that low is included and high is excluded. Each value within the given interval is
equally likely to be drawn by 'uniform'.
import numpy as np
number_of_samples = 1200
low = -1
high = 0
s = np.random.uniform(low, high, number_of_samples)
The histogram of the samples, created with the uniform function in our previous example, looks like this:
binomial(n, p, size=None)
It draws samples from a binomial distribution with specified parameters, n trials and probability p of
success where n is an integer >= 0 and p is a float in the interval [0,1]. ( n may be input as a float, but
it is truncated to an integer in use)
The standard form of this distribution is a standard normal truncated to the range [a, b] — notice that a and b
are defined over the domain of the standard normal. To convert clip values for a specific mean and standard
deviation, use:
plt.hist(s)
plt.show()
plt.hist(s)
plt.show()
Further examples:
We will create the link weights matrix now. truncated_normal is ideal for this purpose. It is a good
idea to choose random values from within the interval
1 1
(− , )
√n √n
where n denotes the number of input nodes.
no_of_input_nodes = 3
no_of_hidden_nodes = 4
rad = 1 / np.sqrt(no_of_input_nodes)
no_of_hidden_nodes = 4
no_of_output_nodes = 2
rad = 1 / np.sqrt(no_of_hidden_nodes) # this is the input in thi
s layer!
We will postpone the definition of the train and run method until later. The weight matrices should be
initialized inside of the __init__ method. We do this indirectly. We define a method
create_weight_matrices and call it in __init__ . In this way, the init method remains clear.
import numpy as np
from scipy.stats import truncnorm
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.create_weight_matrices()
def create_weight_matrices(self):
rad = 1 / np.sqrt(self.no_of_in_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_in_hidden = X.rvs((self.no_of_hidden_nodes,
self.no_of_in_nodes))
rad = 1 / np.sqrt(self.no_of_hidden_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_hidden_out = X.rvs((self.no_of_out_nodes,
self.no_of_hidden_nodes))
def train(self):
pass
def run(self):
pass
We cannot do a lot with this code, but we can at least initialize it. We can also have a look at the weight
matrices:
simple_network = NeuralNetwork(no_of_in_nodes = 3,
The input values of a perceptron are processed by the summation function and followed by an activation
function, transforming the output of the summation function into a desired and more suitable output. The
summation function means that we will have a matrix multiplication of the weight vectors and the input
values.
There are lots of different activation functions used in neural networks. One of the most comprehensive
overviews of possible activation functions can be found at Wikipedia.
The sigmoid function is one of the often used activation functions. The sigmoid function, which we are using,
is also known as the Logistic function.
It is defined as
1
σ(x) =
1 + e −x
Let us have a look at the graph of the sigmoid function. We use matplotlib to plot the sigmoid function:
import numpy as np
X = np.linspace(-5, 5, 100)
plt.plot(X, sigma(X),'b')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Sigmoid Function')
plt.grid()
plt.show()
Looking at the graph, we can see that the sigmoid function maps a given number x into the range of numbers
between 0 and 1. 0 and 1 not included! As the value of x gets larger, the value of the sigmoid function gets
closer and closer to 1 and as x gets smaller, the value of the sigmoid function is approaching 0.
Instead of defining the sigmoid function ourselves, we can also use the expit function from
scipy.special , which is an implementation of the sigmoid function. It can be applied on various data
classes like int, float, list, numpy,ndarray and so on. The result is an ndarray of the same shape as the input
data x.
The logistic function is often often used in neural networks to introduce nonlinearity in the model and to map
signals into a specified range, i.e. 0 and 1. It is also well liked because the derivative - needed in
backpropagation - is simple.
1
σ(x) =
1 + e −x
import numpy as np
import matplotlib.pyplot as plt
def sigma(x):
return 1 / (1 + np.exp(-x))
X = np.linspace(-5, 5, 100)
plt.plot(X, sigma(X))
plt.plot(X, sigma(X) * (1 - sigma(X)))
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Sigmoid Function')
plt.grid()
plt.show()
@np.vectorize
def sigmoid(x):
return 1 / (1 + np.e ** -x)
#sigmoid = np.vectorize(sigmoid)
sigmoid([3, 4, 5])
Output: array([0.95257413, 0.98201379, 0.99330715])
Another easy to use activation function is the ReLU function. ReLU stands for rectified linear unit. It is also
known as the ramp function. It is defined as the positve part of its argument, i.e. y = max (0, x). This is
"currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU)"1 The
ReLu function is computationally more efficient than Sigmoid like functions, because Relu means only
choosing the maximum between 0 and the argument x . Whereas Sigmoids need to perform expensive
exponential operations.
# derivation of relu
def ReLU_derivation(x):
if x <= 0:
return 0
else:
return 1
X = np.linspace(-5, 6, 100)
plt.plot(X, ReLU(X),'b')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('ReLU Function')
plt.grid()
plt.text(0.8, 0.4, r'$ReLU(x)=max(0, x)$', fontsize=14)
plt.show()
import numpy as np
from scipy.special import expit as activation_function
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.create_weight_matrices()
def create_weight_matrices(self):
""" A method to initialize the weight matrices of the neur
al network"""
rad = 1 / np.sqrt(self.no_of_in_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_in_hidden = X.rvs((self.no_of_hidden_nodes,
self.no_of_in_nodes))
rad = 1 / np.sqrt(self.no_of_hidden_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_hidden_out = X.rvs((self.no_of_out_nodes,
self.no_of_hidden_nodes))
We can instantiate an instance of this class, which will be a neural network. In the following example we
create a network with two input nodes, four hidden nodes, and two output nodes.
simple_network = NeuralNetwork(no_of_in_nodes=2,
no_of_out_nodes=2,
no_of_hidden_nodes=4,
learning_rate=0.6)
We can apply the run method to all arrays with a shape of (2,), also lists and tuples with two numerical
elements. The result of the call is defined by the random values of the weights:
simple_network.run([(3, 4)])
Output: array([[0.54558831],
[0.6834667 ]])
FOOTNOTES
1
Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions".
INTRODUCTION
We already wrote in the previous chapters of our
tutorial on Neural Networks in Python. The networks
from our chapter Running Neural Networks lack the
capabilty of learning. They can only be run with
randomly set weight values. So we cannot solve any
classification problems with them. However, the
networks in Chapter Simple Neural Networks were
capable of learning, but we only used linear networks
for linearly separable classes.
Quite often people are frightened away by the mathematics used in it. We try to explain it in simple terms.
Explaining gradient descent starts in many articles or tutorials with mountains. Imagine you are put on a
mountain, not necessarily the top, by a helicopter at night or heavy fog. Let's further imagine that this
mountain is on an island and you want to reach sea level. You have to go down, but you hardly see anything,
maybe just a few metres. Your task is to find your way down, but you cannot see the path. You can use the
method of gradient descent. This means that you are examining the steepness at your current position. You
will proceed in the direction with the steepest descent. You take only a few steps and then you stop again to
reorientate yourself. This means you are applying again the previously described procedure, i.e. you are
looking for the steepest descend.
Each direction goes upwards. You may have reached the deepest level - the global minimum -, but you might
as well be stuck in a basin. If you start at the position on the right side of our image, everything works out fine,
but from the leftside, you will be stuck in a local minimum.
BACKPROPAGATION IN DETAIL
Now, we have to go into the details, i.e. the mathematics.
We will start with the simpler case. We look at a linear network. Linear neural networks are networks where
the output signal is created by summing up all the weighted input signals. No activation function will be
applied to this sum, which is the reason for the linearity.
When we are training the network we have samples and corresponding labels. For each output value o i we
have a label t i, which is the target or the desired value. If the label is equal to the output, the result is correct
ei = ti − oi
We will later use a squared error function, because it has better characteristics for the algorithm:
1
ei = (t − o i) 2
2 i
We want to clarify how the error backpropagates with the following example with values:
We will have a look at the output value o 1, which is depending on the values w 11, w 12, w 13 and w 14. Let's
assume the calculated value (o 1) is 0.92 and the desired value (t 1) is 1. In this case the error is
e 1 = t 1 − o 1 = 1 − 0.92 = 0.08
e 2 = t 2 − o 2 = 1 − 0.18 = 0.82
w 11
e1 ⋅
∑ 4 w 1i
i=1
0.6
0.08 ⋅ = 0.0343
0.6 + 0.1 + 0.15 + 0.25
The total error in our weight matrix between the hidden and the output layer - we called it in our previous
chapter 'who' - looks like this
[]
w 12 w 22 w 32
e1
∑ 4 w 1i ∑ 4 w 2i ∑ 4 w 3i
i=1 i=1 i=1
e who = ⋅ e2
w 13 w 23 w 33
∑ 4 w 1i ∑ 4 w 2i ∑ 4 w 3i
e3
i=1 i=1 i=1
w 14 w 24 w 34
∑ 4 w 1i ∑ 4 w 2i ∑ 4 w 3i
i=1 i=1 i=1
You can see that the denominator in the left matrix is always the same. It functions like a scaling factor. We
can drop it so that the calculation gets a lot simpler:
[ ][ ]
w 11 w 21 w 31
e1
w 12 w 22 w 32
e who = ⋅ e2
w 13 w 23 w 33
e3
w 14 w 24 w 34
If you compare the matrix on the right side with the 'who' matrix of our chapter Neuronal Network Using
Python and Numpy, you will notice that it is the transpose of 'who'.
e who = who. T ⋅ e
So, this has been the easy part for linear neural networks. We haven't taken into account the activation function
until now.
We want to calculate the error in a network with an activation function, i.e. a non-linear network. The
derivation of the error function describes the slope. As we mentioned in the beginning of the this chapter, we
want to descend. The derivation describes how the error E changes as the weight w kj changes:
The error function E over all the output nodes o i (i = 1, . . . n) where n is the total number of output nodes:
n
1
E= ∑ 2 (t i − o i) 2
i=1
If you have a look at our example network, you will see that an output node o k only depends on the input
signals created with the weights w ki with i = 1, …m and m the number of hidden nodes.
This means that we can calculate the error for every output node independently of each other. This means that
we can remove all expressions t i − o i with i ≠ k from our summation. So the calculation of the error for a node
k looks a lot simpler now:
∂E ∂ 1
= (t − o k) 2
∂w kj ∂w kj 2 k
The target value t k is a constant, because it is not depending on any input signals or weights. We can apply the
chain rule for the differentiation of the previous term to simplify things:
In the previous chapter of our tutorial, we used the sigmoid function as the activation function:
1
σ(x) =
1 + e −x
The output node o k is calculated by applying the sigmoid function to the sum of the weighted input signals.
This means that we can further transform our derivative term by replacing o k by this function:
m
∂E ∂
= (t k − o k) ⋅ σ( ∑ w h )
∂w kj ∂w kj i = 1 ki i
∂σ(x)
= σ(x) ⋅ (1 − σ(x))
∂x
The last part has to be differentiated with respect to w kj. This means that the derivation of all the products will
be 0 except the the term w kjh j) which has the derivative h j with respect to w kj:
m m
∂E
= (t k − o k) ⋅ σ( ∑ w kih i) ⋅ (1 − σ( ∑ w kih i)) ⋅ h j
∂w kj i=1 i=1
This is what we need to implement the method 'train' of our NeuralNetwork class in the following chapter.
In [ ]:
INTRODUCTION
In the chapter "Running Neural
Networks", we programmed a class in
Python code called 'NeuralNetwork'. The
instances of this class are networks with
three layers. When we instantiate an ANN
of this class, the weight matrices between
the layers are automatically and randomly
chosen. It is even possible to run such a
ANN on some input, but naturally it
doesn't make a lot of sense exept for
testing purposes. Such an ANN cannot
provide correct classification results. In
fact, the classification results are in no
way adapted to the expected results. The
values of the weight matrices have to be
set according the the classification task.
We need to improve the weight values,
which means that we have to train our network. To train it we have to implement backpropagation in the
train method. If you don't understand backpropagation and want to understand it, we recommend to go
back to the chapter Backpropagation in Neural Networks.
After knowing und hopefully understanding backpropagation, you are ready to fully understand the train
method.
The train method is called with an input vector and a target vector. The shape of the vectors can be one-
dimensional, but they will be automatically turned into the correct two-dimensional shape, i.e.
reshape(input_vector.size, 1) and reshape(target_vector.size, 1) . After this
we call the run method to get the result of the network output_vector_network =
self.run(input_vector) . This output may differ from the target_vector . We calculate the
output_error by subtracting the output of the network output_vector_network from the
target_vector .
import numpy as np
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.create_weight_matrices()
def create_weight_matrices(self):
""" A method to initialize the weight matrices of the neur
al network"""
rad = 1 / np.sqrt(self.no_of_in_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_in_hidden = X.rvs((self.no_of_hidden_nodes,
self.no_of_in_nodes))
rad = 1 / np.sqrt(self.no_of_hidden_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_hidden_out = X.rvs((self.no_of_out_nodes,
self.no_of_hidden_nodes))
output_vector_hidden = activation_function(self.weights_i
We assume that you save the previous code in a file called neural_networks1.py . We will use it under
this name in the coming examples.
To test this neural network class we need train and test data. We create the data with make_blobs from
sklearn.datasets .
n_samples = 500
blob_centers = ([2, 6], [6, 2], [7, 7])
n_classes = len(blob_centers)
data, labels = make_blobs(n_samples=n_samples,
centers=blob_centers,
random_state=7)
labels[:7]
Output: array([2, 2, 1, 0, 2, 0, 1])
We need a one-hot representation for each label. So the labels are represented as
0 (1, 0, 0)
1 (0, 1, 0)
2 (0, 0, 1)
import numpy as np
We create a neural network with two input nodes, and three output nodes. One output node for each class:
simple_network = NeuralNetwork(no_of_in_nodes=2,
no_of_out_nodes=3,
no_of_hidden_nodes=5,
learning_rate=0.3)
The next step consists in training our network with the data and labels from our training samples:
for i in range(len(train_data)):
simple_network.train(train_data[i], train_labels[i])
simple_network.evaluate(train_data, train_labels)
Output: (390, 10)
The following diagram shows the first two layers of our previously used three-layered neural network:
We can see from this diagram that our weight matrix needs one additional column and the bias value has to be
added to the input vector:
The following is a complete Python class implementing our network with bias nodes:
import numpy as np
from scipy.stats import truncnorm
from scipy.special import expit as activation_function
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate,
bias=None):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.no_of_out_nodes = no_of_out_nodes
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
""" A method to initialize the weight matrices of the neur
al
network with optional bias nodes"""
bias_node = 1 if self.bias else 0
rad = 1 / np.sqrt(self.no_of_in_nodes + bias_node)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_in_hidden = X.rvs((self.no_of_hidden_nodes,
self.no_of_in_nodes + bia
s_node))
rad = 1 / np.sqrt(self.no_of_hidden_nodes + bias_node)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.weights_hidden_out = X.rvs((self.no_of_out_nodes,
self.no_of_hidden_nodes
+ bias_node))
output_vector_hidden = activation_function(self.weights_i
n_hidden @ input_vector)
if self.bias:
output_vector_hidden = np.concatenate( (output_vecto
r_hidden, [[self.bias]]) )
output_vector_network = activation_function(self.weights_h
idden_out @ output_vector_hidden)
We can use again our previously created classes to test our classifier:
simple_network = NeuralNetwork(no_of_in_nodes=2,
no_of_out_nodes=3,
no_of_hidden_nodes=5,
learning_rate=0.1,
bias=1)
for i in range(len(train_data)):
simple_network.train(train_data[i], train_labels[i])
simple_network.evaluate(train_data, train_labels)
Output: (382, 18)
EXERCISE
We created in the chapter "Data Creation" a file strange_flowers.txt in the folder data . Create a
Neural Network to classify the 'flowers':
0.000,240.000,100.000,3.020
SOLUTION:
c = np.loadtxt("data/strange_flowers.txt", delimiter=" ")
We need to scale our data, because unscaled input data can result in a slow or unstable learning process. We
will use the function scale from sklearn/preprocessing . It standardizes a dataset along any axis.
It centers to the mean and component wise scale to unit variance.
data = preprocessing.scale(data)
data[:5]
data.shape
labels.shape
simple_network = NeuralNetwork(no_of_in_nodes=4,
no_of_out_nodes=4,
no_of_hidden_nodes=20,
learning_rate=0.3)
for i in range(len(train_data)):
simple_network.train(train_data[i], train_labels[i])
simple_network.evaluate(train_data, train_labels)
Output: (492, 144)
In [ ]:
SOFTMAX
The previous implementations of neural networks in our tutorial
returned float values in the open interval (0, 1). To make a final
decision we had to interprete the results of the output neurons.
The one with the highest value is a likely candidate but we also
have to see it in relation to the other results. It should be obvious
that in a two classes case (c 1 and c 2) a result (0.013, 0.95) is a
clear vote for the class c 2 but (0.73, 0.89) on the other hand is a
different thing. We could say in this situation 'c 2 is more likely
than c 1, but c 1 has still a high likelihood'. Talking about
likelihoods: The return values are not probabilities. It would be
a lot better to have a normalized output with a probability
function. Here comes the softmax function into the picture. The
softmax function, also known as softargmax or normalized
exponential function, is a function that takes as input a vector of
n real numbers, and normalizes it into a probability distribution
consisting of n probabilities proportional to the exponentials of
the input vector. A probability distribution implies that the result
vector sums up to 1. Needless to say, if some components of the
input vector are negative or greater than one, they will be in the
range (0, 1) after applying Softmax . The Softmax function is
often used in neural networks, to map the results of the output
layer, which is non-normalized, to a probability distribution over
predicted output classes.
eoi
σ(o i) =
∑n eoj
j=1
where the index i is in (0, ..., n-1) and o is the output vector of the network
o = (o 0, o 1, …, o n − 1)
def softmax(x):
""" applies softmax to an input x"""
e_x = np.exp(x)
return e_x / e_x.sum()
x = np.array([1, 0, 3, 5])
y = softmax(x)
y, x / x.sum()
Output: (array([0.01578405, 0.00580663, 0.11662925, 0.86178007]),
array([0.11111111, 0. , 0.33333333, 0.55555556]))
import numpy as np
def softmax(x):
""" applies softmax to an input x"""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
softmax(x)
Output: array([0.01578405, 0.00580663, 0.11662925, 0.86178007])
[][]
o1 s1
o2 s2
S(o) : ?
⋯ ⋯
on sn
[ ]
∂s 1 ∂s 1
∂o 1
⋯ ∂o n
∂S
= ⋯
∂O
∂s n ∂s n
∂o 1
⋯ ∂o n
eoi
∂
∂s i ∑n eok
k=1
=
∂o j ∂o j
the derivative of
g(x)
f(x) =
h(x)
is
g ′ (x) =
{ e o i,
0,
if i = j
otherwise
1. case: i = j:
( ∑n e o k) 2
k=1
∑n eok − eoj
eoi k=1
⋅
∑n eok ∑n eok
k=1 k=1
eoi eoj
⋅ (1 − )
∑n eok ∑n eok
k=1 k=1
s i ⋅ (1 − s j)
s i ⋅ (1 − s i)
because i = j.
1. case: i ≠ j:
( ∑n e o k) 2
k=1
eoi eoj
− ⋅
∑n eok ∑n eok
k=1 k=1
We can summarize these two cases and write the derivative as:
g ′ (x) =
{ s i ⋅ (1 − s i),
− s i ⋅ s j,
if i = j
otherwise
If we use the Kronecker delta function1, we can get rid of the case differentiation, i.e. we "let the Kronecker
delta do this work":
∂s i
= s i(δ ij − s j)
∂o j
[ ]
s 1(δ 11 − s 1) s 1(δ 12 − s 2) ⋯ s 1(δ 1n − s n)
import numpy as np
def softmax(x):
e_x = np.exp(x)
return e_x / e_x.sum()
s = softmax(np.array([0, 4, 5]))
si_sj = - s * s.reshape(3, 1)
print(s)
print(si_sj)
s_der = np.diag(s) + si_sj
s_der
import numpy as np
from scipy.stats import truncnorm
@np.vectorize
def sigmoid(x):
return 1 / (1 + np.e ** -x)
def softmax(x):
e_x = np.exp(x)
return e_x / e_x.sum()
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate,
softmax=True):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.softmax = softmax
self.create_weight_matrices()
def create_weight_matrices(self):
""" A method to initialize the weight matrices of the neur
al network"""
rad = 1 / np.sqrt(self.no_of_in_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
output_vector_hidden = sigmoid(self.weights_in_hidden @ in
put_vector)
if self.softmax:
output_vector_network = softmax(self.weights_hidden_ou
t @ output_vector_hidden)
else:
output_vector_network = sigmoid(self.weights_hidden_ou
t @ output_vector_hidden)
return output_vector_network
n_samples = 300
samples, labels = make_blobs(n_samples=n_samples,
centers=([2, 6], [6, 2]),
random_state=0)
simple_network = NeuralNetwork(no_of_in_nodes=2,
no_of_out_nodes=2,
no_of_hidden_nodes=5,
learning_rate=0.3,
softmax=True)
for i in range(size_of_learn_sample):
#print(learn_data[i], labels[i], labels_one_hot[i])
simple_network.train(learn_data[i],
labels_one_hot[i])
FOOTNOTES
1
Kronecker delta:
δ ij =
{ 1,
0,
if i = j
if i ≠ j
A confusion matrix is a matrix (table) that can be used to measure the performance of an machine learning
algorithm, usually a supervised learning one. Each row of the confusion matrix represents the instances of an
actual class and each column represents the instances of a predicted class. This is the way we keep it in this
chapter of our tutorial, but it can be the other way around as well, i.e. rows for predicted classes and columns
for actual classes. The name confusion matrix reflects the fact that it makes it easy for us to see what kind of
confusions occur in our classification algorithms. For example the algorithms should have predicted a sample
as c i because the actual class is c i, but the algorithm came out with c j. In this case of mislabelling the element
cm[i, j] will be incremented by one, when the confusion matrix is constructed.
We will define methods to calculate the confusion matrix, precision and recall in the following class.
2-CLASS CASE
In a 2-class case, i.e. "negative" and "positive", the confusion matrix may look like this:
predicted
negative 11 0
positive 1 12
predicted
negative TN FP
True positive False Positive
positive FN TP
False negative True positive
We can define now some important performance measures used in machine learning:
Accuracy:
TN + TP
AC =
TN + FP + FN + TP
The accuracy is not always an adequate performance measure. Let us assume we have 1000 samples. 995 of
these are negative and 5 are positive cases. Let us further assume we have a classifier, which classifies
whatever it will be presented as negative. The accuracy will be a surprising 99.5%, even though the classifier
could not recognize any positive samples.
TP
recall =
FN + TP
FP
TNR =
TN + FP
Precision:
TP
precision :
FP + TP
To measure the results of machine learning algorithms, the previous confusion matrix will not be sufficient.
We will need a generalization for the multi-class case.
Let us assume that we have a sample of 25 animals, e.g. 7 cats, 8 dogs, and 10 snakes, most probably Python
snakes. The confusion matrix of our recognition algorithm may look like the following table:
predicted
dog 6 2 0
cat 1 6 0
snake 1 1 8
In this confusion matrix, the system correctly predicted six of the eight actual dogs, but in two cases it took a
dog for a cat. The seven acutal cats were correctly recognized in six cases but in one case a cat was taken to be
a dog. Usually, it is hard to take a snake for a dog or a cat, but this is what happened to our classifier in two
cases. Yet, eight out of ten snakes had been correctly recognized. (Most probably this machine learning
algorithm was not written in a Python program, because Python should properly recognize its own species :-) )
You can see that all correct predictions are located in the diagonal of the table, so prediction errors can be
easily found in the table, as they will be represented by values outside the diagonal.
We can generalize this to the multi-class case. To do this we summarize over the rows and columns of the
confusion matrix. Given that the matrix is oriented as above, i.e., that a given row of the matrix corresponds to
specific value for the "truth", we have:
M ii
Precision i =
∑ jM ji
M ii
Recall i =
∑ jM ij
This means, precision is the fraction of cases where the algorithm correctly predicted class i out of all
instances where the algorithm predicted i (correctly and incorrectly). recall on the other hand is the fraction of
cases where the algorithm correctly predicted i out of all of the cases which are labelled as i.
precision snakes = 8 / (0 + 0 + 8) = 1
EXAMPLE
We are ready now to code this into Python. The following code shows a confusion matrix for a multi-class
machine learning problem with ten labels, so for example an algorithms for recognizing the ten digits from
handwritten characters.
If you are not familiar with Numpy and Numpy arrays, we recommend our tutorial on Numpy.
import numpy as np
cm = np.array(
[[5825, 1, 49, 23, 7, 46, 30, 12, 21, 26],
[ 1, 6654, 48, 25, 10, 32, 19, 62, 111, 10],
[ 2, 20, 5561, 69, 13, 10, 2, 45, 18, 2],
[ 6, 26, 99, 5786, 5, 111, 1, 41, 110, 79],
[ 4, 10, 43, 6, 5533, 32, 11, 53, 34, 79],
[ 3, 1, 2, 56, 0, 4954, 23, 0, 12, 5],
[ 31, 4, 42, 22, 45, 103, 5806, 3, 34, 3],
[ 0, 4, 30, 29, 5, 6, 0, 5817, 2, 28],
[ 35, 6, 63, 58, 8, 59, 26, 13, 5394, 24],
[ 16, 16, 21, 57, 216, 68, 0, 219, 115, 5693]])
The functions 'precision' and 'recall' calculate values for a label, whereas the function
'precision_macro_average' the precision for the whole classification problem calculates.
def precision_macro_average(confusion_matrix):
rows, columns = confusion_matrix.shape
sum_of_precisions = 0
for label in range(rows):
sum_of_precisions += precision(label, confusion_matrix)
return sum_of_precisions / rows
def recall_macro_average(confusion_matrix):
rows, columns = confusion_matrix.shape
sum_of_recalls = 0
for label in range(columns):
sum_of_recalls += recall(label, confusion_matrix)
return sum_of_recalls / columns
def accuracy(confusion_matrix):
diagonal_sum = confusion_matrix.trace()
sum_of_all_elements = confusion_matrix.sum()
accuracy(cm)
Output: 0.95038333333333336
USING MNIST
Every line of these files consists of an image, i.e. 785 numbers between 0 and 255.
The first number of each line is the label, i.e. the digit which is depicted in the image. The following 784
numbers are the pixels of the 28 x 28 image.
import numpy as np
test_data[test_data==255]
test_data.shape
Output: (10000, 785)
The images of the MNIST dataset are greyscale and the pixels range between 0 and 255 including both
bounding values. We will map these values into an interval from [0.01, 1] by multiplying each pixel by 0.99 /
255 and adding 0.01 to the result. This way, we avoid 0 values as inputs, which are capable of preventing
weight updates, as we we seen in the introductory chapter.
We need the labels in our calculations in a one-hot representation. We have 10 digits from 0 to 9, i.e. lr =
np.arange(10).
Turning a label into one-hot representation can be achieved with the command: (lr==label).astype(np.int)
import numpy as np
We are ready now to turn our labelled images into one-hot representations. Instead of zeroes and one, we
create 0.01 and 0.99, which will be better for our calculations:
lr = np.arange(no_of_different_labels)
Before we start using the MNIST data sets with our neural network, we will have a look at some images:
for i in range(10):
img = train_imgs[i].reshape((28,28))
plt.imshow(img, cmap="Greys")
plt.show()
We will save the data in binary format with the dump function from the pickle module:
import pickle
We are able now to read in the data by using pickle.load. This is a lot faster than using loadtxt on the csv files:
import pickle
train_imgs = data[0]
import numpy as np
@np.vectorize
def sigmoid(x):
return 1 / (1 + np.e ** -x)
activation_function = sigmoid
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
def create_weight_matrices(self):
"""
A method to initialize the weight
matrices of the neural network
"""
rad = 1 / np.sqrt(self.no_of_in_nodes)
X = truncated_normal(mean=0,
sd=1,
low=-rad,
upp=rad)
self.wih = X.rvs((self.no_of_hidden_nodes,
self.no_of_in_nodes))
rad = 1 / np.sqrt(self.no_of_hidden_nodes)
X = truncated_normal(mean=0, sd=1, low=-rad, upp=rad)
self.who = X.rvs((self.no_of_out_nodes,
self.no_of_hidden_nodes))
output_vector1 = np.dot(self.wih,
input_vector)
output_hidden = activation_function(output_vector1)
output_vector2 = np.dot(self.who,
output_hidden)
output_network = activation_function(output_vector2)
output_vector = np.dot(self.wih,
input_vector)
output_vector = activation_function(output_vector)
output_vector = np.dot(self.who,
output_vector)
output_vector = activation_function(output_vector)
return output_vector
for i in range(len(train_imgs)):
ANN.train(train_imgs[i], train_labels_one_hot[i])
for i in range(20):
res = ANN.run(test_imgs[i])
print(test_labels[i], np.argmax(res), np.max(res))
[7.] 7 0.9829245583409039
[2.] 2 0.7372766887508578
[1.] 1 0.9881823673106839
[0.] 0 0.9873289971465894
[4.] 4 0.9456335245615916
[1.] 1 0.9880120617106172
[4.] 4 0.976550583573903
[9.] 9 0.964909168118122
[5.] 6 0.36615932726182665
[9.] 9 0.9848677489827125
[0.] 0 0.9204097234781773
[6.] 6 0.8897871402453337
[9.] 9 0.9936811621891628
[0.] 0 0.9832119513084644
[1.] 1 0.988750833073612
[5.] 5 0.9156741221523511
[9.] 9 0.9812577974620423
[7.] 7 0.9888560485875889
[3.] 3 0.8772868556722897
[4.] 4 0.9900030761222965
cm = ANN.confusion_matrix(train_imgs, train_labels)
print(cm)
for i in range(10):
print("digit: ", i, "precision: ", ANN.precision(i, cm), "reca
ll: ", ANN.recall(i, cm))
accuracy train: 0.9469166666666666
accuracy: test 0.9459
[[5802 0 53 21 9 42 35 8 14 20]
[ 1 6620 45 22 6 29 14 50 75 7]
[ 5 22 5486 51 10 11 5 53 11 3]
[ 6 36 114 5788 2 114 1 35 76 72]
[ 8 16 54 8 5439 41 10 52 25 90]
[ 5 2 3 44 0 4922 20 3 5 11]
[ 37 4 54 19 71 72 5789 3 41 4]
[ 0 5 31 38 7 4 0 5762 1 32]
[ 52 20 103 83 9 102 43 21 5535 38]
[ 7 17 15 57 289 84 1 278 68 5672]]
digit: 0 precision: 0.9795711632618606 recall: 0.96635576282478
35
digit: 1 precision: 0.9819044793829724 recall: 0.96375018197699
81
digit: 2 precision: 0.9207787848271232 recall: 0.96977196393848
33
digit: 3 precision: 0.9440548034578372 recall: 0.92696989109545
16
digit: 4 precision: 0.9310167750770284 recall: 0.94706599338324
91
digit: 5 precision: 0.9079505626268216 recall: 0.98145563310069
79
digit: 6 precision: 0.978202095302467 recall: 0.949950771250410
3
digit: 7 precision: 0.9197126895450918 recall: 0.97993197278911
57
digit: 8 precision: 0.945992138096052 recall: 0.921578421578421
6
digit: 9 precision: 0.953437552529837 recall: 0.87422934648582
We can repeat the training multiple times. Each run is called an "epoch".
epochs = 3
NN = NeuralNetwork(no_of_in_nodes = image_pixels,
no_of_out_nodes = 10,
no_of_hidden_nodes = 100,
learning_rate = 0.1)
We want to do the multiple training of the training set inside of our network. To this purpose we rewrite the
method train and add a method train_single. train_single is more or less what we called 'train' before. Whereas
the new 'train' method is doing the epoch counting. For testing purposes, we save the weight matrices after
each epoch in
the list intermediate_weights. This list is returned as the output of train:
import numpy as np
@np.vectorize
def sigmoid(x):
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.create_weight_matrices()
def create_weight_matrices(self):
""" A method to initialize the weight matrices of the neur
al network"""
rad = 1 / np.sqrt(self.no_of_in_nodes)
X = truncated_normal(mean=0,
sd=1,
low=-rad,
upp=rad)
self.wih = X.rvs((self.no_of_hidden_nodes,
self.no_of_in_nodes))
rad = 1 / np.sqrt(self.no_of_hidden_nodes)
X = truncated_normal(mean=0,
sd=1,
low=-rad,
upp=rad)
self.who = X.rvs((self.no_of_out_nodes,
self.no_of_hidden_nodes))
output_vectors = []
input_vector = np.array(input_vector, ndmin=2).T
target_vector = np.array(target_vector, ndmin=2).T
output_vector1 = np.dot(self.wih,
input_vector)
output_hidden = activation_function(output_vector1)
output_vector2 = np.dot(self.who,
output_hidden)
output_network = activation_function(output_vector2)
output_vector = np.dot(self.wih,
input_vector)
output_vector = activation_function(output_vector)
output_vector = np.dot(self.who,
output_vector)
output_vector = activation_function(output_vector)
return output_vector
weights = ANN.train(train_imgs,
train_labels_one_hot,
epochs=epochs,
intermediate_results=True)
**********
cm = ANN.confusion_matrix(train_imgs, train_labels)
print(ANN.run(train_imgs[i]))
[[2.60149245e-03]
[2.52542556e-03]
[6.57990628e-03]
[1.32663729e-03]
[1.34985384e-03]
[2.63840265e-04]
[2.18329159e-04]
[1.32693720e-04]
[9.84326084e-01]
[4.34559417e-02]]
cm = list(cm.items())
print(sorted(cm))
import numpy as np
@np.vectorize
def sigmoid(x):
return 1 / (1 + np.e ** -x)
activation_function = sigmoid
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate,
bias=None
):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
"""
A method to initialize the weight
matrices of the neural network with
output_vector1 = np.dot(self.wih,
input_vector)
output_hidden = activation_function(output_vector1)
if self.bias:
output_hidden = np.concatenate((output_hidden,
[[self.bias]]) )
if self.bias:
# adding bias node to the end of the inpuy_vector
input_vector = np.concatenate((input_vector, [1]) )
input_vector = np.array(input_vector, ndmin=2).T
output_vector = np.dot(self.wih,
input_vector)
output_vector = activation_function(output_vector)
if self.bias:
output_vector = np.concatenate( (output_vector,
[[1]]) )
ANN = NeuralNetwork(no_of_in_nodes=image_pixels,
no_of_out_nodes=10,
no_of_hidden_nodes=200,
learning_rate=0.1,
bias=None)
for i in range(len(train_imgs)):
ANN.train(train_imgs[i], train_labels_one_hot[i])
for i in range(20):
res = ANN.run(test_imgs[i])
print(test_labels[i], np.argmax(res), np.max(res))
@np.vectorize
def sigmoid(x):
return 1 / (1 + np.e ** -x)
activation_function = sigmoid
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate,
bias=None
):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
"""
A method to initialize the weight matrices
of the neural network with optional
bias nodes"""
output_vectors = []
input_vector = np.array(input_vector, ndmin=2).T
target_vector = np.array(target_vector, ndmin=2).T
output_vector1 = np.dot(self.wih,
input_vector)
output_hidden = activation_function(output_vector1)
if self.bias:
output_hidden = np.concatenate((output_hidden,
[[self.bias]]) )
output_vector2 = np.dot(self.who,
output_hidden)
output_network = activation_function(output_vector2)
if self.bias:
# adding bias node to the end of the inpuy_vector
input_vector = np.concatenate( (input_vector,
[self.bias]) )
input_vector = np.array(input_vector, ndmin=2).T
output_vector = np.dot(self.wih,
input_vector)
output_vector = activation_function(output_vector)
if self.bias:
output_vector = np.concatenate( (output_vector,
[[self.bias]]) )
return output_vector
epochs = 12
network = NeuralNetwork(no_of_in_nodes=image_pixels,
no_of_out_nodes=10,
no_of_hidden_nodes=100,
learning_rate=0.1,
bias=None)
weights = network.train(train_imgs,
train_labels_one_hot,
epochs=epochs,
intermediate_results=True)
for epoch in range(epochs):
print("epoch: ", epoch)
network.wih = weights[epoch][0]
network.who = weights[epoch][1]
corrects, wrongs = network.evaluate(train_imgs,
train_labels)
print("accuracy train: ", corrects / ( corrects + wrong
s))
corrects, wrongs = network.evaluate(test_imgs,
test_labels)
print("accuracy test: ", corrects / ( corrects + wrongs))
train_labels)
test_labels)
outstr = str(hidden_nodes) + " " + str(learnin
g_rate) + " " + str(bias)
outstr += " " + str(epoch) + " "
outstr += str(train_corrects / (train_correct
s + train_wrongs)) + " "
outstr += str(train_wrongs / (train_corrects
+ train_wrongs)) + " "
outstr += str(test_corrects / (test_corrects
+ test_wrongs)) + " "
outstr += str(test_wrongs / (test_corrects + t
est_wrongs))
fh_out.write(outstr + "\n" )
fh_out.flush()
***************************************************************************
The file nist_tests_20_50_100_120_150.csv contains the results from a run of the previous program.
We will write a new neural network class, in which we can define an arbitrary number of hidden layers. The
code is also improved, because the weight matrices are now build inside of a loop instead redundant code:
In [ ]:
import numpy as np
from scipy.special import expit as activation_function
from scipy.stats import truncnorm
class NeuralNetwork:
def __init__(self,
network_structure, # ie. [input_nodes, hidden1_no
des, ... , hidden_n_nodes, output_nodes]
learning_rate,
bias=None
):
self.structure = network_structure
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
layer_index = 1
no_of_layers = len(self.structure)
while layer_index < no_of_layers:
no_of_layers = len(self.structure)
input_vector = np.array(input_vector, ndmin=2).T
layer_index = 0
# The output/input vectors of the various layers:
res_vectors = [input_vector]
while layer_index < no_of_layers - 1:
in_vector = res_vectors[-1]
if self.bias:
# adding bias node to the end of the 'input'_vecto
r
in_vector = np.concatenate( (in_vector,
[[self.bias]]) )
res_vectors[-1] = in_vector
x = np.dot(self.weights_matrices[layer_index],
in_vector)
out_vector = activation_function(x)
# the output of one layer is the input of the next on
e:
res_vectors.append(out_vector)
layer_index += 1
layer_index = no_of_layers - 1
target_vector = np.array(target_vector, ndmin=2).T
# The input vectors to the various layers
#if self.bias:
# tmp = tmp[:-1,:]
self.weights_matrices[layer_index-1] += self.learnin
g_rate * tmp
output_errors = np.dot(self.weights_matrices[layer_ind
ex-1].T,
output_errors)
if self.bias:
output_errors = output_errors[:-1,:]
layer_index -= 1
no_of_layers = len(self.structure)
if self.bias:
# adding bias node to the end of the inpuy_vector
input_vector = np.concatenate( (input_vector,
[self.bias]) )
in_vector = np.array(input_vector, ndmin=2).T
layer_index = 1
# The input vectors to the various layers
while layer_index < no_of_layers:
x = np.dot(self.weights_matrices[layer_index-1],
in_vector)
out_vector = activation_function(x)
layer_index += 1
return out_vector
In [ ]:
ANN = NeuralNetwork(network_structure=[image_pixels, 50, 50, 10],
learning_rate=0.1,
bias=None)
for i in range(len(train_imgs)):
ANN.train(train_imgs[i], train_labels_one_hot[i])
In [ ]:
corrects, wrongs = ANN.evaluate(train_imgs, train_labels)
print("accuracy train: ", corrects / ( corrects + wrongs))
corrects, wrongs = ANN.evaluate(test_imgs, test_labels)
print("accuracy: test", corrects / ( corrects + wrongs))
In [ ]:
import numpy as np
from scipy.special import expit as activation_function
from scipy.stats import truncnorm
class NeuralNetwork:
def __init__(self,
network_structure, # ie. [input_nodes, hidden1_no
des, ... , hidden_n_nodes, output_nodes]
learning_rate,
bias=None
):
self.structure = network_structure
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
X = truncated_normal(mean=2, sd=1, low=-0.5, upp=0.5)
no_of_layers = len(self.structure)
input_vector = np.array(input_vector, ndmin=2).T
layer_index = 0
# The output/input vectors of the various layers:
res_vectors = [input_vector]
while layer_index < no_of_layers - 1:
in_vector = res_vectors[-1]
if self.bias:
# adding bias node to the end of the 'input'_vecto
r
in_vector = np.concatenate( (in_vector,
[[self.bias]]) )
res_vectors[-1] = in_vector
x = np.dot(self.weights_matrices[layer_index], in_vect
or)
out_vector = activation_function(x)
res_vectors.append(out_vector)
layer_index += 1
layer_index = no_of_layers - 1
target_vector = np.array(target_vector, ndmin=2).T
# The input vectors to the various layers
output_errors = target_vector - out_vector
while layer_index > 0:
out_vector = res_vectors[layer_index]
in_vector = res_vectors[layer_index-1]
#if self.bias:
# tmp = tmp[:-1,:]
self.weights_matrices[layer_index-1] += self.learnin
g_rate * tmp
output_errors = np.dot(self.weights_matrices[layer_ind
ex-1].T,
output_errors)
if self.bias:
output_errors = output_errors[:-1,:]
layer_index -= 1
no_of_layers = len(self.structure)
if self.bias:
# adding bias node to the end of the inpuy_vector
input_vector = np.concatenate( (input_vector, [self.bi
as]) )
layer_index = 1
# The input vectors to the various layers
while layer_index < no_of_layers:
x = np.dot(self.weights_matrices[layer_index-1],
in_vector)
out_vector = activation_function(x)
layer_index += 1
return out_vector
In [ ]:
epochs = 3
FOOTNOTES
1
Wan, Li; Matthew Zeiler; Sixin Zhang; Yann LeCun; Rob Fergus (2013). Regularization of Neural Network
using DropConnect. International Conference on Machine Learning(ICML).
INTRODUCTION
The term "dropout" is used for a technique which
drops out some nodes of the network. Dropping out
can be seen as temporarily deactivating or ignoring
neurons of the network. This technique is applied in
the training phase to reduce overfitting effects.
Overfitting is an error which occurs when a network
is too closely fit to a limited set of input samples.
This technique has been first proposed in a paper "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting" by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan
Salakhutdinov in 2014
We will implement in our tutorial on machine learning in Python a Python class which is capable of dropout.
Let's deactivate (drop out) the node i 2. We can see in the following diagram what's happening:
Now we will examine what happens if we take out a hidden node. We take out the first hidden node, i.e. h 1.
In this case, we can remove the complete first line of our weight matrix:
Taking out a hidden node affects the next weight matrix as well. Let's have a look at what is happening in the
network graph:
So far we have arbitrarily chosen one node to deactivate. The dropout approach means that we randomly
choose a certain number of nodes from the input and the hidden layers, which remain active and turn off the
other nodes of these layers. After this we can train a part of our learn set with this network. The next step
consists in activating all the nodes again and randomly chose other nodes. It is also possible to train the whole
training set with the randomly created dropout networks.
We present three possible randomly chosen dropout networks in the following three diagrams:
We will start with the weight matrix between input and hidden layer. We will randomly create a weight matrix
for 10 input nodes and 5 hidden nodes. We fill our matrix with random numbers between -10 and 10, which
are not proper weight values, but this way we can see better what is going on:
import numpy as np
import random
input_nodes = 10
hidden_nodes = 5
output_nodes = 7
We will choose now the active nodes for the input layer. We calculate random indices for the active nodes:
active_input_percentage = 0.7
active_input_nodes = int(input_nodes * active_input_percentage)
active_input_indices = sorted(random.sample(range(0, input_node
s),
active_input_nodes))
active_input_indices
Output: [0, 1, 2, 5, 7, 8, 9]
We learned above that we have to remove the column j, if the node i j is removed. We can easily accomplish
this for all deactived nodes by using the slicing operator with the active nodes:
wih_old = wih.copy()
wih = wih[:, active_input_indices]
wih
Output: array([[ -6, -8, -3, -9, -5, -6, 4],
[ 5, 3, 7, 8, -4, 7, 7],
[ 9, -7, 4, 0, -6, -2, 7],
[ -8, -9, -4, 8, -8, -2, -3],
[ 3, -10, 0, 0, 2, -7, -9]])
As we have mentioned before, we will have to modify both the 'wih' and the 'who' matrix:
print(who)
active_hidden_percentage = 0.7
active_hidden_nodes = int(hidden_nodes * active_hidden_percentage)
active_hidden_indices = sorted(random.sample(range(0, hidden_node
s),
active_hidden_nodes))
print(active_hidden_indices)
who_old = who.copy()
who = who[:, active_hidden_indices]
wih = wih[active_hidden_indices]
wih
Output: array([[-6, -8, -3, -9, -5, -6, 4],
[ 9, -7, 4, 0, -6, -2, 7],
[-8, -9, -4, 8, -8, -2, -3]])
import numpy as np
import random
input_nodes = 10
hidden_nodes = 5
output_nodes = 7
active_input_percentage = 0.7
active_hidden_percentage = 0.7
wih_old = wih.copy()
wih = wih[:, active_input_indices]
print("\nwih after deactivating input nodes:\n", wih)
wih = wih[active_hidden_indices]
print("\nwih after deactivating hidden nodes:\n", wih)
who_old = who.copy()
who = who[:, active_hidden_indices]
print("\nwih after deactivating hidden nodes:\n", who)
import numpy as np
import random
from scipy.special import expit as activation_function
from scipy.stats import truncnorm
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate,
bias=None
):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
X = truncated_normal(mean=2, sd=1, low=-0.5, upp=0.5)
def dropout_weight_matrices(self,
active_input_percentage=0.70,
active_hidden_percentage=0.70):
# restore wih array, if it had been used for dropout
self.wih_orig = self.wih.copy()
self.no_of_in_nodes_orig = self.no_of_in_nodes
self.no_of_hidden_nodes = active_hidden_nodes
self.no_of_in_nodes = active_input_nodes
return active_input_indices, active_hidden_indices
def weight_matrices_reset(self,
active_input_indices,
active_hidden_indices):
"""
self.wih and self.who contain the newly adapted values fro
m the active nodes.
We have to reconstruct the original weight matrices by ass
igning the new values
from the active nodes
"""
temp = self.wih_orig.copy()[:,active_input_indices]
temp[active_hidden_indices] = self.wih
self.wih_orig[:, active_input_indices] = temp
self.wih = self.wih_orig.copy()
if self.bias:
# adding bias node to the end of the input_vector
input_vector = np.concatenate( (input_vector, [self.bi
as]) )
if self.bias:
output_vector_hidden = np.concatenate( (output_vecto
r_hidden, [[self.bias]]) )
self.weight_matrices_reset(active_in_indices, acti
ve_hidden_indices)
if self.bias:
# adding bias node to the end of the input_vector
input_vector = np.concatenate( (input_vector, [self.bi
as]) )
input_vector = np.array(input_vector, ndmin=2).T
if self.bias:
output_vector = np.concatenate( (output_vector, [[sel
f.bias]]) )
return output_vector
import pickle
train_imgs = data[0]
test_imgs = data[1]
train_labels = data[2]
test_labels = data[3]
parts = 10
partition_length = int(len(train_imgs) / parts)
print(partition_length)
start = 0
for start in range(0, len(train_imgs), partition_length):
print(start, start + partition_length)
6000
0 6000
6000 12000
12000 18000
18000 24000
24000 30000
30000 36000
36000 42000
42000 48000
48000 54000
54000 60000
epochs = 3
simple_network.train(train_imgs,
train_labels_one_hot,
active_input_percentage=1,
active_hidden_percentage=1,
no_of_dropout_tests = 100,
epochs=epochs)
epoch: 0
epoch: 1
epoch: 2
INTRODUCTION
In the previous chapters of our tutorial, we
manually created Neural Networks. This
was necessary to get a deep understanding
of how Neural networks can be
implemented. This understanding is very
useful to use the classifiers provided by
the sklearn module of Python. In this
chapter we will use the multilayer
perceptron classifier MLPClassifier
contained in
sklearn.neural_network
MLPCLASSIFIER
CLASSIFIER
We will continue with examples using the multilayer perceptron (MLP). The multilayer perceptron (MLP) is a
feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An
MLP consists of multiple layers and each layer is fully connected to the following one. The nodes of the layers
are neurons using nonlinear activation functions, except for the nodes of the input layer. There can be one or
more non-linear hidden layers between the input and the output layer.
MULTILABEL EXAMPLE
n_samples = 200
blob_centers = ([1, 1], [3, 4], [1, 3.3], [3.5, 1.8])
data, labels = make_blobs(n_samples=n_samples,
centers=blob_centers,
cluster_std=0.5,
random_state=0)
• solver:
The weight optimization can be influenced with the solver parameter. Three solver modes
are available
▪ 'lbfgs'
Without understanding in the details of the solvers, you should know the following: 'adam'
works pretty well - both training time and validation score - on relatively large datasets, i.e.
thousands of training samples or more. For small datasets, however, 'lbfgs' can converge faster
and perform better.
• 'alpha'
This parameter can be used to control possible 'overfitting' and 'underfitting'. We will cover it in
detail further down in this chapter.
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes=(6,),
random_state=1)
clf.fit(train_data, train_labels)
Output: MLPClassifier(alpha=1e-05, hidden_layer_sizes=(6,), random_st
ate=1,
solver='lbfgs')
clf.score(train_data, train_labels)
Output: 1.0
predictions_train = clf.predict(train_data)
predictions_test = clf.predict(test_data)
train_score = accuracy_score(predictions_train, train_labels)
print("score on train data: ", train_score)
test_score = accuracy_score(predictions_test, test_labels)
print("score on train data: ", test_score)
score on train data: 1.0
score on train data: 0.95
predictions_train[:20]
MULTI-LAYER PERCEPTRON
from sklearn.neural_network import MLPClassifier
X = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]
y = [0, 0, 0, 1]
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
hidden_layer_sizes=(5, 2), random_state=1)
print(clf.fit(X, y))
MLPClassifier(alpha=1e-05, hidden_layer_sizes=(5, 2), random_stat
e=1,
solver='lbfgs')
The following diagram depicts the neural network, that we have trained for our classifier clf. We have two
input nodes X 0 and X 1, called the input layer, and one output neuron 'Out'. We have two hidden layers the first
one with the neurons H 00 ... H 04 and the second hidden layer consisting of H 10 and H 11. Each neuron of the
hidden layers and the output neuron possesses a corresponding Bias, i.e. B 00 is the corresponding Bias to the
neuron H 00, B 01 is the corresponding Bias to the neuron H 01 and so on.
Each neuron of the hidden layers receives the output from every neuron of the previous layers and transforms
these values with a weighted linear summation
n−1
∑ w ix i = w 0x 0 + w 1x 1 + . . . + w n − 1x n − 1
i=0
into an output value, where n is the number of neurons of the layer and w i corresponds to the ith component of
the weight vector. The output layer receives the values from the last hidden layer. It also performs a linear
summation, but a non-linear activation function
g( ⋅ ) : R → R
like the hyperbolic tan function will be applied to the summation result.
In [ ]:
print("weights between in
put and first hidden laye
r:")
print(clf.coefs_[0])
print("\nweights between
first hidden and second h
idden layer:")
print(clf.coefs_[1])
n−1
∑ w ix i = w 0x 0 + w 1x 1 + w B11 ∗ B 11
i=0
∑ w ix i = w 0x 0 + w 1x 1 + w B11
i=0
because B 11 = 1.
We can get the values for w 0 and w 1 from clf.coefs_ like this:
In [ ]:
print("w0 = ", clf.coefs_[0][0][0])
print("w1 = ", clf.coefs_[0][1][0])
In [ ]:
clf.coefs_[0][:,0]
In [ ]:
for i in range(len(clf.coefs_)):
number_neurons_in_layer = clf.coefs_[i].shape[1]
for j in range(number_neurons_in_layer):
weights = clf.coefs_[i][:,j]
print(i, j, weights, end=", ")
print()
print()
intercepts_ is a list of bias vectors, where the vector at index i represents the bias values added to layer i+1.
In [ ]:
print("Bias values for first hidden layer:")
print(clf.intercepts_[0])
print("\nBias values for second hidden layer:")
print(clf.intercepts_[1])
The main reason, why we train a classifier is to predict results for new samples. We can do this with the
predict method. The method returns a predicted class for a sample, in our case a "0" or a "1" :
In [ ]:
result = clf.predict([[0, 0], [0, 1],
[1, 0], [0, 1],
[1, 1], [2., 2.],
[1.3, 1.3], [2, 4.8]])
Instead of just looking at the class results, we can also use the predict_proba method to get the probability
estimates.
In [ ]:
prob_results = clf.predict_proba([[0, 0], [0, 1],
[1, 0], [0, 1],
[1, 1], [2., 2.],
[1.3, 1.3], [2, 4.8]])
print(prob_results)
prob_results[i][0] gives us the probability for the class0, i.e. a "0" and results[i][1] the probabilty for a "1". i
corresponds to the ith sample.
iris = load_iris()
print(train_data[:3])
[[ 1.91343191 -0.6013337 1.31398787 0.89583493]
[-0.93504278 1.48689909 -1.31208492 -1.08512683]
[ 0.4272712 -0.36930784 0.28639417 0.10345022]]
predictions_train = mlp.predict(train_data)
print(accuracy_score(predictions_train, train_labels))
predictions_test = mlp.predict(test_data)
print(accuracy_score(predictions_test, test_labels))
confusion_matrix(predictions_train, train_labels)
Output: array([[42, 0, 0],
[ 0, 37, 1],
[ 0, 2, 38]])
confusion_matrix(predictions_test, test_labels)
Output: array([[ 8, 0, 0],
[ 0, 10, 0],
[ 0, 1, 11]])
print(classification_report(predictions_test, test_labels))
precision recall f1-score support
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
MNIST DATASET
We have already used the MNIST dataset in the chapter Testing with MNIST of our tutorial. You will also find
some explanations about this dataset.
We want to apply the MLPClassifier on the MNIST data. We can load in the data with pickle:
import pickle
train_imgs = data[0]
test_imgs = data[1]
train_labels = data[2]
mlp = MLPClassifier(hidden_layer_sizes=(100, ),
max_iter=480, alpha=1e-4,
solver='sgd', verbose=10,
tol=1e-4, random_state=1,
learning_rate_init=.1)
train_labels = train_labels.reshape(train_labels.shape[0],)
print(train_imgs.shape, train_labels.shape)
mlp.fit(train_imgs, train_labels)
print("Training set score: %f" % mlp.score(train_imgs, train_label
s))
print("Test set score: %f" % mlp.score(test_imgs, test_labels))
Parameters
----------
X : ndarray or sparse matrix of shape (n_samples, n_features)
The input data.
Returns
-------
self : returns a trained MLP model.
plt.show()
Alpha is a parameter for regularization term, aka penalty term, that combats overfitting by constraining the
size of the weights. Increasing alpha may fix high variance (a sign of overfitting) by encouraging smaller
weights, resulting in a decision boundary plot that appears with lesser curvatures. Similarly, decreasing alpha
may fix high bias (a sign of underfitting) by encouraging larger weights, potentially resulting in a more
complicated decision boundary.
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classi
fication
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
alphas = np.logspace(-1, 1, 5)
classifiers = []
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lst
rip('0'),
size=15, horizontalalignment='right')
i += 1
EXERCISES
EXERCISE 1
SOLUTIONS
SOLUTION TO EXERCISE 1
import pandas as pd
dataset = pd.read_csv("data/strange_flowers.txt",
header=None,
names=["red", "green", "blue", "size", "labe
l"],
sep=" ")
dataset
The first four columns contain the data and the last column contains the labels:
We have to scale the data now to reduce the biases between the data:
mlp = MLPClassifier(hidden_layer_sizes=(100, ),
max_iter=480,
alpha=1e-4,
solver='sgd',
tol=1e-4,
random_state=1,
learning_rate_init=.1)
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
Training set score: 0.971698
Test set score: 0.981132
INTRODUCTION
In [ ]:
from sklearn.datasets imp
ort load_digits
digits = load_digits()
digits.keys()
Output: dict_keys(['data', 'target', 'frame', 'feature_names', 'targe
t_names', 'images', 'DESCR'])
The digits dataset contains 1797 images and each images contains 64 features, which correspond to the pixels:
print(digits.data[0])
[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0.
0. 3.
15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5.
8. 0.
0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 1
0. 12.
0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
print(digits.target)
[0 1 2 ... 8 9 8]
The data is also available at digits.images. This is the raw data of the images in the form of 8 lines and 8
columns.
With "data" an image corresponds to a one-dimensional Numpy array with the length 64, and "images"
representation contains 2-dimensional numpy arrays with the shape (8, 8)
mlp = MLPClassifier(hidden_layer_sizes=(5,),
activation='logistic',
alpha=1e-4,
solver='sgd',
tol=1e-4,
random_state=1,
learning_rate_init=.3,
verbose=True)
predictions = mlp.predict(test_data)
predictions[:25] , test_labels[:25]
Output: (array([1, 5, 0, 7, 7, 0, 6, 1, 5, 4, 9, 2, 7, 8, 4, 1, 7,
3, 7, 4, 7, 4,
8, 6, 0]),
array([1, 5, 0, 7, 1, 0, 6, 1, 5, 4, 9, 2, 7, 8, 4, 6, 9,
3, 7, 4, 7, 1,
8, 6, 0]))
In [ ]:
In [ ]:
In [ ]:
DEFINITION
In machine learning, a Bayes classifier is a simple probabilistic
classifier, which is based on applying Bayes' theorem. The
feature model used by a naive Bayes classifier makes strong
independence assumptions. This means that the existence of a
particular feature of a class is independent or unrelated to the
existence of every other feature.
CONDITIONAL PROBABILITY
P(A | B) stands for "the conditional probability of A given B", or "the probability of A under the condition B",
i.e. the probability of some event A under the assumption that the event B took place. When in a random
experiment the event B is known to have occurred, the possible outcomes of the experiment are reduced to B,
and hence the probability of the occurrence of A is changed from the unconditional probability into the
conditional probability given B. The Joint probability is the probability of two events in conjunction. That is, it
is the probability of both events together. There are three notations for the joint probability of A and B. It can
be written as
• P(A ∩ B)
• P(AB) or
• P(A, B)
P(A ∩ B)
P(A | B) =
P(B)
There are about 8.4 million people living in Switzerland. About 64 % of them speak German. There are about
If some aliens randomly beam up an earthling, what are the chances that he is a German speaking Swiss?
S: being Swiss
8.4
P(S) = = 0.00112
7500
If we know that somebody is Swiss, the probability of speaking German is 0.64. This corresponds to the
conditional probability
P(GS | S) = 0.64
So the probability of the earthling being Swiss and speaking German, can be calculated by the formula:
P(GS ∩ S)
P(GS | S) =
P(S)
P(GS ∩ S)
0.64 =
0.00112
and
P(GS ∩ S) = 0.0007168
So our aliens end up with a chance of 0.07168 % of getting a German speaking Swiss person.
A medical research lab proposes a screening to test a large group of people for a disease. An argument against
such screenings is the problem of false positive screening results.
Suppose 0,1% of the group suffer from the disease, and the rest is well:
and
If you have the disease, the test will be positive 99% of the time, and if you don't have it, the test will be
negative 99% of the time:
and
Finally, suppose that when the test is applied to a person having the disease, there is a 1% chance of a false
negative result (and 99% chance of getting a true positive result), i.e.
and
Problem:
In many cases even medical professionals assume that "if you have this sickness, the test will be positive in 99
% of the time and if you don't have it, the test will be negative 99 % of the time. Out of the 1098 cases that
report positive results only 99 (9 %) cases are correct and 999 cases are false positives (91 %), i.e. if a person
gets a positive test result, the probability that he or she actually has the disease is just about 9 %. P("sick" |
"test positive") = 99 / 1098 = 9.02 %
BAYES' THEOREM
We calculated the conditional probability P(GS | S), which was the probability that a person speaks German, if
P(GS, S)
P(GS | S) =
P(S)
What about calculating the probability P(S | GS), i.e. the probability that somebody is Swiss under the
assumption that the person speeks German?
P(GS, S)
P(S | GS) =
P(GS)
As the left sides are equal, the right sides have to be equal as well:
P(GS | S)P(S)
P(S | GS) =
P(GS)
To solve our problem, - i.e. the probability that a person is Swiss, if we know that he or she speaks German -
all we have to do is calculate the right side. We know already from our previous exercise that
P(GS | S) = 0.64
and
P(S) = 0.00112
The number of German native speakers in the world corresponds to 101 millions, so we know that
101
P(GS) = = 0.0134667
7500
Finally, we can calculate P(S | GS) by substituting the values in our equation:
If the some aliens randomly beam up an earthling, what are the chances that he is a German speaking Swiss?
8.4
P(S) = = 0.00112
7500
P(B | A)P(A)
P(A | B) =
P(B)
P(A | B) is the conditional probability of A, given B (posterior probability), P(B) is the prior probability of B
and P(A) the prior probability of A. P(B | A) is the conditional probability of B given A, called the likely-hood.
An advantage of the naive Bayes classifier is that it requires only a small amount of training data to estimate
the parameters necessary for classification. Because independent variables are assumed, only the variances of
the variables for each class need to be determined and not the entire covariance matrix.
INTRODUCTORY EXERCISE
The following lists 'in_time' (the train from Hamburg arrived in time to catch the connecting train to Munich)
and 'too_late' (connecting train is missed) are data showing the situation over some weeks. The first
component of each tuple shows the minutes the train was late and the second component shows the number of
time this occurred.
%matplotlib inline
X, Y = zip(*in_time)
X2, Y2 = zip(*too_late)
bar_width = 0.9
plt.bar(X, Y, bar_width, color="blue", alpha=0.75, label="in tim
From this data we can deduce that the probability of catching the connecting train if we are one minute late is
1, because we had 19 successful cases experienced and no misses, i.e. there is no tuple with 1 as the first
component in 'too_late'.
We will denote the event "train arrived in time to catch the connecting train" with S (success) and the 'unlucky'
event "train arrived too late to catch the connecting train" with M (miss)
We can now define the probability "catching the train given that we are 1 minute late" formally:
P(S | 1) = 19 / 19 = 1
We used the fact that the tuple (1, 19) is in 'in_time' and there is no tuple with the first component 1 in
'too_late'
It's getting critical for catching the connecting train to Munich, if we are 6 minutes late. Yet, the chances are
still 60 %:
P(S | 6) = 9 / 9 + 6 = 0.6
Accordingly, the probability for missing the train knowing that we are 6 minutes late is:
P(M | 6) = 6 / 9 + 6 = 0.4
We can write a 'classifier' function, which will give the probability for catching the connecting train:
in_time_dict = dict(in_time)
def catch_the_train(min):
s = in_time_dict.get(min, 0)
if s == 0:
return 0
else:
m = too_late_dict.get(min, 0)
return s / (s + m)
-1 0
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.6
7 0.4375
8 0.25
9 0.15
10 0.14285714285714285
11 0.11764705882352941
12 0
We will use a file called 'person_data.txt'. It contains 100 random person data, male and female, with body
sizes, weights and gender tags.
import numpy as np
Warning: There might be some confusion between a Python class and a Naive Bayes class. We try to avoid it
by saying explicitly what is meant, whenever possible!
We will now define a Python class "Feature" for the features, which we will use for classification later.
The Feature class needs a label, e.g. "heights" or "firstnames". If the feature values are numerical we may
want to "bin" them to reduce the number of possible feature values. The heights from our persons have a huge
range and we have only 50 measured values for our Naive Bayes classes "male" and "female". We will bin
them into ranges "130 to 134", "135 to 139", "140 to 144" and so on by setting bin_width to 5. There is no
way of binning the first names, so bin_width will be set to None.
The method frequency returns the number of occurrencies for a certain feature value or a binned range.
class Feature:
We will create now two feature classes Feature for the height values of the person data set. One Feature class
contains the height for the Naive Bayes class "male" and one the heights for the class "female":
fts = {}
for gender in genders:
fts[gender] = Feature(heights[gender], name=gender, bin_widt
h=5)
print(gender, fts[gender].freq_dict)
male {160: 5, 195: 2, 180: 5, 165: 4, 200: 3, 185: 8, 170: 6, 15
5: 1, 190: 8, 175: 7}
female {160: 8, 130: 1, 165: 11, 135: 1, 170: 7, 140: 0, 175: 2, 1
45: 3, 180: 4, 150: 5, 185: 0, 155: 7}
We printed out the frequencies of our bins, but it is a lot better to see these values dipicted in a bar chart. We
will do this with the following code:
plt.legend(loc='upper right')
plt.show()
We have to design now a Naive Bayes class in Python. We will call it NBclass. An NBclass contains one or
more Feature classes. The name of the NBclass will be stored in self.name.
class NBclass:
def probability_value_given_feature(self,
feature_value,
feature):
"""
p_value_given_feature returns the probability p
for a feature_value 'value' of the feature to occurr
corresponds to P(d_i | p_j)
where d_i is a feature variable of the feature i
"""
if feature.freq_sum == 0:
In the following code, we will create NBclasses with one feature, i.e. the height feature. We will use the
Feature classes of fts, which we have previously created:
cls = {}
for gender in genders:
cls[gender] = NBclass(gender, fts[gender])
The final step for creating a simple Naive Bayes classifier consists in writing a class 'Classifier', which will
use our classes 'NBclass' and 'Feature'.
class Classifier:
nbclasses = self.nbclasses
probability_list = []
for nbclass in nbclasses:
ftrs = nbclass.features
prob = 1
for i in range(len(ftrs)):
prob *= nbclass.probability_value_given_featur
e(d[i], ftrs[i])
We will create a classifier with one feature class 'height'. We check it with values between 130 and 220 cm.
c = Classifier(cls["male"], cls["female"])
There are no persons - neither male nor female - in our learn set, with a body height between 140 and 144.
This is the reason, why our classifier can't base its result on learned data and therefore comes back with a fify-
fifty result.
fts = {}
c = Classifier(cls["male"], cls["female"])
The name "Jessie" is an ambiguous name. There are about 66 boys per 100 girls with this name. We can learn
from the previous classification results that the probability for the name "Jessie" being "female" is about two-
thirds, which is calculated from our data set "person":
Jessie Washington is only 159 cm tall. If we have a look at the results of our Classifier, trained with heights,
we see that the likelihood for a person 159 cm tall of being "female" is 0.875. So what about an unknown
person called "Jessie" and being 159 cm tall? Is this person female or male?
To answer this question, we will train an Naive Bayes classifier with two feature classes, i.e. heights and
firstnames:
cls = {}
for gender in genders:
fts_heights = Feature(heights[gender], name="heights", bin_wid
c = Classifier(cls["male"], cls["female"])
P(d | c j)P(c j)
P(c j | d) =
P(d)
where
• P(c j | d) is the probability of instance d being in class c_j, it is the result we want to calculate
with our classifier
• P(c j) is the probability for the occurrence of class c j We didn't use it in our classifiers, because
both classes in our example have been equally likely.
• P(d) is the probability for the occurrence of an instance d It's not needed in the calculation,
because it is the same for all classes.
We had used only one feature in our previous examples, i.e. the 'height' or the name.
1
P(d)
is only depending on the values of d 1, d 2, . . . d n. This means that it is a constant as the values of the
feature variables are known.
In [ ]:
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
[[50 0 0]
[ 0 47 3]
[ 0 3 47]]
We use our person data from the previous chapter of our tutorial to train another classifier in the next example:
import numpy as np
def prepare_person_dataset(fname):
genders = ["male", "female"]
persons = []
with open(fname) as fh:
for line in fh:
persons.append(line.strip().split())
firstnames = []
dataset = [] # weight and height
learnset = prepare_person_dataset("data/person_data.txt")
testset = prepare_person_dataset("data/person_data_testset.txt")
print(learnset)
model.fit(w, l)
print(model)
w, l = zip(*testset)
w = np.array(w)
l = np.array(l)
predicted = model.predict(w)
print(predicted)
print(l)
# summarize the fit of the model
print(metrics.classification_report(l, predicted))
print(metrics.confusion_matrix(l, predicted))
[[40 10]
[19 31]]
In [ ]:
In [ ]:
INTRODUCTION
Document classification/categorization is a topic in information science, a
science dealing with the collection, analysis, classification, categorization,
manipulation, retrieval, storage and propagation of information.
This might sound very abstract, but there are lots of situations nowadays,
where companies are in need of automatic classification or categorization
of documents. Just think about a large company with thousands of
incoming mail pieces per day, both electronic or paper based. Lot's of these
mail pieces are without specific addressee names or departments.
Somebody has to read these texts and has to decide what kind of a letter it is ("change of address", "complaints
letter", "inquiry about products", and so on) and to whom the document should be proceeded. This
"somebody" can be an automated text classification system.
The task of text classification consists in assigning a document to one or more categories, based on the
semantic content of the document. Document (or text) classification runs in two modes:
We will implement a text classifier in Python using Naive Bayes. Naive Bayes is the most commonly used text
classifier and it is the focus of research in text classification. A Naive Bayes classifier is based on the
application of Bayes' theorem with strong independence assumptions. "Strong independence" means: the
presence or absence of a particular feature of a class is unrelated to the presence or absence of any other
feature. Naive Bayes is well suited for multiclass text classification.
FORMAL DEFINITION
Let C = { c1, c2, ... cm} be a set of categories (classes) and D = { d1, d2, ... dn} a set of documents.
The task of the text classification consists in assigning to each pair ( ci, dj ) of C x D (with 1 ≤ i ≤ m and 1 ≤ j
≤ n) a value of 0 or 1, i.e. the value 0, if the document dj doesn't belong to ci
d1 ... dj ... dn
c1 a 11 ... a 1j ... a 1n
ci a i1 ... a ij ... a in
cm a m1 ... a mj ... a mn
• Naive Bayes
• Support Vector Machine
▪ Nearest Neighbour
The probability for a class cj is the quotient of the number of Documents of cj and the number of documents of
all classes, i.e. the learn set:
Finally, we come to the formula we need to classify an unknown document, i.e. the probability for a class cj
given a document di:
We can rewrite the previous formula into the following form, our final Naive Bayes classification formula, the
one we will use in our Python implementation in the following chapter:
FURTHER READING
There are lots of articles on text classification. We just name a few, which we have used for our work:
INTRODUCTION
In the previous chapter, we have deduced the formula for calculating the
probability that a document d belongs to a category or class c, denoted as
P(c|d).
Python is ideal for text classification, because of it's strong string class with
powerful methods. Furthermore the regular expression module re of Python
provides the user with tools, which are way beyond other programming
languages.
The only downside might be that this Python implementation is not tuned
for efficiency.
DOCUMENT REPRESENTATION
The document representation, which is based on the bag of word model, is illustrated in the following
diagram:
Our implementation needs the regular expression module re and the os module:
import re
import os
We will use in our implementation the function dict_merge_sum from the exercise 1 of our chapter on
dictionaries:
dict_merge_sum(d1, d2)
Output: {'e': 9, 'd': 18, 'b': 5, 'a': 5}
BAGOFWORDSCLASS
class BagOfWords(object):
""" Implementing a bag of words, words corresponding with thei
r
frequency of usages in a "document" for usage by the
Document class, Category class and the Pool class."""
def __init__(self):
self.__number_of_words = 0
self.__bag_of_words = {}
erg = BagOfWords()
erg.__bag_of_words = dict_merge_sum(self.__bag_of_words,
other.__bag_of_words)
return erg
def add_word(self,word):
""" A word is added in the dictionary __bag_of_words"""
self.__number_of_words += 1
if word in self.__bag_of_words:
self.__bag_of_words[word] += 1
else:
self.__bag_of_words[word] = 1
def len(self):
""" Returning the number of different words of an object
"""
return len(self.__bag_of_words)
def Words(self):
""" Returning a list of the words contained in the object
"""
return self.__bag_of_words.keys()
def WordFreq(self,word):
""" Returning the frequency of a word """
if word in self.__bag_of_words:
return self.__bag_of_words[word]
else:
return 0
class Document(object):
""" Used both for learning (training) documents and for testin
g documents. The optional parameter lear
has to be set to True, if a classificator should be trained. I
f it is a test document learn has to be set to False. """
_vocabulary = BagOfWords()
self._number_of_words = 0
for word in words:
self._words_and_freq.add_word(word)
if learn:
def __add__(self,other):
""" Overloading the "+" operator. Adding two documents con
sists in adding the BagOfWords of the Documents """
res = Document(Document._vocabulary)
res._words_and_freq = self._words_and_freq + other._word
s_and_freq
return res
def vocabulary_length(self):
""" Returning the length of the vocabulary """
return len(Document._vocabulary)
def WordsAndFreq(self):
""" Returning the dictionary, containing the words (keys)
with their frequency (values) as contained
in the BagOfWords attribute of the document"""
return self._words_and_freq.BagOfWords()
def Words(self):
""" Returning the words of the Document object """
d = self._words_and_freq.BagOfWords()
return d.keys()
def WordFreq(self,word):
""" Returning the number of times the word "word" appeare
d in the document """
bow = self._words_and_freq.BagOfWords()
if word in bow:
return bow[word]
else:
return 0
This is the class consisting of the documents for one category /class. We use the term category instead of
"class" so that it will not be confused with Python classes:
class Category(Document):
def __init__(self, vocabulary):
Document.__init__(self, vocabulary)
self._number_of_docs = 0
def Probability(self,word):
""" returns the probabilty of the word "word" given the cl
ass "self" """
voc_len = Document._vocabulary.len()
SumN = 0
for i in range(voc_len):
SumN = Category._vocabulary.WordFreq(word)
N = self._words_and_freq.WordFreq(word)
erg = 1 + N
erg /= voc_len + SumN
return erg
def __add__(self,other):
""" Overloading the "+" operator. Adding two Category obje
cts consists in adding the
BagOfWords of the Category objects """
res = Category(self._vocabulary)
res._words_and_freq = self._words_and_freq + other._word
s_and_freq
return res
def NumberOfDocuments(self):
return self._number_of_docs
The pool is the class, where the document classes are trained and kept:
class Pool(object):
d = Document(self.__vocabulary)
d.read_document(doc)
for j in self.__document_classes:
sum_j = self.sum_words_in_class(j)
prod = 1
for i in d.Words():
wf_dclass = 1 + self.__document_classes[dclas
s].WordFreq(i)
wf = 1 + self.__document_classes[j].WordFre
To be able to learn and test a classifier, we offer a "Learn and test set to Download". The module NaiveBayes
consists of the code we have provided so far, but it can be downloaded for convenience as NaiveBayes.py The
learn and test sets contain (old) jokes labelled in six categories: "clinton", "lawyer", "math", "medical",
"music", "sex".
import os
base = "data/jokes/learn/"
p = Pool()
for dclass in DClasses:
p.learn(base + dclass, dclass)
print(results[:10])
FOOTNOTES
1
Please see our "Further Reading" section of our previous chapter
INTRODUCTION
We mentioned in the introductory chapter of our tutorial that a
spam filter for emails is a typical example of machine learning.
Emails are based on text, which is why a classifier to classify
emails must be able to process text as input. If we look at the
previous examples with neural networks, they always run
directly with numerical values and have a fixed input length. In
the end, the characters of a text also consist of numerical values,
but it is obvious that we cannot simply use a text as it is as input
for a neural network. This means that the text have to be
converted into a numerical representation, e.g. vectors or arrays
of numbers.
We will learn in this tutorial how to encode text in a way which is suitable for machine processing.
BAG-OF-WORDS MODEL
If we want to use texts in machine learning, we need a representation of the text which is usable for Machine
Learning purposes. This means we need a numerical representation. We cannot use texts directly.
In natural language processing and information retrievel the bag-of-words model is of crucial importance. The
bag-of-words model can be used to represent text data in a way which is suitable for machine learning
algorithms. Furthermore, this model is easy and efficient to implement. In the bag-of-words model, a text
(such as a sentence or a document) is represented as the so-called bag (a set or multiset) of its words.
We will use in the following a list of three strings to demonstrate the bag-of-words approach. In linguistics, the
collection of texts used for the experiments or tests is usually called a corpus:
We will use the submodule text from sklearn.feature_extraction . This module contains
utilities to build feature vectors from text documents.
First we need an instance of this class. When we instantiate a CountVectorizer, we can pass some optional
parameters, but it is possible to call it with no arguments, as we will do in the following. Printing the
vectorizer gives us useful information about the default values used when the instance was created:
vectorizer = text.CountVectorizer()
print(vectorizer)
CountVectorizer()
We have now an instance of CountVectorizer, but it has not seen any texts so far. We will use the method
fit to process our previously defined corpus. We learn a vocabulary dictionary of all the tokens (strings) of
the corpus:
vectorizer.fit(corpus)
Output: CountVectorizer()
fit created the vocabulary structure vocabulary_ . This contains the words of the text as keys and a
unique integer value for each word. As the default value for the parameter lowercase is set to True , the
To in the beginning of the text has been turned into to . You may also notice that the vocabulary contains
only words without any punctuation or special character. You can change this behaviour by assigning a regular
expression to the keyword parameter token_pattern of the fit method. The default is set to
(?u)\\b\\w\\w+\\b . The (?u) part of this regular expression is not necessary because it switches on
the re.U ( re.UNICODE ) flag for this expression, which is the default in Python anyway. The minimal
word length will be two characters:
If you only want to see the words without the indices, you can your the method feature_names :
print(vectorizer.get_feature_names())
['and', 'arrows', 'be', 'fortune', 'in', 'is', 'mind', 'nobler',
'not', 'of', 'or', 'outrageous', 'question', 'slings', 'suffer',
'that', 'the', 'tis', 'to', 'whether']
Alternatively, you can apply keys to the vocaulary to keep the ordering:
print(list(vectorizer.vocabulary_.keys()))
['to', 'be', 'or', 'not', 'that', 'is', 'the', 'question', 'whethe
r', 'tis', 'nobler', 'in', 'mind', 'suffer', 'slings', 'and', 'arr
ows', 'of', 'outrageous', 'fortune']
With the aid of transform we will extract the token counts out of the raw text documents. The call will
use the vocabulary which we created with fit :
token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)
The connection between the corpus, the Vocabulary vocabulary_ and the vector created by
transform can be seen in the following image:
Just in case: You might see that people use sometimes todense instead of toarray .
Do not use todense!1
dense_tcm = token_count_matrix.toarray()
dense_tcm
Output: array([[0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
0, 2, 0],
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
1, 1, 1],
[1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0, 0]])
The rows of this array correspond to the strings of our corpus. The length of a row corresponds to the length of
the vocabulary. The i'th value in a row corresponds to the i'th entry of the list returned by CountVectorizer
feature_names = vectorizer.get_feature_names()
for el in vectorizer.vocabulary_:
print(el)
to
be
or
not
that
is
the
question
whether
tis
nobler
in
mind
suffer
slings
and
arrows
of
outrageous
fortune
import pandas as pd
pd.DataFrame(data=dense_tcm,
index=['corpus_0', 'corpus_1', 'corpus_2'],
corpus_0 0 0 2 0 0 1 0 0 1 0 1 0 1
corpus_1 0 0 0 0 1 0 1 1 0 0 0 0 0
corpus_2 1 1 0 1 0 0 0 0 0 1 0 1 0
word = "be"
i = 1
j = vectorizer.vocabulary_[word]
print("number of times '" + word + "' occurs in:")
for i in range(len(corpus)):
print(" '" + corpus[i] + "': " + str(dense_tcm[i][j]))
number of times 'be' occurs in:
'To be, or not to be, that is the question:': 2
'Whether 'tis nobler in the mind to suffer': 0
'The slings and arrows of outrageous fortune,': 0
We will extract the token counts out of new text documents. Let's use a literally doubtful variation of Hamlet's
famous monologue and check what transform has to say about it. transform will use the vocabulary
which was previously fitted with fit.
print(vectorizer.get_feature_names())
['and', 'arrows', 'be', 'fortune', 'in', 'is', 'mind', 'nobler',
'not', 'of', 'or', 'outrageous', 'question', 'slings', 'suffer',
'that', 'the', 'tis', 'to', 'whether']
print(vectorizer.vocabulary_)
{'to': 18, 'be': 2, 'or': 10, 'not': 8, 'that': 15, 'is': 5, 'th
e': 16, 'question': 12, 'whether': 19, 'tis': 17, 'nobler': 7, 'i
n': 4, 'mind': 6, 'suffer': 14, 'slings': 13, 'and': 0, 'arrows':
1, 'of': 9, 'outrageous': 11, 'fortune': 3}
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)
tf_idf = text.TfidfTransformer()
tf_idf.fit(token_count_matrix)
tf_idf.idf_
tf_idf.idf_[vectorizer.vocabulary_['python']]
Output: 1.916290731874155
da = vectorizer.transform(corpus).toarray()
i = 0
# check how often the word 'would' occurs in the the i'th sentenc
e:
#vectorizer.vocabulary_['would']
word_ind = vectorizer.vocabulary_['would']
da[i][word_ind]
da[:,word_ind]
Output: array([0, 1, 0, 1])
corpus = ["It does not matter what you are doing, just do it!",
"Would you work if you won the lottery?",
"You like Python, he likes Python, we like Python, every
body loves Python!"
"You said: 'I wish I were a Python programmer'",
"You can stay here, if you want to. I would, if I were y
ou."
]
n = len(corpus)
TERM FREQUENCY
Some notations:
The simplest choice to define tf(t,d) is to use the raw count of a term in a document, i.e., the number of times
that term t occurs in document d, which we can denote as f t , d
"""
if t in vectorizer.vocabulary_:
word_ind = vectorizer.vocabulary_[t]
t_occurences = da[d, word_ind] # 'd' is the document in
return result
'matter' in 'It does not matter what you are doing, just do it!''
1.00 0.10 0.69 0.75
'matter' in 'Would you work if you won the lottery?''
0.00 0.00 0.00 0.50
'matter' in 'You like Python, he likes Python, we like Python, eve
rybody loves Python!You said: 'I wish I were a Python programme
r'''
0.00 0.00 0.00 0.50
'matter' in 'You can stay here, if you want to. I would, if I wer
e you.''
0.00 0.00 0.00 0.50
'python' in 'It does not matter what you are doing, just do it!''
0.00 0.00 0.00 0.50
'python' in 'Would you work if you won the lottery?''
0.00 0.00 0.00 0.50
'python' in 'You like Python, he likes Python, we like Python, eve
rybody loves Python!You said: 'I wish I were a Python programme
r'''
5.00 0.42 1.79 1.00
'python' in 'You can stay here, if you want to. I would, if I wer
e you.''
0.00 0.00 0.00 0.50
'would' in 'It does not matter what you are doing, just do it!''
0.00 0.00 0.00 0.50
'would' in 'Would you work if you won the lottery?''
1.00 0.14 0.69 0.75
'would' in 'You like Python, he likes Python, we like Python, ever
ybody loves Python!You said: 'I wish I were a Python programmer'''
0.00 0.00 0.00 0.50
'would' in 'You can stay here, if you want to. I would, if I were
you.''
1.00 0.11 0.69 0.67
DOCUMENT FREQUENCY
The document frequency df of a term t is defined as the number of documents in the document set that contain
the term t.
df(t) = | {d ∈ D : t ∈ d} |
The inverse document frequency is a measure of how much information the word provides, i.e., if it's common
or rare across all documents. It is the logarithmically scaled inverse fraction of the document frequency. The
effect of adding 1 to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all
documents in a training set, will not be entirely ignored.
n
idf(t) = log( df ( t ) ) + 1
(Note that the idf formula above differs from the standard textbook notation that defines the idf as
n
idf(t) = log( df ( t ) + 1 ).)
The formula above is used, when TfidfTransformer() is called with smooth_idf=False ! If it is called
with smooth_idf=True (the default) the constant 1 is added to the numerator and denominator of the
idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero
divisions:
n+1
idf(t) = log( df ( t ) + 1 ) + 1
A high value of tf–idf means that the term has a high "term frequency" in the given document and a low
"document frequency" in the other documents of the corpus. This means that this wieght can be used to filter
out common terms.
def df(t):
""" df(t) is the document frequency of t; the document frequen
cy is
the number of documents in the document set that contain
the term t. """
#df("would", vectorizer)
res_idf = []
for word in vectorizer.get_feature_names():
tf_docus = []
res_idf.append([word, idf(word)])
res_idf.sort(key=lambda x:x[1])
for item in res_idf:
print(item)
corpus
Output: ['It does not matter what you are doing, just do it!',
'Would you work if you won the lottery?',
"You like Python, he likes Python, we like Python, everybod
y loves Python!You said: 'I wish I were a Python programme
r'",
'You can stay here, if you want to. I would, if I were yo
u.']
We will use another simple example to illustrate the previously introduced concepts. We use a sentence which
contains solely different words. The corpus consists of this sentence and reduced versions of it, i.e. cutting of
words from the end of the sentence.
print(corpus)
['Cold', 'Cold wind', 'Cold wind blows', 'Cold wind blows over',
'Cold wind blows over the', 'Cold wind blows over the cornfields']
vectorizer = text.CountVectorizer()
vectorizer = vectorizer.fit(corpus)
vectorized_text = vectorizer.transform(corpus)
tf_idf = text.TfidfTransformer()
tf_idf.fit(vectorized_text)
tf_idf.idf_
Output: array([1.33647224, 1. , 2.25276297, 1.55961579, 1.8472
9786,
1.15415068])
import numpy as np
Let us have a closer look at this data. As with all the other data sets in sklearn we can find the actual data
under the attribute data :
print(newsgroups_data.data[0])
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
print(newsgroups_data.data[200])
vectorizer.fit(newsgroups_data.data)
Output: CountVectorizer()
counter = 0
n = 10
for word, index in vectorizer.vocabulary_.items():
print(word, index)
counter += 1
if counter > n:
break
We can turn the newsgroup postings into arrays. We do it with the first one:
a = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print(a)
[0 0 0 ... 0 0 0]
len(vectorizer.vocabulary_)
Output: 130107
There are a lot of 'rubbish' words in this vocabulary. rubish means seen from the perspective of machine
learning. For machine learning purposes words like 'Subject', 'From', 'Organization', 'Nntp-Posting-Host',
'Lines' and many others are useless, because they occur in all or in most postings. The technical 'garbage' from
the newsgroup can be easily stripped off. We can fetch it differently. Stating that we do not want 'headers',
'footers' and 'quotes':
print(newsgroups_data_cleaned.data[0])
print(newsgroups_data.data[0])
From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
vectorizer_cleaned = vectorizer.fit(newsgroups_data_cleaned.data)
len(vectorizer_cleaned.vocabulary_)
So, we got rid of more than 30000 words, but with more than a 100000 words is it still very large.
We can also directly separate the newsgroup feeds into a train and test set:
newsgroups_train = fetch_20newsgroups(subset='train',
remove=('headers', 'footer
s', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',
remove=('headers', 'footer
s', 'quotes'))
vectorizer = CountVectorizer()
train_data = vectorizer.fit_transform(newsgroups_train.data)
# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(train_data, newsgroups_train.target)
test_data = vectorizer.transform(newsgroups_test.data)
predictions = classifier.predict(test_data)
accuracy_score = metrics.accuracy_score(newsgroups_test.target,
predictions)
f1_score = metrics.f1_score(newsgroups_test.target,
predictions,
average='macro')
So far we added all the words to the vocabulary. However, it is questionable whether words like "the", "am",
"were" or similar words should be included at all, since they usually do not provide any significant semantic
contribution for a text. In other words: They have limited predictive power. It would therefore make sense to
exclude such words from editing, i.e. inclusion in the dictionary. This means we have to provide a list of
words which should be neglected, i.e. being filtered out before or after processing text. In natural text
recognition such words are usually called "stop words". There is no single universal list of stop words defined,
which could be used by all natural language processing tools. Usually, stop words consist of the most
frequently used words in a language. "Stop words" can be individually chosen for a given task.
By the way, stop words are an idea which is quite old. It goes back to 1959 and Hans Peter Luhn, one of the
pioneers in information retrieval.
cv = CountVectorizer(input=corpus,
stop_words=["my", "for","the", "has", "tha
n", "if",
"from", "on", "of", "it", "ther
e", "ve",
"as", "no", "be", "which", "is
n", "to",
"me", "is", "can", "then"])
count_vector = cv.fit_transform(corpus)
count_vector.shape
cv.vocabulary_
Output: {'horse': 5,
'kingdom': 8,
'sense': 16,
'thing': 18,
'keeps': 7,
'betting': 1,
'people': 13,
'often': 11,
'said': 15,
'nothing': 10,
'better': 0,
'inside': 6,
'man': 9,
'outside': 12,
'spiritually': 17,
'well': 20,
'physically': 14,
'bigger': 2,
'foot': 3,
'heaven': 4,
'welcome': 19}
sklearn contains default stop words, which are implemented as a frozenset and it can be accessed
with text.ENGLISH_STOP_WORDS :
vectorizer = CountVectorizer(stop_words=text.ENGLISH_STOP_WORDS)
vectors = vectorizer.fit_transform(newsgroups_train.data)
# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
predictions = classifier.predict(vectors_test)
accuracy_score = metrics.accuracy_score(newsgroups_test.target,
predictions)
f1_score = metrics.f1_score(newsgroups_test.target,
predictions,
average='macro')
As in many other cases, it is a good idea to look for ways to automatically define a list of stop words. A list
that is or should be ideally adapted to the problem.
To automatically create a stop word list, we will start with the parameter min_df of
CountVectorizer . When you set this threshold parameter, terms that have a document frequency strictly
lower than the given threshold will be ignored. This value is also called cut-off in the literature. If a float value
in the range of [0.0, 1.0] is used, the parameter represents a proportion of documents. An integer will be
treated as absolute counts. This parameter is ignored if vocabulary is not None.
cv = CountVectorizer(input=corpus,
min_df=2)
count_vector = cv.fit_transform(corpus)
cv.vocabulary_
Output: {'people': 7,
'you': 9,
'cannot': 0,
'is': 3,
'horse': 2,
'my': 5,
'for': 1,
'on': 6,
'there': 8,
'man': 4}
Hardly any words from our corpus text are left. Because we have only few documents (strings) in our corpus
and also because these texts are very short, the number of words which occur in less then two documents is
We can also see the words which have been chosen as stopwords by looking at cv.stop_words_ :
cv.stop_words_
cv = CountVectorizer(input=corpus,
max_df=0.20)
count_vector = cv.fit_transform(corpus)
cv.stop_words_
EXERCISES
EXERCISE 1
Use these novels as the corpus and create a word count vector.
EXERCISE 2
Turn the previously calculated 'word count vector' into a dense ndarray representation.
EXERCISE 3
Let us have another example with a different corpus. The five strings are famous quotes from
1. William Shakespeare
2. W.C. Fields
3. Ronald Reagan
4. John Steinbeck
5. Author unknown
SOLUTIONS
SOLUTION TO EXERCISE 1
corpus = []
books = ["night_and_day_virginia_woolf.txt",
"the_way_of_all_flash_butler.txt",
"moby_dick_melville.txt",
"sons_and_lovers_lawrence.txt",
"robinson_crusoe_defoe.txt",
corpus = []
for book in books:
txt = open(path + "/" + book).read()
corpus.append(txt)
We have to get rid of the Gutenberg header and footer, because it doesn't belong to the novels. We can see by
looking at the texts that the authors works begins after lines of the following kind
There may or may not be a space after the first three stars or instead of "the" there may be "this".
We can use regular expressions to find the starting point of the novels:
corpus = []
books = ["night_and_day_virginia_woolf.txt",
"the_way_of_all_flash_butler.txt",
"moby_dick_melville.txt",
"sons_and_lovers_lawrence.txt",
"robinson_crusoe_defoe.txt",
"james_joyce_ulysses.txt"]
path = "books"
corpus = []
for book in books:
txt = open(path + "/" + book).read()
text_begin = re.search(r"\*\*\* ?START OF (THE|THIS) PROJEC
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)
SOLUTION TO EXERCISE 2
All you have to do is applying the method toarray to get the token_count_matrix :
token_count_matrix.toarray()
Output: array([[ 0, 0, 0, ..., 0, 0, 0],
[19, 0, 0, ..., 0, 0, 0],
[20, 0, 0, ..., 0, 1, 1],
[ 0, 0, 1, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 0, 0],
[11, 1, 0, ..., 1, 0, 0]])
SOLUTION TO EXERCISE 3
# our corpus:
quotes = ["A horse, a horse, my kingdom for a horse!",
"Horse sense is the thing a horse has which keeps it fro
m betting on people."
"I’ve often said there is nothing better for the inside
of the man, than the outside of the horse.",
"A man on a horse is spiritually, as well as physicall
y, bigger then a man on foot.",
"No heaven can heaven be, if my horse isn’t there to wel
come me."]
vectorizer = text.CountVectorizer()
vectorizer.fit(quotes)
vectorized_text = vectorizer.fit_transform(quotes)
tfidf_transformer = text.TfidfTransformer(smooth_idf=True,use_id
"""
alternative way to output the data:
import pandas as pd
df_idf = pd.DataFrame(tfidf_transformer.idf_,
index=vectorizer.get_feature_names(),
columns=["idf_weight"])
df_idf.sort_values(by=['idf_weights']) # sorting data
print(df_idf)
"""
print(f"{'word':15s}: idf_weight")
word_weight_list = list(zip(vectorizer.get_feature_names(), tfid
f_transformer.idf_))
word_weight_list.sort(key=lambda x:x[1]) # sort list by the weigh
ts (2nd component)
for word, idf_weight in word_weight_list:
print(f"{word:15s}: {idf_weight:4.3f}")
INTRODUCTION
One might think that it might not be that difficult to get good
text material for examples of text classification. After all, hardly
a minute goes by in our daily lives that we are not dealing with
written language. Newspapers, books, and most of all, most of
the internet is probably still text-based. For our example
classifiers, however, the texts must be in machine-readable form
and preferably in simple text files, i.e. not formatted in Word or
other formats. In addition, the texts may not be protected by
copyright.
AUTHOR PREDICTION
We want to demonstrate the concepts of the previous chapter of our Machine Learning tutorial in an extended
example. We will use the following novels:
Will will train a classifier with these novels. This classifier should be able to predict the author from an
arbitrary text passage.
txt = open(filename).read()
paragraphs = [para for para in txt.split("\n\n") if len(para)
> min_size]
return paragraphs
path = "books/"
import random
vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
vectors = vectorizer.fit_transform(train_data)
vectors_test = vectorizer.transform(test_data)
predictions = classifier.predict(vectors_test)
accuracy_score = metrics.accuracy_score(test_targets,
predictions)
f1_score = metrics.f1_score(test_targets,
predictions,
average='macro')
We will test this classifier now with a different book of Virginia Woolf.
predictions = classifier.predict(vectors_test)
print(predictions)
targets = [0] * (last_para - first_para)
accuracy_score = metrics.accuracy_score(targets,
predictions)
precision_score = metrics.precision_score(targets,
predictions,
average='macro')
f1_score = metrics.f1_score(targets,
predictions,
average='macro')
predictions = classifier.predict_proba(vectors_test)
print(predictions)
[[6.26578058e-004 2.51943113e-002 4.85163038e-008 4.75065393e-005
4.00835263e-014 9.74131556e-001]
[7.12081909e-001 4.92957656e-002 5.37096844e-003 1.68824845e-009
4.99835718e-013 2.33251355e-001]
[1.11615265e-001 1.70149726e-009 8.02170949e-013 1.93038351e-008
3.38381992e-017 8.88384714e-001]
...
[9.99433053e-001 5.66946558e-004 6.87847449e-032 2.49682983e-019
9.56365457e-038 3.61259105e-033]
[9.99999991e-001 7.95355880e-009 9.29384687e-029 2.81898441e-033
1.49766211e-060 8.27077882e-010]
[1.00000000e+000 2.80028853e-054 1.53409474e-068 4.12917577e-086
3.33829236e-115 1.78467356e-057]]
You may have hoped for a better result and you may be disappointed. Yet, this result is on the other hand quite
impressive. In nearly 60 % of all cases we got the label 0, which stand for Virginia Woolf and her novel "Night
The paragraph with the index 100 was predicted as being "Ulysses by James Joyce". This paragraph contains
the name "Samuel Johnson". "Ulysses" contains many occurences of "Samuel" and "Johnson", whereas "Night
We had trained a Naive Bayes classifier by using MultinomialNB . We want to train now a Neural
Network. We will use MLPClassifier in the following. Be warned: It will take a long time, unless you
have an extremely fast computer. On my computer it takes about five minutes!
vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
vectors = vectorizer.fit_transform(train_data)
vectors_test = vectorizer.transform(test_data)
predictions = classifier.predict(vectors_test)
accuracy_score = metrics.accuracy_score(test_targets,
predictions)
f1_score = metrics.f1_score(test_targets,
predictions,
average='macro')
LANGUAGE PREDICTION
We will train now a classifier which will be capable of recognizing the language of a text for the languages:
We will use two books of each language for training and testing purposes. The authors and book titles should
be recognizable in the following file names:
path = "books/various_languages/"
files = os.listdir("books/various_languages")
labels = {fname[:2] for fname in files if fname.endswith(".txt")}
labels = sorted(list(labels))
labels
Output: ['de', 'dk', 'en', 'es', 'fr', 'it', 'nl', 'se']
print(files)
data = []
targets = []
import random
vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
vectors = vectorizer.fit_transform(train_data)
# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(vectors, train_targets)
vectors_test = vectorizer.transform(test_data)
predictions = classifier.predict(vectors_test)
accuracy_score = metrics.accuracy_score(test_targets,
predictions)
f1_score = metrics.f1_score(test_targets,
predictions,
average='macro')
Let us check this classifiert with some abitrary text in different languages:
vtest = vectorizer.transform(some_texts)
predictions = classifier.predict(vtest)
for label in predictions:
print(label, labels[label])
Thats where Regression Trees come in. Regression Trees work in principal in the same way as Classification
Trees with the large difference that the target feature values can now take on an infinite number of
continuously scaled values. Hence the task is now to predict the value of a continuously scaled target feature Y
given the values of a set of categorically (or continuously) scaled descriptive features X.
1. If the splitting process leads to a empty dataset, return the mode target feature value of the
original dataset
2. If the splitting process leads to a dataset where no features are left, return the mode target feature
value of the direct parent node
3. If the splitting process leads to a dataset where the target feature values are pure, return this
value
If we now consider the property of our new continuously scaled target feature we mention that the third
stopping criteria can no longer be used since the target feature values can now take on an infinite number of
different values. Consequently, it is most likely that we will not find pure target feature values until there is
only one instance left in the dataset.
To make a long story short, there is in general nothing like pure target feature values.
To address this issue, we will introduce an early stopping criteria that returns the average value of the target
feature values left in the dataset if the number of instances in the dataset is ≤ 5.
In general, while handling with Regression Trees we will return the average target feature values as prediction
at a leaf node.
The second change we have to make becomes apparent when we consider the splitting process itself.
While working with Classification Trees we used the Information Gain (IG) of a feature as splitting criteria.
That is, the feature with the largest IG was used to split the dataset on. Consider the following example where
we examine only one descriptive feature, lets say the number of bedrooms, and the costs of the house as target
feature.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number_of_Bedrooms':[2,2,4,1,3,1,4,2],'Price_o
f_Sale':[100000,120000,250000,80000,220000,170000,500000,75000]})
df
0 2 100000
1 2 120000
2 4 250000
3 1 80000
4 3 220000
5 1 170000
6 4 500000
7 2 75000
| D Number of Bedrooms = j |
H(Number of Bedrooms) = ∑ j ∈ Number of Bedrooms ∗ ( |D|
∗ ( ∑ k ∈ Price of Sale ∗ ( − P(k | j) ∗ log2(P(k | j))
If we calculate the weighted entropies, we see that for j = 3, we get a weighted entropy of 0. We get this result
because there is only one house in the dataset with 3 bedrooms. On the other hand, for j = 2 (occurs three
times) we will get a weighted entropy of 0.59436.
To make a long story short, since our target feature is continuously scaled, the IGs of the categorically scaled
descriptive features are no longer appropriate splitting criteria.
Well, we could instead categorize the target feature along its values where for instance housing prices between
$0 and $80000 are categorized as low, between $80001 and $150000 as middle and > $150001
as high.
What we have done here is converting our regression problem into kind of a classification problem. Though,
since we want to be able to make predictions from a infinite number of possible values (regression) this is not
what we are looking for.
Lets come back to our initial issue: We want to have a splitting criteria which allows us to split the dataset in
such a way that when arriving a tree node, the predicted value (we defined the predicted value as the mean
target feature value of the instances at this leaf node where we defined the minimum number of 5 instances as
early stopping criteria) is closest to the actual value.
It turns out that the variance is one of the most commonly used splitting criteria for regression trees where we
will use the variance as splitting criteria.
The explanation therefore is, that we want to search for the feature attributes which most exactly point to the
Well, obviously that one with the smallest variance! We will introduce the maths behind the measure of
variance in the next section.
For the time being we start by illustrating these by arrows where wide arrows represent a high variance and
slim arrows a low variance. We can illustrate that by showing the variance of the target feature for each value
of the descriptive feature. As you can see, the feature layout which minimizes the variance of the target feature
values when we split the dataset along the values of the descriptive feature is the feature layout which most
As stated above, the task during growing a Regression Tree is in principle the same as during the creation of
Classification Trees. Though, since the IG turned out to be no longer an appropriate splitting criteria (neither is
the Gini Index) due to the continuous character of the target feature we must have a new splitting criteria.
Variance
∑n ( y i − yˉ )
i =1
Var(x) = n−1
Where y i are the single target feature values and yˉ is the mean of these target feature values.
Taking the example from above the total variance of the Prize_of_Sale target feature is calculated with:
( 100000 − 189375 ) 2 + ( 120000 − 189375 ) 2 + ( 250000 − 189375 ) 2 + ( 80000 − 189375 ) 2 + ( 220000 − 189375 ) 2 + ( 170000 − 18
Var(Price of Sale) = 7
Since we want to know which descriptive feature is best suited to split the target feature on, we have to
calculate the variance for each value of the descriptive feature with respect to the target feature values.
Hence for the Number_of_Rooms descriptive feature above we get for the single numbers of rooms:
Since we now want to also address the issue that there are feature values which occur relatively rarely but
have a high variance (This could lead to a very high variance for the whole feature just because of one outliner
feature value even though the variance of all other feature values may be small) we address this by calculating
the weighted variance for each feature value with:
2
WeightVar(Number of Rooms = 2) = 8 ∗ 508333333.3 = 190625000
2
WeightVar(Number of Rooms = 3) = 8 ∗ 0 = 0
2
WeightVar(Number of Rooms = 4) = 8 ∗ 31250000000 = 7812500000
Finally, we sum up these weighted variances to make an assessment about the feature as a whole:
Putting all this together finally leads to the formula for the weighted feature variance which we will use at
each node in the splitting process to determine which feature we should choose to split our dataset on next.
|f=l|
feature[choose] = argminf ∈ features ∑ l ∈ levels ( f ) | f | ∗ Var(t, f = l)
∑n ( t i − tˉ ) 2
|f=l| i = 1
= argminf ∈ features ∑ l ∈ levels ( f ) | f | ∗ n−1
Here f denotes a single feature, l denotes the value of a feature (e.g Price == medium), t denotes the value of
the target feature in the subset where f=l.
Following this calculation specification we find the feature at each node to split our dataset on.
import pandas as pd
df = pd.read_csv("data/day.csv",usecols=['season','holiday','weekd
ay','weathersit','cnt'])
df_example = df.sample(frac=0.012)
Season
= 16429.1
Weekday
Weathersit
4 ( 421 − 174.2 ) 2 + ( 165 − 174.2 ) 2 + ( 12 − 174.2 ) 2 + ( 161 − 174.2 ) 2 + ( 112 − 174.2 ) 2 2 ( 352 − 230.5 ) 2 + ( 109
WeightVar(Weathersit) = 9 ∗ 4
+9∗ 1
Since the Weekday feature has the lowest variance, this feature is used to split the dataset on and hence serves
as root node. Though due to random sampling, this example is not that robust (for instance there is no instance
ID3(D,Feature_Attributes,Target_Attr
ibutes,min_instances=5)
Create a root node r
Set r to the mean of the target feature values in D #######Cha
nged########
If num_instances <= min_instances :
return r
Else:
pass
If Feature_Attributes is empty:
return r
Else:
Att = Attribute from Feature_Attributes with the lowest we
ighted variance ########Changed########
r = Att
For values in Att:
Add a new node below r where node_values = (Att == val
ues)
In addition to the changes in the actual algorithm we also have to use another measure of accuracy because we
are no longer dealing with categorical target feature values. That is, we can no longer simply compare the
predicted classes with the real classes and calculate the percentage where we bang on the target. Instead we are
using the root mean square error (RMSE) to measure the "accuracy" of our model.
( t i − Model ( test i ) ) 2
√
∑n
i = i
RMSE = n
Where t i are the actual test target feature values of a test dataset and Model(test i) are the values predicted by
our trained regression tree model for these t i. In general, the lower the RMSE value, the better our model fits
the actual data.
Since we now have adapted our principal ID3 classification tree algorithm to handle continuously scaled target
features and therewith have made it to a regression tree model, we can start implementing these changes in
Python.
Therefore we simply take the classification tree model from the previous chapter and implement the two
changes mentioned above.
As announced for the implementation of our regression tree model we will use the UCI bike sharing dataset
where we will use all 731 instances as well as a subset of the original 16 attributes. As attributes we use the
features: {'season', 'holiday', 'weekday', 'workingday', 'wheathersit', 'cnt'} where the {'cnt'} feature serves as
our target feature and represents the number of total rented bikes per day.
The first five rows of the dataset look as follows:
import pandas as pd
dataset = pd.read_csv("data/day.csv",usecols=['season','holida
y','weekday','workingday','weathersit','cnt'])
dataset.sample(frac=1).head()
Output:
season holiday weekday workingday weathersit cnt
458 2 0 2 1 1 6772
245 3 0 6 0 1 4484
86 2 0 1 1 1 2028
333 4 0 3 1 1 3613
507 2 0 2 1 2 6073
We will now start adapting the originally created classification algorithm. For further comments to the code I
refer the reader to the previous chapter about Classification Trees.
"""
Make the imports of python packages needed
"""
import pandas as pd
import numpy as np
from pprint import pprint
import matplotlib.pyplot as plt
#Import the dataset and define the feature and target columns#
dataset = pd.read_csv("data/day.csv",usecols=['season','holida
y','weekday','workingday','weathersit','cnt']).sample(frac=1)
mean_data = np.mean(dataset.iloc[:,-1])
##################################################################
#########################################
##################################################################
#########################################
"""
Calculate the varaince of a dataset
This function takes three arguments.
1. data = The dataset for whose feature the variance should be cal
culated
2. split_attribute_name = the name of the feature for which the we
ighted variance should be calculated
3. target_name = the name of the target feature. The default for t
his example is "cnt"
"""
def var(data,split_attribute_name,target_name="cnt"):
feature_values = np.unique(data[split_attribute_name])
feature_variance = 0
for value in feature_values:
#Create the data subsets --> Split the original data alon
g the values of the split_attribute_name feature
# and reset the index to not run into an error while usin
g the df.loc[] operation below
subset = data.query('{0}=={1}'.format(split_attribute_nam
e,value)).reset_index()
#Calculate the weighted variance of each subse
t
value_var = (len(subset)/len(data))*np.var(subset[target_n
ame],ddof=1)
#Calculate the weighted variance of the feature
feature_variance+=value_var
return feature_variance
#If the dataset is empty, return the mean target feature valu
e in the original dataset
elif len(data)==0:
return np.mean(originaldata[target_attribute_name])
#If the feature space is empty, return the mean target featur
e value of the direct parent node --> Note that
#the direct parent node is that node which has called the curr
ent run of the algorithm and hence
#the mean target feature value is stored in the parent_node_cl
ass variable.
else:
#Set the default value for this node --> The mean target f
eature value of the current node
parent_node_class = np.mean(data[target_attribute_name])
#Select the feature which best splits the dataset
#Create the tree structure. The root gets the name of the
feature (best_feature) with the minimum variance.
tree = {best_feature:{}}
#Remove the feature with the lowest variance from the feat
ure space
features = [i for i in features if i != best_feature]
#Grow a branch under the root node for each possible valu
e of the root node feature
return tree
##################################################################
#########################################
##################################################################
#########################################
##################################################################
#########################################
##################################################################
#########################################
"""
Create a training as well as a testing set
"""
def train_test_split(dataset):
training_data = dataset.iloc[:int(0.7*len(dataset))].reset_ind
ex(drop=True)#We drop the index respectively relabel the index
#starting form 0, because we do not want to run into errors re
garding the row labels / indexes
testing_data = dataset.iloc[int(0.7*len(dataset)):].reset_inde
x(drop=True)
return training_data,testing_data
training_data = train_test_split(dataset)[0]
testing_data = train_test_split(dataset)[1]
##################################################################
#########################################
##################################################################
#########################################
"""
Compute the RMSE
##################################################################
#########################################
##################################################################
#########################################
"""
Train the tree, Print the tree and predict the accuracy
"""
tree = Classification(training_data,training_data,training_data.co
lumns[:-1],5,'cnt')
pprint(tree)
print('#'*50)
print('Root mean square error (RMSE): ',test(testing_data,tree))
6.0: 2398.1071428571427}},
1.0:
1.0: {'holiday': {0.0:
{1.0: 3284.28,
2.0: 3284.28,
3.0: 3284.28,
4.0: 3284.28,
5.0: 3284.28}}}}}},
2.0: {'holiday': {0.0: {'weekday': {0.0: 258
1.0: 218
65,
2.0: {'w
{1.0: 2140.6666666666665}},
3.0: {'w
{1.0: 2049.0}},
4.0: {'w
{1.0: 3105.714285714286}},
5.0: {'w
{1.0: 2844.5454545454545}},
6.0: {'w
{0.0: 1757.111111111111}}}},
1.0: 1040.0}},
3.0: 473.5}},
2: {'weathersit': {1.0: {'workingday': {0.0: {'weekday': {0.0:
{0.0: 5728.2}},
1.0:
6667,
5.0:
6.0:
{0.0: 6206.142857142857}}}},
1.0: {'holiday': {0.0:
{1.0: 5340.06,
2.0: 5340.06,
3.0: 5340.06,
4.0: 5340.06,
6.0: 4349.7692307692305}},
1.0:
{1.0: 4446.294117647059,
2.0: 4446.294117647059,
3.0: 4446.294117647059,
4.0: 4446.294117647059,
5.0: 5975.333333333333}}}}}},
3.0: 1169.0}},
3: {'weathersit': {1.0: {'holiday': {0.0: {'workingday': {0.0:
{0.0: 5715.0,
6.0: 5715.0}},
1.0:
{1.0: 6148.342857142857,
2.0: 6148.342857142857,
3.0: 6148.342857142857,
4.0: 6148.342857142857,
5.0: 6148.342857142857}}}},
1.0: 7403.0}},
2.0: {'workingday': {0.0: {'holiday': {0.0:
{0.0: 4537.5,
6.0: 5028.8}},
1.0:
1.0: {'holiday': {0.0:
{1.0: 6745.25,
2.0: 5222.4,
3.0: 5554.0,
4.0: 4580.0,
5.0: 5389.409090909091}}}}}},
6.0: 4974.772727272727}},
1.0:
{1.0: 5174.906976744186,
2.0: 5174.906976744186,
3.0: 5174.906976744186,
4.0: 5174.906976744186,
5.0: 5174.906976744186}}}},
1.0: 3101.25}},
2.0: {'weekday': {0.0: 3795.6666666666665,
1.0: 4536.0,
2.0: {'holiday': {0.0: {'w
{1.0: 4440.875}}}},
3.0: 5446.4,
4.0: 5888.4,
5.0: 5773.6,
6.0: 4215.8}},
3.0: {'weekday': {1.0: 1393.5,
2.0: 2946.6666666666665,
3.0: 1840.5,
6.0: 627.0}}}}}}
##################################################
Root mean square error (RMSE): 1623.9891244058906
Above we can see RMSE for a minimum number of 5 instances per node. But for the time being, we have no
idea how bad or good that is. To get a feeling about the "accuracy" of our model we can plot kind of a learning
curve where we plot the number of minimal instances against the RMSE.
"""
Plot the RMSE with respect to the minimum number of instances
"""
fig = plt.figure()
ax0 = fig.add_subplot(111)
RMSE_test = []
RMSE_train = []
for i in range(1,100):
tree = Classification(training_data,training_data,training_dat
ax0.plot(range(1,100),RMSE_test,label='Test_Data')
ax0.plot(range(1,100),RMSE_train,label='Train_Data')
ax0.legend()
ax0.set_title('RMSE with respect to the minumim number of instance
s per node')
ax0.set_xlabel('#Instances')
ax0.set_ylabel('RMSE')
plt.show()
As we can see, increasing the minimum number of instances per node leads to a lower RMSE of our test data
until we reach approximately the number of 50 instances per node. Here the Test_Data curve kind of flattens
out and an additional increase in the minimum number of instances per leaf does not dramatically decrease the
RMSE of our testing set.
tree = Classification(training_data,training_data,training_data.co
lumns[:-1],50,'cnt')
pprint(tree)
2.0: 6093.058823529412,
3.0: 6043.6,
4.0: 6538.428571428572,
5.0: 6050.2307692307695}}}},
1.0: 7403.0}},
2.0: 5242.617647058823,
3.0: 2276.0}},
4: {'weathersit': {1.0: {'holiday': {0.0: {'workingday': {0.0:
2727,
1.0:
4186}},
1.0: 3101.25}},
2.0: 4894.861111111111,
3.0: 1961.6}}}}
Since we have now build a Regression Tree model from scratch we will use sklearn's prepackaged Regression
Tree model sklearn.tree.DecisionTreeRegressor. The procedure follows the general sklearn API and is as
always:
RMSE = np.sqrt(np.sum(((testing_data.iloc[:,-1]-predicted)**2)/le
n(testing_data.iloc[:,-1])))
RMSE
Output: 1592.7501629176463
With a parameterized minimum number of 5 instances per leaf node, we get nearly the same RMSE as with
"""
Plot the RMSE with respect to the minimum number of instances
"""
fig = plt.figure()
ax0 = fig.add_subplot(111)
RMSE_train = []
RMSE_test = []
for i in range(1,100):
#Paramterize the model and let i be the number of minimum inst
ances per leaf node
regression_model = DecisionTreeRegressor(criterion="mse",min_s
amples_leaf=i)
#Train the model
regression_model.fit(training_data.iloc[:,:-1],training_data.i
loc[:,-1:])
#Predict query instances
predicted_train = regression_model.predict(training_data.ilo
c[:,:-1])
predicted_test = regression_model.predict(testing_data.ilo
c[:,:-1])
#Calculate and append the RMSEs
RMSE_train.append(np.sqrt(np.sum(((training_data.iloc[:,-1]-pr
edicted_train)**2)/len(training_data.iloc[:,-1]))))
RMSE_test.append(np.sqrt(np.sum(((testing_data.iloc[:,-1]-pred
icted_test)**2)/len(testing_data.iloc[:,-1]))))
ax0.plot(range(1,100),RMSE_test,label='Test_Data')
ax0.plot(range(1,100),RMSE_train,label='Train_Data')
ax0.legend()
ax0.set_title('RMSE with respect to the minumim number of instance
s per node')
ax0.set_xlabel('#Instances')
ax0.set_ylabel('RMSE')
plt.show()
References:
• https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-
in-python/
• http://nbviewer.jupyter.org/gist/jwdink/9715a1a30e8c7f50a572
• John D. Kelleher, Brian Mac Namee, Aoife D'Arcy, 2015. Machine Learning for Predictiive
Data Analytics. Cambridge, Massachusetts: The MIT Press.
• Lior Rokach, Oded Maimon, 2015. Data Mining with Decision Trees. 2nd Ed. Ben-Gurion,
Israel, Tel-Aviv, Israel: Wolrd Scientific.
• Tom M. Mitchel, 1997. Machine Learning. New York, NY, USA: McGraw-Hill.
TensorFlow is an open-source software library for machine learning across a range of tasks. It is a symbolic
math library, and also used as a system for building and training neural networks to detect and decipher
patterns and correlations, analogous to human learning and reasoning. It is used for both research and
production at Google often replacing its closed-source predecessor, DistBelief. TensorFlow was developed by
the Google Brain team for internal Google use. It was released under the Apache 2.0 open source license on 9
November 2015.
TensorFlow provides a Python API as well as C++, Haskell, Java, Go and Rust APIs.
TENSORFLOW 437
STRUCTURE OF TENSORFLOW PROGRAMS
EXAMPLE
import tensorflow as tf
# Computational Graph:
c1 = tf.constant(0.034)
c2 = tf.constant(1000.0)
x = tf.multiply(c1, c1)
y = tf.multiply(c1, c2)
final_node = tf.add(x, y)
import tensorflow as tf
# Computational Graph:
c1 = tf.constant(0.034, dtype=tf.float64)
c2 = tf.constant(1000.0, dtype=tf.float64)
x = tf.multiply(c1, c1)
y = tf.multiply(c1, c2)
final_node = tf.add(x, y)
TENSORFLOW 438
with tf.Session() as sess:
result = sess.run(final_node)
print(result, type(result))
34.001156 <class 'numpy.float64'>
import tensorflow as tf
# Computational Graph:
A computational graph is a series of TensorFlow operations arranged into a graph of nodes. Let's build a
simple computational graph. Each node takes zero or more tensors as inputs and produces a tensor as an
output. Constant nodes take no input.
Printing the nodes does not output a numerical value. We have defined a computational graph but no
numerical evaluation has taken place!
print(c1)
print(x)
print(final_node)
Tensor("Const_6:0", shape=(4,), dtype=float64)
Tensor("Mul_6:0", shape=(4,), dtype=float64)
Tensor("Add_3:0", shape=(4,), dtype=float64)
TENSORFLOW 439
To evaluate the nodes, we have to run the computational graph within a session. A session encapsulates the
control and state of the TensorFlow runtime. The following code creates a Session object and then invokes its
run method to run enough of the computational graph to evaluate node1 and node2. By running the
computational graph in a session as follows. We have to create a session object:
session = tf.Session()
Now, we can evaluate the computational graph by starting the run method of the session object:
result = session.run(final_node)
print(result)
print(type(result))
[ 23.12 165.62 2.88 162. ]
<class 'numpy.ndarray'>
session.close()
It is usually a better idea to work with the with statement, as we did in the introductory examples!
SIMILARITY TO NUMPY
We will rewrite the following program with Numpy.
import tensorflow as tf
session = tf.Session()
x = tf.range(12)
print(session.run(x))
x2 = tf.reshape(tensor=x,
shape=(3, 4))
x2 = tf.reduce_sum(x2, reduction_indices=[0])
res = session.run(x2)
print(res)
x3 = tf.eye(5, 5)
res = session.run(x3)
print(res)
TENSORFLOW 440
[ 0 1 2 3 4 5 6 7 8 9 10 11]
[12 15 18 21]
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
import numpy as np
x = np.arange(12)
print(x)
x2 = x.reshape((3, 4))
res = x2.sum(axis=0)
print(res)
x3 = np.eye(5, 5)
print(x3)
[ 0 1 2 3 4 5 6 7 8 9 10 11]
[12 15 18 21]
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
TENSORBOARD
• TensorFlow provides functions to debug and optimize programs with the help of a visualization
tool called TensorBoard.
• TensorFlow creates the necessary data during its execution.
• The data are stored in trace files.
• Tensorboard can be viewed from a browser using http://localhost:6006/
We can run the following example program, and it will create the directory "output" We can run now
tensorboard: tensorboard --logdir output
which will create a webserver: TensorBoard 0.1.8 at http://marvin:6006 (Press CTRL+C to quit)
import tensorflow as tf
p = tf.constant(0.034)
TENSORFLOW 441
c = tf.constant(1000.0)
x = tf.add(c, tf.multiply(p, c))
x = tf.add(x, tf.multiply(p, x))
PLACEHOLDERS
A computational graph can be parameterized to accept external inputs, known as placeholders. The values for
placeholders are provided when the graph is run in a session.
TENSORFLOW 442
import tensorflow as tf
c1 = tf.placeholder(tf.float32)
c2 = tf.placeholder(tf.float32)
x = tf.multiply(c1, c1)
y = tf.multiply(c1, c2)
final_node = tf.add(x, y)
Another example:
import tensorflow as tf
import numpy as np
v1 = np.array([3, 4, 5])
v2 = np.array([4, 1, 1])
c1 = tf.placeholder(tf.float32, shape=(3,))
c2 = tf.placeholder(tf.float32, shape=(3,))
x = tf.multiply(c1, c1)
y = tf.multiply(c1, c2)
final_node = tf.add(x, y)
Inserts a placeholder for a tensor that will be always fed. It returns a Tensor that may be used as a handle for
feeding a value, but not evaluated directly.
Important: This tensor will produce an error if evaluated. Its value must be fed using the feed_dict optional
argument to
Session.run()
TENSORFLOW 443
Tensor.eval()
Operation.run()
Args:
Parameter Description
shape: The shape of the tensor to be fed (optional). If the shape is not specified, you can feed a tensor of any shape.
VARIABLES
Variables are used to add trainable parameters to a graph. They are constructed with a type and initial value.
Variables are not initialized when you call tf.Variable. To initialize the variables of a TensorFlow graph, we
have to call global_variables_initializer:
import tensorflow as tf
W = tf.Variable([.5], dtype=tf.float32)
b = tf.Variable([-1], dtype=tf.float32)
x = tf.placeholder(tf.float32)
model = W * x + b
The value can be specified at run time with the feed_dict argument inside Session.run
A placeholder is used for feeding external data into a Tensorflow computation, i.e. from outside of the graph!
TENSORFLOW 444
If you are training a learning algorithm, a placeholder is used for feeding in your training data. This means that
the training data is not part of the computational graph. The placeholder behaves similar to the Python "input"
statement. On the other hand a TensorFlow variable behaves more or less like a Python variable!
Example:
import tensorflow as tf
W = tf.Variable([.5], dtype=tf.float32)
b = tf.Variable([-1], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
model = W * x + b
deltas = tf.square(model - y)
loss = tf.reduce_sum(deltas)
import tensorflow as tf
W = tf.Variable([.5], dtype=tf.float32)
b = tf.Variable([-1], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
model = W * x + b
deltas = tf.square(model - y)
loss = tf.reduce_sum(deltas)
TENSORFLOW 445
W_a = tf.assign(W, [0.])
b_a = tf.assign(b, [1.])
sess.run( W_a )
sess.run( b_a)
# sess.run( [W_a, b_a] ) # alternatively in one 'run'
import tensorflow as tf
W = tf.Variable([.5], dtype=tf.float32)
b = tf.Variable([-1], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
model = W * x + b
deltas = tf.square(model - y)
loss = tf.reduce_sum(deltas)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
import numpy as np
import matplotlib.pyplot as plt
TENSORFLOW 446
for quantity, suffix in [(1000, "train"), (200, "test")]:
samples = np.random.multivariate_normal([-2, -2], [[1, 0],
[0, 1]], quantity)
plt.plot(samples[:, 0], samples[:, 1], '.', label="bad ones "
+ suffix)
bad_ones = np.column_stack((np.zeros(quantity), samples))
plt.legend()
plt.show()
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
number_of_samples_per_training_step = 100
num_of_epochs = 1
TENSORFLOW 447
num_labels = 2 # should be automatically determined
def evaluation_func(X):
return predicted_class.eval(feed_dict={x:X})
Z = pred_func(np.c_[xs.flatten(), ys.flatten()])
# Z is one-dimensional and will be reshaped into 300 x 300:
Z = Z.reshape(xs.shape)
def get_data(fname):
data = np.loadtxt(fname)
labels = data[:, :1] # array([[ 0.], [ 0.], [ 1.], ...]])
labels_one_hot = (np.arange(num_labels) == labels).astype(np.f
loat32)
data = data[:, 1:].astype(np.float32)
return data, labels_one_hot
data_train = "data/the_good_and_the_bad_ones_train.txt"
data_test = "data/the_good_and_the_bad_ones_test.txt"
train_data, train_labels = get_data(data_train)
test_data, test_labels = get_data(data_test)
TENSORFLOW 448
train_size, num_features = train_data.shape
# Optimization.
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cros
s_entropy)
# For the test data, hold the entire dataset in one constant node.
test_data_node = tf.constant(test_data)
# Evaluation.
predicted_class = tf.argmax(y, 1)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess.run(init)
TENSORFLOW 449
# feed data into the model
train_step.run(feed_dict={x: batch_data, y_: batch_label
s})
TENSORFLOW 450
Bias vector: [-0.78089082 0.78089082]
Weight matrix:
[[-0.80193734 0.8019374 ]
[-0.831303 0.831303 ]]
Wx + b: [[ 1.36599553 -1.36599553]]
softmax(Wx + b): [[ 0.93888813 0.06111182]]
Accuracy on test data: 0.97
Accuracy on training data: 0.9725
[1 1 1 1 0]
In [ ]:
In [ ]:
TENSORFLOW 451