Data Science
Data Science
Machine Learning
What is machine learning?
Mike Roberts
“Machine learning is the process by which a computer can work more
accurately as it collects and learns from the data it is given.”
Although machine learning is mainly linked to the data-modeling step of the data
science process, it can be used at almost every step.
The modeling process
L2 regularisation
● The key difference between these techniques is that Lasso shrinks the less
important feature’s coefficient to zero thus, removing some feature
altogether. So, this works well for feature selection in case we have a huge
number of features.
● From a practical standpoint, L1 tends to shrink coefficients to zero whereas
L2 tends to shrink coefficients evenly. L1 is therefore useful for feature
selection, as we can drop any variables associated with coefficients that go to
zero. L2, on the other hand, is useful when you have collinear/codependent
features.
4.Predicting new observations
● If you’ve implemented the first three steps successfully, you now have a
performant model that generalizes to unseen data. The process of applying
your model to new data is called model scoring.
● Model scoring involves two steps. First, you prepare a data set that has
features exactly as defined by your model. This boils down to repeating the
data preparation you did in step one of the modeling process but for a new
data set. Then you apply the model on this new data set, and this results in a
prediction.
Types of machine learning
● Supervised learning techniques attempt to discern results and learn by trying
the data.
● Unsupervised learning techniques don’t rely on labeled data and attempt to find
interaction, to find patterns in the data set, but they can still progress toward
Figure 3.3 A simple Captcha control can be used to prevent automated spam
being sent through an online web form.
● With the help of the Naïve Bayes classifier, a simple yet powerful algorithm to
categorize observations into classes that’s explained in more detail in the
sidebar, you can recognize digits from textual images.
● These images aren’t unlike the Captcha checks many websites have in place to
make sure you’re not a computer trying to hack into the user accounts.
● Let’s see how hard it is to let a computer recognize images of numbers.
● Our research goal is to let a computer recognize images of numbers (step
one of the data science process).
● The data we’ll be working on is the MNIST data set, which is often used in
the data science literature for teaching and benchmarking.
● The MNIST database is a large database of handwritten digits that is
commonly used for training various image processing systems.
● The MNIST images can be found in the data sets package of Scikit-learn and are
already normalized for you (all scaled to the same size: 64x64 pixels), so we won’t need
much data preparation (step three of the data science process).
● But let’s first fetch our data as step two of the data science process, with the
following listing.
In the case of a gray image, you put a value in every matrix entry that depicts the gray
value to be shown. The following code demonstrates this process and is step four of the
data science process: data exploration.
● For a grayscale images, the pixel value is a single number that represents the brightness
of the pixel. The most common pixel format is the byte image, where this number is
stored as an 8-bit integer giving a range of possible values from 0 to 255.
● Typically zero is taken to be black, and 255 is taken to be white.
● The Naïve Bayes classifier is expecting a list of values, but pl.matshow() returns a
two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into a
list, we need to call reshape() on digits.images.
● The net result will be a one-dimensional array that looks something like this:
array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0.,
0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0.,
0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0.,
0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
Figure 3.4 Blurry grayscale representation of the number 0 with its corresponding matrix.
The higher the number, the closer it is to white; the lower the number, the closer it is to
black.
Figure 3.5 We’ll turn an image into something usable by the Naïve Bayes classifier by
getting the grayscale value for each of its pixels (shown on the right) and putting those
values in a list.
● Now that we have a way to pass the contents of an image into the classifier, we
need to pass it a training data set so it can start learning how to predict the
numbers in the images.
● We mentioned earlier that Scikit-learn contains a subset of the MNIST database
(1,800 images), so we’ll use that.
● Each image is also labeled with the number it actually shows. This will build a
probabilistic model in memory of the most likely digit shown in an image given its
grayscale values.
● Once the program has gone through the training set and built the model, we can
then pass it the test set of data to see how well it has learned to interpret the
images using the model.
● The end result of this code is called a confusion matrix, such as the one shown in
figure 3.6. Returned as a two-dimensional array, it shows how often the number
predicted was the correct number on the main diagonal and also in the matrix
entry (i,j), where j was predicted but the image showed i.
● Looking at figure 3.6 we can see that the model predicted the number 2 correctly
17 times (at coordinates 3,3), but also that the model predicted the number 8 15
times when it was actually the number 2 in the image (at 9,3).
● By discerning which images were misinterpreted, we can train the model further by labeling
them with the correct number they display and feeding them back into the model as a new
training set (step 5 of the data science process).
● This will make the model more accurate, so the cycle of learn, predict, correct continues and
the predictions become more accurate.
Unsupervised learning
● Unsupervised Learning is a machine learning technique in which the users do not need to
supervise the model.
● Instead, it allows the model to work on its own to discover patterns and information that was
previously undetected.
● It mainly deals with the unlabelled data.
● Clustering automatically split the dataset into groups base on their similarities
● Anomaly detection can discover unusual data points in your dataset. It is useful
for finding fraudulent transactions
● Association mining identifies sets of items which often occur together in your
dataset
● Latent variable models are widely used for data preprocessing. Like reducing the
number of features in a dataset or decomposing the dataset into multiple
components
DISCERNING A SIMPLIFIED LATENT STRUCTURE FROM YOUR DATA
● In statistics, latent variables are variables that are not directly observed but are
rather inferred (through a mathematical model) from other variables that are
observed (directly measured).
● A latent variable is hidden, and therefore can’t be observed.
● Example:Actual customer satisfaction is a hidden or latent factor, that can only
be measured in comparison to a manifest variable, or observable factor.
● The company might choose to study observable variables, such as sales numbers, the
price per sale, regional trends of purchasing, the gender of the customer, age of the
customer, percentage of return customers, and how high a customer ranked the
product on various sites all in the pursuit of the latent factor — namely, customer
satisfaction.
CASE STUDY: FINDING LATENT VARIABLES IN A WINE QUALITY DATA SET
● In this short case study, you’ll use a technique known as Principal Component Analysis
(PCA) to find latent variables in a data set that describes the quality of wine.
● Then you’ll compare how well a set of latent variables works in predicting the quality of
wine against the original observable set.
● How to identify and derive those latent variables.
● How to analyze where the sweet spot is—how many new variables return the most
utility—by generating and interpreting a scree plot generated by PCA.
● A scree plot is a line plot of the eigenvalues of factors or principal components in an
analysis.
● The scree plot is used to determine the number of factors to retain in an exploratory
factor analysis
Main components of this example
Data set
● With the initial decision made to reduce the data set from 11 original variables to 5 latent
variables, we can check to see whether it’s possible to interpret or name them based on
their relationships with the originals.
● Actual names are easier to work with than codes such as lv1, lv2, and so on.
The rows in the resulting table (table 3.4) show the mathematical correlation. Or, in English, the
first latent variable lv1, which captures approximately 28% of the total information in the set,
has the following formula.
Lv1 = (fixed acidity * 0.489314) + (volatile acidity * -0.238584) + … + (alcohol *
-0.113232)
We can now recode the original data set with only the five latent variables.
Doing this is data preparation again, so we revisit step three of the data
science process: data preparation.
Already we can see high values for wine 0 in volatile acidity, while wine 2 is
particularly high in persistent acidity
COMPARING THE ACCURACY OF THE ORIGINAL DATA SET WITH LATENT VARIABLES
● Now that we’ve decided our data set should be recoded into 5 latent
variables rather than the 11 originals, it’s time to see how well the new
data set works for predicting the quality of wine when compared to the
original.
● We’ll use the Naïve Bayes Classifier algorithm we saw in the previous
example for supervised learning to help. Let’s start by seeing how well