chapter4
chapter4
machine learning
Chapter 4
This chapter covers
• Forms of machine learning beyond classification and
regression
• Formal evaluation procedures for machine learning models
• Preparing data for deep learning
• Feature engineering
• Tackling overfitting
• The universal workflow for approaching machine learning
problems
Four branches of machine learning
• Supervised learning
• Unsupervised learning
• Self-supervised learning
• Reinforcement learning
Supervised learning
• This is by far the most common case
• learning to map input data to known targets (also called annotations),
given a set of examples (often annotated by humans).
• optical character recognition, speech recognition, image classification,
and language translation.
• Sequence generation: Given a picture, predict a caption describing it.
• Syntax tree prediction: Given a sentence, predict its decomposition
into a syntax tree
• Object detection: Given a picture, draw a bounding box around certain
objects inside the picture
• Image segmentation: Given a picture, draw a pixel-level mask on a
specific object
Unsupervised learning
• Finding interesting transformations of the input data without
the help of any targets,
• for the purposes of data visualization, data compression, or data
denoising, or to better understand the correlations present in the
data at hand.
• Unsupervised learning is the bread and butter of data
analytics, and it’s often a necessary step in better
understanding a dataset before attempting to solve a
supervised-learning problem.
• Dimensionality reduction and clustering are well-known
categories of unsupervised learning.
Self-supervised learning
• Self-supervised learning is supervised learning without
human-annotated labels
• There are still labels involved (because the learning has to be
supervised by something), but they’re generated from the input data,
typically using a heuristic algorithm
• autoencoders are a well-known instance of self-supervised
learning, where the generated targets are the input,
unmodified
• In the same way, trying to predict the next frame in a video,
given past frames, or the next word in a text, given previous
words, are instances of self-supervised learning
Reinforcement learning
• an agent receives information about its environment and
learns to choose actions that will maximize some reward.
• For instance, a neural network that “looks” at a videogame screen
and outputs game actions in order to maximize its score can be
trained via reinforcement learning.
• Currently, reinforcement learning is mostly a research area
and hasn’t yet had significant practical successes beyond
games. In time, however, we expect to see reinforcement
learning take over an increasingly large range of real-world
applications:
• self-driving cars, robotics, resource management, education, and so
on. It’s an idea whose time has come, or will come soon.
Evaluating machine-learning models
• In machine learning, the goal is to achieve models that
generalize that perform well on never-before-seen data
• overfitting is the central obstacle.
• It’s crucial to be able to reliably measure the generalization
power of your model.
Training, validation, and test sets
• Why not have two sets: a training set and a test set? You’d
train on the training data and evaluate on the test data
• developing a model always involves tuning its configuration:
• for example, choosing the number of layers or the size of the layers (called
the hyperparameters of the model)
• You do this tuning by using as a feedback signal the performance of
the model on the validation data.
• In essence, this tuning is a form of learning: a search for a good configuration
in some parameter space.
• As a result, tuning the configuration of the model based on its
performance on the validation set can quickly result in overfitting to
the validation set, even though your model is never directly trained
on it.
Training, validation, and test sets
• Information leaks
• Every time you tune a hyperparameter of your model based on the
model’s performance on the validation set, some information about
the validation data leaks into the model.
• If you repeat this many times(running one experiment, evaluating on
the validation set, and modifying your model as a result), then you’ll
leak an increasingly significant amount of information about the
validation set into the model.
• You care about performance on completely new data, not the
validation data, so you need to use a completely different,
never-before-seen dataset to evaluate the model: the test
dataset.
Three classic evaluation recipes
• SIMPLE HOLD-OUT VALIDATION
• Set apart some fraction of your data as your test set. Train on the
remaining data, and evaluate on the test set.
• You should also reserve a validation set.
• It suffers from one flaw:
• if little data is available, then your validation and test sets may contain too few
samples to be statistically representative of the data at hand
Three classic evaluation recipes
• K-FOLD VALIDATION
• With this approach, you split your data into K partitions of equal size. For
each partition i, train a model on the remaining K – 1 partitions, and
evaluate it on partition i.
• Your final score is then the averages of the K scores obtained. This
method is helpful when the performance of your model shows significant
variance based on your train test split.