0% found this document useful (0 votes)
2 views34 pages

chapter4

Chapter 4 of the document discusses various forms of machine learning, including supervised, unsupervised, self-supervised, and reinforcement learning, along with evaluation procedures and data preparation techniques. It emphasizes the importance of feature engineering, tackling overfitting, and the universal workflow for approaching machine learning problems. Additionally, it covers methods for evaluating models, such as hold-out validation and K-fold validation, and highlights the significance of data preprocessing and handling missing values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views34 pages

chapter4

Chapter 4 of the document discusses various forms of machine learning, including supervised, unsupervised, self-supervised, and reinforcement learning, along with evaluation procedures and data preparation techniques. It emphasizes the importance of feature engineering, tackling overfitting, and the universal workflow for approaching machine learning problems. Additionally, it covers methods for evaluating models, such as hold-out validation and K-fold validation, and highlights the significance of data preprocessing and handling missing values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Fundamentals of

machine learning
Chapter 4
This chapter covers
• Forms of machine learning beyond classification and
regression
• Formal evaluation procedures for machine learning models
• Preparing data for deep learning
• Feature engineering
• Tackling overfitting
• The universal workflow for approaching machine learning
problems
Four branches of machine learning
• Supervised learning
• Unsupervised learning
• Self-supervised learning
• Reinforcement learning
Supervised learning
• This is by far the most common case
• learning to map input data to known targets (also called annotations),
given a set of examples (often annotated by humans).
• optical character recognition, speech recognition, image classification,
and language translation.
• Sequence generation: Given a picture, predict a caption describing it.
• Syntax tree prediction: Given a sentence, predict its decomposition
into a syntax tree
• Object detection: Given a picture, draw a bounding box around certain
objects inside the picture
• Image segmentation: Given a picture, draw a pixel-level mask on a
specific object
Unsupervised learning
• Finding interesting transformations of the input data without
the help of any targets,
• for the purposes of data visualization, data compression, or data
denoising, or to better understand the correlations present in the
data at hand.
• Unsupervised learning is the bread and butter of data
analytics, and it’s often a necessary step in better
understanding a dataset before attempting to solve a
supervised-learning problem.
• Dimensionality reduction and clustering are well-known
categories of unsupervised learning.
Self-supervised learning
• Self-supervised learning is supervised learning without
human-annotated labels
• There are still labels involved (because the learning has to be
supervised by something), but they’re generated from the input data,
typically using a heuristic algorithm
• autoencoders are a well-known instance of self-supervised
learning, where the generated targets are the input,
unmodified
• In the same way, trying to predict the next frame in a video,
given past frames, or the next word in a text, given previous
words, are instances of self-supervised learning
Reinforcement learning
• an agent receives information about its environment and
learns to choose actions that will maximize some reward.
• For instance, a neural network that “looks” at a videogame screen
and outputs game actions in order to maximize its score can be
trained via reinforcement learning.
• Currently, reinforcement learning is mostly a research area
and hasn’t yet had significant practical successes beyond
games. In time, however, we expect to see reinforcement
learning take over an increasingly large range of real-world
applications:
• self-driving cars, robotics, resource management, education, and so
on. It’s an idea whose time has come, or will come soon.
Evaluating machine-learning models
• In machine learning, the goal is to achieve models that
generalize that perform well on never-before-seen data
• overfitting is the central obstacle.
• It’s crucial to be able to reliably measure the generalization
power of your model.
Training, validation, and test sets
• Why not have two sets: a training set and a test set? You’d
train on the training data and evaluate on the test data
• developing a model always involves tuning its configuration:
• for example, choosing the number of layers or the size of the layers (called
the hyperparameters of the model)
• You do this tuning by using as a feedback signal the performance of
the model on the validation data.
• In essence, this tuning is a form of learning: a search for a good configuration
in some parameter space.
• As a result, tuning the configuration of the model based on its
performance on the validation set can quickly result in overfitting to
the validation set, even though your model is never directly trained
on it.
Training, validation, and test sets
• Information leaks
• Every time you tune a hyperparameter of your model based on the
model’s performance on the validation set, some information about
the validation data leaks into the model.
• If you repeat this many times(running one experiment, evaluating on
the validation set, and modifying your model as a result), then you’ll
leak an increasingly significant amount of information about the
validation set into the model.
• You care about performance on completely new data, not the
validation data, so you need to use a completely different,
never-before-seen dataset to evaluate the model: the test
dataset.
Three classic evaluation recipes
• SIMPLE HOLD-OUT VALIDATION
• Set apart some fraction of your data as your test set. Train on the
remaining data, and evaluate on the test set.
• You should also reserve a validation set.
• It suffers from one flaw:
• if little data is available, then your validation and test sets may contain too few
samples to be statistically representative of the data at hand
Three classic evaluation recipes
• K-FOLD VALIDATION
• With this approach, you split your data into K partitions of equal size. For
each partition i, train a model on the remaining K – 1 partitions, and
evaluate it on partition i.
• Your final score is then the averages of the K scores obtained. This
method is helpful when the performance of your model shows significant
variance based on your train test split.

• Like hold-out validation, you


need a distinct validation set for
model calibration.
Three classic evaluation recipes
• ITERATED K-FOLD VALIDATION WITH SHUFFLING
• This one is for situations in which you have relatively little data
available and you need to evaluate your model as precisely as
possible.
• I’ve found it to be extremely helpful in Kaggle competitions.
• It consists of applying K-fold validation multiple times, shuffling the
data every time before splitting it K ways. The final score is the
average of the scores obtained at each run of K-fold validation.
• Note that you end up training and evaluating P × K models (where
P is the number of iterations you use), which can very expensive.
Things to keep in mind
• Data representativeness:
• You want both your training set and test set to be representative of the
data at hand
• The arrow of time:
• If you’re trying to predict the future given the past (for example,
tomorrow’s weather, stock movements, and so on), you should not
randomly shuffle your data before splitting it, because doing so will create
a temporal leak
• Redundancy in your data:
• If some data points in your data appear twice, then shuffling the data and
splitting it into a training set and a validation set will result in redundancy
between the training and validation sets. Make sure your training set and
validation set are disjoint.
Data preprocessing for neural
networks
• VECTORIZATION
• All inputs and targets in a neural network must be tensors of
floating-point data (or, in specific cases, tensors of integers).
• Whatever data you need to process—sound, images, text—you must
first turn into tensors, a step called data vectorization.
• For instance, in the two previous text-classification examples, we started from
text represented as lists of integers (standing for sequences of words), and we
used one-hot encoding to turn them into a tensor of float32 data.
• In the examples of classifying digits and predicting house prices, the data
already came in vectorized form, so you were able to skip this step.
Data preprocessing for neural
networks
• VALUE NORMALIZATION
• In the digit-classification example, you started from image data encoded as
floating point values in the 0–1 range.
• when predicting house prices, you had to normalize each feature
independently so that it had a standard deviation of 1 and a mean of 0.
• To make learning easier for your network, your data should have the
following characteristics:
• Take small values: Typically, most values should be in the 0–1 range.
• Be homogenous: That is, all features should take values in roughly the same range.
• the following stricter normalization practice is common and can help:
• Normalize each feature independently to have a mean of 0.
• Normalize each feature independently to have a standard deviation of 1.
Data preprocessing for neural
networks
• HANDLING MISSING VALUES
• In general, with neural networks, it’s safe to input missing values as
0, with the condition that 0 isn’t already a meaningful value.
• The network will learn from exposure to the data that the value 0 means
missing data and will start ignoring the value.
• Note that if you’re expecting missing values in the test data, but the
network was trained on data without any missing values, the
network won’t have learned to ignore missing values!
• In this situation, you should artificially generate training samples with missing
entries: copy some training samples several times, and drop some of the
features that you expect are likely to be missing in the test data.
Feature engineering
• The process of using your own
knowledge about the data and about
the machine-learning algorithm at
hand (in this case, a neural network) to
make the algorithm work better by
applying hardcoded (nonlearned)
transformations to the data before it
goes into the model
• It makes a problem easier by
expressing input data in a simpler way.
It usually requires understanding the
problem in depth.
Feature engineering
• Before deep learning, feature engineering used to be critical,
because classical shallow algorithms didn’t have hypothesis spaces
rich enough to learn useful features by themselves
• For instance, before convolutional neural networks became successful on
the MNIST digit-classification problem, solutions were typically based on
hardcoded features such as the number of loops in a digit image, the
height of each digit in an image, a histogram of pixel values, and so on.
• Fortunately, modern deep learning removes the need for most
feature engineering, because neural networks are capable of
automatically extracting useful features from raw data
• However, …
• Good features still allow you to solve problems more elegantly while using
fewer resources.
• Good features let you solve a problem with far less data
Overfitting and underfitting
• The fundamental issue in machine learning is the tension between
optimization and generalization.
• Optimization refers to the process of adjusting a model to get the best
performance possible on the training data (the learning in machine learning)
• generalization refers to how well the trained model performs on data it has never
seen before.
• At the beginning of training, optimization and generalization are
correlated: the lower the loss on training data, the lower the loss on
test data. While this is happening, your model is said to be underfit
• But after a certain number of iterations on the training data, it’s
beginning to learn patterns that are specific to the training data but
that are misleading or irrelevant when it comes to new data. the model
is starting to overfit
Overfitting and underfitting
• The best solution to overfitting is to get more training data.
• A model trained on more data will naturally generalize better.
• When that isn’t possible, the next-best solution is to
modulate the quantity of information that your model is
allowed to store or to add constraints on what information
it’s allowed to store.
• If a network can only afford to memorize a small number of patterns,
the optimization process will force it to focus on the most
prominent patterns, which have a better chance of generalizing well.
• The processing of fighting overfitting this way is called
regularization
Overfitting and underfitting
• Reducing the network’s size
• The simplest way to prevent overfitting is to reduce the size of the model: the
number of learnable parameters in the model
• In deep learning, the number of learnable parameters in a model is often referred
to as the model’s capacity.
• Intuitively, a model with more parameters has more memorization capacity and
therefore can easily learn a perfect dictionary-like mapping between training
samples and their targets—a mapping without any generalization power.
• On the other hand, if the network has limited memorization resources, it won’t be
able to learn this mapping as easily
• there is no magical formula to determine the right number of layers or the right
size for each layer.
• The general workflow to find an appropriate model size is ….
• to start with relatively few layers and parameters, and increase the size of the layers or add
new layers until you see diminishing returns with regard to validation loss.
Overfitting and underfitting
• Adding weight regularization
• You may be familiar with the principle of Occam’s razor:
• given two explanations for something, the explanation most likely to be correct is
the simplest one
• A common way to mitigate overfitting is to put constraints on the
complexity of a network by forcing its weights to take only small values,
which makes the distribution of weight values more regular.
• This is called weight regularization, and it’s done by adding to the loss function
of the network a cost associated with having large weights.
• L1 regularization: The cost added is proportional to the absolute value of
the weight coefficients (the L1 norm of the weights).
• L2 regularization: The cost added is proportional to the square of the
value of the weight coefficients (the L2 norm of the weights). L2
regularization is also called weight decay in the context of neural networks
Overfitting and underfitting
Overfitting and underfitting
• Adding dropout
• Dropout is one of the most effective and most commonly used
regularization techniques for neural networks, developed by Geoff Hinton
and his students at the University of Toronto
• Dropout, applied to a layer, consists of randomly dropping out (setting to
zero) a number of output features of the layer during training.
• Let’s say a given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a
given input sample during training.
• After applying dropout, this vector will have a few zero entries distributed at
random: for example, [0, 0.5, 1.3, 0, 1.1]. The dropout rate is the fraction of the
features that are zeroed out; it’s usually set between 0.2 and 0.5.
• At test time, no units are dropped out; instead, the layer’s output values
are scaled down by a factor equal to the dropout rate, to balance for the
fact that more units are active than at training time.
Overfitting and underfitting
• Adding dropout
• The core idea is that introducing noise in the output values of a
layer can break up happenstance patterns that aren’t significant
(what Hinton refers to as conspiracies), which the network will start
memorizing if no noise is present.
The universal workflow of machine
learning
• Defining the problem and assembling a dataset
• you must define the problem at hand
• What will your input data be? What are you trying to predict?
• What type of problem are you facing?
• Identifying the problem type will guide your choice of model architecture, loss
function, and so on.
• Be aware of the hypotheses you make at this stage
• Until you have a working model, all you have is merely hypotheses, waiting to
be validated or invalidated.
• Not all problems can be solved
The universal workflow of machine
learning
• Choosing a measure of success
• To control something, you need to be able to observe it.
• To achieve success, you must define what you mean by success
• Your metric for success will guide the choice of a loss function:
what your model will optimize
• For balanced-classification problems, where every class is
equally likely, accuracy and area under the receiver operating
characteristic curve (ROC AUC) are common metrics.
• For class-imbalanced problems, you can use precision and
recall.
• For ranking problems or multilabel classification, you can use
mean average precision.
• And it isn’t uncommon to have to define your own custom
metric by which to measure success.
The universal workflow of machine
learning
• Deciding on an evaluation protocol
• Once you know what you’re aiming for, you must establish how
you’ll measure your current progress. We’ve previously reviewed
three common evaluation protocols:
• Maintaining a hold-out validation set: The way to go when you have plenty of
data
• Doing K-fold cross-validation: The right choice when you have too few
samples for hold-out validation to be reliable
• Doing iterated K-fold validation: For performing highly accurate model
evaluation when little data is available
• Just pick one of these. In most cases, the first will work well enough.
The universal workflow of machine
learning
• Preparing your data
• you should format your data in a way that can be fed into a
machine-learning model
• Assumimg a deep neural network….
• As you saw previously, your data should be formatted as tensors.
• The values taken by these tensors should usually be scaled to small values: for
example, in the [-1, 1] range or [0, 1] range.
• If different features take values in different ranges (heterogeneous data), then
the data should be normalized.
• You may want to do some feature engineering, especially for small-data
problems.
The universal workflow of machine
learning
• Developing a model that does better than a baseline
• Your goal at this stage is to achieve statistical power
• you need to make three key choices to build your first working
model:
• Last-layer activation
• Loss function
• Optimization configuration
The universal workflow of machine
learning
• Scaling up: developing a model that overfits
• Once you’ve obtained a model that has statistical power, the question
becomes, is your model sufficiently powerful? Does it have enough layers
and parameters to properly model the problem at hand?
• The ideal model is one that stands right at the border between
underfitting and overfitting; between undercapacity and overcapacity. To
figure out where this border lies, first you must cross it.
• To figure out how big a model you’ll need, you must develop a model that
overfits: Add layers, Make the layers bigger, Train for more epochs.
• Always monitor the training loss and validation loss, as well as the training
and validation values for any metrics you care about.
• When you see that the model’s performance on the validation data begins to
degrade, you’ve achieved overfitting
The universal workflow of machine
learning
• Regularizing your model and tuning your hyperparameters
• This step will take the most time: you’ll repeatedly modify your model,
train it, evaluate on your validation data (not the test data, at this point),
modify it again, and repeat, until the model is as good as it can get.
• These are some things you should try:
• Add dropout.
• Try different architectures: add or remove layers.
• Add L1 and/or L2 regularization.
• Try different hyperparameters (such as the number of units per layer or the
learning rate of the optimizer) to find the optimal configuration.
• Optionally, iterate on feature engineering: add new features, or remove features
that don’t seem to be informative.
• Once you’ve developed a satisfactory model configuration, you can train
your final production model on all the available data (training and
validation) and evaluate it one last time on the test set

You might also like