0% found this document useful (0 votes)
44 views

The Machine Learning Landscape

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. It involves developing algorithms that can learn from and make predictions on data. There are different types of machine learning based on the level of supervision provided during training, including supervised learning (training data includes labels), unsupervised learning (training data is unlabeled), and semi-supervised learning (training data includes some labels). Machine learning has many applications such as spam filtering, news article classification, and product recommendations.

Uploaded by

Naman Agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

The Machine Learning Landscape

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. It involves developing algorithms that can learn from and make predictions on data. There are different types of machine learning based on the level of supervision provided during training, including supervised learning (training data includes labels), unsupervised learning (training data is unlabeled), and semi-supervised learning (training data includes some labels). Machine learning has many applications such as spam filtering, news article classification, and product recommendations.

Uploaded by

Naman Agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

1. What is Machine Learning?

Machine learning is the science (and art) of programming computers so they


can learn from data.

Here is a slightly more general definition:


[Machine learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.
—Arthur Samuel, 1959

(a) Traditional Approach


Consider how you would write a spam filter using traditional programming
technique
1. You might notice that some words or phrases (such as “4U”, “credit card”,
“free”, and “amazing”) tend to come up a lot in the subject line.
2. You would write a detection algorithm for each of the patterns that you
noticed, and your program would flag emails as spam if a number of these
patterns were detected.
3. You would test your program and repeat steps 1 and 2 until it was good
enough3. to launch.

Since the problem is difficult, your program will likely become a long list of
complex rules—pretty hard to maintain.
(b) Machine Learning approach for automation
What if spammers notice that all their emails containing “4U” are blocked?
They might start writing “For U” instead. A spam filter using traditional
programming techniques would need to be updated to flag “For U” emails. If
spammers keep working around your spam filter, you will need to keep writing
new rules forever.
In contrast, a spam filter based on machine learning techniques automatically
notices that “For U” has become unusually frequent in spam flagged by users,
and it starts flagging them without your intervention.

(c) Helping humans


Finally, machine learning can help humans learn (Figure 1-4). ML models can
be inspected to see what they have learned (although for some models this
can be tricky).
For instance, once a spam filter has been trained on enough spam, it can easily
be inspected to reveal the list of words and combinations of words that it
believes are the best predictors of spam. Sometimes this will reveal
unsuspected correlations or new trends, and thereby lead to a better
understanding of the problem. Digging into large amounts of data to discover
hidden patterns is called data mining, and machine learning excels at it.

(e) To summarize, machine learning is great for:


• problems for which existing solutions require a lot of fine-tuning or long lists
of rules (a machine learning model can often simplify code and perform better
than the traditional approach)
• Complex problems for which using a traditional approach yields no good
solution (the best machine learning techniques can perhaps find a solution)
• Fluctuating environments (a machine learning system can easily be retrained
on new data, always keeping it up to date).
• Getting insights about complex problems and large amounts of dat

(f) Examples of Applications:


• Detecting tumors in brain scans
This is semantic image segmentation, where each pixel in the image is
classified (as we want to determine the exact location and shape of tumors),
typically using CNNs or transformers.

• Automatically classifying news articles


This is natural language processing (NLP), and more specifically text
classification, which can be tackled using recurrent neural networks (RNNs)
and CNNs, but transformers work even better.

• Automatically flagging offensive comments on discussion forums


This is also text classification, using the same NLP tools.

• Making your app react to voice commands


This is speech recognition, which requires processing audio samples: since
they are long and complex sequences, they are typically processed using
RNNs, CNNs, or transformers.

• Representing a complex, high-dimensional dataset in a clear and insightful


diagram
This is data visualization, often involving dimensionality reduction techniques.

• Recommending a product that a client may be interested in, based on past


purchases
This is a recommender system. One approach is to feed past purchases (and
other information about the client) to an artificial neural network, and get it to
output the most likely next purchase. This neural net would typically be trained
on past sequences of purchases across all clients.

2. Types of Machine Learning Systems


There are so many different types of machine learning systems that it is useful
to classify them in broad categories, based on the following criteria:

• How they are supervised during training (supervised, unsupervised,


reinforcement, self-supervised and semi-supervised)

• Whether or not they can learn incrementally on the fly (online versus batch
learning)

• Whether they work by simply comparing new data points to known data
points, or instead by detecting patterns in the training data and building a
predictive model, much like scientists do (instance-based versus model-based
learning)

2.1 Training Supervision


ML systems can be classified according to the amount and type of supervision
they get during training.

2.1.1 Supervised Learning

In supervised learning, the training set you feed to the algorithm includes the
desired solutions, called labels.

A typical supervised learning task is classification. The spam filter is a good


exampleof this: it is trained with many example emails along with their class
(spam or ham), and it must learn how to classify new emails. e.g., Logistic
Regression, Decision Tree

Another typical task is to predict a target numeric value, such as the price of a
car, given a set of features (mileage, age, brand, etc.). This sort of task is called
Regression . To train the system, you need to give it many examples of cars,
including both their features and their targets (i.e., their prices). e.g., Linear
Regression, Decision Tree

2.1.2 Unsupervised Learning

In unsupervised learning, as you might guess, the training data is unlabeled.


The system tries to learn without a teacher. e.g., Clustering, Dimensionality
reduction
For example, say you have a lot of data about your blog’s visitors. You may
want to run a clustering algorithm to try to detect groups of similar visitors. At
no point do you tell the algorithm which group a visitor belongs to: it finds
those connections without your help.
For example, it might notice that 40% of your visitors are teenagers who love
comic books and generally read your blog after school, while 20% are adults
who enjoy sci-fi and who visit during the weekends.
Visualization algorithms are also good examples of unsupervised learning:
you feed them a lot of complex and unlabeled data, and they output a 2D or
3D representation of your data that can easily be plotted.

A related task is dimensionality reduction, in which the goal is to simplify the


data without losing too much information. One way to do this is to merge
several correla-ted features into one.
For example, a car’s mileage may be strongly correlated with its age, so the
dimensionality reduction algorithm will merge them into one feature that
represents the car’s wear and tear. This is called feature extraction.

Yet another important unsupervised task is anomaly detection.


For example, detecting unusual credit card transactions to prevent fraud,
catching manufacturing defects, or automatically removing outliers from a
dataset before feeding it to another learning algorithm. The system is shown
mostly normal instances during training, so it learns to recognize them; then,
when it sees a new instance, it can tell whether it looks like a normal one or
whether it is likely an anomaly.

2.1.3 Semi-supervised Learning

Since labeling data is usually time-consuming and costly, you will often have
plenty of unlabeled instances, and few labeled instances. Some algorithms can
deal with data that’s partially labeled. This is called semi-supervised learning.

Some photo-hosting services, such as Google Photos, are good examples of


this. Once you upload all your family photos to the service, it automatically
recognizes that the same person A shows up in photos 1, 5, and 11, while
another person B shows up in photos 2, 5, and 7. This is the unsupervised part
of the algorithm (clustering). Now all the system needs is for you to tell it who
these people are. Just add one label per person and it is able to name
everyone in every photo, which is useful for searching photos.

Most semi-supervised learning algorithms are combinations of unsupervised


and supervised algorithms. For example, a clustering algorithm may be used to
group similar instances together, and then every unlabeled instance can be
labeled with the most common label in its cluster. Once the whole dataset is
labeled, it is possible to use any supervised learning algorithm.

2.1.4 Self-supervised Learning

Another approach to machine learning involves actually generating a fully


labeled dataset from a fully unlabeled one.
For example, suppose that what you really want is to have a pet classification
model: given a picture of any pet, it will tell you what species it belongs to. If
you have a large dataset of unlabeled photos of pets, you can start by training
an image-repairing model using self-supervised learning.
If you have a large dataset of unlabeled images, you can randomly mask a
small part of each image and then train a model to recover the original image.
During training, the masked images are used as the inputs to the model, and
the original images are used as the labels.

Once it’s performing well, it should be able to distinguish different pet species:
when it repairs an image of a cat whose face is masked, it must know not to
add a dog’s face.
It is now possible to tweak the model so that it predicts pet species instead of
repairing images. The final step consists of fine-tuning the model on a labeled
dataset: the model already knows what cats, dogs, and other pet species look
like, so this step is only needed so the model can learn the mapping between
the species it already knows and the labels we expect from it.

2.1.5 Reinforcement learning

Reinforcement learning is a very different beast. The learning system, called an


agent in this context, can observe the environment, select and perform actions,
and get rewards in return (or penalties in the form of negative rewards). It
must then learn by itself what is the best strategy, called a policy, to get the
most reward over time. A policy defines what action the agent should choose
when it is in a given situation.
2.2 Batch Versus Online Learning
Another criterion used to classify machine learning systems is whether or not
the system can learn incrementally from a stream of incoming data.

2.2.1 Batch learning

In batch learning, the system is incapable of learning incrementally: it must be


trained using all the available data. This will generally take a lot of time and
computing resources, so it is typically done offline. First the system is trained,
and then it is launched into production and runs without learning anymore; it
just applies what it has learned. This is called offline learning.

Unfortunately, a model’s performance tends to decay slowly over time, simply


because the world continues to evolve while the model remains unchanged.
This phenom- enon is often called model rot or data drift. The solution is to
regularly retrain the model on up-to-date data.

If you want a batch learning system to know about new data (such as a new
type of spam), you need to train a new version of the system from scratch on
the full dataset (not just the new data, but also the old data), then replace the
old model with the new one. Fortunately, the whole process of training,
evaluating, and launching a machine learning system can be automated fairly
easily.
This solution is simple and often works fine, but training using the full set of
data can take many hours, so you would typically train a new system only
every 24 hours or even just weekly.
A better option in all these cases is to use algorithms that are capable of
learning incrementally.

2.2.2 Online learning

In online learning, you train the system incrementally by feeding it data


instances sequentially, either individually or in small groups called mini-
batches. Each learning step is fast and cheap, so the system can learn about
new data on the fly.

Online learning is useful for systems that need to adapt to change extremely
rapidly (e.g., to detect new patterns in the stock market). It is also a good
option if you have limited computing resources; for example, if the model is
trained on a mobile device.

Additionally, online learning algorithms can be used to train models on huge


datasets that cannot fit in one machine’s main memory (this is called out-of-
core learning). The algorithm loads part of the data, runs a training step on
that data, and repeats the process until it has run on all of the data. This is
done offline and not in live system. Think it as an incremental learning.
One important parameter of online learning systems is how fast they should
adapt to changing data: this is called the learning rate. If you set a high
learning rate, then your system will rapidly adapt to new data, but it will also
tend to quickly forget the old data (and you don’t want a spam filter to flag
only the latest kinds of spam it was shown).
Conversely, if you set a low learning rate, the system will have more inertia;
that is, it will learn more slowly, but it will also be less sensitive to noise in the
new data or to sequences of nonrepresentative data points (outliers).

2.3 Instance-Based Versus Model-Based Learning


One more way to categorize machine learning systems is by how they
generalize. Most machine learning tasks are about making predictions. This
means that given a number of training examples, the system needs to be able
to make good predictions for (generalize to) examples it has never seen
before. Having a good performance measure on the training data is good, but
insufficient; the true goal is to perform well on new instances.

2.3.1 Instance-based learning

If you were to create a spam filter this way, it would just flag all emails that are
identical to emails that have already been flagged by users—not the worst
solution, but certainly not the best.
Instead of just flagging emails that are identical to known spam emails, your
spam filter could be programmed to also flag emails that are very similar to
known spam emails This requires a measure of similarity between two emails.
A (very basic) similarity measure between two emails could be to count the
number of words they have in common. The system would flag an email as
spam if it has many words in common with a known spam email.
This is called instance-based learning: the system learns the examples, then
generalizes to new cases by using a similarity measure to compare them to the
learned examples (or a subset of them).

2.3.2 Model based learning

Another way to generalize from a set of examples is to build a model of these


examples and then use that model to make predictions.
This is called model-based learning.
In summary:
• You studied the data.
• You selected a model.
• You trained it on the training data (i.e., the learning algorithm searched for
the model parameter values that minimize a cost function).
• Finally, you applied the model to make predictions on new cases (this is
called inference), hoping that this model will generalize well.
This is what a typical machine learning project looks like.

3. Main Challenges of Machine Learning


In short, since your main task is to select a model and train it on some data,
the two things that can go wrong are “bad model” and “bad data”. Let’s start
with examples of bad data.

3.1.1 Insufficient Quantity of Training Data


Machine learning, unlike humans, takes a lot of data for most machine
learning algorithms to work properly. Even for very simple problems you
typically need thousands of examples, and for complex problems such as
image or speech recognition you may need millions of examples (unless you
can reuse parts of an existing model)
Question 1. What is The Unreasonable Effectiveness of Data ?

Answer. In a famous paper published in 2001, Microsoft researchers Michele


Banko and Eric Brill showed that very different machine learning algorithms,
including fairly simple ones, performed almost identically well on a complex
problem of natural language disambiguation once they were given enough
data.
As the authors put it, “these results suggest that we may want to reconsider
the trade-off between spending time and money on algorithm development
versus spending it on corpus development”
The idea that data matters more than algorithms for complex problems was
further popularized.

3.1.2 Non-representative Training Data


In order to generalize well, it is crucial that your training data be representative
of the new cases you want to generalize to. This is true whether you use
instance-based learning or model-based learning.

For example, the set of countries we used in below ML problem for training
the linear model was not perfectly representative; it did not contain any
country with a GDP per capita lower than 23, 500orhigherthan62,500.

If we add data that contain any country with a GDP per capita lower than
23, 500orhigherthan 62,500, and train a linear model on it, we get the solid
line, while the old model is represented by the dotted line.

As you can see, not only does adding a few missing countries significantly alter
the model, but it makes it clear that such a simple linear model is probably
never going to work well. It seems that very rich countries are not happier than
moderately rich countries (in fact, they seem slightly unhappier!), and
conversely some poor countries seem happier than many rich countries.
If the sample is too small, you will have sampling noise (i.e., nonrepresentative
data as a result of chance), but even very large samples can be
nonrepresentative if the sampling method is flawed. This is called sampling
bias.

Question 2. Write an example of Sampling Bias ?

Answer. The most famous example of sampling bias happened during the US
presidential election in 1936, which pitted Landon against Roosevelt: the
Literary Digest conducted a very large poll, sending mail to about 10 million
people. It got 2.4 million answers, and predicted with high confidence that
Landon would get 57% of the votes. Instead, Roosevelt won with 62% of the
votes. The flaw was in the Literary Digest’s sampling method:

• First, to obtain the addresses to send the polls to, the Literary Digest used
telephone directories, lists of magazine subscribers, club membership lists, and
the like. All of these lists tended to favor wealthier people, who were more
likely to vote Republican (hence Landon).

• Second, less than 25% of the people who were polled answered. Again this
introduced a sampling bias, by potentially ruling out people who didn’t care
much about politics, people who didn’t like the Literary Digest, and other key
groups. This is a special type of sampling bias called nonresponse bias.

3.1.3 Poor-Quality Data


Obviously, if your training data is full of errors, outliers, and noise (e.g., due to
poor-quality measurements), it will make it harder for the system to detect the
underlying patterns, so your system is less likely to perform well. It is often well
worth the effort to spend time cleaning up your training data. e.g.,

• If some instances are clearly outliers, it may help to simply discard them or
try to fix the errors manually.

• If some instances are missing a few features (e.g., 5% of your customers did
not specify their age), you must decide whether you want to ignore this
attribute altogether, ignore these instances, fill in the missing values (e.g., with
the median age), or train one model with the feature and one model without
it.

3.1.4 Irrelevant Features


As the saying goes: garbage in, garbage out. Your system will only be capable
of learning if the training data contains enough relevant features and not too
many irrelevant ones. A critical part of the success of a machine learning
project is coming up with a good set of features to train on. This process,
called feature engineering, involves the following steps: e.g.,

• Feature selection (selecting the most useful features to train on among


existingfeatures).

• Feature extraction (combining existing features to produce a more useful one


—as we saw earlier, dimensionality reduction algorithms can help)

• Creating new features by gathering new data.

Now see some examples of bad algorithms.

3.2.1 Overfitting the Training Data


Let say if you get cheated by your loved one, you might be tempted to say all
men / women are cheaters. Overgeneralizing is something that we humans do
all too often, and unfortunately machines can fall into the same trap if we are
not careful. In machine learning this is called overfitting. This means that the
model performs well on the training data, but it does not generalize well i.e.,
not performing well on test dataset or the data model has never seen before.
Even though a high-degree polynomial life satisfaction model performs much
better on the training data than the simple linear model, would you really trust
its predictions? for e.g., at 80000 USD ?
Constraining a model to make it simpler and reduce the risk of overfitting is
called regularization.
The dotted line represents the original model that was trained on the countries
represented as circles (without the countries represented as squares), the solid
line is our second model trained with all countries (circles and squares), and
the dashed line is a model trained with the same data as the first model but
with a regularization constraint. You can see that regularization forced the
model to have a smaller slope: this model does not fit the training data
(circles) as well as the first model, but it actually generalizes better to new
examples that it did not see during training (squares).

Question 3.1 What is regularization ?

Answer. Constraining a model to make it simpler and reduce the risk of


overfitting is called regularization.
For example, the linear model we defined earlier has two parameters, θ0 and
θ1. This gives the learning algorithm two degrees of freedom to adapt the
model to the training data: it can tweak both the height (θ0) and the slope (θ1)
of the line. If we forced θ1 = 0, the algorithm would have only one degree of
freedom and would have a much harder time fitting the data properly: all it
could do is move the line up or down to get as close as possible to the
training instances, so it would end up around the mean. A very simple model
indeed! If we allow the algorithm to modify θ1 but we force it to keep it small,
then the learning algorithm will effectively have somewhere in between one
and two degrees of freedom. It will produce a model that’s simpler than one
with two degrees of freedom, but more complex than one with just one. You
want to find the right balance between fitting the training data perfectly and
keeping the model simple enough to ensure that it will generalize well

Question 3.2 What is overfitting? Write some methods to prevent it.

Answer. Overfitting means, machine performs well on training data but does
not able to perform the same on test data. Overfitting happens when the
model is too complex relative to the amount and noisiness of the training
data.

Here are possible solutions:

• Simplify the model by selecting one with fewer parameters (e.g., a linear
model rather than a high-degree polynomial model), by reducing the number
of attributes in the training data, or by constraining the model.

• Gather more training data.

• Reduce the noise in the training data (e.g., fix data errors and remove
outliers).

3.2.2 Underfitting the Training Data


Underfitting is the opposite of overfitting: it occurs when your model is too
simple to learn the underlying structure of the data. For example, a linear
model of life satisfaction is prone to underfit; reality is just more complex than
the model, so its predictions are bound to be inaccurate, even on the training
examples.

Here are the main options for fixing this problem:

• Select a more powerful model, with more parameters.

• Feed better features to the learning algorithm (feature engineering).

• Reduce the constraints on the model (for example by reducing the


regularization)

4. Testing and Validating


The only way to know how well a model will generalize to new cases is to
actually try it out on new cases. One way to do that is to put your model in
production and monitor how well it performs. This works well, but if your
model is horribly bad, your users will complain—not the best idea.
A better option is to split your data into two sets: the training set and the test
set. As these names imply, you train your model using the training set, and you
test it using the test set. The error rate on new cases is called the
generalization error (or out-of-sample error), and by evaluating your model on
the test set, you get an estimate of this error. This value tells you how well your
model will perform on instances it has never seen before.
If the training error is low (i.e., your model makes few mistakes on the training
set) but the generalization error is high, it means that your model is overfitting
the training data.

Question 4. What are Hyperparameters ?

Answer. A Machine Learning model is defined as a mathematical model with a


number of parameters that need to be learned from the data. By training a
model with existing data, we are able to fit the model parameters. However,
there is another kind of parameter, known as Hyperparameters, that cannot be
directly learned from the regular training process. They are usually fixed before
the actual training process begins. These parameters express important
properties of the model such as its complexity or how fast it should learn.

Some examples of model hyperparameters include:

1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization.


2. The learning rate for training a neural network.
3. The C and sigma hyperparameters for support vector machines.
4. The k in k-nearest neighbors.

5. Hyperparameter Tuning and Model Selection


Evaluating a model is simple enough: just use a test set. But suppose you are
hesitating between two types of models (say, a linear model and a polynomial
model): how can you decide between them? One option is to train both and
compare how well they generalize using the test set.
Now suppose that the linear model generalizes better, but you want to apply
some regularization to avoid overfitting. The question is, how do you choose
the value of the regularization hyperparameter? One option is to train 100
different models using 100 different values for this hyperparameter. Suppose
you find the best hyperparameter value that produces a model with the lowest
generalization error—say, just 5% error. You launch this model into production,
but unfortunately it does not perform as well as expected and produces 15%
errors. What just happened?
The problem is that you measured the generalization error multiple times on
the test set, and you adapted the model and hyperparameters to produce the
best model for that particular set. This means the model is unlikely to perform
as well on new data.
A common solution to this problem is called holdout validation: you simply
hold out part of the training set to evaluate several candidate models and
select the best one. The new held-out set is called the validation set (or the
devel- opment set, or dev set). More specifically, you train multiple models
with various hyperparameters on the reduced training set (i.e., the full training
set minus the validation set), and you select the model that performs best on
the validation set. After this holdout validation process, you train the best
model on the full training set (including the validation set), and this gives you
the final model. Lastly, you evaluate this final model on the test set to get an
estimate of the generalization error.

Question 5. What is No Free Lunch Theorem ?

Answer. In a famous 1996 paper, David Wolpert demonstrated that if you


make absolutely no assumption about the data, then there is no reason to
prefer one model over any other. This is called the No Free Lunch (NFL)
theorem.
For some datasets the best model is a linear model, while for other datasets it
is a neural network. There is no model that is a priori guaranteed to work
better on all datasets.(hence the name of the theorem). The only way to know
for sure which model is best is to evaluate them all.
Since this is not possible, in practice you make some reasonable assumptions
about the data and evaluate only a few reasonable models. For example, for
simple tasks you may evaluate linear models with various levels of
regularization, and for a complex problem you may evaluate various neural
networks.

You might also like