The Machine Learning Landscape
The Machine Learning Landscape
Since the problem is difficult, your program will likely become a long list of
complex rules—pretty hard to maintain.
(b) Machine Learning approach for automation
What if spammers notice that all their emails containing “4U” are blocked?
They might start writing “For U” instead. A spam filter using traditional
programming techniques would need to be updated to flag “For U” emails. If
spammers keep working around your spam filter, you will need to keep writing
new rules forever.
In contrast, a spam filter based on machine learning techniques automatically
notices that “For U” has become unusually frequent in spam flagged by users,
and it starts flagging them without your intervention.
• Whether or not they can learn incrementally on the fly (online versus batch
learning)
• Whether they work by simply comparing new data points to known data
points, or instead by detecting patterns in the training data and building a
predictive model, much like scientists do (instance-based versus model-based
learning)
In supervised learning, the training set you feed to the algorithm includes the
desired solutions, called labels.
Another typical task is to predict a target numeric value, such as the price of a
car, given a set of features (mileage, age, brand, etc.). This sort of task is called
Regression . To train the system, you need to give it many examples of cars,
including both their features and their targets (i.e., their prices). e.g., Linear
Regression, Decision Tree
Since labeling data is usually time-consuming and costly, you will often have
plenty of unlabeled instances, and few labeled instances. Some algorithms can
deal with data that’s partially labeled. This is called semi-supervised learning.
Once it’s performing well, it should be able to distinguish different pet species:
when it repairs an image of a cat whose face is masked, it must know not to
add a dog’s face.
It is now possible to tweak the model so that it predicts pet species instead of
repairing images. The final step consists of fine-tuning the model on a labeled
dataset: the model already knows what cats, dogs, and other pet species look
like, so this step is only needed so the model can learn the mapping between
the species it already knows and the labels we expect from it.
If you want a batch learning system to know about new data (such as a new
type of spam), you need to train a new version of the system from scratch on
the full dataset (not just the new data, but also the old data), then replace the
old model with the new one. Fortunately, the whole process of training,
evaluating, and launching a machine learning system can be automated fairly
easily.
This solution is simple and often works fine, but training using the full set of
data can take many hours, so you would typically train a new system only
every 24 hours or even just weekly.
A better option in all these cases is to use algorithms that are capable of
learning incrementally.
Online learning is useful for systems that need to adapt to change extremely
rapidly (e.g., to detect new patterns in the stock market). It is also a good
option if you have limited computing resources; for example, if the model is
trained on a mobile device.
If you were to create a spam filter this way, it would just flag all emails that are
identical to emails that have already been flagged by users—not the worst
solution, but certainly not the best.
Instead of just flagging emails that are identical to known spam emails, your
spam filter could be programmed to also flag emails that are very similar to
known spam emails This requires a measure of similarity between two emails.
A (very basic) similarity measure between two emails could be to count the
number of words they have in common. The system would flag an email as
spam if it has many words in common with a known spam email.
This is called instance-based learning: the system learns the examples, then
generalizes to new cases by using a similarity measure to compare them to the
learned examples (or a subset of them).
For example, the set of countries we used in below ML problem for training
the linear model was not perfectly representative; it did not contain any
country with a GDP per capita lower than 23, 500orhigherthan62,500.
If we add data that contain any country with a GDP per capita lower than
23, 500orhigherthan 62,500, and train a linear model on it, we get the solid
line, while the old model is represented by the dotted line.
As you can see, not only does adding a few missing countries significantly alter
the model, but it makes it clear that such a simple linear model is probably
never going to work well. It seems that very rich countries are not happier than
moderately rich countries (in fact, they seem slightly unhappier!), and
conversely some poor countries seem happier than many rich countries.
If the sample is too small, you will have sampling noise (i.e., nonrepresentative
data as a result of chance), but even very large samples can be
nonrepresentative if the sampling method is flawed. This is called sampling
bias.
Answer. The most famous example of sampling bias happened during the US
presidential election in 1936, which pitted Landon against Roosevelt: the
Literary Digest conducted a very large poll, sending mail to about 10 million
people. It got 2.4 million answers, and predicted with high confidence that
Landon would get 57% of the votes. Instead, Roosevelt won with 62% of the
votes. The flaw was in the Literary Digest’s sampling method:
• First, to obtain the addresses to send the polls to, the Literary Digest used
telephone directories, lists of magazine subscribers, club membership lists, and
the like. All of these lists tended to favor wealthier people, who were more
likely to vote Republican (hence Landon).
• Second, less than 25% of the people who were polled answered. Again this
introduced a sampling bias, by potentially ruling out people who didn’t care
much about politics, people who didn’t like the Literary Digest, and other key
groups. This is a special type of sampling bias called nonresponse bias.
• If some instances are clearly outliers, it may help to simply discard them or
try to fix the errors manually.
• If some instances are missing a few features (e.g., 5% of your customers did
not specify their age), you must decide whether you want to ignore this
attribute altogether, ignore these instances, fill in the missing values (e.g., with
the median age), or train one model with the feature and one model without
it.
Answer. Overfitting means, machine performs well on training data but does
not able to perform the same on test data. Overfitting happens when the
model is too complex relative to the amount and noisiness of the training
data.
• Simplify the model by selecting one with fewer parameters (e.g., a linear
model rather than a high-degree polynomial model), by reducing the number
of attributes in the training data, or by constraining the model.
• Reduce the noise in the training data (e.g., fix data errors and remove
outliers).