0% found this document useful (0 votes)
88 views

Data Science

Machine learning is a process where computers can learn from data without being explicitly programmed. It involves developing models from sample data that can be used to predict outcomes for new data. Some common applications of machine learning include classification, regression, predictive modeling, root cause analysis, and process optimization. The machine learning process involves feature engineering, model training, validation, and applying the model to new unseen data. Regularization techniques like L1 and L2 regularization are used to prevent overfitting.

Uploaded by

Smart's Vids
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Data Science

Machine learning is a process where computers can learn from data without being explicitly programmed. It involves developing models from sample data that can be used to predict outcomes for new data. Some common applications of machine learning include classification, regression, predictive modeling, root cause analysis, and process optimization. The machine learning process involves feature engineering, model training, validation, and applying the model to new unseen data. Regularization techniques like L1 and L2 regularization are used to prevent overfitting.

Uploaded by

Smart's Vids
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Unit – III

Machine Learning
What is machine learning?
Mike Roberts
“Machine learning is the process by which a computer can work more
accurately as it collects and learns from the data it is given.”

Arthur Samuel, 1959


“Machine learning is a field of study that gives computers the ability to
learn without being explicitly programmed.”
Applications for machine learning in data science

● Regression and classification are of primary importance to a data scientist. To


achieve these goals, one of the main tools a data scientist uses is machine
learning.
● The uses for regression and automatic classification are wide ranging, such as
the following:
■ Finding oil fields, gold mines, or archeological sites based on existing
sites (classification and regression)
■ Finding place names or persons in text (classification)
■ Identifying people based on pictures or voice recordings (classification)
■ Recognizing birds based on their whistle (classification)
Applications for machine learning in data science

■Identifying profitable customers (regression and classification)


■ Proactively identifying car parts that are likely to fail (regression)
■ Identifying tumors and diseases (classification)
■ Predicting the amount of money a person will spend on product (regression)
■ Predicting the number of eruptions of a volcano in a period (regression)
■ Predicting your company’s yearly revenue (regression)
■ Predicting which team will win the Champions League in soccer
(classification)
Applications for machine learning in data science

Occasionally data scientists build a model (an abstraction of reality) that


provides insight to the underlying processes of a phenomenon. When the goal
of a model isn’t prediction but interpretation, it’s called root cause analysis(
identifying “root causes” of problems or events ). Here are a few examples:
■ Understanding and optimizing a business process, such as determining
which products add value to a product line
■ Discovering what causes diabetes
■ Determining the causes of traffic jams
Where machine learning is used in the data science process

Although machine learning is mainly linked to the data-modeling step of the data
science process, it can be used at almost every step.
The modeling process

The modeling phase consists of four steps:


1 Feature engineering and model selection
2 Training the model
3 Model validation and selection
4 Applying the trained model to unseen data
● It’s possible to chain or combine multiple techniques. When you chain
multiple models, the output of the first model becomes an input for the
second model. When you combine multiple models, you train them
independently and combine their results. This last technique is also known
as ensemble learning.
● A model consists of constructs of information called features or
predictors and a target or response variable. Your model’s goal is to
predict the target variable, for example, tomorrow’s high temperature.
The variables that help you do this and are (usually) known to you are the
features or predictor variables such as today’s temperature, cloud
movements, current wind speed, and so on.
1.Engineering features and selecting a model
● With engineering features, you must come up with and create possible predictors
for the model. This is one of the most important steps in the process because a
model recombines these features to achieve its predictions.
● Often you may need to consult an expert or the appropriate literature to come up
with meaningful features.
● Certain features are the variables you get from a data set, as is the case with the
provided data sets in our exercises and in most school exercises.
● In practice you’ll need to find the features yourself, which may be scattered
among different data sets.
● In several projects we had to bring together more than 20 different data sources
before we had the raw data we required. Often you’ll need to apply a
transformation to an input before it becomes a good predictor or to combine
multiple inputs.
● Sometimes you have to use modeling techniques to derive features: the output of
a model becomes part of another model.
● In text mining,Documents can first be annotated to classify the content into
categories, or you can count the number of geographic places or persons in the
text.
● This counting is often more difficult than it sounds; models are first applied to
recognize certain words as a person or a place. All this new information is then
poured into the model you want to build.
● One of the biggest mistakes in model construction is the availability bias:
● It refers to the human tendency to judge an event by the ease with which
examples of the event can be retrieved from your memory or constructed a
new.(heuristics can lead to inaccurate judgments about how common things occur
and about how representative certain things may be.)
● When the initial features are created, a model can be trained to the data.
2.Training your model
● With the right predictors in place and a modeling technique in mind, you can
progress to model training.
● In this phase you present to your model data from which it can learn.
● The most common modeling techniques have industry-ready implementations in
almost every programming language, including Python. These enable you to train
your models by executing a few lines of code. For more state-of-the art data
science techniques, you’ll probably end up doing heavy mathematical calculations
and implementing them with modern computer science techniques.
● Once a model is trained, it’s time to test whether it can be extrapolated to
reality:model validation.
3.Validating a model
● Data science has many modeling techniques, and the question is which one is the right one
to use.
● A good model has two properties: it has good predictive power and it generalizes well to
data it hasn’t seen.
● To achieve this you define an error measure (how wrong the model is) and a validation
strategy.
Error Measures:
● Two common error measures in machine learning are the classification error rate
for classification problems and the mean squared error for regression problems.
● The classification error rate is the percentage of observations in the test data
set that your model mislabeled; lower is better.
● The mean squared error measures how big the average error of your prediction is.
Validation strategies:
● Dividing your data into a training set with X% of the observations and keeping the
rest as a holdout data set (a data set that’s never used for model creation)—This
is the most common technique.
● K-folds cross validation—This strategy divides the data set into k parts and uses
each part one time as a test data set while using the others as a training data set.
This has the advantage that you use all the data available in the data set.
● Leave-1 out—This approach is the same as k-folds but with k=1. You always leave
one observation out and train on the rest of the data. This is used only on small
data sets, so it’s more valuable to people evaluating laboratory experiments than
to big data analysts.
● Another popular term in machine learning is regularization.
● In the context of machine learning, Regularisation is a technique used to reduce the
errors by fitting the function appropriately on the given training set and avoid
overfitting.
● When applying regularization,you incur a penalty for every extra variable used to
construct the model With L1 regularization you ask for a model with as few predictors
as possible. This is important for the model’s robustness: simple solutions tend to hold
true in more situations.
● L2 regularization aims to keep the variance between the coefficients of the predictors
as small as possible. Overlapping variance between predictors makes it hard to make out
the actual impact of each predictor. Keeping their variance from overlapping will
increase interpretability. To keep it simple: regularization is mainly used to stop a model
from using too many features and thus prevent over-fitting.
L1 Regularisation

● A regression model which uses L1 Regularisation technique is called


LASSO(Least Absolute Shrinkage and Selection Operator) regression.
● Lasso Regression adds “absolute value of magnitude” of coefficient as penalty
term to the loss function(L).

L2 regularisation

● A regression model that uses L2 regularisation technique is called Ridge


regression.
● Ridge regression adds “squared magnitude” of coefficient as penalty term to
the loss function(L).
Difference between L1 and L2

● The key difference between these techniques is that Lasso shrinks the less
important feature’s coefficient to zero thus, removing some feature
altogether. So, this works well for feature selection in case we have a huge
number of features.
● From a practical standpoint, L1 tends to shrink coefficients to zero whereas
L2 tends to shrink coefficients evenly. L1 is therefore useful for feature
selection, as we can drop any variables associated with coefficients that go to
zero. L2, on the other hand, is useful when you have collinear/codependent
features.
4.Predicting new observations
● If you’ve implemented the first three steps successfully, you now have a
performant model that generalizes to unseen data. The process of applying
your model to new data is called model scoring.
● Model scoring involves two steps. First, you prepare a data set that has
features exactly as defined by your model. This boils down to repeating the
data preparation you did in step one of the modeling process but for a new
data set. Then you apply the model on this new data set, and this results in a
prediction.
Types of machine learning
● Supervised learning techniques attempt to discern results and learn by trying

to find patterns in a labeled data set. Human interaction is required to label

the data.

● Unsupervised learning techniques don’t rely on labeled data and attempt to find

patterns in a data set without human interaction.

● Semi-supervised learning techniques need labeled data, and therefore human

interaction, to find patterns in the data set, but they can still progress toward

a result and learn even if passed unlabeled data as well.


Supervised learning-CASE STUDY: DISCERNING DIGITS FROM IMAGES
● One of the many common approaches on the web to stopping computers
from hacking into user accounts is the Captcha check—a picture of text
and numbers that the human user must decipher and enter into a form
field before sending the form back to the web server.

Figure 3.3 A simple Captcha control can be used to prevent automated spam
being sent through an online web form.
● With the help of the Naïve Bayes classifier, a simple yet powerful algorithm to
categorize observations into classes that’s explained in more detail in the
sidebar, you can recognize digits from textual images.
● These images aren’t unlike the Captcha checks many websites have in place to
make sure you’re not a computer trying to hack into the user accounts.
● Let’s see how hard it is to let a computer recognize images of numbers.
● Our research goal is to let a computer recognize images of numbers (step
one of the data science process).
● The data we’ll be working on is the MNIST data set, which is often used in
the data science literature for teaching and benchmarking.
● The MNIST database is a large database of handwritten digits that is
commonly used for training various image processing systems.
● The MNIST images can be found in the data sets package of Scikit-learn and are
already normalized for you (all scaled to the same size: 64x64 pixels), so we won’t need
much data preparation (step three of the data science process).
● But let’s first fetch our data as step two of the data science process, with the
following listing.
In the case of a gray image, you put a value in every matrix entry that depicts the gray
value to be shown. The following code demonstrates this process and is step four of the
data science process: data exploration.
● For a grayscale images, the pixel value is a single number that represents the brightness
of the pixel. The most common pixel format is the byte image, where this number is
stored as an 8-bit integer giving a range of possible values from 0 to 255.
● Typically zero is taken to be black, and 255 is taken to be white.
● The Naïve Bayes classifier is expecting a list of values, but pl.matshow() returns a
two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into a
list, we need to call reshape() on digits.images.
● The net result will be a one-dimensional array that looks something like this:
array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0.,
0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0.,
0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0.,
0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
Figure 3.4 Blurry grayscale representation of the number 0 with its corresponding matrix.
The higher the number, the closer it is to white; the lower the number, the closer it is to
black.
Figure 3.5 We’ll turn an image into something usable by the Naïve Bayes classifier by
getting the grayscale value for each of its pixels (shown on the right) and putting those
values in a list.
● Now that we have a way to pass the contents of an image into the classifier, we
need to pass it a training data set so it can start learning how to predict the
numbers in the images.
● We mentioned earlier that Scikit-learn contains a subset of the MNIST database
(1,800 images), so we’ll use that.
● Each image is also labeled with the number it actually shows. This will build a
probabilistic model in memory of the most likely digit shown in an image given its
grayscale values.
● Once the program has gone through the training set and built the model, we can
then pass it the test set of data to see how well it has learned to interpret the
images using the model.
● The end result of this code is called a confusion matrix, such as the one shown in
figure 3.6. Returned as a two-dimensional array, it shows how often the number
predicted was the correct number on the main diagonal and also in the matrix
entry (i,j), where j was predicted but the image showed i.
● Looking at figure 3.6 we can see that the model predicted the number 2 correctly
17 times (at coordinates 3,3), but also that the model predicted the number 8 15
times when it was actually the number 2 in the image (at 9,3).

Figure 3.6 Confusion matrix


produced by predicting what
number is depicted by a blurry
image
The model was correct in (35+40) 75 cases and incorrect in (15+10) 25 cases,
resulting in a (75 correct/100 total observations) 75% accuracy.
Figure 3.7 For each blurry image a number is predicted; only the number 2 is misinterpreted as 8.
Then an ambiguous number is predicted to be 3 but it could as well be 5; even to human eyes this
isn’t clear.

● By discerning which images were misinterpreted, we can train the model further by labeling
them with the correct number they display and feeding them back into the model as a new
training set (step 5 of the data science process).
● This will make the model more accurate, so the cycle of learn, predict, correct continues and
the predictions become more accurate.
Unsupervised learning
● Unsupervised Learning is a machine learning technique in which the users do not need to
supervise the model.
● Instead, it allows the model to work on its own to discover patterns and information that was
previously undetected.
● It mainly deals with the unlabelled data.

Types of Unsupervised Learning


Unsupervised learning problems further grouped into clustering and association problems.
● Hierarchical clustering
● K-means clustering
● K-NN (k nearest neighbors)
● Principal Component Analysis
● Singular Value Decomposition
● Independent Component Analysis
Applications of unsupervised machine learning
Some applications of unsupervised machine learning techniques are:

● Clustering automatically split the dataset into groups base on their similarities
● Anomaly detection can discover unusual data points in your dataset. It is useful
for finding fraudulent transactions
● Association mining identifies sets of items which often occur together in your
dataset
● Latent variable models are widely used for data preprocessing. Like reducing the
number of features in a dataset or decomposing the dataset into multiple
components
DISCERNING A SIMPLIFIED LATENT STRUCTURE FROM YOUR DATA

● In statistics, latent variables are variables that are not directly observed but are
rather inferred (through a mathematical model) from other variables that are
observed (directly measured).
● A latent variable is hidden, and therefore can’t be observed.
● Example:Actual customer satisfaction is a hidden or latent factor, that can only
be measured in comparison to a manifest variable, or observable factor.
● The company might choose to study observable variables, such as sales numbers, the
price per sale, regional trends of purchasing, the gender of the customer, age of the
customer, percentage of return customers, and how high a customer ranked the
product on various sites all in the pursuit of the latent factor — namely, customer
satisfaction.
CASE STUDY: FINDING LATENT VARIABLES IN A WINE QUALITY DATA SET

● In this short case study, you’ll use a technique known as Principal Component Analysis
(PCA) to find latent variables in a data set that describes the quality of wine.
● Then you’ll compare how well a set of latent variables works in predicting the quality of
wine against the original observable set.
● How to identify and derive those latent variables.
● How to analyze where the sweet spot is—how many new variables return the most
utility—by generating and interpreting a scree plot generated by PCA.
● A scree plot is a line plot of the eigenvalues of factors or principal components in an
analysis.
● The scree plot is used to determine the number of factors to retain in an exploratory
factor analysis
Main components of this example

Data set

● The University of California, Irvine (UCI) has an online repository of 325


data sets for machine learning exercises at
http://archive.ics.uci.edu/ml/.
● We’ll use the Wine Quality Data Set for red wines created by P. Cortez,
A. Cerdeira, F. Almeida, T. Matos, and J. Reis4. It’s 1,600 lines long and
has 11 variables per line, as shown in table 3.2.
The first three rows of the Red Wine Quality Data Set

Principal Component Analysis—A technique to find the latent variables in your


data set while retaining as much information as possible.
Scikit-learn—We use this library because it already implements PCA for us
and is a way to generate the scree plot
Part one of the data science process is to set our research goal: We want to
explain the subjective “wine quality” feedback using the different wine
properties.
With the initial data preparation behind you, you can execute the PCA. The resulting
scree plot (which will be explained shortly) is shown in figure 3.8. Because PCA is an
explorative technique, we now arrive at step four of the data science process: data
exploration, as shown in the following listing.
The plot generated from the wine data set is shown in figure 3.8. What you hope to see is an
elbow or hockey stick shape in the plot. This indicates that a few variables can represent
the majority of the information in the data set while the rest only add a little more. In our
plot, PCA tells us that reducing the set down to one variable can capture approximately 28%
of the total information in the set
An elbow shape in the plot suggests that five variables can hold most of the information found
inside the data.At this point, we could go ahead and see if the original data set recoded with five
latent variables is good enough to predict the quality of the wine accurately,
INTERPRETING THE NEW VARIABLES

● With the initial decision made to reduce the data set from 11 original variables to 5 latent
variables, we can check to see whether it’s possible to interpret or name them based on
their relationships with the originals.
● Actual names are easier to work with than codes such as lv1, lv2, and so on.

The rows in the resulting table (table 3.4) show the mathematical correlation. Or, in English, the
first latent variable lv1, which captures approximately 28% of the total information in the set,
has the following formula.
Lv1 = (fixed acidity * 0.489314) + (volatile acidity * -0.238584) + … + (alcohol *
-0.113232)
We can now recode the original data set with only the five latent variables.
Doing this is data preparation again, so we revisit step three of the data
science process: data preparation.

Already we can see high values for wine 0 in volatile acidity, while wine 2 is
particularly high in persistent acidity
COMPARING THE ACCURACY OF THE ORIGINAL DATA SET WITH LATENT VARIABLES

● Now that we’ve decided our data set should be recoded into 5 latent

variables rather than the 11 originals, it’s time to see how well the new

data set works for predicting the quality of wine when compared to the

original.

● We’ll use the Naïve Bayes Classifier algorithm we saw in the previous

example for supervised learning to help. Let’s start by seeing how well

the original 11 variables could predict the wine quality scores.


GROUPING SIMILAR OBSERVATIONS TO GAIN INSIGHT FROM THE
DISTRIBUTION OF YOUR DATA

● The general technique we’re describing here is known as clustering.


● In this process, we attempt to divide our data set into observation
subsets, or clusters, wherein observations should be similar to those in
the same cluster but differ greatly from the observations in other
clusters.
● Figure 3.10 gives you a visual idea of what clustering aims to achieve. The
circles in the top left of the figure are clearly close to each other while
being farther away from the others
● Scikit-learn implements several common algorithms for clustering data in
its sklearn.cluster module, including the k-means algorithm, affinity
propagation, and spectral clustering.
● k-means is a good general-purpose algorithm with which to get started.
However, like all the clustering algorithms, you need to specify the
number of desired clusters in advance, which necessarily results in a
process of trial and error before reaching a decent conclusion
● One other disadvantage is the need to specify the number of desired
clusters in advance. This often results in a process of trial and error
before coming to a satisfying Conclusion.
● The following listing uses an iris data set to see if the algorithm can
group the different types of irises.
import sklearn
from sklearn import cluster
import pandas as pd
data = sklearn.datasets.load_iris()
X = pd.DataFrame(data.data, columns = list(data.feature_names))
print X[:5]
model = cluster.KMeans(n_clusters=3, random_state=25)
results = model.fit(X)
X["cluster"] = results.predict(X)
X["target"] = data.target
X["c"] = "lookatmeIamimportant"
print X[:5]
classification_result =
X[["cluster","target","c"]].groupby(["cluster","target"]).agg("count")
print(classification_result)
Semi-supervised learning
● Semi-supervised learning is the type of machine learning that uses a
combination of a small amount of labeled data and a large amount of unlabeled
data to train models.
● This approach to machine learning is a combination of supervised machine
learning, which uses labeled training data, and unsupervised learning, which
uses unlabeled training data.
● Unlabeled data, when used in conjunction with a small amount of labeled data,
can produce considerable improvement in learning accuracy.
● A common semi-supervised learning technique is label propagation.
● In this technique,you start with a labeled data set and give the same label to
similar data points.
● This is similar to running a clustering algorithm over the data set and labeling each
cluster based on the labels they contain.
● One special approach to semi-supervised learning worth mentioning here is active
learning.
● Active learning is a special case of machine learning in which a learning algorithm
can interactively query a user (or some other information source) to label new data
points with the desired outputs.

You might also like