0% found this document useful (0 votes)
9 views

Data - Analytics - Chapter 2

The document provides an overview of machine learning including definitions, applications, and the modeling process. It discusses different machine learning techniques like supervised learning, unsupervised learning and reinforcement learning. It also describes commonly used machine learning algorithms like regression, classification and clustering.

Uploaded by

payalwani73
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data - Analytics - Chapter 2

The document provides an overview of machine learning including definitions, applications, and the modeling process. It discusses different machine learning techniques like supervised learning, unsupervised learning and reinforcement learning. It also describes commonly used machine learning algorithms like regression, classification and clustering.

Uploaded by

payalwani73
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Chapter 2

Machine Learning Overview


• Introduction to Machine Learning, deep learning,
Artificial intelligence
• Applications for machine learning in data science
• The modeling process
– Engineering features and selecting a model, Training the
model, Validating the model, Predicting new observations
• Types of machine learning
– Supervised learning, Unsupervised learning,
Semi-supervised learning, ensemble techniques
• Regression models
– Linear, Polynomial, Logistic
• Concept of classification, clustering and reinforcement
learning
Introduction
• Can you teach your computers to protect you
from malicious persons?
• Can you explicitly teach a computer to
recognize persons in a picture?
• Have you ever went for grocery shopping? What
do you do before going to the market?
– I always prepare a list of ingredients beforehand. Also,
I make the decision according to the previous
purchasing experience. Then, I go and purchase the
items. But, with the rising inflation, it’s not too easy to
work in the budget. I have observed that my budget
gets deviated a lot of times.
– This happens because the shopkeeper changes the
quantity and price of a product very often. Due to
such factors, I have to modify my shopping list. It
takes a lot of effort, research and time to update the
list for every change.
– This is where Machine Learning can come to your
rescue.
• “Machine learning is a field of study that gives
computers the ability to learn without being
explicitly programmed.” Arthur Samuel,
1959
• In normal cases programming by example;
for solving a specific task, we feed the
algorithm with a specific data.
• For Machine learning we develop
general-purpose algorithms that can be used
on a large classes of learning problems.
• Machine learning is the study of computer algorithms
that can improve automatically through experience
and by the use of data. It is seen as a part of artificial
intelligence
• Machine Learning is the most popular technique
of predicting the future or classifying information to
help people in making necessary decisions.
• Machine Learning algorithms are trained over
instances or examples through which they learn
from past experiences and also analyze the historical
data.
• Therefore, as it trains over the examples, again and
again, it is able to identify patterns in order to make
predictions about the future.
• “Machine learning is the process by which a
computer can work more accurately as it
collects and learns from the data it is given.”
– Eg : Autocorrecting/ autocompleting messages on
a phone; phone learns more about messages, as
user writes more text messages on a phone.
• In the broader field of science, machine
learning is a subfield of artificial intelligence
and is closely related to applied mathematics
and statistics.
• With the help of Machine Learning, we can
develop intelligent systems that are capable
of taking decisions on an autonomous basis.
• These algorithms learn from the past
instances of data through statistical
analysis and pattern matching.
• Then, based on the learned data, it provides
us with the predicted results.
• Data is the core foundation of machine
learning algorithms.
• Machine learning combines CS, Mathematics,
and Statistics.
– Statistics essential for drawing inferences from
data
– Mathematics for developing machine learning
models
– CS for implementing the algorithms.
• Machine learning is used to impart
intelligence to static systems.
• In order to derive meaningful insights from this data
and learn from the way in which people and
the system interface with the data, we need
computational algorithms that can churn the data and
provide us with results that would benefit us in various
ways.
• Machine learning has facilitated the automation of
redundant tasks that have taken away the need for
manual labour.
• All of this is possible due to the massive
amount of data that we generate on a daily basis.
• Machine Learning facilitates several methodologies to
make sense of this data and provide you
with steadfast and accurate results.
• These machine learning algorithms use the
patterns contained in the training data to
perform classification and future predictions.
• Whenever any new input is introduced to
the ML model, it applies its learned patterns
over the new data to make future predictions.
• Based on the final accuracy, one
can optimize their models using
various standardized approaches.
How does M/L work?
Applications for machine learning in
data science
• Two main machine learning tools most important to
Data scientist
– Regression
– Classification
• The uses for regression and automatic classification
are wide ranging, such as the following:
– Finding oil fields, gold mines, or archeological sites based
on existing sites (classification and regression)
– ■ Finding place names or persons in text (classification)
– ■ Identifying people based on pictures or voice recordings
(classification)
– ■ Recognizing birds based on their whistle (classification)
– Identifying profitable customers (regression and
classification)
– Proactively identifying car parts that are likely to

fail (regression)
– Identifying tumors and diseases (classification)

– Predicting the amount of money a person will


spend on product X (regression)


– Predicting the number of eruptions of a volcano

in a period (regression)
– Predicting your company’s yearly revenue

(regression)
– Predicting which team will win the Champions

League in soccer (classification)


• Root Cause analysis Occasionally data
scientists build a model (an abstraction of
reality) that provides insight to the underlying
processes of a phenomenon.
– Here the goal of a model isn’t prediction but
interpretation.
• ■ Understanding and optimizing a business process,
such as determining which products add value to a
product line
• ■ Discovering what causes diabetes
• ■ Determining the causes of traffic jams
• Where M/L is used in the Data science
process:
– The process of Data science involves
• Setting up a research goal
• Retreiving data
• Data preparation
• Data exploration
• Data modeling ( Model & variable selection, Model
execution, model diagnostic and model comparison)
• Presentation “& automation.
– M/L used in the data preparation phase, since
qualitative data needs to be sent for data
modeling.
• An eg would be Cleansing a list of text strings, wher we
can use M/L to group similar strings together, so that it
becomes easier to correct spelling errors.
– M/L is also useful in Data exploration M/L
algorithms can be used to extract out underlying
patterns in data.
The modeling process
• The modeling phase consists of four steps:
– Feature engineering and model selection
– Training the model
– Model validation and selection
– Applying the trained model to unseen data
• Until we find a good model, we iterate between the
first three steps.
• The fourth step is generally there when the goal is
prediction, and not there when the goal is explanation
(Root cause analysis)
– For instance, you might want to find out the causes of
species’ extinctions but not necessarily predict which one is
next in line to leave our planet.
• Another replacement of the fourth
step/technique is Ensemble learning .
– This technique involves chaining or combine
multiple techniques.
– When you chain multiple models, the output of
the first model becomes an input for the second
model.
– When multiple models are combined, they are
trained independently and their results are
combined.
• A model consists of constructs of information called
features or predictors and a target or response variable.
• A model’s goal is to predict the target variable using the
predictors.
– for example, to predict tomorrow’s high temperature, The
variables that help you do this and are (usually) known to you
are the features or predictor variables such as today’s
temperature, cloud movements, current wind speed, and so on.
• The best models are those that accurately represent reality,
preferably while staying concise and interpretable.
• To achieve this, feature engineering is the most important
and arguably most interesting part of modeling.
– For example, an important feature in a model that tried to
explain the extinction of large land animals in the last 60,000
years in Australia turned out to be the population number and
spread of humans.
• 1. Engineering features and selecting a model:
– Feature engineering is the first step in modeling
process.
– It involves identifying the predictors for the
model.
– This is one of the most important steps in the
process, since a model recombines these features
to achieve its predictions.
– Feature selection
• Certain features are the variables that we get from a
data set.
• At times we may need to extract out the necessary
features , which may be scattered among different
data sets.
• Often raw data is collected from many multiple
sources. In such cases, we need to apply a
transformation to an input or multiple inputs before it
becomes a good feature /predictor.
• Eg of combining multiple inputs is Interaction variables
impact of single variable low, but if both present then
very high impact.
– Vinegar ,bleach common household products, but mixing
them results in poisonous gas.
– Thus in situations were we need to identify and select
features, we need to use modeling techniques to
derive features thus output of a model becomes
part of another model.
– Eg :Text mining, where Documents can be first
grouped into categories using some modeling
technique, and then sent to the model that we want
to use.
– Problem of Availability Bias in model construction
• your features are only the ones that you could easily get
your hands on and your model consequently represents this
one-sided “truth.”
• Models suffering from availability bias often fail when
they’re validated because it becomes clear that they’re not
a valid representation of the truth.
• Eg : World war II , English planes after German bombing
• Training the Model
– Once the right predictors have been identified,
with a modeling technique in mind, the next step
is to train the model.
– Training involves presenting data to our model,
from which it can learn.
– Once a model is trained, it’s time to test whether
it can be extrapolated to reality i.e model
validation.
• Model Validation
– Many modeling techniques available, need to
choose the right one
– A good model has two properties:
• it has good predictive power
• It generalizes well to data it hasn’t seen.
– To achieve this we define an error measure (how
wrong the model is) and a validation strategy
– Two common error measures in machine learning
are
• The classification error rate for classification problems
• The mean squared error for regression problems.
– The classification error rate is the percentage of
observations in the test data set that your model
mislabeled; lower is better.
– The mean squared error measures how big the
average error of your prediction is.
• Consequence of squaring, bigger errors get even more
weight than they otherwise would. Small errors remain
small or can even shrink (if <1), whereas big errors are
enlarged and will definitely draw your attention.
• Many validation strategies exist, the common
ones are :
– Dividing your data into a training set with X% of the
observations and keeping the rest as a holdout data
set (a data set that’s never used for model
creation)—This is the most common technique.
– K-folds cross validation—This strategy divides the
data set into k parts and uses each part one time as a
test data set while using the others as a training data
set. This has the advantage that you use all the data
available in the data set.
– Leave-1 out—This approach is the same as k-folds but
with k=1. You always leave one observation out and
train on the rest of the data. This is used only on small
data sets, so it’s more valuable to people evaluating
laboratory experiments than to big data analysts.
– Model validation is extremely important since it
determines whether our model works in real-life
situations.
– In order to perform a good validation ,
• test your models on Unseen data (provided the unseen
data is a true representation of of what it would
encounter when applied on fresh observations by other
people.)
• For classification models, instruments like the
confusion matrix are the best.
– Once you’ve constructed a good model, you can
(optionally) use it to predict the future.
• Predicting new observations
– Once the first three steps are implemented
successfully, we now have a performant model that
generalizes to unseen data.
– Model scoring is The process of applying your model
to new data.
• In fact, model scoring was done implicitly during validation,
but now we don’t know the correct outcome.
– Model scoring involves two steps.
• First, prepare a data set that has features exactly as defined
by our model. This boils down to repeating the data
preparation we did in step one of the modeling process but
for a new data set.
• Then apply the model on this new data set, and this results
in a prediction.
Types of M/L algorithms
• Based on the amount of human interaction
required to coordinate the M/L approaches
and how these approaches use the Labelled
data, the algorithms are categorized as :
– Supervised learning algorithms
– Unsupervised learning algorithms
– Semisupervised learning algorithms
• Supervised learning algorithms
• Supervised learning techniques attempt to discern
results and learn by trying to find patterns in a
labelled data set
• Human interaction is required to label the data.
– Eg For House price prediction, we first need data
about houses such as; square foot, no. of rooms, the
house has a garden or not, and so on features.
– We then need to know the prices of these houses ie;
class labels.
– Now data coming from thousands of houses, their
features, and prices, we can now train a supervised
machine learning model to predict a new house’s
price based on past experiences of the model.
• Two types of Supervised learning:
– Classification: In Classification, a computer program is
trained on a training dataset, and based on the
training it categorizes the data in different class labels.
• This algorithm is used to predict the discrete values such as
male|female, true|false, spam|not spam, etc.
• Eg Email spam detection, speech recognition, identification
of cancer cells, etc.
– Types of Classification Algorithms:
• Naive Bayes classifier
• Decision Trees
• Logistic Regression
• K-Nearest Neighbours
• Support vector machine
• Random forest classification
– Regression: The task of the regression algorithm is
to find the mapping function to map input
variables(x) to the continuous output variable(y).
– Regression algorithms are used to predict
continuous values such as price, salary, age,
marks, etc.
• Eg; Weather prediction, house price prediction, fake
news detection, etc.
– Types of Regression Algorithms:
• Simple linear Regression
• Multiple linear Regression
• polynomial Regression
• Unsupervised learning techniques
• Unsupervised learning techniques don’t rely
on labeled data and attempt to find patterns
in a data set without human interaction.
• In an unsupervised learning model, the
algorithm learns on an unlabeled dataset and
tries to make sense by extracting features,
co-occurrence, and underlying patterns on its
own.
– Eg; Anomaly detection, including fraud detection.
• Types of Unsupervised Learning:
– Clustering
– Anomaly detection
– Association rules learning
– Neural Networks
• Semi-supervised techniques
– Uses small amount of labeled data and a large
amount of unlabeled data , which provides the
benefits of both unsupervised & supervised
learning while avoiding challenges of finding a
large amount of labeled data.
– These techniques need labeled data, hence
human interaction , to find patterns in the data
set, but they still can progress toward a result and
learn even if passed unlabeled data as well
• Reinforcement Learning:
– Reinforcement learning is a type of machine
learning where the model learns to behave in an
environment by performing some actions and
analyzing the reactions.
– RL takes appropriate action in order to maximize
the positive response in the particular situation.
– The reinforcement model decides what actions to
take in order to perform a given task that’s why it
is bound to learn from the experience itself.
• Eg; Lets take an example of a baby when she is
learning how to walk. In the first case, when the
baby starts walking and makes it to the chocolate
since the chocolate is the end goal for the baby
and the response of a baby is positive as she is
happy. In the second case, when the baby starts
walking and while walking she gets hit by the chair
and couldnot reach to the chocolate then she
starts crying which is a negative response. It is to
say that how we human learn from trail and error.
Here, the baby is “agent” , chocolate is the
“reward” and many hurdles in between. Now the
agent tries several ways and finds out the best
possible path to reach the reward.
Regression models
• Predictive modelling techniques such as regression analysis
may be used to determine the relationship between a
dataset’s dependent (goal) and independent variables.
• It is widely used when the dependent and independent
variables are linked in a linear or non-linear fashion, and the
target variable has a set of continuous values.
• Thus, regression analysis approaches help establish causal
relationships between variables, modelling time series, and
forecasting.
• Regression analysis, for example, is the best way to examine
the relationship between sales and advertising expenditures
for a corporation.
– For example, relationship between rash driving and number of
road accidents by a driver is best studied through regression
• Regression analysis is an important tool for
modelling and analyzing data.
• Here, we fit a curve / line to the data points, in
such a manner that the differences between the
distances of data points from the curve or line is
minimized.
• There are multiple benefits of using regression
analysis. They are as follows:
– It indicates the significant relationships between
dependent variable and independent variable.
– It indicates the strength of impact of
multiple independent variables on a dependent
variable.
• Regression analysis also allows us to compare
the effects of variables measured on different
scales, such as the effect of price changes and
the number of promotional activities.
• These benefits help market researchers / data
analysts / data scientists to eliminate and
evaluate the best set of variables to be used
for building predictive models
Types of Regression techniques
• There are various kinds of regression
techniques available to make predictions.
• These techniques are mostly driven by three
metrics
– Number of independent variables,
– Type of dependent variables
– Shape of regression line).
• Linear Regression

You might also like