0% found this document useful (0 votes)
31 views

DSF Unit 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

DSF Unit 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT III

MACHINE
LEARNING
The modeling process - Types of machine learning -
Supervised learning – Unsupervised learning -Semi-
supervised learning- Classification, regression -
Clustering – Outliers and Outlier Analysis.
MODELING PROCESS
 There are 10 steps are involved to make better
machine learning model.
1. Problem Definition
2. Data Collection
3. Data Exploration and Preprocessing
4. Feature Selection
5. Model Selection
6. Model Training
7. Model Evaluation
8. Model Tuning
9. Model Deployment
10. Model Maintenance
Machine learning
 Machine learning is a subset of AI, which enables the
machine to automatically learn from data, improve
performance from past experiences, and make
predictions.
 Machine learning uses algorithms and data sets to
teach computers to learn from data and improve with
experience.
 In simple words, ML teaches the systems to think and
understand like humans by learning from the data.
Types of machine
learning
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
Supervised Learning
 Supervised machine learning is based on
supervision. It this, we train the machines using
the "labelled" dataset, and based on the
training, the machine predicts the output.
 First, we train the machine with the input and
corresponding output, and then we ask the
machine to predict the output using the test
dataset.
 The main goal of the supervised learning
technique is to map the input variable(x) with
the output variable(y). Some real-world
applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc
Example
TYPES OF SUPERVISED MACHINE
LEARNING
 Supervised machine learning can be classified
into two types of problems, which are given
below:
1. Classification
2. Regression
CLASSIFICATION:
 Classification deals with predicting categorical

target variables, which represent discrete


classes or labels.
 Eg: classifying emails as spam or not spam,

or predicting whether a patient has a high risk


of heart disease.
TYPES OF SUPERVISED MACHINE
LEARNING
TYPES OF SUPERVISED MACHINE
LEARNING
There are two types of Classifications:
 Binary Classifier: If the classification problem

has only two possible outcomes, then it is


called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM
or NOT SPAM, CAT or DOG, etc.
 Multi-class Classifier: If a classification

problem has more than two outcomes, then it is


called as Multi-class Classifier.
Example: Classifications of types of crops,
Classification of types of music.
TYPES OF SUPERVISED MACHINE
LEARNING
Some popular classification algorithms are
given below:
 1. Random Forest Algorithm
 2. Decision Tree Algorithm
 3. Logistic Regression Algorithm
 4. Support Vector Machine Algorithm
TYPES OF SUPERVISED MACHINE
LEARNING
REGRESSION:
 Regression algorithms are used if there is a

relationship between the input variable and the


output variable.
 It is used for the prediction of continuous
variables, such as Weather forecasting, Market
Trends, etc.
 Below are some popular Regression algorithms:

1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
UNSUPERVISED
LEARNING
 As its name suggests, there is no need for
supervision. It means, in unsupervised
machine learning, the machine is trained
using the unlabeled dataset, and the machine
predicts the output without any supervision.
 Here, the models are trained with the data

that is neither classified nor labelled, and the


model acts on that data without any
supervision.
UNSUPERVISED
LEARNING
 It analyzes and clusters unlabeled datasets
using machine learning algorithms. These
algorithms find hidden patterns and data
without any human intervention.
 It means, we don’t give output to our model.

The training model has only input parameter


values and discovers the groups or patterns
on its own.
TYPES OF UNSUPERVISED MACHINE
LEARNING
 Unsupervised Learning can be further
classified into two types, which are given
below:
1. Clustering
2. Association
CLUSTERING
 Clustering in unsupervised machine learning

is the process of grouping unlabeled data into


clusters based on their similarities.
 The goal of clustering is to identify patterns

and relationships in the data without any prior


knowledge of the data’s meaning.
TYPES OF UNSUPERVISED MACHINE
LEARNING
 Some common clustering algorithms

1) K-means Clustering: Partitioning Data into


K Clusters
2) Hierarchical Clustering: Building a
Hierarchical Structure of Clusters
3) Density-Based Clustering (DBSCAN):
Identifying Clusters Based on Density
4) Mean-Shift Clustering: Finding Clusters
Based on Mode Seeking
5) Spectral Clustering: Utilizing Spectral
Graph Theory for Clustering
TYPES OF UNSUPERVISED MACHINE
LEARNING
 ASSOCIATION:
 Association rule learning finds interesting relations
among variables within a large dataset. The main
aim is to find the dependency of one data item on
another data item and map those variables
accordingly so that it can generate maximum profit.
 Eg: Market Basket analysis, Web usage mining,
continuous production, etc.
 Some common clustering algorithms
 1) Apriori Algorithm
 2) FP-Growth Algorithm
 3) Eclat Algorithm
 4) Efficient Tree-based Algorithms
SEMI-SUPERVISED LEARNING
 Semi-Supervised learning is a type of Machine
Learning algorithm that represents the
intermediate ground between Supervised and
Unsupervised learning algorithms.
 It uses the combination of labeled and
unlabeled datasets during the training period
SEMI-SUPERVISED LEARNING
 Assumptions followed by Semi-Supervised
Learning:
Continuity Assumption:
 Also known as the smoothness assumption, this

assumption states that data points that are close


to each other are likely to have the same label.
Cluster assumption
 This assumption states that data naturally form

discrete clusters, and that points in the same


cluster are more likely to have a common label.
Manifold assumption
 This assumption states that the data lies roughly

in a lower-dimensional space than the input space.


SEMI-SUPERVISED LEARNING
 Text document classification
 Anomaly detection - Fraud detection
 NLP – Sentiment analysis
 Speech recognition
 Medical imaging – tumor detection, disease

classification
 Drug discovery
OUTLIER
 Outliers in machine learning refer to data
points that are significantly different from the
majority of the data. These data points can be
anomalous, noisy, or errors in measurement.
 An outlier is a data point that significantly

deviates from the rest of the data.


 It can be either much higher or much lower

than the other data points, and its presence


can have a significant impact on the results of
machine learning algorithms.
 They can be caused by measurement or

execution errors. The analysis of outlier data


is referred to as outlier analysis or outlier
mining.
OUTLIER
TYPES OF OUTLIERS
1. Global outliers:
 isolated data points that are far away from the

main body of the data.


 often easy to identify and remove.

2. Contextual outliers:
 These are unusual in a specific context but

may not be outliers in a different context.


 often more difficult to identify and may

require additional information or domain


knowledge to determine their significance.
Outlier Detection Methods in
Machine Learning
1. Statistical Methods:
 Z-Score: This method calculates the standard deviation
of the data points and identifies outliers as those with
Z-scores exceeding a certain threshold (typically 3 or -
3).
 Interquartile Range (IQR): IQR identifies outliers as

data points falling outside the range defined by Q1-


k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1 and Q3 are the
first and third quartiles, and k is a factor (typically 1.5).
2. Distance-Based Methods:
 K-Nearest Neighbors (KNN): KNN identifies outliers as

data points whose K nearest neighbors are far away from


them.
 Local Outlier Factor (LOF): This method calculates the

local density of data points and identifies outliers as


those with significantly lower density compared to their
Outlier Detection Methods in
Machine Learning
3. Clustering-Based Methods:
 Density-Based Spatial Clustering of

Applications with Noise (DBSCAN):


In DBSCAN, clusters data points based on their
density and identifies outliers as points not
belonging to any cluster.
 Hierarchical clustering:

Hierarchical clustering involves building a


hierarchy of clusters by iteratively merging or
splitting clusters based on their
similarity. Outliers can be identified as clusters
containing only a single data point or clusters
significantly smaller than others.
Techniques for Handling Outliers
in Machine Learning
1. Removal:
 This involves identifying and removing outliers

from the dataset before training the


model. Common methods include:
◦ Thresholding: Outliers are identified as data points
exceeding a certain threshold (e.g., Z-score > 3).
◦ Distance-based methods: Outliers are identified
based on their distance from their nearest neighbors.
◦ Clustering: Outliers are identified as points not
belonging to any cluster or belonging to very small
clusters.
Techniques for Handling Outliers
in Machine Learning
2. Transformation:
 This involves transforming the data to reduce

the influence of outliers. Common methods


include:
◦ Scaling: Standardizing or normalizing the data to
have a mean of zero and a standard deviation of one.
◦ Winsorization: Replacing outlier values with the
nearest non-outlier value.
◦ Log transformation: Applying a logarithmic
transformation to compress the data and reduce the
impact of extreme values.
Techniques for Handling Outliers
in Machine Learning
3. Robust Estimation:
 This involves using algorithms that are less

sensitive to outliers. Some examples include:


◦ Robust regression: Algorithms like L1-regularized
regression or Huber regression are less influenced by
outliers than least squares regression.
◦ M-estimators: These algorithms estimate the model
parameters based on a robust objective function that
down weights the influence of outliers.
◦ Outlier-insensitive clustering
algorithms: Algorithms like DBSCAN are less
susceptible to the presence of outliers than K-means
clustering.
Techniques for Handling Outliers
in Machine Learning
4. Modeling Outliers:
 This involves explicitly modeling the outliers as

a separate group. This can be done by:


◦ Adding a separate feature: Create a new feature
indicating whether a data point is an outlier or not.
◦ Using a mixture model: Train a model that assumes
the data comes from a mixture of multiple
distributions, where one distribution represents the
outliers.
TECHNIQUES FOR OUTLIER
ANALYSIS
1. Visual inspection: using plots to identify
outliers
2. Statistical methods: using metrics like
mean, median, and standard deviation to detect
outliers
3. Machine learning algorithms: using
algorithms like One-Class SVM, Local Outlier
Factor (LOF), and Isolation Forest to detect
outliers.

You might also like