0% found this document useful (0 votes)
13 views

02 - Diagnostics For Machine Learning Model

This document discusses various techniques for diagnosing and preprocessing machine learning models, including identifying outliers, handling missing values, removing duplicate data, and reducing data dimensionality. Dimensionality reduction techniques aim to reduce overfitting by removing irrelevant or redundant features. Regularization is also discussed as a technique to reduce overfitting by adding a penalty term to model complexity. The differences between L1 and L2 regularization are summarized. Both aim to prevent overfitting but L1 regularization results in sparser solutions while L2 regularization results in a single non-sparse solution.

Uploaded by

MauJuarezSan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

02 - Diagnostics For Machine Learning Model

This document discusses various techniques for diagnosing and preprocessing machine learning models, including identifying outliers, handling missing values, removing duplicate data, and reducing data dimensionality. Dimensionality reduction techniques aim to reduce overfitting by removing irrelevant or redundant features. Regularization is also discussed as a technique to reduce overfitting by adding a penalty term to model complexity. The differences between L1 and L2 regularization are summarized. Both aim to prevent overfitting but L1 regularization results in sparser solutions while L2 regularization results in a single non-sparse solution.

Uploaded by

MauJuarezSan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Diagnostics for

Machine
Learning
Model
PhD. Msc. David C. Baldears S.
PhD(s). Msc. Diego Lopez Bernal

1
Outliers
An outlier is an observed value that is notably distinct from the other values in
a dataset.

Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set

● The outlier is though as a mistake(normally not)


● Decide whether the outlier is
part of your population of
interest or not
● See how much the outlier(s)
are affecting the results
2
Outlier
● Outliers are useful to detect significant deviations from normal
behavior  Applications:  
○ Credit Card Fraud Detection  
○ Network Intrusion Detection

3
Missing Values
● Reasons for missing values
○ Information is not collected (e.g., people decline to give their age and weight)
○ Attributes may not be applicable to all cases (e.g., annual income is not applicable to
children)  
● Handling missing values
○ Eliminate Data Objects
○ Estimate Missing Values
○ Ignore the Missing Value During Analysis
○ Replace with all possible values (weighted by their probabilities)

4
Duplicate Data
● Data set may include data objects that are duplicates, or almost duplicates
of one another
○ Major issue when merging data from heterogeneous sources  
● Examples:
○ Same person with multiple email addresses  
● Data cleaning
○ Process of dealing with duplicate data issues

5
Data Preprocessing
● Aggregation  
● Sampling
●  Dimensionality Reduction  
● Feature subset selection  
● Feature creation  
● Discretization and Binarization  
● Attribute Transformation

6
Aggregation & Sampling
Aggregation

●  Combining two or more attributes (or objects) into a single attribute (or object)  
● Purpose
○ Data reduction  Reduce the number of attributes or objects
○ Change of scale  Cities aggregated into regions, states, countries, etc
○ More “stable” data  Aggregated data tends to have less variability

Sampling

● Sampling is the main technique employed for data selection. – It is often used for both the
preliminary investigation of the data and the final data analysis.  
● Statisticians sample because obtaining the entire set of data of interest is too expensive or
time consuming.  
● Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
7
Curse Of Dimensionality
● Dimensionality: amount of features that describe our dataset
● When dimensionality increases, data becomes increasingly sparse in the
space that it occupies

● The amount of training data to cover 20% of the feature space grows
exponentially. In other words: more features need more data.
8
Curse Of Dimensionality
● Another way to observe this phenomenon:

● Ass you add new dimensions, you create new space that is not filled with
data. Therefore, you need more data for it to work well.
● Definitions of density and distance between points, which is critical for
clustering and outlier detection, become less meaningful. 9
Curse Of Dimensionality
● It is also one of the main causes of overfitting:

● Can we separate them?

10
Curse Of Dimensionality
It is also one of the main causes of overfitting:

When projected into 2D:

Overfitting!

Better option::

11
Curse Of Dimensionality
● How to solve it?
● Dimensionality Reduction  
○ Avoid curse of dimensionality
○ Reduce amount of time and memory required by data mining algorithms
○ Allow data to be more easily visualized
○ May help to eliminate irrelevant features or reduce noise
○ Avoid overfitting

12
Feature Subset Selection
Reduce dimensionality of data

Remove:  

● Redundant features
○ duplicate much or all of the information contained in one or more other attributes
○ Example: purchase price of a product and the amount of sales tax paid
●  Irrelevant features
○ contain no information that is useful for the data mining task at hand
○ Example: students' ID is often irrelevant to the task of predicting students' GPA
●  Techniques
○ Brute-force approach:  Try all possible feature subsets as input to data mining algorithm
○ Embedded approaches:  Feature selection occurs naturally as part of the data mining algorithm
○ Filter approaches:  Features are selected before data mining algorithm is run
13
Bias and Variance
● Bias:
○ Assumptions made by a model to make a function
easier to learn.
○ It is actually the error rate of the training data.
When the error rate has a high value, it has High
Bias
○ when the error rate has a low value, it has low Bias.
● Variance:
○ The error rate of the testing data is called variance.
○ When the error rate has a high value, it has High
variance
○ When the error rate has a low value, it has Low
variance.
14
Bias and Variance

High bias (underfit) High bias (underfit) High variance (overfit)

15
Model Complexity
Underfitting and Overfitting

16
Underfitting
1. Reasons for Underfitting:
a. High bias and low variance
b. The size of the training dataset used is not enough.
c. The model is too simple.
d. Training data is not cleaned and also contains noise in it.
2. Techniques to reduce underfitting:
a. Increase model complexity
b. Increase the number of features, performing feature engineering
c. Remove noise from the data.
d. Increase the number of epochs or increase the duration of training to get better results.

17
Overfitting
1. Reasons for Overfitting are as follows:
a. High variance and low bias
b. The model is too complex
c. The size of the training data
2. Techniques to reduce overfitting:
a. Increase training data.
b. Reduce model complexity.
c. Early stopping during the training phase (have an eye over the loss over the training period
as soon as loss begins to increase stop training).
d. Ridge Regularization and Lasso Regularization
e. Use dropout for neural networks to tackle overfitting.

18
As Regularization
Regularization is a very important technique in machine learning to prevent
overfitting. Mathematically speaking, it adds a regularization term in order to
prevent the coefficients to fit so perfectly to overfit.

The difference between their properties


can be promptly summarized as follows:

Sparsity: some parameters become 0


19
Differences
The table below shows the summarized differences between L1 and L2
regularization
L1 Regularization L2 Regularization

1 Panelizes the sum of absolute value of weights. penalizes the sum of square weights.

2 It has a sparse solution. It has a non-sparse solution.

3 It gives multiple solutions. It has only one solution.

4 Constructed in feature selection. No feature selection.

5 Robust to outliers. Not robust to outliers.

6 It generates simple and interpretable models. It gives more accurate predictions when the output
variable is the function of whole input variables.

7 Unable to learn complex data patterns. Able to learn complex data patterns.

8 Computationally inefficient over non-sparse Computationally efficient because of having


conditions. analytical solutions. 20

You might also like