unit 1
unit 1
Supervised learning is a machine learning method in which models are trained using
labeled data. In supervised learning, models need to find the mapping function to map the
input variable (X) with the output variable (Y).
Supervised learning needs supervision to train the model, which is similar to as a student
learns things in the presence of a teacher. Supervised learning can be used for two types of
problems:
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first true Artificial Intelligence as it learns
train the model for each data, and then only similarly as a child learns daily routine
it can predict the correct output. things by his experiences.
PCA
Principal component analysis, or PCA, is a dimensionality reduction method that is
often used to reduce the dimensionality of large data sets, by transforming a large set
of variables into a smaller one that still contains most of the information in the large
set.
Step 1: Standardization
Step 2: Covariance Matrix Computation
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify
the principal components
Step 4: Create a Feature Vector
Step 5: Recast the Data Along the Principal Components Axes
Types of cross-validation
1. K-fold cross-validation
2. Hold-out cross-validation
3. Stratified k-fold cross-validation
4. Leave-p-out cross-validation
5. Leave-one-out cross-validation
6. Monte Carlo (shuffle-split)
7. Time series (rolling cross-validation)
K-fold cross-validation
In this technique, the whole dataset is partitioned in k parts of equal size and each partition is
called a fold. It’s known as k-fold since there are k parts where k can be any integer - 3,4,5,
etc.
One fold is used for validation and other K-1 folds are used for training the model. To use
every fold as a validation set and other left-outs as a training set, this technique is repeated k
times until each fold is used once.
Holdout cross-validation
Also called a train-test split, holdout cross-validation has the entire dataset partitioned
randomly into a training set and a validation set. A rule of thumb to partition data is that
nearly 70% of the whole dataset will be used as a training set and the remaining 30% will be
used as a validation set. Since the dataset is split into only two sets, the model is built just one
time on the training set and executed faster.
Stratified k-fold cross-validation
As seen above, k-fold validation can’t be used for imbalanced datasets because data is split
into k-folds with a uniform probability distribution. Not so with stratified k-fold, which is an
enhanced version of the k-fold cross-validation technique. Although it too splits the dataset
into k equal folds, each fold has the same ratio of instances of target variables that are in the
complete dataset. This enables it to work perfectly for imbalanced datasets, but not for time-
series data.
Leave-p-out cross-validation
An exhaustive cross-validation technique, p samples are used as the validation set and n-p
samples are used as the training set if a dataset has n samples. The process is repeated until
the entire dataset containing n samples gets divided on the validation set of p samples and the
training set of n-p samples. This continues till all samples are used as a validation set.
Leave-p-out cross-validation
An exhaustive cross-validation technique, p samples are used as the validation set and n-p
samples are used as the training set if a dataset has n samples. The process is repeated until
the entire dataset containing n samples gets divided on the validation set of p samples and the
training set of n-p samples. This continues till all samples are used as a validation set.
Monte Carlo cross-validation
Also known as shuffle split cross-validation and repeated random subsampling cross-
validation, the Monte Carlo technique involves splitting the whole data into training data and
test data. Splitting can be done in the percentage of 70-30% or 60-40% - or anything you
prefer. The only condition for each iteration is to keep the train-test split percentage different.
Time series (rolling cross-validation / forward chaining method)
Time series is the type of data collected at different points in time. This kind of data allows
one to understand what factors influence certain variables from period to period. Some
examples of time series data are weather records, economic indicators, etc.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is
collected from several resources and then preprocessed and organized to provide proper
performance of the model.
Testing Set
This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model, used
only after the training of the model is complete. Testing set is usually a properly organized
dataset having all kinds of data for scenarios that the model would probably be facing when
used in the real world.
Validation Set
The validation set is used to fine-tune the hyperparameters of the model and is considered a
part of the training of the model. The model only sees this data for evaluation but does not
learn from this data, providing an objective unbiased evaluation of the model. Validation
dataset can be utilized for regression as well by interrupting training of model when loss of
validation dataset becomes greater than loss of training dataset .i.e. reducing bias and
variance.
Parameters Hyperparameters