0% found this document useful (0 votes)
1 views

unit 1

Uploaded by

renuka.ai
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

unit 1

Uploaded by

renuka.ai
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Supervised Machine Learning:

Supervised learning is a machine learning method in which models are trained using
labeled data. In supervised learning, models need to find the mapping function to map the
input variable (X) with the output variable (Y).

Supervised learning needs supervision to train the model, which is similar to as a student
learns things in the presence of a teacher. Supervised learning can be used for two types of
problems:

Classification and Regression.


Example: Suppose we have an image of different types of fruits. The task of our supervised
learning model is to identify the fruits and classify them accordingly. So to identify the image
in supervised learning, we will give the input data as well as output for that, which means we
will train the model by the shape, size, color, and taste of each fruit. Once the training is
completed, we will test the model by giving the new set of fruit. The model will identify the
fruit and predict the output using a suitable algorithm.

Unsupervised Machine Learning:


Unsupervised learning is another machine learning method in which patterns inferred from
the unlabeled input data. The goal of unsupervised learning is to find the structure and
patterns from the input data. Unsupervised learning does not need any supervision. Instead, it
finds patterns from the data by its own.
Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only input data is


provided to the model along with the provided to the model.
output.

The goal of supervised learning is to train The goal of unsupervised learning is to


the model so that it can predict the output find the hidden patterns and useful insights
when it is given new data. from the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems in Clustering and Associations problems.
.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.

Supervised learning model produces an Unsupervised learning model may give


accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first true Artificial Intelligence as it learns
train the model for each data, and then only similarly as a child learns daily routine
it can predict the correct output. things by his experiences.

It includes various algorithms such as It includes various algorithms such as


Linear Regression, Logistic Regression, Clustering, KNN, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Logistic Regression
o logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
K-Nearest Neighbor (KNN) Algorithm for Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Unsupervised learning:
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group
that data according to similarities, and represent that dataset in a compressed format.
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn
what is K-means clustering algorithm, how the algorithm works, along with the
Python implementation of k-means clustering.
Apriori Algorithm in Machine Learning
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed
to work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm uses
a breadth-first search and Hash Tree to calculate the itemset associations efficiently. It is
the iterative process for finding the frequent itemsets from the large dataset.

PCA
Principal component analysis, or PCA, is a dimensionality reduction method that is
often used to reduce the dimensionality of large data sets, by transforming a large set
of variables into a smaller one that still contains most of the information in the large
set.
Step 1: Standardization
Step 2: Covariance Matrix Computation
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify
the principal components
Step 4: Create a Feature Vector
Step 5: Recast the Data Along the Principal Components Axes

Types of cross-validation
1. K-fold cross-validation
2. Hold-out cross-validation
3. Stratified k-fold cross-validation
4. Leave-p-out cross-validation
5. Leave-one-out cross-validation
6. Monte Carlo (shuffle-split)
7. Time series (rolling cross-validation)

K-fold cross-validation
In this technique, the whole dataset is partitioned in k parts of equal size and each partition is
called a fold. It’s known as k-fold since there are k parts where k can be any integer - 3,4,5,
etc.
One fold is used for validation and other K-1 folds are used for training the model. To use
every fold as a validation set and other left-outs as a training set, this technique is repeated k
times until each fold is used once.
Holdout cross-validation
Also called a train-test split, holdout cross-validation has the entire dataset partitioned
randomly into a training set and a validation set. A rule of thumb to partition data is that
nearly 70% of the whole dataset will be used as a training set and the remaining 30% will be
used as a validation set. Since the dataset is split into only two sets, the model is built just one
time on the training set and executed faster.
Stratified k-fold cross-validation
As seen above, k-fold validation can’t be used for imbalanced datasets because data is split
into k-folds with a uniform probability distribution. Not so with stratified k-fold, which is an
enhanced version of the k-fold cross-validation technique. Although it too splits the dataset
into k equal folds, each fold has the same ratio of instances of target variables that are in the
complete dataset. This enables it to work perfectly for imbalanced datasets, but not for time-
series data.

Leave-p-out cross-validation
An exhaustive cross-validation technique, p samples are used as the validation set and n-p
samples are used as the training set if a dataset has n samples. The process is repeated until
the entire dataset containing n samples gets divided on the validation set of p samples and the
training set of n-p samples. This continues till all samples are used as a validation set.
Leave-p-out cross-validation
An exhaustive cross-validation technique, p samples are used as the validation set and n-p
samples are used as the training set if a dataset has n samples. The process is repeated until
the entire dataset containing n samples gets divided on the validation set of p samples and the
training set of n-p samples. This continues till all samples are used as a validation set.
Monte Carlo cross-validation
Also known as shuffle split cross-validation and repeated random subsampling cross-
validation, the Monte Carlo technique involves splitting the whole data into training data and
test data. Splitting can be done in the percentage of 70-30% or 60-40% - or anything you
prefer. The only condition for each iteration is to keep the train-test split percentage different.
Time series (rolling cross-validation / forward chaining method)
Time series is the type of data collected at different points in time. This kind of data allows
one to understand what factors influence certain variables from period to period. Some
examples of time series data are weather records, economic indicators, etc.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.

Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.

Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is
collected from several resources and then preprocessed and organized to provide proper
performance of the model.

Testing Set
This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model, used
only after the training of the model is complete. Testing set is usually a properly organized
dataset having all kinds of data for scenarios that the model would probably be facing when
used in the real world.

Validation Set
The validation set is used to fine-tune the hyperparameters of the model and is considered a
part of the training of the model. The model only sees this data for evaluation but does not
learn from this data, providing an objective unbiased evaluation of the model. Validation
dataset can be utilized for regression as well by interrupting training of model when loss of
validation dataset becomes greater than loss of training dataset .i.e. reducing bias and
variance.

Challenges Motivating Deep Learning


In Machine Learning, there occurs a process of analyzing data for building or training
models. It is just everywhere; from Amazon product recommendations to self-driven cars, it
beholds great value throughout. As per the latest research, the global machine learning
market is expected to grow by 43% by 2024. This revolution has enhanced the demand for
machine learning professionals to a great extent. AI and machine learning jobs have
observed a significant growth rate of 75% in the past four years, and the industry is
growing continuously.
 Poor Quality of Data
 Underfitting of Training Data
 Overfitting of Training Data
 Machine Learning is a Complex Process
 Lack of Training Data
 Slow Implementation
 Imperfections in the Algorithm When Data Grows
Estimators, Bias, Variance
 Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value.
Variance
 Variance is the measure of spread in data from its mean position. In machine
learning variance is the amount by which the performance of a predictive
model changes when it is trained on different subsets of the training data.
Reduce the reduce Variance in Machine Learning:
 Cross-validation : By splitting the data into training and testing sets multiple
times, cross-validation can help identify if a model is overfitting or underfitting
and can be used to tune hyperparameters to reduce variance.
 Feature selection: By choosing the only relevant feature will decrease the
model’s complexity. and it can reduce the variance error.
 Regularization: We can use L1 or L2 regularization to reduce variance in
machine learning models
 Ensemble methods: It will combine multiple models to improve generalization
performance. Bagging, boosting, and stacking are common ensemble methods
that can help reduce variance and improve generalization performance.
 Simplifying the model: Reducing the complexity of the model, such as
decreasing the number of parameters or layers in a neural network, can also help
reduce variance and improve generalization performance.
 Early stopping: Early stopping is a technique used to prevent overfitting by
stopping the training of the deep learning model when the performance on the
validation set stops improving.

Comparison table between Parameters and


Hyperparameters

Parameters Hyperparameters

Parameters are the Hyperparameters are the


configuration model, which explicitly specified parameters
are internal to the model. that control the training
process.

Parameters are essential for Hyperparameters are essential


making predictions. for optimizing the model.

These are specified or These are set before the


estimated while training the beginning of the training of the
model. model.

It is internal to the model. These are external to the


model.
These are learned & set by These are set manually by a
the model by itself. machine learning
engineer/practitioner.

These are dependent on the These are independent of the


dataset, which is used for dataset.
training.

The values of parameters The values of hyperparameters


can be estimated by the can be estimated by
optimization algorithms, hyperparameter tuning.
such as Gradient Descent.

The final parameters The selected or fine-tuned


estimated after training hyperparameters decide the
decide the model quality of the model.
performance on unseen
data.

Some examples of model Some examples of model


parameters are Weights in hyperparameters are the
an ANN, Support vectors in learning rate for training a
SVM, Coefficients in Linear neural network, K in the KNN
Regression or Logistic algorithm, etc.
Regression.

You might also like