0% found this document useful (0 votes)
267 views

Module 1 ML Mumbai University

Machine learning algorithms learn without being explicitly programmed by analyzing large amounts of data and recognizing patterns to make predictions or decisions. The document discusses various machine learning concepts including supervised learning techniques like decision trees, random forests, and XGBoost which are trained using labeled data. Unsupervised learning techniques like k-means clustering and DBSCAN are used to discover hidden patterns in unlabeled data. Issues that can impact machine learning models such as overfitting, underfitting and data bias are also covered at a high level.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
267 views

Module 1 ML Mumbai University

Machine learning algorithms learn without being explicitly programmed by analyzing large amounts of data and recognizing patterns to make predictions or decisions. The document discusses various machine learning concepts including supervised learning techniques like decision trees, random forests, and XGBoost which are trained using labeled data. Unsupervised learning techniques like k-means clustering and DBSCAN are used to discover hidden patterns in unlabeled data. Issues that can impact machine learning models such as overfitting, underfitting and data bias are also covered at a high level.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Module 1

Introduction to ML
What is Machine Learning ?

• Computer algorithm that learn without been explicitly coded by a programmer.


• The machine receives data as input and uses an algorithm to formulate answers.
• Use of statistical methods, algorithms are trained to make classifications or predictions,
and to uncover key insights in data
Machine Learning / Traditional Programming

RULES

Data Data
Rules
Computer Output Output
How does ML works

• Human learns from experience


• By analogy, when we face an unknown situation, the likelihood of success is lower than
the known situation. Machines are trained the same.
• ML core concept is
• Learning : discovery patterns in data
• Feature vector :The list of attributes used to solve a problem is called a feature vector. You
can think of a feature vector as a subset of data that is used to tackle a problem.
• Model : machine uses algorithm and transfer the discovery into model
• Inferring: You can use the model previously trained to make inference on new data.
Types of ML

• Supervised Learning : in presence of a supervisor or teacher


• Data is having label (Outcome or data annotation)
• Supervised learning is classified into two categories of algorithms:
• Classifications : A classification problem is when the output variable is a category.
• Regression : A regression problem is when the output variable is a real value.

• Key Points : Supervised learning deals with or learns with “labeled” data. This
implies that some data is already tagged with the correct answer.
Algorithm

• Linear Regression
• Logistic Regression
• Decision Tree
• SVM
• Random Forest
• AdaBoost
Decision Tree
• Decision tree algorithms are nothing but a series of if-else statements that can be used to predict a result
based on a dataset. This flowchart-like structure helps us in decision making.
• What is Entropy : is the measures of impurity, disorder or uncertainty in a bunch of examples. Entropy
controls how a Decision Tree decides to split the data
• Red and Green balls in sample of 14 (4 Red and 10 Green

• The entropy of a group in which all examples belong to the same class will always be 0
• he entropy of a group with 50% in either class will always be 1
Decision Tree

• Information gain (IG) measures how much “information” a feature gives us about the class. It
tells us how important a given attribute of the feature vectors is. Information gain (IG) is used
to decide the ordering of attributes in the nodes of a decision tree.
• Information Gain ID3 : it tends to use the feature that has more unique values.
• Gain Ratio C 4.5 : Tend to prefer unbalanced spilt ( one partition much smaller than
other)
• Gini Index (CART ): Normalized by Spilt Info ( Used in Binary setup)
Bagging

• Bagging chooses a random sample from the data set. Hence each model is generated from
the samples (Bootstrap Samples) provided by the Original Data with replacement known
as row sampling.
• This step of row sampling with replacement is called bootstrap.
• Each model is trained independently which generates results.
• The final output is based on majority voting after combining the results of all models.
This step which involves combining all the results and generating output based on
majority voting is known as aggregation.
XGBoost – Extreme Gradient Boosting

• Boosting: Builds model from individual weak learner in iterative way


• Unlike random forest not build on random subset of data/features
• But more weights on instances with wrong predictions learn from mistakes
• Gradient boosting uses GD to minimize loss function
• XGBoost :
• Developed by university of Washington (2016)
• Credited with winning kaggle competitions
• Uses many tricks to optimize accuracy and speed
Decision Tree ----- XG Boost

• Decision Tree: Flow chart of decisions based on certain condition


• Bagging : Combine prediction from multiple decision tree via majority voting ( democracy)
• Random forest : similar to bagging but only a subset of features are selected at random to build a
collection of decision tree( address overfitting)
• Boosting : builds model sequentially by minimizing errors from previous models and boosting influence if
high performing models
• Gradient Boosting: uses Gradient descent to minimize errors
• XGBoost: Optimized Gradient boosting (Parallelization, regularization and pruning etc.)
KNN

• Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the
training as well as test data.
• Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
• Step 3 − For each point in the test data do the following −
• 3.1 − Calculate the distance between test data and each row of training data with the help of any of the method
namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is
Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
Unsupervised Learning

• In unsupervised learning, an algorithm explores input data without being given an


explicit output variable (e.g., explores customer demographic data to identify
patterns)
• You want algorithm to find pattern and classify data
• Algorithm
• K-means
• GMM
K-Means

• Initialize ‘K’ i.e number of clusters to be created.


• Randomly assign K centroid points.
• Assign each data point to its nearest centroid to create K clusters.
• Re-calculate the centroids using the newly created clusters.
• Repeat steps 3 and 4 until the centroid gets fixed.
• WCSS: the sum of the square distance between points in a cluster and the cluster centroid.
Elbow Method
DBSCAN

• Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
• The DBSCAN algorithm uses two parameters:
• minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
• eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
DBSCAN

• There are three types of points after the DBSCAN


clustering is complete:
• Core — This is a point that has at least m points
within distance n from itself.
• Border — This is a point that has at least one Core
point at a distance n.
• Noise — This is a point that is neither a Core nor a
Border. And it has less than m points within
distance n from itself.

Association rule

• Market Basket analysis

• Investing time and resources on deliberate product placements like this not only reduces a customer’s
shopping time, but also reminds the customer of what relevant items (s)he might be interested in buying, thus
helping stores cross-sell in the process.

• Association rules help uncover all such relationships between items from huge databases.

• Milk 70
• Toothbrush 4
• Common 10
• Total 100
• Toothbrush milk calculate support and confidence

Supervised vs Unsupervised
Supervised learning Unsupervised learnig
Input Data is provided to the model along with the Only input data is provided in Unsupervised Learning.
output in the Supervised Learning.
Output is predicted by the Supervised Learning. Hidden patterns in the data can be found using the
unsupervised learning model.
Accurate results are produced using a supervised The accuracy of results produced are less in unsupervised
learning model. learning models.
Training the model to predict output when a new Finding useful insights, hidden patterns from the unknown
data is provided is the objective of Supervised dataset is the objective of the unsupervised learning.
Learning.
Computational Complexity is very complex There is less computational complexity
Some of the applications of Supervised Learning are Some of the applications of Unsupervised Learning are
Spam detection, handwriting detection, pattern detecting fraudulent transactions, data preprocessing etc.
recognition, speech recognition etc.
Steps in ML APPLICATION

• Collect Data
• Prepare Input
• Analyse the input
• Train the algorithm
• Valid the algorithm
• Test
• Use it
Issues with ML
• Inadequate Training Data/ Poor Quality of data : data plays a vital role in the processing of machine
learning.
• Overfitting : Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the performance of
the model
• Methods to reduce overfitting:
• Increase training data in a dataset.
• Reduce model complexity by simplifying the model by selecting one with fewer parameters
• Ridge Regularization and Lasso Regularization
• Early stopping during the training phase
• Reduce the noise
• Reduce the number of attributes in training data.
• Constraining the model.
Issues with ML

• Underfitting : Model is too simple to understand the structure of data.


• Whenever a machine learning model is trained with fewer amounts of data, it destroys the accuracy of the machine
learning model.
• Methods to reduce Underfitting:
• Increase model complexity
• Remove noise from the data
• Trained on increased and better features
• Reduce the constraints
• Increase the number of epochs to get better results.
Issues with ML

• Data Bias : Biased data leads to inaccurate results, skewed outcomes, and other analytical
errors.
• Irrelevant features
• Slow implementation
• Lack of Explianabiltiy
AI/ML/DEEP LEARNING

• AI: is the capability of a computer system to mimic human cognitive functions such
as learning and problem-solving.
• ML:is an application of AI. It’s the process of using mathematical models of data to
help a computer learn without direct instruction. This enables a computer system to
continue learning and improving on its own, based on experience.
• Deep Learning : The neural network helps the computer system achieve AI through
deep learning
Application of ML

• Machine learning is also used for a variety of tasks like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.
• A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the
user’s historical data. Tech companies are using unsupervised learning to
improve the user experience with personalizing recommendation.
Application of ML

• Finance Industry
• Machine learning is growing in popularity in the finance industry. Banks are mainly
using ML to find patterns inside the data but also to prevent fraud.

• Automation:
• Machine learning, which works entirely autonomously in any field without the need for
any human intervention. For example, robots performing the essential process steps
in manufacturing plants.

• Healthcare industry
Application of ML

• Marketing
• With the boom of data, marketing department relies on AI to optimize the customer
relationship and marketing campaign.
Training Test and Validation set

• In most supervised machine learning tasks, best practice recommends to split your data into three
independent sets: a training set, a testing set, and a validation set.
• Training Dataset: The sample of data used to fit the model.
• Validation Data set : Evaluate model
• Data is used to fine-tune the model hyper parameters. The validation set is also known as the Dev set
or the Development set.
• Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset.
Cross Validation Techniques
Need ?
K fold cross validation
• The whole data is divided into k sets of almost equal sizes. The first set is selected as the
test set and the model is trained on the remaining k-1 sets.
• In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are
used to train the data and the error is calculated.

The mean of errors from all the iterations is calculated as the


CV test error estimate.

Image Source: Wikipedia


K fold cross validation

• The best part about this method is each data point gets to be in the test set exactly once
and gets to be part of the training set k-1 times. As the number of folds k increases, the
variance also decreases (low variance).
• Typically, K-fold Cross Validation is performed using k=5 or k=10
Stratified K-Fold Cross-Validation

• Do not do random sampling


• Make sure that each test and training test data split is in such a way that it represents all the
classes from the population.
• E.G 1000 customers out of which 60% is female and 40% is male.
• 80: 20 split
• 800 Training-- 480 female population and 320 representing the male population.
• Same for test .
Stratified K-Fold Cross-Validation
Bias Variance Overfitting
Underfitting
Bias

• Difference(error ) between model prediction and ground truth


• Evaluated on training data .
• Bias High – basic model (cannot capture important feature of data)
• Can we deploy the model and use in unseen data--??(NO)
• Model cannot find patterns in our training set and hence fails for both seen and unseen
data, is called Under fitting.
Variance

• Opposite of bias.
• Variance is the average variability in the model prediction for the given dataset.
• High variance model does not generalize well on unseen data (Overfitting)
• our model will capture all the features of the data given to it, including the noise, when given
new data, it cannot predict on it as it is too specific to training data.
• our model will perform really well on testing data and get high accuracy but will fail to perform
on new, unseen data.
Overfitting and Under fitting

This situation where any given model is


performing too well on the training data
but the performance drops significantly
over the test set is called an overfitting
model.

On the other hand, if the model is


performing poorly over the test and the
train set, then we call that an underfitting
model
Overfitting vs Under fitting
Performance Analysis
Classification
Confusion Matrix

• Table that summarizes the


classification model
Precision and Recall

• Precision gives the proportion of positive predictions that are actually correct.
• Recall measures the proportion of actual positives that were predicted correctly.
• Recall is useful matrix in disease prediction ( we don’t want FN)
• Precision is used when we want to reduce FP (SPAM detection in email if spam is
positive we don’t want our model to predict non spam email as spam )
Accuracy

• Accuracy is the ratio of number of correct prediction and total number of prediction.
• Not useful if we have imbalanced data
• 1000 Patients (900 Not COVID and 100 COVID)
• Even if all are predicted not COVID accuracy is still 90%
F score

• we might know we want to maximize either recall or precision at the expense of the other metric.
• Use case : preliminary disease screening of patients ( we want recall to be 1.0 we can accept low precision )
• However, in cases where we want to find an optimal blend of precision and recall, we can combine the two metrics
using the F1 score .
• Harmonic mean is used instead of average to punish extreme values
• E.g. Precision is 1 and recall is 0
• AVG will be 0.5
• F1 score will be ZERO
Specificity

• Specificity = (True Negative)/(True Negative + False Positive)


• Specificity is a measure of the proportion of people not suffering from the disease who got
predicted correctly as the ones who are not suffering from the disease.
• In other words, the proportion of person who is healthy actually got predicted as healthy is
specificity.
Example

Calculate
Precision
Recall
Accuracy
F-score
Specificity
Performance Measure
Regression
MAE/MSE/RMSE

MAE is the absolute difference between the target value and the value predicted by the model. The MAE is more robust
to outliers and does not penalize the errors.

As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is. It is
preferred more than other metrics because it is differentiable and hence can be optimized better.

RMSE is the most widely used metric for regression tasks. RMSE is useful when large errors are undesired.

You might also like