Module 1 ML Mumbai University
Module 1 ML Mumbai University
Introduction to ML
What is Machine Learning ?
RULES
Data Data
Rules
Computer Output Output
How does ML works
• Key Points : Supervised learning deals with or learns with “labeled” data. This
implies that some data is already tagged with the correct answer.
Algorithm
• Linear Regression
• Logistic Regression
• Decision Tree
• SVM
• Random Forest
• AdaBoost
Decision Tree
• Decision tree algorithms are nothing but a series of if-else statements that can be used to predict a result
based on a dataset. This flowchart-like structure helps us in decision making.
• What is Entropy : is the measures of impurity, disorder or uncertainty in a bunch of examples. Entropy
controls how a Decision Tree decides to split the data
• Red and Green balls in sample of 14 (4 Red and 10 Green
• The entropy of a group in which all examples belong to the same class will always be 0
• he entropy of a group with 50% in either class will always be 1
Decision Tree
• Information gain (IG) measures how much “information” a feature gives us about the class. It
tells us how important a given attribute of the feature vectors is. Information gain (IG) is used
to decide the ordering of attributes in the nodes of a decision tree.
• Information Gain ID3 : it tends to use the feature that has more unique values.
• Gain Ratio C 4.5 : Tend to prefer unbalanced spilt ( one partition much smaller than
other)
• Gini Index (CART ): Normalized by Spilt Info ( Used in Binary setup)
Bagging
• Bagging chooses a random sample from the data set. Hence each model is generated from
the samples (Bootstrap Samples) provided by the Original Data with replacement known
as row sampling.
• This step of row sampling with replacement is called bootstrap.
• Each model is trained independently which generates results.
• The final output is based on majority voting after combining the results of all models.
This step which involves combining all the results and generating output based on
majority voting is known as aggregation.
XGBoost – Extreme Gradient Boosting
• Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the
training as well as test data.
• Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
• Step 3 − For each point in the test data do the following −
• 3.1 − Calculate the distance between test data and each row of training data with the help of any of the method
namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is
Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
Unsupervised Learning
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
• The DBSCAN algorithm uses two parameters:
• minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
• eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
DBSCAN
• Investing time and resources on deliberate product placements like this not only reduces a customer’s
shopping time, but also reminds the customer of what relevant items (s)he might be interested in buying, thus
helping stores cross-sell in the process.
• Association rules help uncover all such relationships between items from huge databases.
• Milk 70
• Toothbrush 4
• Common 10
• Total 100
• Toothbrush milk calculate support and confidence
•
Supervised vs Unsupervised
Supervised learning Unsupervised learnig
Input Data is provided to the model along with the Only input data is provided in Unsupervised Learning.
output in the Supervised Learning.
Output is predicted by the Supervised Learning. Hidden patterns in the data can be found using the
unsupervised learning model.
Accurate results are produced using a supervised The accuracy of results produced are less in unsupervised
learning model. learning models.
Training the model to predict output when a new Finding useful insights, hidden patterns from the unknown
data is provided is the objective of Supervised dataset is the objective of the unsupervised learning.
Learning.
Computational Complexity is very complex There is less computational complexity
Some of the applications of Supervised Learning are Some of the applications of Unsupervised Learning are
Spam detection, handwriting detection, pattern detecting fraudulent transactions, data preprocessing etc.
recognition, speech recognition etc.
Steps in ML APPLICATION
• Collect Data
• Prepare Input
• Analyse the input
• Train the algorithm
• Valid the algorithm
• Test
• Use it
Issues with ML
• Inadequate Training Data/ Poor Quality of data : data plays a vital role in the processing of machine
learning.
• Overfitting : Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the performance of
the model
• Methods to reduce overfitting:
• Increase training data in a dataset.
• Reduce model complexity by simplifying the model by selecting one with fewer parameters
• Ridge Regularization and Lasso Regularization
• Early stopping during the training phase
• Reduce the noise
• Reduce the number of attributes in training data.
• Constraining the model.
Issues with ML
• Data Bias : Biased data leads to inaccurate results, skewed outcomes, and other analytical
errors.
• Irrelevant features
• Slow implementation
• Lack of Explianabiltiy
AI/ML/DEEP LEARNING
• AI: is the capability of a computer system to mimic human cognitive functions such
as learning and problem-solving.
• ML:is an application of AI. It’s the process of using mathematical models of data to
help a computer learn without direct instruction. This enables a computer system to
continue learning and improving on its own, based on experience.
• Deep Learning : The neural network helps the computer system achieve AI through
deep learning
Application of ML
• Machine learning is also used for a variety of tasks like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.
• A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the
user’s historical data. Tech companies are using unsupervised learning to
improve the user experience with personalizing recommendation.
Application of ML
• Finance Industry
• Machine learning is growing in popularity in the finance industry. Banks are mainly
using ML to find patterns inside the data but also to prevent fraud.
• Automation:
• Machine learning, which works entirely autonomously in any field without the need for
any human intervention. For example, robots performing the essential process steps
in manufacturing plants.
• Healthcare industry
Application of ML
• Marketing
• With the boom of data, marketing department relies on AI to optimize the customer
relationship and marketing campaign.
Training Test and Validation set
• In most supervised machine learning tasks, best practice recommends to split your data into three
independent sets: a training set, a testing set, and a validation set.
• Training Dataset: The sample of data used to fit the model.
• Validation Data set : Evaluate model
• Data is used to fine-tune the model hyper parameters. The validation set is also known as the Dev set
or the Development set.
• Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset.
Cross Validation Techniques
Need ?
K fold cross validation
• The whole data is divided into k sets of almost equal sizes. The first set is selected as the
test set and the model is trained on the remaining k-1 sets.
• In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are
used to train the data and the error is calculated.
• The best part about this method is each data point gets to be in the test set exactly once
and gets to be part of the training set k-1 times. As the number of folds k increases, the
variance also decreases (low variance).
• Typically, K-fold Cross Validation is performed using k=5 or k=10
Stratified K-Fold Cross-Validation
• Opposite of bias.
• Variance is the average variability in the model prediction for the given dataset.
• High variance model does not generalize well on unseen data (Overfitting)
• our model will capture all the features of the data given to it, including the noise, when given
new data, it cannot predict on it as it is too specific to training data.
• our model will perform really well on testing data and get high accuracy but will fail to perform
on new, unseen data.
Overfitting and Under fitting
• Precision gives the proportion of positive predictions that are actually correct.
• Recall measures the proportion of actual positives that were predicted correctly.
• Recall is useful matrix in disease prediction ( we don’t want FN)
• Precision is used when we want to reduce FP (SPAM detection in email if spam is
positive we don’t want our model to predict non spam email as spam )
Accuracy
• Accuracy is the ratio of number of correct prediction and total number of prediction.
• Not useful if we have imbalanced data
• 1000 Patients (900 Not COVID and 100 COVID)
• Even if all are predicted not COVID accuracy is still 90%
F score
• we might know we want to maximize either recall or precision at the expense of the other metric.
• Use case : preliminary disease screening of patients ( we want recall to be 1.0 we can accept low precision )
• However, in cases where we want to find an optimal blend of precision and recall, we can combine the two metrics
using the F1 score .
• Harmonic mean is used instead of average to punish extreme values
• E.g. Precision is 1 and recall is 0
• AVG will be 0.5
• F1 score will be ZERO
Specificity
Calculate
Precision
Recall
Accuracy
F-score
Specificity
Performance Measure
Regression
MAE/MSE/RMSE
MAE is the absolute difference between the target value and the value predicted by the model. The MAE is more robust
to outliers and does not penalize the errors.
As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is. It is
preferred more than other metrics because it is differentiable and hence can be optimized better.
RMSE is the most widely used metric for regression tasks. RMSE is useful when large errors are undesired.