0% found this document useful (0 votes)
2 views

Unit 1 Notes_FML

The document outlines a course on the fundamentals of machine learning, covering key concepts, techniques, and applications. It includes topics such as supervised and unsupervised learning, feature engineering, and challenges like data quality and bias. The course aims to equip students with the skills to apply machine learning methods to real-world problems.

Uploaded by

dr.robi.ai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 1 Notes_FML

The document outlines a course on the fundamentals of machine learning, covering key concepts, techniques, and applications. It includes topics such as supervised and unsupervised learning, feature engineering, and challenges like data quality and bias. The course aims to equip students with the skills to apply machine learning methods to real-world problems.

Uploaded by

dr.robi.ai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Unit 1

Fundamentals of Machine Learning


Dr. Sonakshi Vij
VSE&T
Course Outcomes
After studying this course, the students will be able to:
Understand machine learning tools and techniques
CO1
with their applications.
Apply machine learning techniques for
CO2
classification and regression.
Perform feature engineering techniques. CO3
Design supervised and unsupervised machine
CO4
learning based solutions for real-world problems.
Syllabus
UNIT I:

Introduction to machine learning- Basic concepts, developing a learning system, Learning Issues, and challenges. Types of machine learning:
Learning associations, supervised, unsupervised, semi-supervised and reinforcement learning, Feature selection Mechanisms, Imbalanced data,
Outlier detection, Applications of machine learning like medical diagnostics, fraud detection, email spam detection

UNIT II:

Supervised Learning- Linear Regression, Multiple Regression, Logistic Regression, Classification; classifier models, K Nearest Neighbour
(KNN), Naive Bayes, Decision Trees, Support Vector Machine (SVM), Random Forest

UNIT III:

Unsupervised Learning- Dimensionality reduction; Clustering; K-Means clustering; C-means clustering; Fuzzy C means clustering, EM Algorithm,
Association Analysis- Association Rules in Large Databases, Apriori algorithm, Markov models: Hidden Markov models (HMMs).

UNIT IV:

Reinforcement learning- Introduction to reinforcement learning, Methods and elements of reinforcement learning, Bellman equation, Markov
decision process (MDP), Q learning, Value function approximation, Temporal difference learning, Concept of neural networks, Deep Q Neural
Network (DQN), Applications of Reinforcement learning.
Targeted Book

Book Title:
Machine Learning
Publication House:
Oxford Publications
Authors:
Dr. S. Sridhar & Dr. M. Vijayalakshmi
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
• With the help of Machine Learning, we can develop intelligent
systems that are capable of taking decisions on an autonomous basis.
• These algorithms learn from the past instances of data
through statistical analysis and pattern matching.
• Then, based on the learned data, it provides us with the predicted
results.
• Data is the core backbone of machine learning algorithms.
• With the help of the historical data, we are able to create more data by
training these machine learning algorithms.
What is Machine Learning?
What is Machine Learning?
In Real Life
In Real Life
In Real Life
In Real Life
In Real Life
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
Data Scientist: Job Description
Data Scientist: Job Description
Challenges & Issues in Machine Learning
Challenges & Issues in Machine Learning
Challenges & Issues in Machine Learning
1. Poor Quality of Data
• Data plays a significant role in the machine learning process.
• One of the significant issues that machine learning professionals face is the
absence of good-quality data.
• Unclean and noisy data can make the whole process extremely exhausting.
We don’t want our algorithm to make inaccurate or faulty predictions.
• Hence the quality of data is essential to enhance the output.
• Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted
features, is done with the utmost level of perfection.
Fitting in Data
Challenges & Issues in Machine Learning
2. Underfitting of Training Data
• This process occurs when data is unable to establish an accurate relationship between
input and output variables.
• It simply means trying to fit in undersized jeans.
• It signifies the data is too simple to establish a precise relationship.
• To overcome this issue:
a) Maximize the training time
b) Enhance the complexity of the model
c) Add more features to the data
d) Reduce regular parameters
e) Increasing the training time of model
Challenges & Issues in Machine Learning
3. Overfitting of Training Data
• Overfitting refers to a machine learning model trained with a massive amount of data
that negatively affect its performance.
• It is like trying to fit in Oversized jeans.
• Unfortunately, this is one of the significant issues faced by machine learning
professionals.
• This means that the algorithm is trained with noisy and biased data, which will affect
its overall performance.
• Let’s understand this with the help of an example. Let’s consider a model trained to
differentiate between a cat, a rabbit, a dog, and a tiger. The training data contains 1000
cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable probability
that it will identify the cat as a rabbit. In this example, we had a vast amount of data,
but it was biased; hence the prediction was negatively affected.
Fitting in Data
Fitting in Data
Challenges & Issues in Machine Learning
We can tackle this issue by:
a) Analyzing the data with the utmost level of perfection
b) Use data augmentation technique
c) Remove outliers in the training set
d) Select a model with lesser features
Challenges & Issues in Machine Learning
4. Machine Learning is a Complex Process
• The machine learning industry is young and is continuously
changing.
• Rapid hit and trial experiments are being carried on.
• The process is transforming, and hence there are high chances of
error which makes the learning complex.
• It includes analyzing the data, removing data bias, training data,
applying complex mathematical calculations, and a lot more.
• Hence it is a really complicated process which is another big
challenge for Machine learning professionals.
Challenges & Issues in Machine Learning
5. Lack of Training Data
• The most important task you need to do in the machine learning
process is to train the data to achieve an accurate output.
• Less amount training data will produce inaccurate or too biased
predictions. Let us understand this with the help of an example.
• Consider a machine learning algorithm similar to training a child.
• One day you decided to explain to a child how to distinguish
between an apple and a watermelon.
Challenges & Issues in Machine Learning
5. Lack of Training Data
• You will take an apple and a watermelon and show him the
difference between both based on their color, shape, and taste.
• In this way, soon, he will attain perfection in differentiating between
the two. But on the other hand, a machine-learning algorithm needs a
lot of data to distinguish.
• For complex problems, it may even require millions of data to be
trained. Therefore we need to ensure that Machine learning
algorithms are trained with sufficient amounts of data.
Challenges & Issues in Machine Learning
6. Slow Implementation
• This is one of the common issues faced by machine learning
professionals.
• The machine learning models are highly efficient in providing
accurate results, but it takes a tremendous amount of time.
• Slow programs, data overload, and excessive requirements usually
take a lot of time to provide accurate results.
• Further, it requires constant monitoring and maintenance to deliver
the best output.
Challenges & Issues in Machine Learning
7. Imperfections in the Algorithm When Data Grows
• The model may become useless in the future as data grows.
• The best model of the present may become inaccurate in the coming
Future and require further rearrangement.
• So you need regular monitoring and maintenance to keep the
algorithm working.
• This is one of the most exhausting issues faced by machine learning
professionals.
In a Nutshell: Challenges & Issues in Machine Learning

Less Training Poor Data Irrelevant Fitting


Data Quality Features (Over/Under)

Deciding how
Complexity Maintenance Biased Data much data is
sufficient
Steps in Machine Learning

Hyper
Parameter Prediction/
Model/ Model Model
Data Data Tuning / Using in
Preparation Algorithm Testing/
Collection Training Model Real
Selection Evaluation
Improvem Life/Apply
ent
Data Bias
• Bias in data is an error that occurs when certain elements of a dataset are
overweighted or overrepresented.
• Biased datasets don't accurately represent ML model's use case, which leads
to skewed outcomes, systematic prejudice, and low accuracy.
• Often, the erroneous result discriminates against a specific group or groups
of people.
• For example, data bias reflects prejudice against age, race, culture, or sexual
orientation.
• In a world where AI systems are increasingly used everywhere, the danger
of bias lies in amplifying discrimination.
Data Bias
• It takes a lot of training data for machine learning models to produce viable
results.
• If you want to perform advanced operations (such as text, image, or video
recognition), you need millions of data points.
• Poor or incomplete data as well as biased data collection & analysis
methods will result in inaccurate predictions because the quality of the
outputs is determined by the quality of the inputs.
Data Bias
Reporting Bias
• Reporting bias occurs when the frequency of events, properties, and/or outcomes
captured in a data set does not accurately reflect their real-world frequency.
• This bias can arise because people tend to focus on documenting circumstances that are
unusual or especially memorable, assuming that the ordinary can "go without saying”.
EXAMPLE:
• A sentiment-analysis model is trained to predict whether book reviews are positive or
negative based on a corpus of user submissions to a popular website.
• The majority of reviews in the training data set reflect extreme opinions (reviewers who
either loved or hated a book), because people were less likely to submit a review of a
book if they did not respond to it strongly.
• As a result, the model is less able to correctly predict sentiment of reviews that use
more subtle language to describe a book.
Data Bias
• Automation bias is a tendency to favor results generated
by automated systems over those generated by non-
automated systems, irrespective of the error rates of each.
EXAMPLE:
• Software engineers working for a sprocket manufacturer
were eager to deploy the new "groundbreaking" model
they trained to identify tooth defects, until the factory
supervisor pointed out that the model's precision and
recall rates were both 15% lower than those of human
inspectors.
Data Bias
Selection Bias
Selection bias occurs if a data set's examples are chosen in a
way that is not reflective of their real-world distribution.
Selection bias can take many different forms:
Coverage bias: Data is not selected in a representative
fashion.

EXAMPLE: A model is trained to predict future sales of a


new product based on phone surveys conducted with a sample
of consumers who bought the product. Consumers who instead
opted to buy a competing product were not surveyed, and as a
result, this group of people was not represented in the training
data.
Data Bias
•Non-response bias (or participation bias): Data ends up being unrepresentative due to
participation gaps in the data-collection process.
EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product and with a sample of
consumers who bought a competing product. Consumers who bought the competing product
were 80% more likely to refuse to complete the survey, and their data was underrepresented in
the sample.

•Sampling bias: Proper randomization is not used during data collection.


EXAMPLE: A model is trained to predict future sales of a new product based on phone
surveys conducted with a sample of consumers who bought the product and with a sample of
consumers who bought a competing product. Instead of randomly targeting consumers, the
surveyor chose the first 200 consumers that responded to an email, who might have been more
enthusiastic about the product than average purchasers.
Data Bias
Implicit Bias
Implicit bias occurs when assumptions are made based on one's own mental models and personal
experiences that do not necessarily apply more generally.
EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature to
indicate a person is communicating the word "no." However, in some regions of the world, a head
shake actually signifies "yes."
A common form of implicit bias is confirmation bias, where model builders unconsciously process
data in ways that affirm preexisting beliefs and hypotheses. In some cases, a model builder may
actually keep training a model until it produces a result that aligns with their original hypothesis; this
is called experimenter's bias.
EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a variety
of features (height, weight, breed, environment). The engineer had an unpleasant encounter with a
hyperactive toy poodle as a child, and ever since has associated the breed with aggression. When the
trained model predicted most toy poodles to be relatively docile, the engineer retrained the model
several more times until it produced a result showing smaller poodles to be more violent.
Data Bias
Feature Selection Mechanism in Machine Learning

• The goal of feature selection techniques in machine learning is to find the best set of
features that allows one to build optimized models of studied phenomena.
• The techniques for feature selection in machine learning can be broadly classified
into the following categories:
• Supervised Techniques: These techniques can be used for labeled data and to
identify the relevant features for increasing the efficiency of supervised models like
classification and regression. For Example- linear regression, decision tree, SVM,
etc.
• Unsupervised Techniques: These techniques can be used for unlabeled data. For
Example- K-Means Clustering, Principal Component Analysis, Hierarchical
Clustering, etc.
• From a taxonomic point of view, these techniques are classified into filter, wrapper,
embedded, and hybrid methods.
Feature Engineering Taxonomy
Feature selection, as a dimensionality
reduction technique, aims to choose a small
subset of the relevant features from the
original features by removing irrelevant,
redundant, or noisy features. Feature
selection usually can lead to better learning
performance, higher learning accuracy,
lower computational cost, and better model
interpretability. This article focuses on the
feature selection process and provides a
comprehensive and structured overview of
feature selection types, methodologies, and
techniques both from the data and
algorithm perspectives.

The aim of feature selection is to


maximize relevance and minimize
redundancy
Characteristics of Feature Selection Algorithms

• The purpose of a feature selection algorithms is to identify relevant features


according to a definition of relevance. However, the notion of relevance in machine
learning has not yet been rigorously defined on a common agreement. A primary
definition of relevance is the notion of being relevant with respect to an objective.
• There are several considerations in the literature to characterize feature selection
algorithms. In view of these, it is possible to describe this characterization as a search
problem in the hypothesis space as follows:
• Search Organization: general strategy with which the space of hypothesis is
explored.
• Generation of Successors: mechanism by which possible variants (successor
candidates) of the current hypothesis are proposed.
• Evaluation Measure: function by which successor candidates are evaluated,
allowing to compare different hypotheses to guide the search process.
Characteristics of Feature Selection Algorithms
Imbalanced Data
• A classification data set with skewed class proportions is
called imbalanced.
• Classes that make up a large proportion of the data set are
called majority classes. Those that make up a smaller
proportion are minority classes.
• If you have an imbalanced data set, first try training on the
true distribution.
• If the model works well and generalizes, you're done! If not,
try the following downsampling and upweighting technique.
Imbalanced Data: Downsampling & Upweighting

• An effective way to handle imbalanced data is to downsample


and upweight the majority class.
• Downsampling (in this context) means training on a
disproportionately low subset of the majority class examples.
• Upweighting means adding an example weight to the
downsampled class equal to the factor by which you
downsampled.

{example weight} = {original example weight} × {downsampling factor}


Techniques to work with
Imbalanced Data in Machine Learning

Upsampling Minority Class

Downsampling Majority Class

Generate Synthetic Data

Combine Upsampling & Downsampling Techniques

Balanced Class Weight


Outliers
Outlier detection is the
process of detecting outliers,
or a data point that is far
away from the average, and
depending on what you are
trying to accomplish,
potentially removing or
resolving them from the
analysis to prevent any
potential skewing.

In descriptive statistics, the interquartile range is a measure of


statistical dispersion, which is the spread of the data.
Natural & Non-Natural Outliers
• The non-natural outliers are those which are caused
by measurement errors, wrong data collection, or wrong
data entry whereas natural outliers could be the use case
of fraudulent transactions in banking data, etc.
• No matter how alert you are during the data collection,
every Data Analyst has felt the frustration of finding the
outliers.
• Outliers are one of those problems which we come across
almost every time while doing machine learning modeling.
Think Think

Are noise &


outliers the
same?
Noise vs Outliers
• Noise is considered as a random
error or the variance in a
measured variable.
• The process of noise removal
should be done before outlier
detection.
Outliers
How to treat Outliers?
• There are several ways to treat outliers in a dataset, depending on the
nature of the outliers and the problem being solved.
• Trimming: It excludes the outlier values from our analysis. By
applying this technique, our data becomes thin when more outliers
are present in the dataset. Its main advantage is its fastest nature.
• Capping: In this technique, we cap our outliers data and make the
limit i.e, above a particular value or less than that value, all the
values will be considered as outliers, and the number of outliers in
the dataset gives that capping number.
How to treat Outliers?
• For example, if you’re working on the income feature, you might find that people
above a certain income level behave similarly to those with a lower income. In this
case, you can cap the income value at a level that keeps that intact and accordingly
treat the outliers.
• Treating outliers as a missing value: By assuming outliers as the missing
observations, treat them accordingly, i.e., same as missing values imputation.
• Discretization: In this technique, by making the groups, we include the outliers in a
particular group and force them to behave in the same manner as those of other points
in that group. This technique is also known as Binning.

Imputation is a technique used for replacing the missing data with some
substitute value to retain most of the data/information of the dataset.
Binning method for data smoothing

To smooth a data set is to


create an approximating
function that attempts to
capture important patterns
in the data, while leaving
out noise or other fine-scale
structures/rapid
phenomena
Numerical for Practice
Question: Consider the sales record of an
organization for the first quarter in various
outlets (lakhs):
89, 32, 21, 8, 12, 13, 11, 65, 96, 54
How can this data be smoothed and reorganized
for better retrieval? How can the outliers be
detected and dealt with?
Types of Machine Learning
Types of Machine Learning
Types of Machine Learning
Difference between different types of learning
Supervised Learning
Supervised Learning
• Supervised Learning is the most popular paradigm for performing machine
learning operations. It is widely used for data where there is a
precise mapping between input-output data.
• The dataset, in this case, is labeled, meaning that the
algorithm identifies the features explicitly and carries out
predictions or classification accordingly.
• As the training period progresses, the algorithm is able
to identify the relationships between the two variables such that we
can predict a new outcome.
• Resulting Supervised learning algorithms are task-oriented.
• As we provide it with more and more examples, it is able to learn more
properly so that it can undertake the task and yield us the output more
accurately.
Unsupervised Learning
• In the case of an unsupervised learning algorithm, the data is not
explicitly labeled into different classes, that is, there are no labels.
• The model is able to learn from the data by finding implicit patterns.
• Unsupervised Learning algorithms identify the data based on
their densities, structures, similar segments, and other similar
features.
• Unsupervised Learning Algorithms are based on Hebbian Learning.
• Cluster analysis is one of the most widely used techniques in
supervised learning.
Unsupervised Learning
Unsupervised Learning
Reinforcement Learning
• Reinforcement Learning covers more area of Artificial Intelligence
which allows machines to interact with their dynamic environment in
order to reach their goals.
• With this, machines and software agents are able to evaluate the ideal
behavior in a specific context.
• With the help of this reward feedback, agents are able to learn the
behavior and improve it in the longer run. This simple feedback reward
is known as a reinforcement signal.
Reinforcement Learning
• The agent in the environment is required to take actions that are
based on the current state.
• This type of learning is different from Supervised Learning in the
sense that the training data in the former has output
mapping provided such that the model is capable of learning
the correct answer.
• Whereas, in the case of reinforcement learning, there is no answer
key provided to the agent when they have to perform a particular
task.
• When there is no training dataset, it learns from its own experience.
Reinforcement Learning
Applications of Machine Learning
Applications of Machine Learning
Fraud Detection using Machine Learning
Fraud Detection using Machine Learning
The code deploys the following infrastructure:
a) An Amazon Simple Storage Service (Amazon S3) bucket containing an example dataset of credit card
transactions.
b) An Amazon SageMaker notebook instance with different ML models that will be trained on the dataset.
c) An AWS Lambda function that processes transactions from the example dataset and invokes the two
Amazon SageMaker endpoints that assign anomaly scores and classification scores to incoming data
points.
d) An Amazon API Gateway REST API invokes predictions using signed HTTP requests.
e) An Amazon Kinesis Data Firehose delivery stream loads the processed transactions into another Amazon
S3 bucket for storage.
• The Guidance also provides an example of how to invoke the prediction REST API as part of the Amazon
Sagemaker notebook.
• When the transactions have been loaded into Amazon S3, you can use analytics tools and services,
including Amazon QuickSight, for visualization, reporting, ad-hoc queries, and more detailed analysis.
Spam Detection using ML
Email Filtering for Spam Detection using ML
Beyond the Curriculum
Case Study: Supervised Learning: Beyond the Curriculum

• Facial Recognition is one of the most popular applications of


Supervised Learning and more specifically – Artificial Neural
Networks.
• Convolutional Neural Networks (CNN) is a type of ANN used for
identifying the faces of people.
• These models are able to draw features from the image through
various filters.
• Finally, if there is a high similarity score between the input image and
the image in the database, a positive match is provided.
Case Study: Supervised Learning: Beyond the Curriculum

• Baidu, China’s premier search engine company has been investing


in facial recognition.
• While it has already installed facial recognition systems in its security
systems, it is now extending this technology to the major airports of
China.
• Baidu will provide the airports with facial recognition technology that
will provide access to the ground crew and the staff.
• Therefore, the passengers do not have to wait in long queues for flight
check-in when they can simply board their flight by scanning their
faces.
Case Study: Unsupervised Learning:
Beyond the Curriculum

• One of the most popular unsupervised learning


techniques is clustering. Using clustering, businesses
are able to capture potential customer segments for
selling their products.
• Sales companies are able to identify customer
segments that are most likely to use their services.
Companies can evaluate the customer segments and
then decide to sell their product to maximize the
profits.
Case Study: Unsupervised Learning:
Beyond the Curriculum

• One such company that is performing brand


marketing analytics using Machine Learning is
an Israeli based startup – Optimove.
• The goal of this company is to ingest
and process the customer data in order to make
it accessible to the marketers.
• They take it one step further by providing smart
insights to the marketing team, allowing them to reap
the maximum profit out of their product marketing.
Case Study: Reinforcement Learning:
Beyond the Curriculum

• Google’s Active Query Answering (AQA) system makes use of


reinforcement learning.
• It reformulates the questions asked by the user.
• For example, if you ask the AQA bot the question – “What is the birth
date of Nikola Tesla” then the bot would reformulate it into different
questions like “What is the birth year of Nikola Tesla”, “When was Tesla
born?” and “When is Tesla’s birthday”.
• This process of reformulation utilized the traditional
sequence2sequence model, but Google has integrated reinforcement
Learning into its system to better interact with the query
based environment system.
Case Study: Reinforcement Learning:
Beyond the Curriculum

• This is a deviation from


the traditional seq2seq model
such that all the tasks are
carried out using reinforcement
learning and policy
gradient methods.
• That is, for a given question
q0, we want to obtain the best
possible answer a*.
• The goal is to maximize the
award a* = argmaxa R(ajq0).
References: For further reading

1) https://data-flair.training/blogs/machine-learning-tutorial/
2) https://data-flair.training/blogs/types-of-machine-learning-algorithms/
3) https://www.analyticsvidhya.com/blog/2020/10/feature-selection-
techniques-in-machine-learning/
4) https://www.statice.ai/post/data-bias-types
5) https://www.section.io/engineering-education/imbalanced-data-in-ml/
6) https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-
to-detect-and-remove-outliers-with-python-code/
7) https://aws.amazon.com/solutions/implementations/fraud-detection-using-
machine-learning/
Appendix
Filter Methods
• Filter methods pick up the intrinsic properties of the features
measured via univariate statistics instead of cross-validation
performance. These methods are faster and less computationally
expensive than wrapper methods. When dealing with high-
dimensional data, it is computationally cheaper to use filter methods.
Example: Information Gain Technique
• Information gain calculates the reduction in entropy from the
transformation of a dataset. It can be used for feature selection by
evaluating the Information gain of each variable in the context of the
target variable.
Filter Methods
Assignment 1: Date of Submission: 6/4/2023

Consider a real-life dataset (source can be Kaggle, UCI-MLR


etc.) and perform the following operations: (2*5=10 points)
a. Data imputation/ Data cleaning for pre-processing
(including handling of missing values)
b. Outlier detection
c. Identifying & handling data bias
d. Identifying & handling imbalanced data
e. Feature selection (avoiding overfitting & underfitting)

You might also like