Unit 1 Notes_FML
Unit 1 Notes_FML
Introduction to machine learning- Basic concepts, developing a learning system, Learning Issues, and challenges. Types of machine learning:
Learning associations, supervised, unsupervised, semi-supervised and reinforcement learning, Feature selection Mechanisms, Imbalanced data,
Outlier detection, Applications of machine learning like medical diagnostics, fraud detection, email spam detection
UNIT II:
Supervised Learning- Linear Regression, Multiple Regression, Logistic Regression, Classification; classifier models, K Nearest Neighbour
(KNN), Naive Bayes, Decision Trees, Support Vector Machine (SVM), Random Forest
UNIT III:
Unsupervised Learning- Dimensionality reduction; Clustering; K-Means clustering; C-means clustering; Fuzzy C means clustering, EM Algorithm,
Association Analysis- Association Rules in Large Databases, Apriori algorithm, Markov models: Hidden Markov models (HMMs).
UNIT IV:
Reinforcement learning- Introduction to reinforcement learning, Methods and elements of reinforcement learning, Bellman equation, Markov
decision process (MDP), Q learning, Value function approximation, Temporal difference learning, Concept of neural networks, Deep Q Neural
Network (DQN), Applications of Reinforcement learning.
Targeted Book
Book Title:
Machine Learning
Publication House:
Oxford Publications
Authors:
Dr. S. Sridhar & Dr. M. Vijayalakshmi
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
• With the help of Machine Learning, we can develop intelligent
systems that are capable of taking decisions on an autonomous basis.
• These algorithms learn from the past instances of data
through statistical analysis and pattern matching.
• Then, based on the learned data, it provides us with the predicted
results.
• Data is the core backbone of machine learning algorithms.
• With the help of the historical data, we are able to create more data by
training these machine learning algorithms.
What is Machine Learning?
What is Machine Learning?
In Real Life
In Real Life
In Real Life
In Real Life
In Real Life
What is Machine Learning?
What is Machine Learning?
What is Machine Learning?
Data Scientist: Job Description
Data Scientist: Job Description
Challenges & Issues in Machine Learning
Challenges & Issues in Machine Learning
Challenges & Issues in Machine Learning
1. Poor Quality of Data
• Data plays a significant role in the machine learning process.
• One of the significant issues that machine learning professionals face is the
absence of good-quality data.
• Unclean and noisy data can make the whole process extremely exhausting.
We don’t want our algorithm to make inaccurate or faulty predictions.
• Hence the quality of data is essential to enhance the output.
• Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted
features, is done with the utmost level of perfection.
Fitting in Data
Challenges & Issues in Machine Learning
2. Underfitting of Training Data
• This process occurs when data is unable to establish an accurate relationship between
input and output variables.
• It simply means trying to fit in undersized jeans.
• It signifies the data is too simple to establish a precise relationship.
• To overcome this issue:
a) Maximize the training time
b) Enhance the complexity of the model
c) Add more features to the data
d) Reduce regular parameters
e) Increasing the training time of model
Challenges & Issues in Machine Learning
3. Overfitting of Training Data
• Overfitting refers to a machine learning model trained with a massive amount of data
that negatively affect its performance.
• It is like trying to fit in Oversized jeans.
• Unfortunately, this is one of the significant issues faced by machine learning
professionals.
• This means that the algorithm is trained with noisy and biased data, which will affect
its overall performance.
• Let’s understand this with the help of an example. Let’s consider a model trained to
differentiate between a cat, a rabbit, a dog, and a tiger. The training data contains 1000
cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable probability
that it will identify the cat as a rabbit. In this example, we had a vast amount of data,
but it was biased; hence the prediction was negatively affected.
Fitting in Data
Fitting in Data
Challenges & Issues in Machine Learning
We can tackle this issue by:
a) Analyzing the data with the utmost level of perfection
b) Use data augmentation technique
c) Remove outliers in the training set
d) Select a model with lesser features
Challenges & Issues in Machine Learning
4. Machine Learning is a Complex Process
• The machine learning industry is young and is continuously
changing.
• Rapid hit and trial experiments are being carried on.
• The process is transforming, and hence there are high chances of
error which makes the learning complex.
• It includes analyzing the data, removing data bias, training data,
applying complex mathematical calculations, and a lot more.
• Hence it is a really complicated process which is another big
challenge for Machine learning professionals.
Challenges & Issues in Machine Learning
5. Lack of Training Data
• The most important task you need to do in the machine learning
process is to train the data to achieve an accurate output.
• Less amount training data will produce inaccurate or too biased
predictions. Let us understand this with the help of an example.
• Consider a machine learning algorithm similar to training a child.
• One day you decided to explain to a child how to distinguish
between an apple and a watermelon.
Challenges & Issues in Machine Learning
5. Lack of Training Data
• You will take an apple and a watermelon and show him the
difference between both based on their color, shape, and taste.
• In this way, soon, he will attain perfection in differentiating between
the two. But on the other hand, a machine-learning algorithm needs a
lot of data to distinguish.
• For complex problems, it may even require millions of data to be
trained. Therefore we need to ensure that Machine learning
algorithms are trained with sufficient amounts of data.
Challenges & Issues in Machine Learning
6. Slow Implementation
• This is one of the common issues faced by machine learning
professionals.
• The machine learning models are highly efficient in providing
accurate results, but it takes a tremendous amount of time.
• Slow programs, data overload, and excessive requirements usually
take a lot of time to provide accurate results.
• Further, it requires constant monitoring and maintenance to deliver
the best output.
Challenges & Issues in Machine Learning
7. Imperfections in the Algorithm When Data Grows
• The model may become useless in the future as data grows.
• The best model of the present may become inaccurate in the coming
Future and require further rearrangement.
• So you need regular monitoring and maintenance to keep the
algorithm working.
• This is one of the most exhausting issues faced by machine learning
professionals.
In a Nutshell: Challenges & Issues in Machine Learning
Deciding how
Complexity Maintenance Biased Data much data is
sufficient
Steps in Machine Learning
Hyper
Parameter Prediction/
Model/ Model Model
Data Data Tuning / Using in
Preparation Algorithm Testing/
Collection Training Model Real
Selection Evaluation
Improvem Life/Apply
ent
Data Bias
• Bias in data is an error that occurs when certain elements of a dataset are
overweighted or overrepresented.
• Biased datasets don't accurately represent ML model's use case, which leads
to skewed outcomes, systematic prejudice, and low accuracy.
• Often, the erroneous result discriminates against a specific group or groups
of people.
• For example, data bias reflects prejudice against age, race, culture, or sexual
orientation.
• In a world where AI systems are increasingly used everywhere, the danger
of bias lies in amplifying discrimination.
Data Bias
• It takes a lot of training data for machine learning models to produce viable
results.
• If you want to perform advanced operations (such as text, image, or video
recognition), you need millions of data points.
• Poor or incomplete data as well as biased data collection & analysis
methods will result in inaccurate predictions because the quality of the
outputs is determined by the quality of the inputs.
Data Bias
Reporting Bias
• Reporting bias occurs when the frequency of events, properties, and/or outcomes
captured in a data set does not accurately reflect their real-world frequency.
• This bias can arise because people tend to focus on documenting circumstances that are
unusual or especially memorable, assuming that the ordinary can "go without saying”.
EXAMPLE:
• A sentiment-analysis model is trained to predict whether book reviews are positive or
negative based on a corpus of user submissions to a popular website.
• The majority of reviews in the training data set reflect extreme opinions (reviewers who
either loved or hated a book), because people were less likely to submit a review of a
book if they did not respond to it strongly.
• As a result, the model is less able to correctly predict sentiment of reviews that use
more subtle language to describe a book.
Data Bias
• Automation bias is a tendency to favor results generated
by automated systems over those generated by non-
automated systems, irrespective of the error rates of each.
EXAMPLE:
• Software engineers working for a sprocket manufacturer
were eager to deploy the new "groundbreaking" model
they trained to identify tooth defects, until the factory
supervisor pointed out that the model's precision and
recall rates were both 15% lower than those of human
inspectors.
Data Bias
Selection Bias
Selection bias occurs if a data set's examples are chosen in a
way that is not reflective of their real-world distribution.
Selection bias can take many different forms:
Coverage bias: Data is not selected in a representative
fashion.
• The goal of feature selection techniques in machine learning is to find the best set of
features that allows one to build optimized models of studied phenomena.
• The techniques for feature selection in machine learning can be broadly classified
into the following categories:
• Supervised Techniques: These techniques can be used for labeled data and to
identify the relevant features for increasing the efficiency of supervised models like
classification and regression. For Example- linear regression, decision tree, SVM,
etc.
• Unsupervised Techniques: These techniques can be used for unlabeled data. For
Example- K-Means Clustering, Principal Component Analysis, Hierarchical
Clustering, etc.
• From a taxonomic point of view, these techniques are classified into filter, wrapper,
embedded, and hybrid methods.
Feature Engineering Taxonomy
Feature selection, as a dimensionality
reduction technique, aims to choose a small
subset of the relevant features from the
original features by removing irrelevant,
redundant, or noisy features. Feature
selection usually can lead to better learning
performance, higher learning accuracy,
lower computational cost, and better model
interpretability. This article focuses on the
feature selection process and provides a
comprehensive and structured overview of
feature selection types, methodologies, and
techniques both from the data and
algorithm perspectives.
Imputation is a technique used for replacing the missing data with some
substitute value to retain most of the data/information of the dataset.
Binning method for data smoothing
1) https://data-flair.training/blogs/machine-learning-tutorial/
2) https://data-flair.training/blogs/types-of-machine-learning-algorithms/
3) https://www.analyticsvidhya.com/blog/2020/10/feature-selection-
techniques-in-machine-learning/
4) https://www.statice.ai/post/data-bias-types
5) https://www.section.io/engineering-education/imbalanced-data-in-ml/
6) https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-
to-detect-and-remove-outliers-with-python-code/
7) https://aws.amazon.com/solutions/implementations/fraud-detection-using-
machine-learning/
Appendix
Filter Methods
• Filter methods pick up the intrinsic properties of the features
measured via univariate statistics instead of cross-validation
performance. These methods are faster and less computationally
expensive than wrapper methods. When dealing with high-
dimensional data, it is computationally cheaper to use filter methods.
Example: Information Gain Technique
• Information gain calculates the reduction in entropy from the
transformation of a dataset. It can be used for feature selection by
evaluating the Information gain of each variable in the context of the
target variable.
Filter Methods
Assignment 1: Date of Submission: 6/4/2023