1 - AML - Manish
1 - AML - Manish
Other examples of AI
Machine Translation such as Google Translate
Self Driving Vehicles such as Google’s Waymo
AI Robots such as Sophia and Aibo
Speech Recognition applications like Apple’s Siri or OK Google
Definition
What is Machine Learning?
However, if the problem is complex, we'll likely end up with a long list of rules that
are hard to maintain and scale to other similar problems.
An ML system would be much shorter, easier to maintain, and in many cases, more
accurate.
Traditional approach vs ML
Steps for Spam Detection Using Pattern Recognition
1.Analyze Spam Characteristics:
1. Identify common words/phrases (e.g., "4U," "credit
card," "free," "amazing").
2. Observe patterns in sender names, email bodies, etc.
2.Develop Detection Algorithms:
1. Create algorithms to detect identified patterns.
2. Flag emails as spam based on pattern matches.
3.Iterate and Refine:
1. Test the detection algorithm.
2. Continuously refine the algorithm for better accuracy.
spam filter using traditional programming techniques
Since the problem is not trivial, your program will likely become a long list of com ‐ plex rules—pretty hard to
maintain.
Traditional approach• Machine
vs ML Learning-Based Spam Filtering
• Automatic Learning:
• Detects frequent patterns in spam versus ham examples.
• Learns which words/phrases are strong spam indicators.
• Results in shorter, easier-to-maintain, and more accurate
programs.
• Adaptability:
• Machine learning models can adapt to new spam tactics
(e.g., "4U" to "For U").
• Reduces the need for manual updates to detection rules.
spam filter using ML • Ability to remain sensitive (Resilience Against
Evolving Tactics):
• Continually learns from new data, making it harder for
spammers to bypass filters.
• Higher Accuracy:
• More likely to identify spam correctly, reducing false
positives and negatives
It is easy to search and analyze structured data. Structured data exists in a predefined
format. Relational database consisting of tables with rows and columns is one of the best
examples of structured data. Structured data generally exist in tables like excel files and
Google Docs spreadsheets. The programming language SQL (structured query language)
is used for managing the structured data. SQL is developed by IBM in the 1970s and
majorly used to handle relational databases and warehouses.
Structured data is highly organized and understandable for machine language. Common
applications of relational databases with structured data include sales transactions, Airline
reservation systems, inventory control, and others.
Unstructured Data
• All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much
data available, but they did not know how to derive data value since
the data is raw.
Unstructured data is the data that lacks any predefined model or format. It
requires a lot of storage space, and it is hard to maintain security in it. It
cannot be presented in a data model or schema. That's why managing,
analyzing, or searching for unstructured data is hard. It resides in various
different formats like text, images, audio and video files, etc. It is qualitative
in nature and sometimes stored in a non-relational database or NO-SQL.
•By Approach:
• Instance-based Learning: Compares new data points to known ones.
• Model-based Learning: Detects patterns and builds a predictive model.
Types of Machine Learning
1. Supervised Learning
In supervised learning, the data is already labeled, which means you know the target variable. Using this method of
learning, systems can predict future outcomes based on past data. It requires that at least an input and output
variable be given to the model for it to be trained.
Below is an example of a supervised learning method. The algorithm is trained using labeled data of dogs and cats.
The trained model predicts whether the new image is that of a cat or a dog.
Some examples of supervised learning include linear regression, logistic regression, support vector machines, Naive Bayes,
and decision tree.
Labelled and unlabeled data
https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Supervised Learning
•Definition:
• Training data includes desired solutions, known as labels.
• The algorithm learns from labeled examples provided by a
"supervisor."
•Process:
• Training: Machine is trained using well-labeled data with correct
answers.
• Prediction: New examples are provided, and the algorithm uses the
learned patterns to predict the correct outcomes.
•Goal:
• To enable the algorithm to generalize from the labeled data and make
accurate predictions on new, unseen data.
Type of prediction
Image Source:https://www.mathworks.com/help/stats/machine-learning-in-matlab.html
Classification
•Purpose:
• Predict outcomes based on data features.
• Example: A bank predicting if a customer will default on a loan.
•Key Components:
• Features: Characteristics of the data (e.g., credit history, number of loans).
• Target: The outcome to be predicted (e.g., loan repayment status).
•Types:
• Binary Classification: Two possible outcomes (e.g., Yes/No, 1/0).
• Multiclass Classification: More than two possible outcomes.
•Common Algorithms:
• Logistic Regression
• Decision Tree Classifier
• K Nearest Neighbor Classifier
• Random Forest Classifier
• Neural Networks https://medium.com/@betulsamancii/supervised-vs-unsupervised-learning-630e024093bd
Regression
•Purpose:
•Predict continuous values rather than categories.
•Example: Predicting the price of a house based on its features.
•Key Components:
•Features: Attributes of the data (e.g., lot size, number of bedrooms, neighborhood).
•Target: The continuous value to be predicted (e.g., house price).
•Process:
•Train an algorithm to understand how features relate to the target value.
•Use the trained model to predict values for new data.
•Common Algorithms:
•Linear Regression
•Decision Tree Regressor
•K Nearest Neighbor Regressor
•Random Forest Regressor
•Neural Networks
Classification and Regression
• Some algorithms can be used for
both regression and classification
tasks
https://static.javatpoint.com/tutorial/machine-learning/images/regression-vs-classification-in-machine-learning.png
Classification and Regression
• Algorithms:
• K-Means
• DBSCAN
• Hierarchical Cluster Analysis (HCA)
• Gaussian Mixture Models (GMMs)
• Principal Component Analysis (PCA)
• Density-Based Spatial Clustering (DBSCAN)
• Applications:
• Customer segmentation, pattern recognition, anomaly detection
.
2. Association Rule Learning:
• Identifies relationships or associations between variables in datasets.
• Useful for uncovering patterns like "if X is purchased, Y is also likely.“
• Algorithms:
• Apriori
• Eclat
• FP-Growth Algorithm
• Applications:
• Market basket analysis, recommendation systems, product placement.
Others category
1. Anomaly Detection:
•Applications:
•Detecting fraudulent credit card transactions.
•Catching manufacturing defects.
•Removing outliers from datasets.
•Training: Mostly normal instances are shown; the system learns to recognize normal behavior and flag
anomalies.
•Tolerance: Can often handle a small percentage of outliers in the training set.
2. Novelty Detection:
•Purpose: Identify new or previously unseen data that deviates from known normal patterns.
•Applications: Often used for detecting new types of anomalies or novel instances that were not present
during training.
Others
• Dimensionality reduction
• Types: PCA, Kernel PCA etc…
3. Semi-Supervised Learning:-
•Definition:
• Combines elements of supervised and unsupervised learning.
• Utilizes both a limited amount of labeled data and a large
amount of unlabeled data.
•Training Process:
• Labeled Data:
• Provides supervision and guidance for the model.
• Helps the algorithm learn the relationships between inputs
and outputs.
• Unlabeled Data:
• Allows the model to identify additional patterns and
structures in the data.
• Enhances the learning process by providing more context
beyond the labeled data.
•Goal:
• Improve model accuracy by leveraging both labeled and unlabeled data.
• Develop a better representation of the data, leading to more accurate predictions.
• Applications:
• Natural Language Processing (NLP): Handling large text corpora where labeling is costly.
• Image Classification: Utilizing vast amounts of unlabeled images to improve classification
performance.
• Medical Diagnosis: Combining limited labeled cases with abundant unlabeled data to enhance
diagnostic models.
• Advantages:
• Reduces the need for extensive labeled datasets, which are often expensive or time-consuming to
obtain.
• Can lead to better performance compared to using only labeled data due to the richer information
from unlabeled data.
• Examples of Techniques:
• Self-Training: Using the model’s own predictions on unlabeled data to iteratively improve its accuracy.
• Co-Training: Using multiple models trained on different features or views of the data to label the
unlabeled data and improve each other's performance.
• Graph-Based Methods: Constructing a graph where nodes represent data points and edges represent
similarities, using this structure to propagate labels from labeled to unlabeled data.
Reinforcement Learning
•Definition:
•A type of machine learning where an algorithm learns through interaction with an environment.
•Receives feedback in the form of rewards or penalties to learn a policy that maximizes cumulative
rewards over time.
•Process:
•Reward Signal: The environment provides feedback based on the quality of actions (rewards for good
actions, penalties for bad actions).
•Policy: The algorithm learns a mapping from states to actions (policy) that maximizes the total reward.
•Goal:
•To develop an optimal policy that leads to the highest possible cumulative reward.
•Applications:
•Recommendation Systems: Algorithms personalize recommendations by learning from user interactions and
feedback.
•Key Concepts: LBH
• Trial-and-Error Learning: The algorithm learns from its experiences, improving its performance over time
through experimentation.
• Exploration vs. Exploitation: Balancing between trying new actions (exploration) and using known actions
that give high rewards (exploitation).
• Reward Function: The design of the reward function influences the behavior and learning efficiency of the
algorithm.
• Value Function: Estimates the expected reward for a given state or action to guide decision-making.
•Challenges:
• Designing the Reward Function: Crafting an effective reward function that aligns with the desired outcome
can be complex.
• Scalability: Reinforcement learning can be computationally intensive and may require extensive training time.
• Exploration: Ensuring adequate exploration of the action space to discover optimal strategies can be
challenging.
Type of ML
Another criterion used to classify Machine
Learning systems is whether or not the
system can learn incrementally from a
stream of incoming data.
• Online vs Batch ML
• Batch Learning
• Definition:
• Machine learning where the model is trained offline on the entire dataset in one
go.
• Model updates occur in batch mode after processing all training data at once.
• Process:
• Training: Model is trained once on a fixed dataset.
• Prediction: After training, the model is used to make predictions without further
updates.
• Suitability:
• Best for static datasets where data does not change over time.
• Useful when training on large datasets is too computationally expensive to process
in real-time.
•Advantages: OT,CR
•Efficient for large datasets when offline training is feasible.
•Training can be done in parallel, utilizing significant computational resources.
•Disadvantages: A,RT
•Adaptability: Model does not adapt to changes in the data distribution over time.
•Re-training: New models must be trained from scratch if data distribution changes,
which can be time-consuming and computationally expensive.
•Applications:
•Natural Language Processing (NLP)
•Computer Vision
•Recommendation Systems
•Quality:
•Depends on the quality and quantity of training data and choice of algorithms.
https://www.linkedin.com/pulse/types-machine-learning-techniques-training-method-based-sharma
• Online Learning
• Definition:
• Machine learning where the model is updated incrementally as new data arrives in real-time.
• Training occurs on small subsets of data, with continuous updates as new data is received.
• Process:
• Training: Model is trained on a small subset of data, updated continuously with new data.
• Adaptation: Model adapts to changes in data distribution over time.
• Suitability:
• Ideal for streaming data and scenarios where data distribution changes frequently.
• Useful for applications with evolving data, such as financial market analysis, real-time
recommendations, and customer behavior analysis.
•Advantages: A,E
•Adaptability: Model can learn and adapt to changes (dynamic ) in data distribution over time.
•Efficiency: Computationally efficient as only a subset of data is processed at any time.
•Disadvantages: AC,DM
•Algorithm Choice: Performance depends on the choice of online learning algorithms and
their update rules.
•Data Management: Requires efficient handling of data streams and updates.
•Applications:
•Financial Market Analysis
•Customer Behavior Analysis
•Real-time Recommendation Systems
•Quality:
•Depends on the size of the data subset used for training, the rate of new data arrival, and the
algorithm's effectiveness.
Instance-Based Versus Model-Based Learning
• One more way to categorize Machine Learning systems is by how they
generalize.
• Most Machine Learning tasks are about making predictions.
• This means that given a number of training examples, the system
needs to be able to generalize to examples it has never seen before.
• Having a good performance measure on the training data is good, but
insufficient; the true goal is to perform well on new instances.
Instance-Based
•Definition:
•A machine learning approach where the model learns from
examples by storing them and then making predictions
based on the similarity of new examples to these stored
instances.
•Process:
•Basic Approach: The filter flags emails that are identical to known spam emails.
•Improved Approach: The filter flags emails that are similar to known spam emails using a
similarity measure.
•Similarity Measure: One simple measure is to count the number of common words between the new email and
known spam emails.
•Classification Rule: If the new email shares many words with known spam, it is flagged as spam.
•Characteristics:
•Lazy Learning: The model does not learn a general rule but instead stores and compares
individual examples.
•Computational Cost: The cost is incurred at prediction time, as similarity comparisons are made
on-the-fly.
•No Explicit Model: The system relies on stored instances and does not create a general model
during training.
•Advantages: (SF)
• Simplicity: Easy to implement and understand.
• Flexibility: Can handle noisy or irrelevant features as the prediction relies on similar instances
rather than a fixed model.
•Disadvantages: (S,G)
• Scalability: May become inefficient with large datasets due to the need to compare each new
instance to many stored instances.
• Prediction for Outliers: May struggle with instances significantly different from stored
examples.
• No Generalization: The system only generalizes based on the similarity of stored instances
and does not create a generalized model.
•Figure Example: A new instance is classified based on the majority class of the most similar
instances (e.g., a new shape classified as a triangle because most similar shapes are triangles).
• Applications:
• Classification: Assigning class labels based on the majority class of similar
instances (e.g., k-Nearest Neighbors).
• Regression: Predicting values based on the average or weighted average of
similar instances.
• Anomaly Detection: Identifying outliers by comparing new instances to
stored normal instances.
• Examples:
• k-Nearest Neighbors (k-NN): Classifies new instances based on the majority
vote of the k closest stored instances.
• Case-Based Reasoning: Uses past cases or experiences to solve new
problems.
Model Based
•Definition:
•Machine learning approach where a mathematical model is explicitly learned from training data to map inputs to
outputs.
•The learned model is then used to make predictions on new data.
•Process:
•Characteristics:
•Generalization: The model generalizes from the training data, allowing it to make predictions for inputs that
differ from those in the training set.
•Model Definition: Involves defining and learning a model structure (e.g., linear regression, decision trees,
neural networks)
•Advantages: A,E,I,(ou)
•Adaptability: Can make predictions for new inputs that are not present in the training set.
•Interpretability: Some models (e.g., linear regression, decision trees) offer clear insights into the relationship
between inputs and outputs.
•Efficiency: Once trained, predictions can be made quickly without needing to reference the entire training
dataset.
•Disadvantages: COT
•Complexity: Models may not always capture complex, non-linear relationships accurately.
•Overfitting: Risk of overfitting to the training data, especially with complex models and small datasets.
•Training Time: Some models (e.g., neural networks) can be computationally expensive to train.
Applications:
•Regression: Predicting continuous values (e.g., house prices, temperature).
•Classification: Assigning categorical labels (e.g., spam detection, image classification).
•Time Series Forecasting: Predicting future values based on past data (e.g., stock prices, weather forecasting).
• Types of Models:
• Decision Trees: Partition the input space into regions based on feature
values, making predictions based on majority class or average value in each
region.
• Support Vector Machines: Classify data by finding the hyperplane that best
separates different classes.
Model-based learning
Model based
• Example-
• Suppose you want to know if money makes people happy, so you download the Better Life Index data from the OECD’s
website and stats about gross domestic product (GDP) per capita from the IMF’s website. Then you join the tables and sort
by GDP per capita.
• Although the data is noisy (i.e., partly random), it
looks like life satisfaction goes up more or less
linearly as the country’s GDP per capita increases.
• So you decide to model life satisfaction as a linear
function of GDP per capita. A few possible linear models
• This step is called model selection:
• linear model of life satisfaction with just one
attribute, GDP per capita.
• life_satisfaction = θ0 + θ1 × GDP_per_capita
• This model has two model parameters, θ0 and θ1.
By tweaking these parameters, you can make your
model represent any linear function, as shown
• More data with not great algo still can solve the problem
• Termed as Unreasonable effectiveness of data
• Huge data doesn’t matter the algorithm
• However in real practice we never has a huge
• Doing a labelling a data is also a very time consuming
Challenges……
• Non-representative Training Data
• Initial model (i.e. wit
• Now you have more data: A new model
• Lets have data collection related to who will win the WC cricket
• Only India: not a good representation
• Sufficient data gathering from various countries
• Its referred as Sampling noise
• Sampling bias: Even though huge data but not properly sampled like: 20 countries but
asked question to all Indians
• Poor-Quality Data
• Data wrangling , quality of data, missing value, non structured data
• 60% is data cleaning
• Garbage in garbage out
Irrelevant Features
• No contribution of columns
• Increase in cost of computation
• Example : Whether person will come or not (Age, Wt, ht, location )
• Location no contribution
• Wt, ht : BMI (single )
• Very important overgeneralization in Humans and Machines: Just as humans
may overgeneralize based on limited experiences (e.g., assuming all taxi
drivers in a foreign country are thieves after one bad experience),
Overfitting the Training Data machines can also overgeneralize if not carefully trained.
It is common to use 80% of the data for training and hold out 20% for testing. However, this depends on the size of the
dataset: if it contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances: that’s
probably more than enough to get a good estimate of the generalization error.
Testing and Validation
• The only way to know how well a model will generalize to new cases is to actually try it out on new cases.
• One way to put your model in production and monitor how well it performs.
• This works well, but if your model is horribly bad, your users will complain—not the best idea.
• A better option is to split your data into two sets: the training set and the test set.
• As these names imply, you train your model using the training set, and you test it using the test set.
• The error rate on new cases is called the generalization error (or out-of sample error), a
• nd by evaluating your model on the test set, you get an estimate of this error.
• This value tells you how well your model will perform on instances it has never seen before.
• If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization
error is high, it means that your model is overfitting the training data
Hyperparameters Tuning and Model Selection
Machine Learning Development Lifecycle
(MLDLC)
• https://medium.com/@pp1222001/decoding-the-machine-learning-d
evelopment-lifecycle-mldlc-4133e05ab8c7
How are AI and ML connected?
• While AI and ML are not quite the same thing, they are closely connected. The
simplest way to understand how AI and ML relate to each other is:
• While artificial intelligence encompasses the idea of a machine that can mimic
human intelligence, machine learning does not. Machine learning aims to teach a
machine how to perform a specific task and provide accurate results by
identifying patterns.
• Let’s say you ask your Google Nest device, “How long is my commute today?” In
this case, you ask a machine a question and receive an answer about the estimated
time it will take you to drive to your office. Here, the overall goal is for the device
to perform a task successfully—a task that you would generally have to do
yourself in a real-world environment (for example, research your commute time).
• In the context of this example, the goal of using ML in the overall system is not to
enable it to perform a task. For instance, you might train algorithms to analyze
live transit and traffic data to forecast the volume and density of traffic flow.
However, the scope is limited to identifying patterns, how accurate the prediction
was, and learning from the data to maximize performance for that specific task.
Benefits of using AI and ML together
• AI and ML bring powerful benefits to organizations of all shapes and
sizes, with new possibilities constantly emerging. In particular, as the
amount of data grows in size and complexity, automated and
intelligent systems are becoming vital to helping companies automate
tasks, unlock value, and generate actionable insights to achieve better
outcomes.
• Here are some of the business benefits of using artificial intelligence
and machine learning:
Applications of AI and ML
• Manufacturing
Production machine monitoring, predictive maintenance, IoT analytics, and operational efficiency.
• Financial services
Risk assessment and analysis, fraud detection, automated trading, and service processing optimization.
• Telecommunications
Intelligent networks and network optimization, predictive maintenance, business process automation,
upgrade planning, and capacity forecasting.