AIMl TA2
AIMl TA2
splitting your dataset once into training and testing sets might give you biased or unstable
results, depending on how the split was done.---This is where Cross-Validation (CV) comes
in — it's a technique that helps ensure your model's performance is robust, reliable, and
generalizes well.
What is Cross-Validation?---Cross-validation is a resampling technique that splits your
dataset into multiple training and testing sets in a systematic way.
The most common type is: K-Fold Cross-Validation
---Here's how it works:---Split the dataset into k equal-sized folds (subsets).---For each of
the k iterations:---Use k-1 folds for training.---Use the remaining 1 fold for testing.---Repeat
this process k times, with a different fold used for testing each time.---Average the results
(like accuracy, precision, etc.) to get a final performance estimate.
--- Why It Helps (Especially with Small Datasets)---Less bias: Every data point gets to be
in the training and test set at least once.---More reliable evaluation: Reduces the effect of
randomness from a single train-test split.---Better generalization: Helps choose a model
that performs well across different splits.--- Example:---If you have 100 data points and
use 5-Fold CV:---Each fold will have 20 data points.---The model trains on 80, tests on 20
— repeated 5 times.---You get 5 performance scores → average them for final evaluation.
What is Machine Learning (ML)?---ML is a method where computers learn patterns
from data and make predictions or decisions without being explicitly programmed for the
task.
High-Level Steps in Machine Learning---------Step, Description-----Data Collection,
Gather raw data----Preprocessing, Clean and format data----Splitting, Divide into
training/testing sets-----Choose Model, Select an appropriate algorithm-----Training, Fit the
model to training data-----Evaluation, Test performance on unseen data----Prediction, Use
model to predict new data
Types of Machine Learning---Supervised Learning → Learn from labeled data (e.g.,
classification, regression)----Unsupervised Learning → Find patterns in unlabeled data
(e.g., clustering)-----Semi-supervised Learning → Mix of labeled and unlabeled data
-----Reinforcement Learning → Learn by interacting with an environment (e.g., games,
robotics)
Applications of Machine Learning------ Email, Spam detection--- Social Media,
Personalized recommendations, content ranking--- E-commerce, Product
recommendations, dynamic pricing--- Banking/Finance, Fraud detection, credit
scoring--- Healthcare, Disease prediction, medical imaging analysis--- Automotive,
Self-driving cars, traffic prediction--- Entertainment, Movie/music recommendations
(Netflix, Spotify)--- NLP, Chatbots, language translation, sentiment analysis---
Manufacturing, Predictive maintenance, quality control--- Weather, Forecasting,
disaster prediction--- Computer Vision, Face recognition, object detection
Key Machine Learning Terminology
1. Algorithm--A set of rules or mathematical procedures that a machine follows to learn
patterns from data.--It is the engine that powers model training.---Examples: Linear
Regression, Decision Tree, Support Vector Machine (SVM), Neural Networks.
2. Model---The output generated after the machine learning algorithm trains on data.---
It represents the learned relationship between inputs (features) and outputs (labels). ---
Once trained, the model is used to make predictions on new data.
3. Feature Set (Input Variables / Independent Variables)---The input data used to make
predictions.---Features are the measurable properties or characteristics of the data.---
Example: For a house price model, features could be:---Size (sq ft)---Number of bedrooms-
--Location====== 4. Predictor Variable----Another name for a feature — it's a variable
used to predict the outcome.----All predictor variables together form the feature set.
5. Response Variable (Target Variable / Output Variable / Dependent Variable)---The
outcome you are trying to predict or classify.---Example: In predicting house prices, the
price is the response variable.====== 6. Training Data---The portion of the dataset used
to train the model.--The model learns patterns from this data — both inputs (features) and
known outputs (targets).---Typically around 70–80% of the dataset.
7. Testing Data---The portion of the dataset used to evaluate the model's
performance.--The model has not seen this data during training, so it gives a real-world
performance estimate.----Typically around 20–30% of the dataset.
i) PrecisionDefinition:Precision tells us how many of the predicted positives are
actually positive. Use Case: When false positives are costly (e.g., spam detection – don't
mark important emails as spam).
ii) Recall (Sensitivity or True Positive Rate)--Definition:Recall tells us how many of the
actual positives were correctly predicted. Use Case: When missing positives is risky (e.g.,
disease detection – don’t miss people who are actually sick).
iii) F1-Score--Definition:F1-score is the harmonic mean of Precision and Recall. It
balances both when you need a single performance measure.
Significance of Errors (Residuals)---Residuals are critical in evaluating and improving
your model. Here's why: Model Accuracy---Smaller residuals = better predictions.---A
model with low average residuals is usually more accurate. Model Diagnosis---Plotting
residuals helps detect:---Non-linearity in data---Outliers or noise---Bias in predictions
(e.g., consistently over- or under-predicting)
Error Metrics Are Based on Residuals--Common performance metrics are derived from
residuals: Model Optimization---During training, algorithms try to minimize residuals by
adjusting internal parameters (e.g., weights in linear regression or neural nets). --This is
done using loss functions like MSE or cross-entropy (in classification). Helps in Feature
Selection---High residuals could mean:---Missing important features---Irrelevant or noisy
data---Need for transformation (e.g., polynomial features)
What is a Tradeoff in Machine Learning?
---A tradeoff in machine learning means that improving one aspect of a model’s
performance often comes at the cost of another. You can’t have it all — increasing one
thing might hurt another, so finding the right balance is key.
--- Common Tradeoffs in Machine Learning--- Bias-Variance Tradeoff
Bias:Error due to simplified assumptions in the model (e.g., using a linear model for
non-linear data).----High bias → Underfitting
Variance:---Error due to too much complexity, causing the model to fit the noise in the
training data.--High variance → Overfitting Goal: Find the sweet spot where both bias
and variance are minimized — i.e., the model generalizes well.
-- Precision vs. Recall Tradeoff---In classification tasks:---Precision: How many
predicted positives are correct---Recall: How many actual positives were correctly
predicted--Increasing precision may reduce recall, and vice versa.--- Example: In
medical diagnosis:---You might want high recall (catch all actual sick patients),---Even if
that means lower precision (some false alarms).
Model Complexity vs. Interpretability--Complex models (e.g., deep neural networks)
may give better accuracy,--But they are harder to interpret ("black boxes").--- On the
other hand:---Simple models like decision trees or linear regression are easy to
understand,---But may perform worse on complex problems.
Training Time vs. Accuracy---More training time (e.g., more epochs, larger datasets)
usually improves performance.---But after a point, returns diminish, and it may not be
worth the extra cost. -- Sometimes you have to trade speed for quality, especially in
real-time systems.---- Amount of Data vs. Model Performance
---More data typically improves performance.---But collecting and processing data is
costly and time-consuming.---Sometimes it’s better to:Improve the features or algorithm
instead of just gathering more data.
What is Feature Extraction in Machine Learning?
Feature Extraction is a technique used to transform raw data into a set of meaningful
features that can be used by a machine learning algorithm.
It involves creating new features from the existing data — often by reducing dimensionality
or extracting hidden patterns. Example:---Suppose you have an image. Raw pixels (say
100x100 = 10,000 features) are hard to use directly.---- With feature extraction, you might
transform the image into:-Color histograms-Edge maps-Shape descriptors
Common Feature Extraction Techniques:
Data Type, Techniques
Text, TF-IDF, Word2Vec, BERT embeddings
Images, SIFT, HOG, CNN feature maps
Audio, MFCC (Mel-frequency cepstral coefficients)
General, PCA (Principal Component Analysis), Autoencoders
How is Feature Extraction Different from Feature Selection?
Aspect, Feature Extraction, Feature Selection
What it does, Creates new features from raw data, Chooses the best features from existing
Output, Transformed data (possibly lower dimensions), Subset of original features
Techniques, PCA, LDA, Autoencoders, Chi-Square, Mutual Info, RFE
Example, Combine multiple sensor signals into 1 summary, Pick top 10 features from 100
Goal, Compress, reduce noise, find hidden patterns, Improve performance, reduce
overfitting
Advantages of Feature Extraction
====Improves Model Performance---Removes noise and irrelevant data---Highlights
patterns, trends, and important signals
===Reduces Dimensionality--Helps tackle the “curse of dimensionality”---Speeds up
training and reduces overfitting
===Makes Data More Manageable---Easier to visualize and interpret---Enables use of
algorithms that require fewer inputs
====Boosts Generalization----Helps models perform better on unseen data
Real-World Example---- Sentiment Analysis on movie reviews:
Raw text → Cleaned text → TF-IDF or Word2Vec vectors → Used for classification
---- Face Recognition: Raw pixels → PCA → Eigenfaces → Classifier