0% found this document useful (0 votes)

38 views22 pages

Machine Learning Fundamentals Exam Guide

The document is an examination paper for a B.Tech course on Machine Learning Fundamentals, detailing various topics such as the differences between Gradient Descent and Stochastic Gradient Descent, definitions of Artificial Intelligence, Machine Learning, and Deep Learning, and concepts like bias, variance, and their trade-offs. It includes questions on model evaluation metrics, handling high variance in datasets, and the importance of feature selection and regularization techniques. Additionally, it discusses the iterative process of model selection and tuning in machine learning.

Uploaded by

stonly098

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views22 pages

Machine Learning Fundamentals Exam Guide

Uploaded by

stonly098

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

[JUNE-2024]

IP. University-([Link] ]-Akash Books 2024-13

(SGD)
Q.1. () What is the difference between stochastic gradient descent (15)
EXAMINATION
TERM SEMESTER [[Link])
ENDFOURTH
and gradient descent (GD)? (SGD) are
Ans. Both Grad1ent Descent (GD) and Stochastic Gradient Descent
[AIDSIAIML/IOT-2101

OF MACHINE LEARNING to find the min1mum of a function, typically a

FUNDAMENTALS Max. Marks: 75 iterat1ve optim1zation algor1thms used
gradient of the functon
loss function in machine learn1ng They work by calculat1ng the then tak1ng a step in the
Time: 3 Hrs. including Q. No. 1
which is compulsory. (which points in the d1rection of the steepest ascent) and function's value The key
questions in all (towards the steepest descent) to reduce the
Note: Attempt five opposite direction
question from each
unit. d1fference l1es in how they calculate the gradient
Select one Learnine
(Al), Machine Gradient Descent (GD)
Artificial Intelligence
Q.1. (a) Explain the
terms (1.5)
" Gradient Calculation: GD calculates the gradientdata of the loss function using
in short? capable of for each point before updating
(ML) and Deep Learning the entire train1ng dataset. It sums up the errors
mach1nes
Artihcial Intelligence
(AI): Al aims to create as problem-solving
Ans. human intell1gence, such the model's parameters.
typically require once per epoch (one
perform1ng tasks that " Update Frequency: The model's parameters are updated
dec1sion- mak1ng, and learning. machines
enabl1ng
ML 1s a subset of AI that focuses onpatterns and make pass through the entire dataset).
Machine Learning (ML): min1mum for convex loss
Convergence: GD 1s guaranteed to converge to a local
programm1ng. They identify
exphc1t
to learn from data w1thout functions. However, it can be slow, especially for large datasets, as it needs to process
pred1ctions or decC1sIons. neural networks
1s a subfeld of ML that uses artiic1al the entire dataset for each update.
"Deep Learning (DL): DL"deep") to analyze data and learn complex patterns.
with multiple layers (hence models in Advantages:
of Learning/Training min1mum for convex functions (u) Stable
Q.1. (b) What are the
different types
(1.5) (1) Guaranteed convergence to a local
ML? convergence.
Examination 2018. (Pg. No. 1-2018).
Ans. Refer to Q. 1.(b)Frst Term Disadvantages:
2019. (Pg. No. 14-2019) min1ma for non-convex functions
Refer to Q3 (b) End Term Exam1nat1on
(i)Slow for large datasets. (in)Can get stuck in local
umptions for data to be met before starting with linear Stochastic Gradient Descent (SGD)
Q.1. (c) List all aSS (1.5)
loss function using
regression?
relationsh1p exists between the independent and " Gradient Calculation: SGD calculates the grad1ent of the
Ans. Linearity: A linear randomly chosen data point from the training dataset.
only one
dependent varnables. updated after each data point
each other. " Update Frequency: The model's parameters are
" Independence: Errors are independent of 1s processed. This means multiple updates per epoch
constant across all levels of the
" Homnoscedasticity: The variance of errors is " Convergence: SGD's path to the min1mum 1s
more noisy and erratic due to
independent varnables.
in selecting data points. However. this "noise" can sometimes help the
the randomness it might not converge
" Normality: Errors are normally distributed. algorithm escape local minima and find a better solution. While
correlated with
" No Multicollinearity: Independent variables are not highly as precisely as GD, it often gets close enough for practical purposes.
each other
another? Advantages:
Q.1. (d) How are covariance and correlation different from one
(1.5) (1) Much faster than GD, especially for large datasets.
of updates.
Ans. Covariance: Measures the degree to which two variables change together. (11) Can escape local min1ma due to the stochastic nature
Itsvalue is affected by the scales of the variables. Disadvantages:
" Correlation: Measures the strength and direction of the linear relationship (1) Noisy convergence.
between two variables. It is standard1zed (between -1 and 1), making it easier to (1u) May not converge to the exact minimum.
compare relationships across different scales. (iu) Requires careful tuning of the learning rate.
Q.1. (e) If your dataset is suffering from high variance, how would you Mini-Batch Gradient Descent: A compromise between GD and SGD 1s Mn1
handle it? (1.5)
Batch Gradient Descent.
Ans. "m1n1-batch")
" Regularization (LI or L2): Adds a penalty to the loss function to prevent It calculates the gradient using a small, randomly chosen subset (a
SGD and the stab1l1ty
overfitting of the training data. This offers a balance between the speed of
of GD
Feature Selection: Choose only the most relevant features.
Q.1 (g)What doyou understand by DQN? (1.5)
Cross-Validation: Use techniques lhke k-fold cross-validation toevaluate model
performance and avoid overftting. Ans: Refer to Q3 of New topics added UNIT-IV (Pg. No. 19-2021).
More Data: If possible,obta1n more data to train the model. Q.1. (h) What do you mean by Fuzzy C-Means clustering? (1.5)
" Reduce Model Complexity: Use a simpler model with fewer parameters. Ans. Refer Q 1 of New topics added UNIT-III (Pg. No. 11-2021).
Machne
Learning (AlMI JAIDS)
of IP University |B Tech -Akash Boks
be used to
u n d a m e n t a l

ourth
ester,
metrics
that can estimate the b Business 0bjective: What are you trying to achieve? Mazimizng acuraty'
NA
the
perfornnance

n the dependent
(1.3) . . n g costs? Dilferent algonthns m1ght be better suited for different oyertyvea
ba are r v g r e s s i o n
model?

of
varance
vAnable
ah n e a r Measures
the
proprtiou
3. Algorithm Characteristics:
eteaY
number of predCtors in
based on the . Aceuracy: How accurate does the model need to be?Somealgurithna are knows
Lzxsquared

Austs Rsquared r hugher acuraCy (hanothers, but this often comes at the cost of ornplezity
"dsvd
Rsguret

squared
d1tlerence
between predicted . Interpretability: How imnportant is it to understand whythe model s makIng
Average
Brror
(MSE): rtanpredictions"Snpler models (e g, linear regression, decison trees) are zenerally
Mva
SyArd
Square rot of MSE easler to interpret
oore nterpretable than complex models (e g.. deep learn1ng)
(RMSE):
arvd Error . Computational Cost: How much proces9ing power and time are available?
"RNeun
algorithms tiill now. If given Sone algornthms are computationally more expens1ve than others
learning a
many
machine
a l g o r i t h mto be
ussed for that? (L.5) . Scalability: How well does the algor1thm scale to large dataseta?
here are d e t e r m i n e which
l one factors . Robustness: How well does the algor1thm perform on noisy or incomplete data
bow can variety of
tta se depends on a
Ans T3e best
algorntàm . Assumptions: Each algor1thm makes certa1n assumptions about the data Its
lnderstanding
the Data: mvortant to choose an algorithm whose assumptions are met by your data
2 (e.g., number of
Data: height., weight)
or discrete 4. The I terative Process:
(a) Tvpe of Contnuous (e g, here.
Numeical: work well " Start Simple: Beg1n with simpler, more interpretable alzorthms (e 2 inear
algorithms often
Regression
chuldren
(unordered
eg, colors) or categories, ordinal (ordered regress10n, log1stic regression, decision trees)
Categorical: Nomunal levels).
education
Classification algorithms are frequently " Experiment: Try d1fferent algor1thms and compare their performance using
eg, AUC-RC for
categores,
ues. appropriate evaluation metrics (eg. accuracy. precis1on recall F1-score
tokenization, TF-IDF)
before being
Lsed class1fication: mean squared error. R-squared for regression)
techniques are essential
preprocessingeg.
Text: Requires language processing (NLP) . Cross-Validation: Use cross-validation techniques (e g k-fold cross-valbdatuon
Dost algorithms Naturalconvolutional neural networks (CNNs) for featues to ensure that the model general1zes well to unseen data
Image: Often requires
extraction
Recurrent neural networbe
" Tune Hyperparameters: Opt1mize the hyperparameters of each algonthm to
collected over time. ach1eve the best possible performance
Time Series: Data points appropriate.
are
RNNS or time sernes-specifhic models " Iterate: Refne your approach based on the results You might need to revnst
(b) Size of Data: your feature engineering. data preprocess1ng or algor1thm selection
linear regression, decision trees)
Small Datasets: Simpler algor:thms le.g, 5. Tools and Libraries:
to avo.d overfitting
might be preferable
(e.g., deep learning) can be used. "Seikit-learn (Python): Provades a w1de range of mach1ne learn1ng algor1thms
Large Datasets: More complex algorithms factor. and tools for model selection, evaluation. and tun1ng
and computational efhciency becomes a key
(c) Structure of Data:
" TensorFlow/Keras (Python): Popular librarnes for deep learn1ng
- Labeled Data: Supervised learning algorithms (classification or
regression) " PyTorch (Python):Another widely used deep learn1ng framework
¿re used UNIT-I
Unlabeled Data: Unsupervised learning algorithms (clustering,
Q2. (a) What is Bias, Variance, and their trade-off? When does
Cimens1onal1ty reduction) are necessary regularization (Li, L2) come into play in Machine Learning and how is it
(d) Missing Values:How are m1ss1ng values handled? Some algornthms are more different from normalization? (7.5)
robust to niss1ng data than others Cons1der imputation techniques.
(e) Outliers: Are there outlhers present? Some algor1thms are more sensitive to
Ans. Bias: Bias refers to the error introduced by s1mphfy1ng assumptions made
by amodel to nake the target function easier to learn A hugh bias model underits the
outhers than others Cons1der outlher removal or robust algor1thms data. meaning it misses 1mportant patterns and performs poorly on both train1ng and
2. Deining the Problem: unseen data. Th1nk of it hke try1ng to fit a straight l1ne to highly curved data - it won't
(a) Tpe of Problem: capture the complex1ty.
(lassification: Predicting acategory (eg, spam or not spam) Variance: Variance refers to the model's sens1t1v1ty to iuctuations in the
Legression: Predicting a continuous value (e g. house price) tran1ng data A high varance model overfits the data. memorizing noise and specitic
(lustering: Grouping simlar data points together Customer
xamples rather than general1zng to the underly1ng patterns Th1s leads to excellent
segmentation leg
performance on tran1ng data but poor perfornmance on unseen data lmag1ne trving to
Iimensionality Reductiop: Reduing the number of features while it a very complex curve that goes through everv single tranng po1nt. even the no1sy
¡reserving important information ODes It won t generalhze well
Learning (AIMLAIDS)
f
Machine LP. University-([Link] ]-Akash Books 2024-17
F u n d a m e n t a l

madel
wrh low blas and low
Semster, is a . Methods to Deal with Outliers:
:NA
Fourth The ndeal
snano

capturrs
the underly1ng patterns
Iradeot:

wher
the
model
deceasng
buas
often inereases (a) Renmoval: Remove outler data points if they are due to errors or are truly
st that
wtreme values that are not representative of the population
" B i a s - l a r i a n e

the
sweet challenge
1s
This s The
balane.
arav
to noas and the
optamal
are used to combat (b) Transformation: Apply transformations (eg. logar1thm1c, square root) to
Vertang
a:m to techntques
dout
and vy vesa le
the loss tunction. Regularnzation

discouraging sompress the scale of the data and reduce the induence of outlhers
(Ll. L2: term to
Anan
a penalty lc)Imputation:Ifoutlhers are due to m1ssing data. impute them us1ng appropnate
Thev add
R e g u l a r i z a t i o n

"
high
varnane)
onplex
elatonshps

absolute values of the weights methods (e g.. mean. median, or more sophisticated imputation techniques)
over sum ofthe Q.3. (a) How would you handle an imbalanced dataset? How do you deal
rOttung

t§e hodel
trvm
leanng
(Lasso):
Adds the
weghts
beom1ng
zero.
ettect:vel
perform1ng with the class imbalance in a classification problem? (7.5)
leadto some
R e g u l a r i z a t i o n

"L1
functaon Thiscan
of the weights to the Ans. The best approach depends on the speciâc dataset and problem It's often
to the loss squares performance us1ng
teature selecten
Addsthe
sum ot the
elmnates them entirely. necessary to experiment wth d1tterent techn1ques and compare their
imbalance, the
but rarely approprate metrics. Cons1der the size of the dataset. the severity of the
(Ridge):
towards zero
Regularization

" L2
Th:sshrnks
the werghts
prepress1ng
step that scales the features computational cost, and the interpretabil1ty requirements when making your dec1sion
loss functaon is
done to
prevent features
Normalzattonlarger a with Start with sumpler methods like SMOTE or random undersampl1ng and then move on
to 1). This1s
Normalization:

range (e g. 0to l or -l
proess and to mprove the convergence speed of to more complex techniques if necessary
to a standarddon1nating the learn1ng tran1ng andiis independent of 1. Understanding the Problem:
applied before
values from N o r m a l z a t l o uis varnance but it can hely the of
algonthms address bias or " What is Class Imbalance? Class imbalance occurs when the number
optimization
doesntdreetly classes of a dataset is sign1ficantly different. For example, in a
models compley
It 1nstances in d1fferent
will far outweigh the
the eecavely inmportant in
model buildine fraud detection dataset. the number of legitimate transactions
model learn more engineering is ering. What number of fraudulent transactions.
why feature feature engine
Q.2. (b)Mention the used for
techniques (7.5) " Why is it a Problem? Standard classificat1on algorithms aim to
maxim1ze
out some of
outliers.
and list three methods
to deal with In imbalanced datasets, a model can achieve high accuracy by s1mply
mention overall accuracy.
and poorly on the minority
predicting the majority class all the time, even if it performs
outliers
Engineering:
of Feature predictions A
features lead to better
Importance
Ans.
Performance: Better represented well. the
class.
Model imbalanced
1sn't
" Which Metrics to Use: Accuracy is often a mislea d1ng metric in
Improved
(a)
from the data it's given.
so if the data
model can only learn on
suffer class1ñcation. Instead, focus on metrics that are more sensitive to the performance
wll easier
model's performane
Interpretability: Relevant
features make the model the m1nor1ty class:
(b) Increased Model meanings. it's easier to understand why the (i) Precision: Out of all the instances predicted as positive, how
many were
to If the features have clear
understand actually positive?
predictions.
model is mak1ng certan sometimes mitigate the correctly
Missing Data: Featre engineering can (ii) Recall: Out of all the actual positive instances, how many were
(c) Handles
features that capture the informat1on that predicted?
creat1ng new
impact of miss1ng values by both metrics.
m1ght be lost due to missing data.
(iii) F1-Score: The harmonic mean of precision and recall, balancing
Feature eng1neering can help smooth
out
(iv) AUC-ROC (Area Under the Receiver Operating Characteristic curve):
(d) Addresses Data Inconsistencies: various thresholds.
transform1ng or combining features. A measure of the classifier's ability todistinguish between classes at
noisy or incons1stent data by to
Techniques for Feature Engineering: 2. Data-Level Approaches (Resampling): These methods modify the dataset
variables (e.g.. colors. address the imbalance:
Creating Dummy Variables: Convert1ng categorical
(a)
cities) nto numerncal representations that a model can understand. " Oversampling the Minority Class: Creating synthetic samples for the
ranges m1nor1ty class.
(b) Binning: Group1ng continuous varnables into d1screte intervals (e.g.. age
instead of exact ages)
(i) SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic
instances by interpolating between existing minority class instances.
(c) Feature Scaling: Scal1ng features to a standard range (like normalization). (ii)ADASYN (Adaptive Synthetic Sampling Approach): Focuses on generating
(d) Polynomial Features: Creating polynomial combinations of existing features syntheticsamples in regionsof the feature space where the minority class is harder to
(eg.x r^2, r^3). learn
(e) Domain Specific Features: Using domain expertise to create new features " Undersampling the Majority Class: Reducing the number of instances in the
that are relevant to the specific problem. majority class.
" Outliers: Outhers are data points that s1gniñcantly devnate from other (i) Random Undersampling: Randomly removes instances from the majority
cbservations They can be due to errors in data collection, genunely extreme values, or class. Can lead to loss of information if not done carefully.
other reasons.
L e a r n i n g
(AIML/AIDS) LP. University-([Link].J-Akash Books 2024-19
Machine

Fundamental
of
that
are "hnked" to minority UNIT-II
i n s t a n c e s

Fourth
Semester,

majority
class o4.(a) Explain One-hot encoding and Label Encoding. How do they affect
18-2024 Removes

Removes
majority
class instances .dimensionality of the given dataset? Does Principle Component Analysis
Links:

(n)
Tomek

bours
(ENN):
class. cA) do the same and how is it different from Linear Discriminant Analysis
class instances.
Nearest
Neigh
to the
minority

Learning): These methods (LDA)? (9)

belong( C o s t - S e n s i t i v e
(iii)
Edited
nejghbors
mostlv
i m b a l a n c e :
Ans: One-Hot Encoding: Converts categorical variables (nom1nal or unordered)
misclassifications of .nto a numerical representation suitable for machine learn1ng algor1thms.
A p p r o n c h e s

for the
k-nearest

Whose Algorithm-Level account costs to

to d1fferent
of the minority class,.
3. learning
aleonthm

Ass1gns
misclass1fications . How it Works: For each category in a feature, a new binary (0 or 1) feature
modity
the Learning:
assigned
to
Many
a l g o r i t h m s
bave
built-in cost- s ereated. If a dat: point belongs to a particular category, the correspond1ng binary
costs are feature get a
C o s t - S e n s i t i v e

classes.
Higher attention to it. foature gets a value of 1, and all other binary features for that or1ginal
more
d1fferent
model to
pay
but
instead of costs, weights value of0.
forc1ng the learning,

" Example: Suppose you have a "Color" feature with values "Red," "Green," and
sens1tive learn1ng capabil1t1es.
weight.
cost-sensitive

to higher
"Color_Green,"
class gets a "Blue." One-hot encoding would create three new features: "Color Red,"
S1milar
Class
Weights:
minority multiple models to improve
"
to classes.
The methods
combine
and "Color_Blue." A data point with "Color = Red" would have "Color Red = 1," "Color_
assigned These
are
Green = 0," and "Color Blue = 0."
Methods:

4.
Ensemble
of the algorithm Random Forest
" Impact on Dimensionality: Increases dimensional1ty. If a categorical feature
modification
performance
the data
for training each tree.
Forest: A
Random subsets of
"
Balanced
to create
balanced
models by
training
each model on a
has n unique categories, one-hot encoding creates n new features.
ensemble of
2. Label Encoding: Converts categorical variables into numerical representations
bootstrapping
that uses Creates an
majority class.
by assigning a unique integer to each category.
EasyEnsemble:

" subset of the

undersampled
d1fferent
5. Other Considerations:
that are more informative about " How it Works: Each category is assigned an integer. For example. "Red" m1ght
Creating new ffeatures be encoded as 0, "Green" as 1, and "Blue" as 2
Engineering: performance.

" Impact on Dimensionality: Does not change dimensionality. It replaces the

Feature troot
" sign1ficantly improve class can be
class can the minority
the m1nornty Detection: In
some cases,
original categorical feature with a single numerical feature.
used.
" Anomaly techniques can be
anomalies, and
anomaly detection can provide valuablo 3. Dimensionality and Encoding:
with domain experts approach.
" One-Hot Encoding: Increases the number of features, potentially leading to
Expertise: Consulting
" Domain the most
appropriate

ins1ghts into the data

and help in choosing a fraud detection higher computational costs and the curse of dimensionality if not
handled carefully.
Detection): Let's say you're buildung
Example Scenario (Fraud
minor1ty class.
However, it avoids imposing an artificial order on the categories.
transactions are the
model. Fraudulent
data and find that it's highly imbalanced (e. g " Label Encoding: Maintains the number of features but introduces an artific1al
the more
1. DataAnalysis: You
analyze
order to the categories, which might be misle ading for nominal variables. It's
1% fraudulent transactions).
and Fl-score as your
suitable for ordinal categorical variables where the order has meaning.
decide to use preCision, recall,
2. Metric Selection: You 4. Principal Component Analysis (PCA): Adimensionality reduction technique.
evaluation metrics.
It aims to find the principal components of the data, which are the directions of
oversample the minority class.
3. Resampling: You try SMOTE to maximum variance. By projecting the data onto a lower-dimensional space defined by
Forest model on the balanced data. these principal components, PCA reduces the number of features while preserving as
4. Model Training: You tran a Random
model's performance using the chosen metrics. much variance (information) as possible.
5. Evaluation: You evaluate the
You might need to adyust the amount of oversampl1ng, try d1fferent " How it Works:
6. Iteration:
to optimize the model's performance.
algorithms, or use cost-sensitive learning 1. Calculates the covariance matrix of the data.
Q.3. (b) Differentiate between K-Me ans and KNN algorith ms? Are both 2. Finds the eigenvectors and eigenvalues of the covariance matrix.
same types of Learning? Explain in detail with examples. (7.5)
3. Sorts the eigenvectors based on their corresponding eigenvalues (largest to
Ans. Refer to Q5la) End Term Examination 2017 (Pg. No. 17-2017). smallest).
Refer to [Link]) End Term Examination 2018 (Pg. No. 9-2018). 4. Selects the top k eigenvectors (corresponding to the largest eigenvalues) to form
No, K- Means and KNN are fundamentally d1fferent types of learning algor1thms. a new basis for the lower-dimensional space.
"K-Means: Unsuperv1sed learning It deals with unlabeled data and aims to 5. Projects the data onto this new space.
d1scover underly1ng patterns or structures in the data by grouping similar data po
together " Impact on Dimensionality: Reduces dimensionality. You choose the number
" KNN:Supervised learning It works w1th labeled data and uses the knownlabels of principal components to keep, which will be less than or equal to the original number
of the train1ng data to predict the labels of features.
or values of new, unseen data
points.
(AIML/AIDS)

of Machine
Learning I.P. University-([Link].|-Akash Books 2024-21
Fundamental

20-2024 Fourth
Semester,
Analysis (LDA):
Purpose:
Another
learn1ng g
d1mensional1ty Mnemonic to Remember:
Discriminant
des1gned for
superv1sed (classification)
classes
" Type I: "I" saw a fire when there wasn't one (False Pos1t1ve)
5. Linear specifically separate the
but it's that best " Type II: "II didn't see the fre when there was one (False Negative)
reduction technique, combinations of features with1n-class
variance
the bnear minimizes the ROC and AUC Curves
LDA aims to fndbetween-class
variance and
max1mizes the Now, let's move on to Receiver Operating Character1stic (ROC) curves and
" How it Works: the with1n-class scatter matrx. Area Under the Curve (AUC). These are tools used to evaluate the performance of
matr1x and
between-class scatter scattos classification models, particularly in situations with imbalanced classes (where one
1. Calculates the matrix formed by these
eigenvalues of the
eigenvectors and class has many more examples than the other).
2. Finds the " ROC Curve:
matrices. eigenvalues.
eigenvectors based on their correspond1ng lower-d1mensionl
o It's a graphical representation of a model's performance across diferent
[Link] the eigenvectors to form a new
basis for the
classification thresholds.
4. Selects the top k
space
o It plots the True Positive Rate (Sensitivity or Recall) on the y·axisagainst the
space False Positive Rate (1 - Specificity) on the x-axis. 2
5. Projects the dataonto this new d1mensional1ty. The maximum
number of
oIdeally, you want the ROC curve to rise steeply towards the top left corner,
Dimensionality: Reduces
Impact on is min(n_classes - 1, n_jeatures). indicating a model that can achieve a high true positive rate with a low false
you can get
linear discrim1nants (new features)
Differen ces between PCA
and LDA: positive rate.
LDA " AUC (Area Under the Curve):
PCA
Feature o It's a single value that summarizes the overall performance of a classification
Learning Unsuperv1sed
Supervised model.
Type oIt represents the area under the ROC curve.
Maxim1ze class separab1l1ty
Max1mize var1ance
Goal oAUC values range from 0 to 1:
Dimensional1ty reduction before Feature extraction for class1fication,. " AUC = 1: Perfect classification
d1mensional1ty reduction when class
Use Case applying other algorithms, noise separatIon is important AUC = 0.5: Random guessing (no better than chance).
reduction
Linear discr1m1nants " AUC >0.5: Indicates some level of performance better than random.
Output Pr1nc1pal components
o The higher the AUC, the better the model's abl1ty to distinguish between classes
Class Requred
Labels
Not used Examplc: Think of predicting whether a patient has a disease (positive case) or
not (negative case).
error? What do
Q.4. (b) What's the difference between Type land Type II (6)
" Type I Error: Telling a healthy person they have the dise ase (False Pos1tive)
you mean by the ROC and AUC curves. " Type II Error: Telling asick person they're healthy (False Negative)
Ans. Type I and Type II Errors: These errors occur in hypothe sis testing, a " ROCCurve: Shows you how well your test performs at different thresholds for
fundamental concept in statistics used to make decisions based on evidence (data). dec1ding someone has the dise ase.
Imagine you're conducting a test, trying to determine if a new drug is effective.
" AUC: Gives you a single number summariz1ng how good your test 1s overall at
Null Hypothesis (H0): This is a statement of no effect or no difference. In our
distinguishing between sick and healthy people
drug example, it might be "the new drug has no effect."
"Alternative Hypothesis (HI): This is the statement youre try1ng to find Q.5. (a) What is the difference between the normal soft margin SVM and
SVM with a linear kernel? What is Kernel Trick in an SVM Algorithm? What is
evidence for In our example, it could be "the new drug has a positive effect. the signiicance of Gamma and Regularization in SVM? List popular kernels
HOis True (Drug is Ine ffective) HO is False (Drug is Efective) used in SVM along with a scenario of their applications. (8)
Reject HO Type I Error (False Positive) Correct Decision (Power) Ans: Soft Margin SVM between SVM with a Linear Kernel:
oLinear Kernel: This is a specifc type of kernel function used in SVM. It maps
Failto Reject H0 Correct Decision (Specifcity) Type II Error (False Negative) the data points to a higher-dimensional space where the decision boundary is
"Type I Error (False Positive): You reject the null hypothesis when it'sactually a hyperplane (a line in 2D,a plane in 3D, etc). It's called linear" because the
true In our example,this means conclud1ng the drug is effect1ve when it actually 1sn't. decision boundary is linear in this higher-dimensional space.
o Analogy: Asmoke alarm goes off when there's no fire oSoft Margin SVM: This addresses the issue of non-separable data. In real
Type II Error (False Negative): You fa1l to reject the null hypothesis when it's world datasets, it's often impossible to ind a perfectly clean separating
actually false. In our example, this means fa1l1ng to recognize that the drug 1s effective hyperplane Soft margin SVM allows for some data points to be misclassifed
when it actually is. or fall within the margin (the "soft" part). It introduces a penalty for these
Analogy: A smoke alarm fails to go off when there's a fire violations, controlled by a parameter (usually called C).
(AIMLJAIDe

Machine Learning IP. University-([Link].J-Akash Books 2024-23

Fundamental of
0A Fourth Semester,
can be a soft margin SVM. The "soft data points. These "left-out" data points are called "'out-of-bag" samples The 00B error
kernel SVM non--separability, while the
o Key
Difference: Alinear algorithm handles
the data. is calculated by averaging the predictions of the models on these out-of-bag samples
margin" aspect
is about how used to map the
function It provndes an estimate of the model's performance on unseen data, s1m1lar to cross
kernel" is about the specific mathematical technique that allows val1dation.
"inear clever
kernel tr1ck is a al)
without How it Occurs: During the training of each tree in a Random Forest (or Bagg1ng
infinite-d1mension

Trick: The spaces (even

" hernel Instead of ma
ensemble), a random subset of the data is selected (with replacement). The data pounts
high-dimensional

n very thal space.

SVMs to operate the data points in comn Predictions on these samples
explictlycalculating
the coordinates of product, the kernel function directly not selected for a particular tree are its out-of. bag samples.
calculating the dot pertorming the mappin. are aggregated across all trees to calculate the 0OB
the ponts and then space, implicitly are made by the tree, and the errors
the higher-d1mensional
the dot product in computation. error rate.
This saves a lot of Regularization: UNIT-III
Gamma and RBF). It contral,
" Significance of non-linear kernels (like Q.6. (a) What is Bayes' Theorem? State at least
1 use case with respect to the
parameter is specifc to
This means a wider kernel, capturing terms prior probability and marginal
machine learning context. What do the
o Gamma:
A smaller gamma narrower kernel, focusine Bayes Theorem? Are Gaussian Naive
the width of the kernel. means a likelihood mean in the context of Naive
while a larger gamma (7)
more global patterns, Bayes the same as binomial Naive Bayes?
on local patterns. term Examination(May-June-2018)
the trade-off between Ans. For Bayes' Theorem: Refer Q.4 (a)-End
This parameter controls
o Regularization (C): and minimizing classification errors. A smaller C (Pg. No. 14-2018).
maximizing the margin while a larger C penalizes classifiers use Bayes' theorem
misclassifcations (wider margin), " Use Case in Machine Learning: Naive Bayes
allows for more a data point belonging to a particular class, given 1t8
margin). to calculate the probability of
misclassifications more heavily (narrower features.
Applications:
" Popular Kernels and separable). " Prior Probability and Marginal Likelihood:
(high-dimensional, often linearly probab1l1ty of a
(a) Prior Probability (P(A)): In a classification context, it's the For example,
o Linear: Text classification
polynomial relationships between features.
(captures before considering its
o Polynomial: Image recognition data point belonging to a particular class
features). the probability of a customer churning before looking at their usage patterns.
purpose, works well for many observing the features across
oRBF (Radial Basis Function): General (b) Marginal Likelihood (P(B)):The probability of
especially when non-linear relationships are present. all classes. It acts as a normalizing constant.
datasets,
network, can be used for binary
o Sigmoid: Similar to a two-layer neural " Gaussian between Binomial Naive Bayes:
classiication problems. the features are continuous and follow
16-2019). (a) Gaussian Naive Bayes: Assumes that
Refer to Q. 5(a) End Term Examination Feb-2019. (Pg. No. a Gaussian (normal) distribution.
is a more stable algorithm as compared to other features are binary (e.g.. presence/
Q.5. (b) Why boosting
Boost perform better than SVM? What is (b) Binomial Naive Bayes: Assumes that the
ensemble algorithms? Why does XG absence of a word in a document).
0OB error and how does it occur? (7)
different types of data, depending on the
Gradient Boosting, They are not the same. They are used for
Ans. Boosting Stability: Boosting algorithms (like AdaBoost, distribution of the features.
generally more stable than other ensemble methods (like Bagging with
XGBoost) are clusters in a clustering
Random Forest) because they learn sequentially. Each new model focuses on
correcting Q.6. (b) How would you define the number of
prone to algorithm? How does the Gaussian Mixture Model work using
the Expectation
the mistakes of the previous models. This adaptive nature makes them less Maximization a'gorithm?
(7)
overfitting and more robust.
algorithm:
XGBoost between SVM: XGBoost often outp erformsSVMs due to several factors: Ans. To define the number of clusters in a clustering
optimal number of
naturally
(a) Handling Non-linearity: XGBoost, being a tree-based ensemble, cannon-linear " Defining the Number of Clusters: Determining the
methods include:
with clusters is a crucial step in clustering. Common
capture complex non-l1near relationships in the data, while SVMs within-cluster sum of squares (WCSS) for different
kernels can be computationally expensive. (i) Elbow Method: Plot the
numbers of clusters. Look for the "elbow" point where the rate of decrease in WCSS
(b) Feature Interactions: XGBoost can automatically learn interactions between slows down.
features, which can be crucial for many real-world datasets.
to its own cluster
(c) Regularization: XGBoost has built-in regularization mechanisms that prevent (ii) Silhouette Score: Measures how similar a data point is
cluster1ng.
overitt1ng which can be a challenge for SVMs, especially with non-linear kernels compared to other clusters. Higher silhouette scores ind1cate better
within-cluster dispersion of the data to that of
(d) Speed: XGBoost is often faster to tran than SVMs, especially on large datasets. (iii) Gap Statistic: Compares the
randomly generated data.
" 00B Error (Out-of-Bag Error): In ensemble methods like Bagging (and
Random Forest) each model 1s trained on a random subset of the data, leaving out some
LP.
24-2024 Fourth Semester, Fundamental of Machine Learning (AIMLAIDS) University-|[Link].]-Akash
F1-Score: These metrics Books
Precision, Recall, and
Gaussian Mixture Model (GMM) and EM Algorithm: matr1x and prov1de a 2024-25
more are
() GMM: Aprobabilhstic model that assumes the data points are generated from a performance, especially when comprehens1ve evaluation der1ved from the
mxture of Gaussian distributions. each representing a cluster. dealing with of a
class1ficat1onconfus1on
(0)EM Algorithm (Expectation-Maximization): An iterative algorithm used to
" Precision:. Measures how
many
imbalanced
of the pos1t1ve datasets. models
It focuses on the accuracy of the
otmate the parameters (means, variances, and mixing probabilities) of the Gaussian positive
Rormula: Precision =TP/ (TP + predict1ions. predictions were actually correct
components in the GMM.
Recall(Sensitivity or FP)
How it Works:
positive cases were
True Positive Rate):
1. Initialization: Intialhze the parameters of the Gaussian components (e.g.., correctly predicted. Measures how many of the
It focuses on the
using k-means). positive cases.
model's abl1ty to find allactual
o Formula:
Recall=TP/ (TP + FN) the
2. Expectation (E)Step: Caleulate the responsibility of each data point to each . F1-Score: The
cluster. The responsibility s the probability that a data point belongs to a particular
cluster given the current parameters. oasure that considers harmonic mean of prec1sion
balance the trade-off both precision and recall It's and recall. It provides a balanced
3. Maximization (M) Step: Update the parameters of the Gaussian
components between precision and recall. particularly useful when you want
based on the responsibilities. This involves recalculating the means, variances, and o Formula: Fl-score =
2 * (Precision *
mixing probab1l1ties.
Model Accuracy between Model Recall) /(Precision +Recall)
4. Iteration: Repeat steps 2 and 3 until the
parameters converge. " Model Accuracy: Performance
The
Q.7. (a) What is a confusion matrix and why do you need it? Define precision, the total number of instances. overall proportion of correctly class1fied instances out of
recall, and Fl-score? Model accuracy or Model performance? Which one will
you prefer and why? o Formula: Accuracy = (TP +
(8) TN) /(TP+ FP +FN +TN)
Ans. Confusion Matrix: A confusion matrix is a table used " Model Performance: A
to evaluate the broader term that encompasses varnous
performance of a classifñcation model. It provides a detailed model's effectiveness, including accuracy, aspects of a
precision, recall, F1l-score, and other metncs
predictions compared to the actual outcomes. It's particularly breakdown of the model's depending on the specific problem.
classification problems where you want to understand not justuseful when dealing with
how many predictions Which to Prefer and Why?
are correct, but also what types of errors the model is
making. In most cases, model performance 1s a more
informative measure than just
The confusion matrix for a binary classification problem (two accuracy. Here's why:
positive and negative): classes, often called
" Imbalanced Datasets: Accuracy can be misleading in imbalanced datasets A
Actual Positive model might have high accuracy but perform poorly on the minor1ty class, which 19
Actual Negative often the class of greater interest.
Predicted Positive True Positive (TP) False Positive (FP) " Business Context: The choice of metric depends on the specific business
Predicted Negative False Negative (FN) True Negative (TN) objective. For example, in a medical diagnosis scenario, recall might be more important
True Positive (TP): The model correctly predicts the than precision because it's crucial to identify all actual positive cases (even if it means
" False Positive (FP): The model positive class. having some false positives).
known as a Type I error). incorrectly predicts the positive class (also " Trade-off between Precision and Recall: Often, there's a trade-off between
False Negative (FN): The model incorrectly precision and recall. You might need to choose a model that optimizes one metric over
known as a Type II error). predicts the negative class (also the other depending on the specific application. The Fl-score belps in balanc1ng th1s
trade-of.
" True Negative (TN): The model
correctly predicts the negative class. Q.7. (b) List the most popular distribution curves along with scenarios
Why You Need a Confusion Matrix
where you will use them in an algorithm. (7)
" Beyond Accuracy: A confusion matrix
gives
you a more nuanced view of model Ans. Some popular distribution curves and ther use cases in algorithms are:
performance than simple accuracy. It helps you understand where the model is strong
and where it struggles. 1. Normal Distribution (Gaussian):
" Error Analysis: By examining (a) Shape: Bell-shaped, symmetrical.
the
gain insights into the types of errors yourfalse positives and false negatives, you can (b) Use Cases:
model is making. This can guide you in
improv1ng the model (eg.. by add1ng more features or adjusting thresholds). " Many machine learning algorithms assume data is normally distributed (at least
Performance Evaluation for Imbalan ced approximately).
one class has many more examples Datasets: In situations where
can be m1sleadng. A confusion than the other (imbalanced datasets), " Used in linear regression, Gaussian Naive Bayes, and as a building block in
matrix allows you to use more accuracy Gaussian Mixture Models.
precision and recall. appropriate metrics hke
(AIMLIAIDS)

of Machine
Learning
IP. University-([Link].J-Akash Books 2024-27
hypothesis testing.
Fundamental

26-2024 Fourth
Semester. for
normal
distribution
UNIT-IV
relies on the
" Central Limit Theorem Q.8. (a) List the advantages and limitations of the Temporal Difference
2. Uniform Distribution:
values within a
range. Learning Method? (8)
probability for all
equal Ans. For Temporal Difference Learning Method: Refer to Q8 (b)-End Term
(a) Shape: Flat,
(b) Use Cases: Examination (May-June-2017).
numbers for simulations. classification.
Advantages of TemporalDifference (TD) Learning:
Generating random undersampling in
imbalanced
" random
Randomized
algorithms, like " Model-Free: TD learning can learn d1rectly from exper1ence without requ1rng
" in k-means
custering.
a model of the environment (transition probabilities).
cluster centers
" Assigning initial
success/ " Can Learn from Incomplete Episodes: Unlike Monte Carlo methods, TD
the probability of
3. Bernoulli Distribution:

possible outcomes (0 or 1), represents learning can update its estimates based on incomplete episodes, mak1ng it suitable for
(a) Shape: Two continuing tasks or tasks where episodes can be very long.
failure of a single trial. " Computationally Efficient: TD updates are computationally less expens1ve
(b) Use Cases: than Monte Carlo updates.
variables.
" Modeling binary random " Handles Non-stationarity: Can adapt to changes in the env1ronment more
distribution. easily than some other reinforcement learning methods.
" Foundation for binomial classification algorithms.
regression and other binary Limitations of TD Learning:
" Used in logistic
4. Binomial Distribution: " Bias: TD learning estimates can be biased, especially with funct1on
independent Bernoulli
probability of k successes in n approximation, unlike Monte Carlo methods which are unbiased.
(a) Shape: Represents the " Variance: While less variance than Monte Carlo, TD estimates can still have
trials.
(b) Use Cases:
sign1ficant variance, especially with off-policy learning.
trials.
successes in a fixed number of " Local Optima: Like other iterative optimization methods, TD learn1ng can get
Modeling the number of
some classification scenarios. stuck in local optima.
" Used in binomial tests and in
Poisson Distribution:
" Parameter Tuning: Requires careful tuning of learning rates and other
parameters.
given number of events occurring in
(a) Shape: Represents the probability of a Q.8. (b) Explain Markov Decision Process and use of Bellman equations in
interval of time or space if these events occur with a known average rate and
a fixed learning? (7
independently of the time since the last event.
Ans. For Markov Decision Process (MDP): Refer to Q9 (a)-July 2023 (End
(b) Use Cases: Term Examination)
customer arrivals).
" Modeling rare events (e.g., website visits, For Bellman Equations: Refer to Q7 (a) of End term Exam1nation 2017
" Used in queuing theory and risk management. (Pg. No. 20-2017).
6. Exponential Distribution: Q.9. (a) What do you mean by Associative Rule Mining (ARM)? Explain
(a) Shape: Describes the p.obabil1ty of the time between events in a Poisson process. Apriori algorithm with real life example of some dataset. (8)
Ans. Associative Rule Mining (ARM): ARM is a data mining technique used
(b) Use Cases: to discover relationships between variables in large datasets. It identifies sets of items
" Modeling waiting times. that frequently occur together. These relationships are expressed as association rules.
" Used in survival analysis. " Example: {Bread, Milk} > {Butter) (Customers who buy bread and milk also
7. Gamma Distribution: tend to buy butter).
(a) Shape: Aflexible distribution that can take various shapes depending on its Apriori Algorithm: Apriori is a classic algorithm for ARM. It uses a level-wise
parameters. approach to find frequent itemsets (sets of items that appear frequently in the dataset)
(b)Use Cases: and generate association rules.
" Modeling waiting times (more general than exponential). " Steps:
" Used in Bayesian statistics. 1. Support Calculation: Calculate the support (frequency) of each itemset.
2. Candidate Generation: Generate candidate itemsets of length k+1 from
frequent itemsets of length k.
(AlMI/AIDS)

Learning
Fundamental of Machine subsets
LP. University-(B Tech J-Akash Bsks
Semester, ntrequent
Ourth have 3. Embedded Methodw:
itemsets that itemsets are found
Canddae frequent
3 Pruning: Remve no more itemsets " These methods incorporate feature seletin int the
steps 2and 3 unt1l the frequent mtel tran4 pr
4 lteration: Repeat: rules tirom
" Regularization (L1 and L2): L! regula rizats la
s arnk te
asseaton

Generate
5. Rule Generation: coefficents of les5 Important features to er effectively perfrnz batre
" Reaife xample: transactions L2 regular1zat1on (Ridge) shr1nk the eefficents warda er but elets
rarely n
Mlk | | 2 | | Bread,
customer
supermarket
dataset wth them ent1rely
||1|Bread.
aer & l*.fo. |5| M
Bought | M1lk, lDiapers | . 'Tree-based Models (Random Forest, XGBst): These miels
Trarutuva ID Items
Dapers Beer
4 | Bread, prrde
apers Reer 3 Mk feature mportance scores based on how often each feature s used for ptting es
Dhapers Reer D1apers, Beer} and generate n the trees.
like (Milk,
Aprnn would dentan
quent iteemsets 4. Other Considerations:
b..
rules hke diapers also tend to
Beer) Customers who buy milk and " Domain Expertise: Consult wath domain experte to dentify features that are
Dapers -> lhkely to be relevant
" Mk
working on a data
beer variables while " Feature Engineering: Creat1ng new features from existung cnes can sometunes
do vou select important logistic regression
9. (b) How evaluate a improve model performance
and How would you (7)
set? What is log likelibood
model? Choosing the nght
Log-Likelihood
(Feature Selection):
Important V'ariables Including
learning models. " In statistics, the lhkel1hood function measures bow well a statistical model fts
Ans Selecting effective and ethcient machine
features s Tucal for buulding increased computational
cost a set of observed data. It quant1fes the probabilaty of observing the data ven the
overfitting.
features can lead to common feature selection
model's parameters
TElevant or redun dant breakdown of
nterpretabil1tv Here's a
aDd reduced model " The log-lhkehhood 1s simply the natural iogarnthm of the likelubood funton Its
technaques often used because.
1 Filter Methods:
the relevance of features " It's mathematically easier to work wth eg,*hen marminng the lhkeibood
statistical measures to evaluate
" These metbods use " The logarithm is a monotonic function, s0 manmizng the log-likelihood is
independent of the chosen model numerical equivalent to maximizing the likelihood
the l1near relationship between
" Correlation An aly sis: Measures correlation
High suggests a strong relationship. " It can prevent underfiow 1ssues when dealing with very small probabltes
features and the target varnable relationships.
causation, and it only captures linear Evaluating a Logistic Regression Model: Lognst1c regression 1s a class1ñcataon
However, correlataon doesntimply
between categorical features and algor1thm, so we use spec1fic evaluation metrcs
" Chi-Square Test: Measures the dependence
the target varable It's useful for classification problems 1. Confusion Matrix: A table summarnz1ng the counts of true pos1tives true
means of a numercal feature negat1ves, false positives, and false negat1ves
" ANOVA(Analvsis of Variance): Compares the
d1fferent categor1es of the target vanable. It's used when the target var1able is 2. Accuracy: The overall proportion of correctly class1âed instances However.
across
categorical and the features are numerical accuracy can be m1sleading for imbalanced datasets
the
" Information Gain: Measures the reduct1on in entropy (uncertainty) of 3. Precision: The proportion of true pos1t1ves among the pred1cted pos1tIves High
problems.
target varnable when a particular feature 1s used It's used for classification prec1sion means the model is good at not label1ng negat1ves as pos1tives.
2. Wrapper Methods: Precision = TP /(TP + FP)
" These methods use a spec1fic model to evaluate the relevance of features They 4. Recall (Sensitivity or True Positive Rate): The proportion of true pos1tives
search through different subsets of features and evaluate the performance of the model among the actual pos1tives High recall means the model is good at ind1ng all the actual
on each subset positives

" Recursive Feature Elimination (RFE): Recurs1vely removes the least " Recal) = TP/(TP +FN)
important features based on the models performance It starts with all features and 5. Fl-score: The harmonic mean of prec1sion and recall. It balances the trade-off
iterat1vely elim1nates features until the des1red number of features 18 reached between precision and recall
" Forward Selection: Starts wath no features and iteratively adds the most [Link] 2* (Prec1sion *Recall) (Precision + Recall)
important feature until the desired number of features 1s reached
" Backward Elimination: Starts with all features and iterat1vely removes the
LAUC-ROC (Area Under the Receiver Operating Characteristiccurve): A
least imn portant feature until the desired number of features 18 reached
measur of the nodels ability to d1stinguush between classes at different class1ication
threshol An AlC of lis perfect, and 05 1s random guessing
Machine Learning (AIML/AIDS)
30-2024 Fourth Semester, Fundamental of
models. Higher log-likelihood
. Log-Likelihood: Can be used to compare different criterion for model
nerally suggests abetter ft to the data. However, it's not the sole
select1on
æ. Hosmer-Lemeshow Test: A statistical test to assess the goodness of fit of the
logistic regression model. A non-significant p-value indicates a good fit. However. this
test has some limitations and is not always used.
4. Cross-Validation: Tech nigues like k-fold cross-validation are used to
how well the model generalizes to unseen data and to avoid overfitting.
Which Metrics to Use?
"The choice of evaluation metrics depends on the specific problem and business
context.

For balanced datasets, accuracy might be a reasonable starting point.

For imbalanced datasets, precision,recall, and Fl-score are more informative.
If the cost of false positives and false negatives is different, you might need to
optimize for a specific metric (e.g., maximizing recall in medical diagnosis).
AUC-ROC provides a good overall measure of
need to consider different classification thresholds. performance, especially when you
LP. University- ([Link])-Akash Books 2023-1]

Q.1. (c) Write down the major difference hetween K-means clustering and
END TERM EXAMINATION[JULY-2023] hierarchical clustering. (2.5)
FOURTH SEMESTER ([Link]] Ans. The major differences between K-means clustering and hierarchical clustering
are as follows:
FUNDAMENTAL OF MACHINE LEARNING [AIMLIAIDS-210]
Time: 3 Hrs. Max. Marks: 75
1. Algorithm Type:
" K-means Clustering: K-means is a partitioning-based clustering algorithn It
Note: Attempt five questions in all including Q. No. 1 which is compulsory. aims to partition a dataset into K distinct, non-overlapping clusters by mirnimizing the
Select one question from each unit. Only scientiic calculator are allowed. within-cluster variance.
Q.1. Answer all the following with precise justification: " Hierarchical Clustering: Hierarchical clustering is an agglomerative or div1sive
clustering algorithm. It builds a hierarchy of clusters by either merging individual data
Q1. (a) State any five examples of machine learning application. (2.5)
points into clusters (agglomerative) or splitting clusters iteratively (divisive).
Ans. 1. Image Recognition and Classification: Machine learning algorithms 2. Number of Clusters:
are used extensively in image recognition and classification tasks, such as identifying " K-means Clustering: The number of clusters (K) needs to be specified heforehand
objects in photographs, medical image analysis, facial recognition systems, and in K-means clustering. It partitions the data into exactly K clusters.
autonomous vehicle navigation.
2. Natural Language Processing (NLP): NLP applications utilize machine " Hierarchical Clustering: Hierarchical clustering does not require specifying
learning techniques to understand, interpret, and generate human language. Examples the number of clusters beforehand. It produces a hierarchical tree-like structure
(dendrogram), and the number of clusters can be determined based on the structure of
include sentiment analysis, language translation, chatbots, and text summarization.
the dendrogram or hy using methods like the elbow method or silhouette sCore.
3. Recommendation Systems: Machine learning powers recommendation 3. Result Interpretation:
systems used by companies like Amazon, Netflix, and Spotify to suggest products, " K-means Clustering: The result of K-means clustering is a hard assignment of
movies, music, and other content based on users' nast behavior and preferences.
data points to clusters. Each data point belongs to exactly one cluster.
4. Predictive Analytics: Machine learning is used in predictive analytics to Hierarchical Clustering: Hierarchical clustering provides a hierarchical
forecast future trends and behaviors based on historical data. This includes applications representation of clusters, allowing for both hard and soft assignments of data points
in fnance for stock market prediction,in healthcare for disease diagnosis and prognosis,
and in manufacturing for predictive maintenance. to clusters. It offers more lexibility in interpreting the relationships between clusters.
Q.1. (d) What is Reinforcement Learning. (2.5)
5. Fraud Detection: Machine learning algorithms are employed in fraud detection
systems to identify anomalous patterns and detect fraudulent activities in various Ans. Reinforcement learning (RL) 0s a type of machine learning paradigm where an
domains, such as banking transactions, insurance claims, and cybersecurity. agent learns to make decisions by interacting with an environment. The agent learns to
achieve a goal or maximize a cumulative reward by taking actions in the environment.
Q.1. (b) State any two model combination schemne to improve the accuracy In reinforcement learning, the agent receives feedback from the environment in the
of a classifier. (2.5)
form of rewards or penalties based on the actions it takes. The goal of the agent is to
Ans. Two model combination schemes commonly used to improve the accuracy of learn a policy, which is a mapping from states toactions, that maximizes the cumulative
a classifier are: reward over time.
" Ensemble Learning: Ensemble learning involves combining multiple base The key components of reinforcement learning are:
classifers to build a stronger and more robust model. There are several ensemble environment. It
1. Agent: The learner or decision-maker that interacts with the
methods, including: makes decisions based on the current state of the environment and the
learned policy.
" Bagging (Bootstrap Aggregating): In bagging, multiple instances of the 2. Environment: The external system with which the agent interacts. It provides
base classifier are trained on different subsets of the training data (sampled with on the actions taken by
feedback to the agent in the form of rewards or penalties based
replacement), and their predictions are aggregated, often by averaging or voting. the agent.
" Boosting: Buosting sequentially trains multiple weak learners, with each of the environment that the agent
3. State: The current situation or configuration
subsequent model focusing on the mistakes of the previous ones. Examples of boosting the agent to make decisions about which action to take.
perceives. It helps
algorithms include AdaBoost, Gradient Boosting Machines (GBM), and XGBoost. agent at each state. The agent selects actions
4. Action: Thechojces available to the
" Random Forest: Random Forest is an ensemble learning method that constructs based on the current state and the learned policy.
multiple decision trees during training and combines their predictions through averaging feedback provided by the environment to
the agent
or voting. Each tree is trained on a random subset of the features and training data. 5. Reward: The immediate the action was in the given state.
how good or bad
leading to diverse models that reduce overfitting and improve accuracy. ater taking an action. It indicates
12-2023 Fourth Semester, Fundamentals of Machine Learning (AIMLAIDS) LP. University-[BTech)-Akash Books 2023-13
Q.1. (e) Define (a) Decision trees (b) Imbalance Data. (2.5) color, and texture, along with labels indicating whether they are apples or
oranges.
Ans. (a) Decision Trees: Decision trees are a popular supervised machine learning Using this labeled dataset, a supervised learning algorithm can learn to classify new
algorithm used for classification and regression tasks. They recursively split the data fruits based on their features, distinguishing between apples and oranges.
based on features to create a tree-like structure where each internal node represents Unsupervised Learning: Unsupervised learning is a type of machine learning
a decision based on a feature, eauh branch represents an outcome of that decision, where the algorithm learns patterns and structures from unlabeled data, meaning the
and each leaf node represents a class label (in classification) or a predicted value (in training data does not have any associated labels or outcomes. The goal of unsupervised
regression). The goal of a decision tree algorithm is to create a model that predicts the learning is to explore and discover hidden patterns, relationships, or structures in the
target variable by making decisions based on the features. Decision trees are easy to data without guidance or supervision.
interpret and understand, and they can handle both numerical and categorical data. Example: Clustering is a common task in unsupervised learning For example,
(b) Imbalanced Data: Imbalanced data refers to a situation in a classification consider a dataset containing various customer attributes such as age, income, and
problem where the distribution of class labels in the training data is skewed, meaning spending habits, but without any labels indicating customer segments. Using clustering
one class has significantly more instances than the others. This can lead to biased models algorithms,unsupervised learning can group similar customers together based on their
that perform poorly on predicting the minority class, as the model may prioritize the attributes, revealing distinct segments or clusters in the data, such as high-income
majority class due to its higher representation in the data. Imbalanced data is common spenders, budget-conscious individuals, and middle-aged savers.
in real-world scenarios, such as fraud detection, rare disease diagnosis, or anomaly
detection. Addressing imbalanced data requires special handling techniques such as Distinguishing Factors:
resampling methods (e.g., oversampling, undersampling), algorithmic approaches (e.g., 1. Data Labels: In supervised learning, the training data has labeled examples,
cost-sensitive learning, ensemble methods), or using evaluation metrics that are robust while in unsupervised learning, the data is unlabeled.
toimbalanced classes (e.g., precision, recall, 2. Goal: Supervised learning aims to learn a mapping from input features to output
Fl-score).
Q.1. () Define (a) Classification (b) Q Learning. labels or predictions, whileunsupervised learning aims to explore and discover hidden
(2.5) patterns or structures in the data without explicit guidance.
Ans. (a) Classification: Classification is a type of supervised machine
learning
task where the goal is to predict the category or class label of a given input based 3. Tasks: Supervised learning tasks include classification, regression, and
features. In classification, the training data consists of labeled examples, where oneachits prediction, where the algorithm learns to make predictions based on labeled data.
example is associated with a class label. The objective is to learn a model or classifier Unsupervised learning tasks include clustering, dimensionality reduction, and anomaly
that can accurately assign class labels to unseen instances based on their features. detection, where the algorithm discovers patterns or structures in unlabeled data.
Common types of classification tasks include binary classification (where there are only Q.2. (b) List and explain the steps to design a learning system in detail.
two classes), multi-class classification (where there are more (7.5)
than two classes), and
multi-label classification. Ans. Designing a learning system involves several steps to ensure the development
(b)Q-Learning: Q-learning isa model-free reinforcement learning algorithm used of an effective and effcient system that meets the desired objectives. Here are the steps
to learna policy for making decisions in an environment with along with detailed explanations:
discrete states and actions.
It is based on the concept of learning the quality (Q-value) of taking a 1. Define Objectives and Scope:
in a given state, and it aims to find the optimal policy that maximizesparticular action
the
reward over time. Q-learning operates by iteratively updating the Q-valuescumulative " Explanation: Begin by clearly defining the objectives of the learning system
Bellman equation, which represents the expected future reward for taking anusing the and its scope. Understand what problem the system aims to solve, what tasks it needs
a state and then following the optimal policy thereafter. Q-learning is action in to perform, and what the desired outcomes are. Define the boundaries of the system to
environments with discrete state and action spaces and has been successfullywell-suited for determine its limitations and constraints.
to various tasks, including robotic control, game playing, and applied 2. Data Collection and Preparation:
optimization problems. " Explanation: Gather relevant data required for training and evaluation of the
UNIT - I
Q.2. (a) Distinguish between supervised learning and learning system. This includes identifying sources of data, collecting or generating
a
learning. Illustrate with example. unsupervised datasets, and preprocessing the data to clean, normalize, and transform it into
(7.5) suitable format for learning algorithms. Ensure the data is representative, diverse, and
Ans. Supervised Learning: Supervised learning is a type of
where the algorithm learns from labeled data, meaning each machine learning adequately covers the problem domain.
of input features along with their corresponding target labels
training example consists [Link] Selection and Engineering:
or outcomes. The goal features from the collected data
of supervised learning is to learn a mapping from input " Explanation: Select or engineer appropriate
features to output labels or
predictions, based on the patterns present in the labeled data. The algorithm that are relevant and informative for the learning task. This involves analyzing the
generalizes relevant attributes, and transforming or creating new features to
from the training data to make predictions on unseen data. data, identifying
algorithms. Feature selection and engineering
Example: Classification is a typical task in supervised learning. improve the performance of the learning the learning system.
For instance, effectiveness of
consider a dataset containing information about various fruits such as their weight. play a crucial role in determining the
Design:
4. Algorithm Selection and Model
14-2023 Fourth Semester, Fundamentals of Machine Learning (AIMLLAIDS) LP. University-[[Link])-Akash Books 2023-15

" Explanation: Choose suitable learning algorithms and design the model 3. Algorithmic Perspective: From an algorithmic perspective, machine learning
architecture based on the nature ofthe learning task, data characteristics, and objectives involves designing and implementing algorithms that can effciently learn patterns
of the system. Consider factors such as scalability, interpretability, complexity, and and relationships from data. This includes developing various learning paradigms
computationalresources. Experiment with different algorithms and model architectures (e.g.,supervised learning,unsupervised learning, reinforcement learning), algorithmic
toidentify the most suitable approach. techniques (e.g., decision trees, neural networks, support vector machines), and
5. Training and Evaluation: optimization strategies (e.g., gradient descent, evolutionary algorithms).
Explanation: Train the learning model using the prepared data and evaluate Issues in Machine Learning:
its performance using appropriate evaluation metrics and validation techniques. Split 1. Data Quality and Quantity: Machine learning heavily relies on data, and
the dataset into training, validation, and test sets to assess the model's generalization the quality and quantity of data can significantly impact the performance of learning
ability and avoid overfitting. Fine-tune the model parameters and hyperparameters to algorithms. Issues such as missing values, noise, outliers, and imbalanced data can
optimize its performance. affect the accuracy and reliability of machine learning models.
6. Validation and Testing: 2. Overfitting and Underfitting: Overfitting occurs when a model learns the
Explanation: Validate the trained model using separate validation data to training data too well and performs poorly on unseen data, while underfitting occurs
ensure its robustness and generalization ability. Perform thorough testing of the model when a model is too simple to capture the underlying patterns in the data. Balancing
on unseen test data to evaluate its performance in real-world scenarios. Validate the between overfitting and underfitting is a critical challenge in machine learning,
model's outputs against ground truth or human judgments to identify any discrepancies requiring techniques such as regularization, cross-validation, and model selection.
or errors.
3. Bias and Fairness: Machine learning models may exhibit biases or discrimination
7. Deployment and Integration: based on the characteristics of the data used for training. Biases in data or algorithms
" Explanation: Deploy the trained model into the target environment or can lead tounfair or discriminatory outcomes, posing ethical and social implications.
system Addressing bias and fairness issues in machine learning is an ongoing challenge that
and integrate it with the existing infrastructure or applications. Ensure
compatibility, requires careful consideration and mitigation strategies.
scalability, and reliability of the deployed model. Monitor the performance of the
deployed model in production and update it as needed to adapt to changes in data or 4. Interpretability and Explainability: Many machine learning models,
requirements. especially complex ones like deep neural networks, are often considered black
8. Monitoring and Maintenance: boxes, making it challenging to interpret their decisions and predictions. Ensuring
" Explanation: Continuously monitor the performance of the deployed model,
interpretability and explainability of machine learning models is crucial, especially
collect feedback, and gather additional data for retraining or fine-tuning as needed. in high-stakes applications such as healthcare and finance, where transparency and
accountability are essential.
Implement monitoring mechanisms to detect anomalies, drifts, or degradation in
model performance. Regularly update and maintain the learning system to ensure its Q.3. (b) Discuss role of Machine Learning in Fraud detection , medical
diagnosis and email spam detection. (7.5)
effectiveness and relevance over time.
Q.3. (a) What is Machine Learning? Explain different perspectives Ans. Machine learning plays a crucial role in fraud detection, medical diagnosis,
issues in Machine Learning. and and email spam detection by leveraging data-driven algorithms to identify patterns,
(7.5)
Ans. Machine learning is a subfield of artificial intelligence (AI) that focuses anomalies, and predictive indicators. Here's how machine learning is applied in each
on of these domains:
developing algorithms and techniques that enable computers to learn from data and 1. Fraud Detection:
improve their performance on aspecific task without being explicitly programmed.
In machine learning, algorithms learn patterns and relationships
from data " Role: Machine learning algorithms are used to analyze transaction data, user
experience, allowing them to make predictions, decisions, or recommendationsthrough
based
behavior, and other relevant information to detect fraudulent activities or suspicious
on neW or unseen data. patterns.
where models
Different Perspectives in Machine Learning: " Approaches: Supervised learning techniques are commonly used,
labeled examples of both fraudulent and legitimate transactions to learn
1. Statistical Perspective: From a statistical
perspective, machine learning are trained on
unsupervised learning or
algorithms are viewed as tools for estimating and modeling the underlying the patterns of fraud. Anomaly detection methods, such as
probability
distributions of data. These algorithms aim to minimize errors or discrepancies semi-supervised learning, are also employed to identify unusual or
irregular activities
observed data and predicted outcomes by learning from sample data. between that deviate from normal behavior.
analyze transaction history,
2. Computational Perspective: From a
computational perspective, machine " Examples: Credit card fraud detection systems
potentially fraudulent transactions.
learning algorithms are considered optimization problems where location, spending patterns, and other factors to flag
the goal is to fnd learning to detect fraudulent claims by
the optimal parameters or configurations that minimize a certain
objective function, Similarly, insurance companies use machine identify suspicious
historical data to
such as error or loss function. These algorithms often
involve iterative optimization analyzing claim details,customer information, and
techniques to converge to the best solution. patterns.
16-2023 Fourth Semester, Fundamentals of Machine
Learning (AIMLAIDS) I.P. University-([Link])-Akash Books
2. Medical Diagnosis: 2023-17
" Role: Machine learning aids healthcare " Examples of regression tasks include predicting house
professionals in diagnosing diseases, ke size, location, and number of bedrooms, prices based on features
predicting patient outcomes, and recommending treatment plans by analyzing medica| ales revenue. forecasting stock prices, and estimating
data such as patient records, diagnostic tests,
medical images, and genomic data. " Popular algorithms for regression include linear
"Approaches: Supervised learning algorithms are often used regression, polynomial regression,
where mnodels are trained on labeled datasets of for medical diagnosis, lecision trees, random forests, and neural
patient records and corresponding Now, let's discuss the networks.
diagnoses. Deep learning techniques, particularly convolutional methods used to learn multiple classes for a K-class
(CNNS) and recurrent neural networks (RNNS), are effective neural networks Jassification problem:
images (e.g., X-rays, MRIs, CT scans) and time-series data for analyzing medical 1. One-vs-Rest (OvR) or 0ne-vs-All (OvA):
vital signs). (e.g., electrocardiograms,
" In the OvR approach, a separate binary classifier is
" Examples: Machine learning algorithms assist in t as the positive class and the rest as the negative trained for each class, treating
cancer, diabetes, cardiovascular disorders, and neurologicaldiagnosing diseases such as class.
conditions. For instance. " During prediction, the class with the highest
image recognition algorithms can detect abnormalities in medical images. confidence score among all binary
whil classifierS 1S Selected as the final prediction.
predictive models can assess the risk of developing certain diseases " For example, in a 3-class classification problem, three
factors and lifestyle habits. based on genetic binary
trained: one for class 1 vs. not class 1, one for class 2 vs, not class 2, and classifiers are
one for class 3
3. Email Spam Detection: Vs. not class 3.
" Role: Machine learning is used to classify
incoming emails as either spam or [Link]-vs-One (Ov0):
legitimate (ham) by analyzing the content, metadata, sender information, end "In the Ov0 approach, a binary classifier is trained for each pair of classes.
behavior patterns. user
" Approaches: Supervised learning algorithms, such as Naive Bayes, support yotes isDuring
" prediction, each classifier votes for one class, and the class with the most
vector machines (SVM), and logistic regression, are commonly employed selected as the final prediction.
for email spam
detection. These algorithms learn from labeled datasets of spam and non-spam " For example, in a 3-class classification problem, three binary classifhers are
emails kvoined: one for class 1 [Link] 2, one for class 1 vs. class 3, and one for
to identify common characteristics and features associated with class 2 vs. class
spam messages.
" Examples: Email service providers use machine learning-based spam filters to
automatically route suspicious emails to the spam folder, protecting users from phishing [Link] Multiclass Classification:
attempts, malware distribution, and unwanted advertising. These filters continuously "Some algorithms, such as decision trees, support multiclass classification directly
adapt and evolve based on user feedback and new spam patterns. without the need for binary classifiers.
UNIT-II " These algorithms can directly handle multiple classes and partition the feature
Q4. (a) Using example of your own Compare Classification with space intoregions correspondng to each class.
regression? Also Explain the methods used to learn multiple classes for a K Q.4. (b) Describe the random forest algorithm to improve classifier
class Classification Problem.
(7.5) accuracy. For the following set of training samples, find which attribute can
Classification: be chosen as root for decision tree classification. (7.5)
" Classification is a type of supervised learning where the goal to predict the
categorical class labels of new instances based on the features present in the data. Instance Classification al a2
T
"In classification, the output variable is discrete and represents different classes 1

or categories. 2 T T
"Examples of classification tasks include spam detection (classifying emails as 3
F

spam or non-spam), image recognition (classifying images into different categories), and 4 F F
sentiment analysis (classifying text as positive, negative, or neutral). T
5
" Popular algorithms for classifcation include decision trees, logistic regression, T
support vector machines (SVM), k-nearest neighbors (KNN), and neural networks. 6
Regression:
powerful ensemble learning
algorithm used for classification
" Regression is also a type of supervised learning, but instead of predicting discrete Ans. Random Forest is a constructing a multitude of decision trees during
works by
class labels, it predicts continuous numeric values. and regression tasks. It the mean
classes (classification) oralgorithm
uraining and outputs the
class that is the mode of the Random Forest
Here's how the
" In regression, the output variable is continuous and represents a quantity or Prediction (regression) of
the individual trees.
value.
Works to improve classifier accuracy:
LP. University-|B Tech)-Akash Books 2023 19
18-2023 Fourth Semester, Fundamentals of Machine Learning (AIMIJAIDS)
1. Random Sampling: Random Forest starts by creating multiple decision tres
Each tree is trained on a different subset of the training data, selected randomly with
replacement. This process is known as bootstrapping. This sampling ensures diversjty =1- (0.5 (-0.667) (-0.5849) - 0.3333 (- 15849)
among the trees, as each tree sees a slightly different portion of the data. +0.5((-0.333) (-1.5849) (0.6667X-0.551
2. Decision Tree Construction: Each decision tree is constructed by recursivelv =1-|(0.1949 + 0.2641) +(0.2641 +0.1949))
partitioning the feature space into smaller regions, where each region corresponds to =1-2(0.1949 + 0.2641)
a specific class label. The splits are made based on the feature that provides the best =1-0.9164
separation of classes at each node, considering only the random subset of features.
= 0.836
3. Ensemble Learning: By combining the predictions from multiple decision
trees, Random Forest reduces the variance and tends to generalize well to unseen data Information Gain (a,) = 0.836
This ensemble learning approach helps toimprovethe overallaccuracy of the classifer
4. Hyperparameter Tuning: Random Forest offers various hyperparameters Information Gain (a,) = 1
=1-1=0
that can be tuned to optimize performance, such as the number of trees in the forest
the maximum depth of each tree, and the size of the random feature subsets. Tuning Since information Gain (a,) > Gain (a,)
these hyperparameters using techniques like cros-validation can further enhance the So that a, is selected root node.
accuracy of the classifier. Q.5.(a) Write Bayes theorem. What is relationship between Bayes theorem
5. Out-of-Bag Error Estimation: Random Forest provides an estimate of and problem of concept learning. (7.5)
generalization error during training through out-of-bag (0OB) samples. Since each tree Ans. Bayes' Theorem is a fundamental concept in probability theory and statistics.
is trained on a bootstrapped sample, there will be data points not used in training for It provides a way to update our beliefs about the probability of an event occurnng
each tree. These 0OB samples can be used to estimate the model's performance without based on new evidence or information. Bayes' Theorem is expressed mathematically as
the need for a separate validation set. follows:
From the given dataset P(A|B) = PB)P(B|A) x P(A)
Entropy (S) = E-pi - log, pi Where:
Here (3+, 3-) " P(A|B) is the probability of event A occurring given that event B has occurred
Since the number of position and negative instances are same. This is called the posterior probability.
A has occurred.
SoEntropy of entire system (S) = 1 " P(B|A) is the probability of event B occurring given that event
Now, Thisis called the likelihood.
occurring independently of
" P(A) and P(B)are the probabilities of events A and B
|ST Entropy(S,) probabilities.
Information Gain (S, a) = Entropy(S) each other. These are called the prior
problem of concept learning.
The relationship between Bayes' Theorem and the
Now Entropy for a, Ve values (a)

Relationship with Concept Learning:

Information Gain (S, a,)= E(S) SulEntropy (a) of concept learning, Bayes' Theorem can
1. Bayesian Learning: In the context hypotheses (or concepts) given observed
Ve (T,F) of different
be used to update the probabilities hypothesis being true.
about the probability of each
data. Initially, we have prior beliefs beliefs using Bayes Theorem to obtain posterior
Information Gain (S, a,) = E(S) LEntropyla,)
Ve (T.F)
As we observe data, we update
these
probabilities.
Integration: Just as in Bayes'
Theorenm, wherePA)represents
For a, For a, [Link] Knowledge concepts can be integrated
learning, prior knowledge about
(3T, 3F) [4T, 2F] prior knowledge, in concept knowledge can be in the form of assumptions about
This prior relationships between
True ’ (2+, 1-] True ’ (2+, 2-1 into the learning process. complexity concept, or the
of the
the
the distribution of data,
False ’ [1+, 2-1 False - (1+, 1-J different features. termPB| A)
In concept learning, the likelihood
Entropy of y (T) = 1 Likelihood Estimation: particular hypothesis
3.
probability of observing
the data given a
corresponds to the training data and the assumed
Entropy of y (F) = 1 based on the
likelihood is estimated
because no of value are same. or concept. This
distribution of the data given the concept.
Gain (a,) = 1-Entropy
6 (True)- 3 Entropy(False)
LP. University-([Link]-Akash Books
20-2023 Fourth Semester, Fundamentals of Machine Learning 2023-21
(AIMAIDS)
UNIT-III
4. Posterior Probability Update: Baves' Theorem allows us to Q.6. (a) Compare K means clustering with Hierarchical
about hypotheses based on observed data. Similarly, in concept update our beliefs Clustering
our beliefs about the probability of different concepts being learning, we update Techniques. Explain the basic elements of a Hidden Markov model(HMM) List
data. This helps in refining the learned concepts and true based on the observed any two application of HMM. (7.5)
on unseen data. making more accuratepredictions Ans. K"means Clustering vs. Hierarchical Clustering:
5. Model Selection:Baves'Theorem can be used K-means Clustering:
for model selection by comparing
the posterior probabilities of different models given the " Objective: To partition data into k clusters where each data point belongs to
concept learning,different hypothesis spaces or models canobserved data. Similarly, in cluster with the nearest mean, serving as the cluster's centroid. the
be compared based on their
posterior probabilities, allowing us to select the most probable concept given the data. " Algorithm:
Q.5. (b) Consider the training data in the following 1. Initialize k centroids randomly.
class attribute. In the table, the Humidity attribute table where Play is a
2. Assign each data point to the nearest centroid to form clusters.
H" (for high), Sunny has values y (for yes) or has values "L" (for low) or
"S" (for strong) or "W" (for weak), and Play has N" (for no), Wind has values Recalculate the centroids as the mean of points in each cluster.
values "Yes" or "No". (7.5) 4. Repeat steps 2 and 3 until convergence.
Humidity Sunny Wind May " Scalability: Better suited for large datasets due to its computational efficiency.
N S No Number of Clusters: Requires the number of clusters (k) to be specifed
H
H
H
L
What is class label for the following day
z W

W
Yes
Yes
Yes
No
beforehand.
Hierarchical Clustering:
" Objective: To create a hierarchy of clusters by recursively merging or splitting
clusters based on their proximity.
" Types:
according to naive Bayesian classification?(Humidity =L, Sunny =N, Wind =W), " Agglomerative: Start with individual data points as clusters and merge them
Ans. First, we need to calculate the prior probabilities iteratively.
for each class: " Divisive: Start with one cluster containing all data points and split them
P(Play = yes) = 3/5 iteratively.
P(Play = no) = 2/5
Dendrogram: Graphical representation showing the merging or splitting of
Next, we need to calculate the likelihoods of each clusters.
attribute value given each class:
P(Humidity = 1| Play =yes) =0 " Flexibility: No need to specify the number of clusters beforehand, suitable for
PHumidity = l | Play = no) =1 exploring hierarchical structures in the data.
P(Sunny = N | Play = yes) = 2/3 " Interpretability: Provides insights into the relationships between clusters at
P(Sunny =N | Play =no) = /3 different levels of granularity.
P(Wind = W|Play = yes) = 2/2 = 1 Basic Elements of a Hidden Markov Model (HMM):
P(Wind =W | Play =no) = 0/2 = 0 A
Hidden Markov Model (HMM) is' a statistical model used to model sequences of
Using the Naive Bayes formula, we can observations by assuming the existence of an underlying sequence of hidden states.
each class: calculate the posterior probabilities for Basic elements of an HMM include:
PPlay = yes | Humidity = L, Sunny = N, 1. States (Hidden): A set of hidden states S = (S1, S2, ..., SN), representing the
=L|Play = yes) * P(Sunny = N| Play = yes)Wind = W) = P(Play = yes) * P(Humidity
* P(Wind = W| Play = yes) = (3/5) * 0* underlying system that generates the observed data. These states are not directly
(2/3) * 1 =0 observable.
.. OM),
PPlay = no | Humidity = L, Sunny = 2. Observations (Emissions): A set of observations 0 = {01, 02,
L|Play = no) * P(Sunny =N|Play = no) *N, Wind = W)= P(Play = no) * P(Humidity = representing the observable outputs associated with each
hidden state.
P(Wind = W| Play = no) = (2/5) * 1 * (/3)
*0= 0 where Aij represents the
3. Transition Probabilities: A matrix A of size N x N,
Therefore. If both "yes' and "no` cases These probabilities define the
probability of transitioning from state Si to state Sj.
that the model hasn't seen any training have zero posterior probabilities, it suggests dynamics of the hidden states over time.
to either class. This might indicate a examples with the given features that belong of size N x M, where Bij represents the
problem with the training data or the model's
assumptions not holding true for the given 4. Emission Probabilities: A matrix B
data. In such cases, it's essential to revisit probability of emitting observation Oj
from state Si.
the data preprocessing steps, feature
selection, and the Naive Baves model itself to
improve its performance.
LP. University-[[Link])-Akash Books 2023-23
22-2023 Fourth Semester, Fundamentals of Machine Learning (AIMLAIDS)
5. Initial State Distribution: A probability distribution over the initial hidden Step 2: (Repeat): Assign Points to Clusters
states, denoted by I = (n1, I2, ..., nN), where zi represents the probability of Recalculate the distances from each point to each centroid:
in state Si. starting "For ml = 2.5:
Applications of Hidden Markov Models (HMM): Distance from 2.5:0.5 (|2.5 - 2|)
Hidden Markoy Models find applications in various fields, including: Distance from 62.0: 59.5 (|62.0 - 2|)
"For m2 = 62.0:
1. Speech Recognition: HMMs are widely used in speech
recognition systems to
model the temporal dependencies in speech signals. Each phoneme Distance from 2.5: 59.5 (|62.0- 2.5|)
can be modeled as a Distance from 62.0: 0 (it self)
hidden state, and the observed acoustic features correspond to the emitted
observations. Assign points to the nearest centroid:
2. Bioinformatics: HMMs are used in bioinformatics for
tasks such as gene "Cluster 1: (2, 3)
prediction, sequence alignment, and protein structure prediction. HMMs can model the
underlying biological processes and dependencies in sequences of DNA, RNA, or amino " Cluster 2: (4, 10, 12, 20, 311, 25)
acids. Step 3: (Repeat): Update the Centroids
Q.6. (b) Use K Means 1 clustering to Cluster the following data Calculate the mean of points in each cluster (no change in points, so centroids
group" Assume cluster centroid are ml =2 and m2 = 4. The distance into two remain the same).
used is Euclidean distance. (2, 4, 10, 12, 3, 20,311, function
25). (7.5) Since there's no change in the assignment of points toclusters and centroids, the
Ans. To perform K-means clustering with two clusters using the algorithm has converged. The final clusters are:
given initial
centroids ml = 2 and m2 = 4, we'll iterate through the steps of " Cluster 1: (2, 3)
clusters and updating the centroids until convergence. We'll use the assigning points to
Euclidean distance " Cluster 2: (4, 10, 12, 20, 311, 25)
function.
Here's the step-by-step process: Q.7.(a) Explain the EM Algorithmand Fuzzy CMeans Clustering in details.
(7.5)
1. Initialize the centroids: ml =2 and m2 = 4.
Ans. The Expectation-Maximization (EM) algorithm is a powerful iterative
2. Assign each data point to the nearest centroid based on Euclidean distance.
technique used to estimate parameters of statistical models, particularly when dealing
3. Update the centroids based on the mean of the points assigned to each with latent variables or incomplete data. It is widely applied in various felds such
cluster.
4. Repeat steps 2 and 3 until convergence (centroids no longer change significantly). as machine learning, computer vision, and natural language processing. The EM
Let's go through these steps: algorithm alternates between two main steps: the E-step (Expectation) and the M-step
Step 1: Initialization (Maximization).
Centroids: ml = 2 and m2 = 4 1. Initialization: Begin by initializing the parameters ofthe model, often randomly
Step 2: Assign Points to Clusters or using some heuristic.
Calculate the Euclidean distance from each point to each centroid: 2. E-step (Expectation):
or missing data
" For mnl = 2: " In this step, compute the expected value of the latent variables
Distance from 2:0 (itself) given the observed data and the current parameter estimates.
using the current
Distance from 4: 2 (|2-4|) "Calculate the probability distribution of the latent variables
parameters (using Bayes' rule or other methods).
" For m2 = 4:
3. M-step (Maximization):
Distance from 2: 2 (|2 -4|D maximize the likelihood function.
Distance from 4: 0 (it self) " Update the parameters of the model to
incorporating the information obtained from the
E-step.
Assign points to the nearest centroid: the expected complete data log-likelihood
" Cluster 1: (2, 3} " Estimate the parameters by maximizing
" Cluster 2: (4. 10, 12, 20, 311, 25) computed in the E-step.
4. Iteration: longer
Step 3: Update the Centroids convergence, where the parameters no
Calculate the mean of points in each cluster: "Repeat the E-step and Mstep untilpredefined number of iterations.
"Cluster 1: 2 + 3/2 = 2.5 change significantly or after reaching a
5. Convergence Criteria: iterations,
"Cluster 2: 4 + 10 + 12 + 20 + 311 + 25/6 = 62.0 maximum number of
criteria include reaching a
Update centroids: " Common convergence log-likelihood function, or observing small changes in
" ml = 2.5 achieving a small change in the
" m2 = 62.0
the parameter values.
24-2023 Fourth Semester, Fundamentals of Machine IP. University-[[Link]]-Akash Books 2023-25
Learning (AIMUAIDS)
Fuzzy C-MeansClustering: " Support count is the number of transactions
containing the itemset.
Fuzzy C-Means (FCM) clustering is an 4. Pruning:
algorithm that allows data points to belongextension of the classic K-means clustering
to multiple clusters with varying degrees " Eliminate candidate itemsets that do not meet the minimum support
of membership. Unlike K-means, which threshold.
assigns each data point to exactly one cluster, "Only frequent itemsets are retained for further iterations.
FCM assigns each data point a membership value for each cluster, indicating the degree
to which the point belongs to that cluster. 5. Repeat:
1. Initialization: Start by initializing the " Repeat steps 2 to 4 until no more frequent itemsets can be generated.
cluster centers (centroids) anda fuzziness
parameter m, which controls the degree of fuzziness " The algorithm terminates when no new frequent itemsets can be found.
in the clustering.
2. Membership Assignment Association Analysis:
(E-step):
"Compute the membership degree of each data Association analysis is a data mining technique used to discover interesting
membership function. point toeach cluster using a fuzzy patterns, correlations, or relationships among variables in large datasets. It is widely
"The menmbership degree represents the likelihood or degree of belongingness applied in various domains, including market basket analysis, customer behavior
data point to a particular cluster. of a analysis, and recommendation systems.
3. Centroid Update (M-step): Key Concepts in Association Analysis:
"Update the cluster centers (centroids) based on the 1. Support:
in the E-step. membership degrees computed
" Support measures the frequency of occurrence of an itemset in the dataset.
" Each cluster center is updated as a wveighted
weights are the membership degrees raised average of all data points, where the " It indicates how frequently the itemset appears in transactions and is calculated
4. Iteration:
to the power of m. as the ratio of transactions containing the itemset to the total number of transactions.
2. Confidence:
" Repeat the membership assignment and
centroid update steps until convergence,
where the cluster centers and membership degrees " Confidence measures the reliability of the association rule.
no longer change significantly. " It indicates the likelihood that an item Yis purchased when item X is purchased
5. Convergence Criteria: Y to*
" Similar to K-means, conmmon convergence and is calculated as the ratio of the support of the itemset containing both X and
number of iterations or criteria include reaching a maximum the support of the itemset containing only X.
observing small changes in the cluster centers and membership 3. Association Rule:
degrees. where X and Y are
Q.7. (b) Explain Apriori Algorithm in Machine " An association rule is an implication of the form X ’ Y,
analysis in details. Learning and association itemsets.
be present
Ans. The Apriori algorithm is a classic algorithm used
(7.5) It indicates that if X is present in a transaction, then Y is likely to
in machine learning for
association rule mining in transactional databases or datasets. It is particularly
as well.
useful for identifying frequent itemsets and generating 4. Lift:
association rules that reveal two itemsets, X and Y, beyond
relationships between different items in a dataset. Association " Lift measures the strength of association between
interesting patterns, associations, or relationships among items inanalysis aims to find what would be expected by chance.
large datasets. the rule to the expected confidence
Steps of the Apriori Algorithm: " It is calculated as the ratio of the confidence of
1. Frequent Itemset Generation: if XandY were independent.
Association Analysis:
" Begin by identifying all frequent itemsets, i.e.,sets Application of Apriori Algorithm and
of items that appear together
frequently in the dataset. 1. Market Basket Analysis:
analysis are widely used in retail for market
" A frequent itemset is a set of items that
meets a minimum support threshold " Apriori algorithm and association
specified by the user. basket analysis.
2. Candidate Generation: customer purchasing behavior by identifying
" It helps retailers understand recommendations or optimizing product
" Generate candidate itemsets by joining frequently co-occurring items and generating
itemsets of size.
frequent itemsets of size k to create placement strategies.
"Prune the candidate itemsets that contain
subsets [Link] Behavior Analysis: management (CRM) to
cannot be frequent themselves (the Apriori principle). that are not frequent, as these " Association analysis is applied
in customer relationship
preferences.
3. Support Counting: analyze customer behavior and opportunities, customer
patterns such as cross-selling
" Count the support (frequency of occurrence) of
each candidate itemset in the " It helps businesses identify
affinity.
dataset. segmentation, and product
26-2023 Fourth Semester, Fundamentals of Machine ILearning (AIMLAIDS) LP. University-([Link])-Akash Books 2023-27
3. Web Usage Mining: APerform Action and 0bserve Reward: The agent takes the selected
"Association analysis is used in web usage mining to
action
analyze user nav1gation A observes the reward it receives from the environment, as well as the next state it
patterns on websites. transitions into.
" It helps website owners improve website design, 6. Update Q-value: Using the observed reward and the Q-value of the next state,
content
personalized recommendations based on users' browsing behavior. organ1zation, and
[Link] and Bioinformatics: update the Q-value of the current state-action pair using the Q-learning update rule:
O's, a)- Qs, a)+a<r + ymax, Qs', a) - Q(s, a)l
" Association analysis is applied in healthcare and bioinformatics to Where:
patterns in patient data and genomic data. discover
Q(s, a) is the current Q-value of the state-action pair.
" It helps identify associations between medical conditions, treatments, and
markers, leading to insights for personalized medicine and disease diagnosis. genetic " ais the learning rate (0< as 1),determining to what extent the newly acquired
information will override the old information.
Q.8. (a) Explain the Q function and Q learning Algorithm
deterministic rewards and actions with example. assuming " ris the reward received after taking action a from state s.
(7.5) yis the discount factor (0<ys 1), representing theimportance of future rewards.
Day Outlook Temperature HumidityWind Play Tennis 's' is the next state after taking action a from state s.
D Sunny Hot High Weak No 6. Repeat: Continue taking actions, observing rewards, and updating Qvalues
D2 Sunny Hot until convergence or for a fixed number of iterations.
High Strong No
D3 Overcast Hot High Weak Yes
Let's illustrate this with an example using the given data and hypothetical @-values
in the table:
D4 Rain Mild High Weak Yes
D5 Rain State Action Q-Value
Cool Normal Weak Yes
D6 Sunny, Hot, High, Weak Play 0.5
Rain Cool Normal Strong No
D7 Sunny, Hot, High, Strong Play 0.3
Overcast Cool Normal Strong Yes
D8 Overcast, Hot, High, Weak Play 0.9
Sunny Mild High Weak Yes
D9 Rain, Mild, High, Weak Play 0.7
Sunny Cool Normal Weak Yes
D10 Rain, Cool, Normal, Weak Play 0.8
Rain Mild Normal Strong Yes
0.4
D11 Sunny Mild Rain, Cool, Normal, Strong Play
Normal Strong Yes
D12 Overcast, Cool, Normal, Strong Play 0.6
Overcast Hot Normal Weak Yes
D13 Sunny, Mild, High, Weak Play 0.2
Overcast Hot Normal Weak Yes
D14 Sunny, Cool, Normal, Weak Play 0.7
Rain Mild High No
Strong 0.6
Rain, Mild, Normal, Weak Play
Ans. A. In reinforcement learning, the Q-function (Quality
the expected cumulative reward an agent can obtain by taking function) represents Sunny, Mild, Normal, Strong Play 0.3
a specific action from
a certain state and then following a particular policy. The Overcast, Mild, High, Strong Play 0.8
model-free, off-policy reinforcement learning algorithm used Q-learning algorithm is a
to find the optimal action Overcast, Hot, Normal, Weak Play 0.9
selection policy for a given environment. 0.1
Rain, Mild, High, Strong Play
The Q-learning algorithm proceeds as follows:
1. Initialization: Initialize the Q-table with the expected future reward for taking a
arbitrary values. The Q-table has In this example, the Q-values represent valuesare updatediteratively as the agent
rows for each possible state and columns for each possible action. Initially, particular action from a certain state. These
could be zero or randomly assigned. these values time, the Q-values converge towards the
optimal
interacts with the environment. Over
2. Exploration vs. Exploitation: The agent the expected cumulative reward.
selects an action to take from policy that maximizes Explain
the current state based on a trade-off between and its role in machine learning?
exploration (trying new actions) and Q.8. (b)Discuss Bellman equation (7.5)
exploitation (using known actions with the highest Q-values). value function Approximation.
concept in
3. Action Selection: The agent selects an Bellman equation is a fundamental definition
action Ans. Bellman Equation: The
This action can be chosen using various strategies like based on the current state. reinforcement learning. It provides a recursive
epsilon-greedy, where there's a dynamic programming and utility of being in a
balance between exploration and exploitation. represents the expected return or
of the value function, which certain policy thereafter. The Bellman equation can
particular state and following a
28-2023 Fourth Semester:, Fundamentals of
Machine Learning (AIMIJAIDS) LP. University-|BTech)-Akah Books
be expressed in two main forms: the
Optimality Equation. Bellman Expectation Equation and the Bellman
" Byiteratively updating Q-values towards the 2023-29
1. Bellman Expectation action selections over time. optimal
make better
Equation: learn
to
Q-funtion., these algorithms
The Bellman Expectation Equation
expresses the value of a state s under a given Function
Value F Approximation:
policy n as the expected immediate reward
Vn(s) = Ez|Rt + 1+ yV(St + 1)|St =
plus the expected value of the next state: Value function approximation is a technique used to
s] large or continuous state spaces where estimate the value function
Where: feasible.
Instead. of storing explicit enuneration
values for every state, of all states is not
" Vr(s) is the value of the value function using value function approximation
states under approximating

involves
"Rt + 1 is the immediate reward policy .
aparameterized
received after transitioning from state s to the Common methods for value function function.
next state.
1. Linear Approximation: approximation include:
" St + 1is the next state reached after Donresent the value function as a linear
taking an action according to policy 1.
" yis the discount factor that balances
immediate rewards against future rewards. Hse techniques like linear combination of features.
" (] Er (:} denotes the expected regression or gradient descent to learn the
value under policy n. of the linear function. coefhciernts
[Link] Optimality Equation: 2. Non-linear Approximation:
The Bellman Optimality Equation expresses the ITse non-linear function approximators
expected return achievable by following the value of a state s as the maximum such as neural networks or decision trees
V*(s)= max E(Rt + 1+ yV*(St + 1)| St =optimal policy:
s, At = al
to approximate the value function
Where: ,These models have the flexibility to capture complex
and values. relationships between states
" V*(s) is the value of states
under the optimal policy. 3. Kernel Methods:
" Max, denotes maximizing over all
possible actions a in state s. . IUse kernel methods such as kernel
" The rest of the terms have the same
meaning as in the Bellman Expectation
regression or kernel ridge regression to
Equation. approximate the value function
Role of Bellman Equation in Machine "Kernel methods implicitly map the input space into a high-dimensional
Learning: feature
The Bellman equation plays a crucial role in space, where linear methods can be applied.
various machine learning algorithms, Value function approximation enables reinforcement learning algorithms to scale
particularly in reinforcement learning and dynamic
Some key roles include: programming-based methods. to larger and more complex problems by generalizing from observed states to unseen
1. Value Iteration: states. However, it introduces approximation errors, and careful selection of function
" The Bellman Optimality Equation serves as approximators and features is erucial for achieving good performance.
the basis for value iteration Q.9. (a) Write Short note on the following
algorithms, where the value function is iteratively updated towards its optimal values.
(7.5)

" Value iteration converges to the optimal value () Markov Decision Process
the Bellman backup operation. function by repeatedly applying (ii) Application of neural Network.
2. Policy Evaluation: Ans. (i) Markov Decision Process (MDP): AMarkov Decision Process (MDP) is
" The Bellman Expectation Equation is used a mathematical framework used to model decision-making problems in which an agent
for policy evaluation in reinforcement
learning. interacts with an environment over a series of discerete time steps. It is widely applied
" It calculates the expected return of a state in the field of reinforcement learning and sequential decision-making. The MDP model
under a
considering the expected rewards and future state values. given policy by recursively consists of the following components:
3. Policy Improvement: 1. States (S):
"The Bellman Optimality Equation guides " A set of all possible situations or configurations that the system can be in.
means to evaluate the expected return of differentpolicy improvement by providing the "At each time step, the environment is in a certain state,
which may be observable
actions in a state.
"It enables the selection of the action that or hidden.
to policy improvement.
maximizes the expected return, leading
2. Actions (A):
4. Q-Learning and SARSA: can take in each state.
"A set of all possible actions that the agent have associated
" Q-learning and SARSA, two popular from one state to another and may
Bellman Optimality Equation to update action reinforcement learning algorithms, use the " Actions influence the transition
transitions and rewards.
values (Q-values) based on observed costs or rewards.
Transition Probabilities (T):
3.
to another given a particular
" The probabilities of transitioning from one state
action.
LP.
University-(B Tech)-Akash Books
30-2023 Fourth Semester. Fundamentals of Machine Learning (AIMVAIDS) 2023-31
" Applications include credit scoring,
and customer relationship management. algorithmic trading, market
ana
"Represented as T(s, a, s), where s is the current state, a is the action taken. trend analysis.
s is the next state.
6. Autonomous Systems and Robotics:
4. Rewards (R):
.Neural networks play a crucial role in the development of
" Immediate rewards or costs received by the agent for taking an action in a systems. autonomous vehicles,
particular state. drones, and robotic
. Applications include navigation, object recognition, path planning. and robotic
" Can be deterministic or stochastic and may depend on the current state and control.
action.
5. Policy (n): Q.9. (b) Discuss the significance of Deep QNeural Network in machine
laarning. Alsowrite down application of Reinforcement Learning. (7.5)
" Astrategy or decision-making rule that maps states to actions.
" Determines the agent's behavior in the environment and infiuences the sequence Ans. Significance of Deep Q-Network (DQN) in Machine Learning:
of actions taken over time. Deep Q-Network (DQN)is aground breaking algorithm that combines reinforcement
JRorning with deep neural networks to solve complex decision-making problems. It was
The goal in a Markov Decision Process is to find an optimal policy that maximizes otroduced by researchers at Deep Mind in 2013 and has since revolutionized the field of
the expected cumulative reward over time. The optimal policy can be found using
dynamic programming methods, such as value iteration or policy iteration, or by using inforcement learning. The significance of DQN in machine learning can be understood
through several key aspects:
reinforcement learning algorithms, such as Q-learning or policy gradient methods.
(ii) Application of Neural Networks: 1. Deep Learning Representation:
Neural networks, inspired by the structure and function of the human brain " DQN leverages deep neural networks to approximate the Q-function. hich
estimates the expected cumulative reward of taking an action in a given state
are versatile machine learning models capable of learning complex patterns and state
relationships from data. They consist of interconnected layers of artificial neurons " Deep neural networks enable DQN to learn complex and high-dimensional
representations directly from raw input data, such as images or
sensory inputs.
that process input data and produce output predictions. Neural networks have found
numerous applications across various domains, including: 2. End-to-End Learning:
actions, enabling end
1. Image Recognition and Computer Vision: " DQN learns directly from raw sensory inputs and output
engineering.
" Convolutional Neural Networks (CNNs) are widely used for tasks such as image to-end learning without the need for manual feature
and allows DQN to automatically
classification, object detection,and facial recognition. " This approach simplifies the learning process the data.
representations from
" Applications include autonomous driving, medical imaging, surveillance systems, discover relevant features and
and image-based quality control in manufacturing. 3. Experience Replay:
technique that stores past experiences (state,
2. Natural Language Processing (NLP): " DQN utilizes experience replay, a buffer.
in a replay memory
"Recurrent Neural Networks (RNNS) and Transformers are employed for tasks action, reward, next state)
consecutive samples
such as language translation, sentiment analysis, and text generation. Experience replay breaks the temporal correlation between
" sampling experiences from the replay
by randomly
" Applications include machine translation, chatbots, document summarization, and enables more efficient learning
and speech recognition. buffer during training.
3. Recommendation Systems: 4. Target Network:
training by decoupling the
"Neural networks are used to build personalized recommendation systems that target network to stabilize
" DQN employs a separateBellman update from the Q-network being updated
suggest products, movies, music, or content to users based on their preferences and target Q-values used in the for training, mitigating the
behavior. provides stable target values
" The target network Q-value estimates.
"Applications include e-commerce platforms, streaming services, and social media bias and oscillations in
issues of overestimation Breakthroughs:
platform. Learning
5. Deep Reinforcement breakthroughs in deep
reinforcement
4. Healthcare and Biomedical Applications: the foundation for numerous
" DON laid OpenAl>'s Dota 2 bot.
" Neural networks are used for medical image analysis, disease diagnosis, drug AlphaGo, AlphaZero, and reinforcement
learning, including combining deep learning with
discovery, and patient outcome prediction. potential of
It demonstrated the
decision-making tasks in
diverse domains.
" Applications include medical imaging interpretation, personalized medicine, and learning tosolve complex Learning:
genomic analysis. Applicationsof
Reinforcement
applications across various
5. Finance and Business Analytics: learning (RL) has a wide range of environment to achieve
Reinforcement interactions with an
learn from
" Neural networks are employed for fraud detection, risk assessment., stock market domains due to its ability toapplications include:
prediction,and customer churn prediction. specific goals.
Somenotable
00z3 Fourth Semester, Fundamentals of Machine Learning (AIMLUAIDS)
1. Game Playing:
KL algorithms have achieved remarkable success in playing
games, video games, and multiplayer online complex board
games.
Examples include AlphaGo, which
and OpenA>'s agents trained to plav Dotadefeated world champions n the game of Go
2 and StarCraft l1.
2. Robotics:
"Reinforcement learning is used to train robotic
agents to perform tasks such as
grasping objects, navigating environments, and interacting with humans.
" Applications include industrial
robots. automation, autonomous vehicles, and household
3. Autonomous Systems:
"RL is applied in autonomous
control in dynamic and uncertain systems to learn policies for
environments. decision-making and
" Examples include
management systems.
autonomous drones, self-driving cars, and
intelligent trafic
4. Finance and
Trading:
" RL algorithms are used in
risk management, and pricing finance for portfolio optimization, algorithmic trading.
derivatives.
" RL can learn optimal trading
maximizing cumulative rewards. strategies by interacting with financial markets and
5. Healthcare:
" RL is employed in
healthcare for personalized treatment
monitoring, and medical diagnosis. planning, patient
" Applications include
image analysis. adaptive treatment strategies, drug
discovery, ánd medical
6. Recommendation Systems:
" RL techniques are
and personalize user used in recommendation
experiences. systems to optimize content
" RL can learn t0 delivery
recommnend
preferences and feedback. products, movies, music, and
content based on user
7. Energy Management:
" RL algorithms
are
consumption, grid stability,applied
and
in energy
management systems to optimize
renewable energy
"RL can learn to integration. energy
to maximize efficiency control smart grids, HVACsystems, and
and sustainability. energy storage devices

Common questions

Cost-sensitive learning assigns different misclassification costs to different classes, focusing more on the minority class by assigning higher costs to its misclassifications. This encourages the model to pay greater attention to the minority class. In contrast, class weights involve assigning higher weights to the minority class rather than explicit costs. Both methods aim to address class imbalance but do so through slightly different mechanisms, with cost-sensitive learning being more about penalization and class weights adjusting the importance during training .

Precision might be prioritized over recall in scenarios where false positives carry a high cost. For example, in a spam detection system, mistakenly classifying a legitimate email as spam (a false positive) is less acceptable than missing a few spam emails. Here, focusing on precision ensures that when the model predicts a positive class, it is more likely to be correct, minimizing false positives. The trade-off between precision and recall depends on business needs and the cost associated with different types of errors .

In a Markov Decision Process (MDP), 'states' represent all possible configurations or situations the system can be in. 'Actions' are decisions or moves the agent can take in each state. 'Transition probabilities' determine the likelihood of moving from one state to another given a specific action. 'Rewards' quantify the immediate benefit received after taking an action in a state. The goal in an MDP is to develop a policy that maximizes the expected cumulative reward. These components interact in a way that guides decision-making processes over time, facilitating optimal decisions based on expected outcomes .

Understanding the confusion matrix is essential because it provides a detailed view of a model's performance beyond overall accuracy, which can be misleading in imbalanced datasets. It breaks down predictions into true positives, false positives, true negatives, and false negatives, offering insights into specific error types. For imbalanced datasets, metrics derived from the confusion matrix, like precision, recall, and F1-score, are more informative. These metrics focus on the model's ability to correctly identify the minority class, often the more important class in practical applications .

The Gaussian Mixture Model (GMM) assumes that data points are generated from a mixture of Gaussian distributions. The EM algorithm is used to estimate the parameters of these Gaussian distributions iteratively. It begins with initialization, calculates the expectation step to assign responsibility of data points to clusters, and then updates parameters during the maximization step. This process is crucial for finding the maximum likelihood estimates of the parameters, enabling the GMM to effectively model complex data structures where clusters overlap .

Value function approximation allows reinforcement learning algorithms to generalize from observed states to unseen states, enabling them to scale to larger problems where storing every state-action value is infeasible. Approximation methods, such as linear combinations, neural networks, and kernel methods, represent the value function with a parameterized form. However, challenges include managing approximation errors and selecting appropriate features and models to ensure good performance. Careful tuning and validation are crucial to balance complexity and accuracy in approximations .

The Deep Q-Network (DQN) has revolutionized reinforcement learning by combining deep learning with Q-learning to handle complex decision-making tasks. Key reasons for using DQN include its ability to approximate the Q-function using deep neural networks, enabling end-to-end learning without manual feature engineering. DQN can learn directly from high-dimensional sensory inputs, such as raw images, by leveraging experience replay to stabilize training. This allows DQN to automatically discover relevant features, significantly improving the scalability and applicability of reinforcement learning to real-world problems like robotics and gaming .

In Q-learning, a balance between exploration and exploitation is maintained to optimize learning. Exploration involves trying new actions to discover their effects, which is crucial for building a complete model of the environment. Exploitation uses the current knowledge to maximize immediate rewards. Strategies like epsilon-greedy achieve this balance by selecting actions randomly with a low probability and choosing the best-known actions otherwise. This balance is significant as it ensures the agent doesn't get stuck in local optima and can adapt to dynamic environments .

Neural networks transform healthcare by enabling applications like medical image analysis, disease diagnosis, drug discovery, and patient outcome prediction. They can analyze medical imaging for interpreting conditions, offering more precise and faster diagnostics. In personalized medicine, neural networks identify patterns in patient data that inform tailored treatment plans. For drug discovery, they model complex biological processes to identify potential drug candidates. These applications enhance accuracy, efficiency, and personalization in healthcare, significantly impacting treatment and research .

One-hot encoding increases the dataset's dimensionality by converting categorical variables into a set of binary variables, each representing a category level. Label encoding, on the other hand, assigns an integer value to each category, not significantly increasing dimensionality. PCA, unlike these encodings, reduces dimensionality by transforming features into a set of linearly uncorrelated variables, called principal components. Unlike PCA, encoding techniques do not reduce dimensionality, but PCA can capture variance in data, which encoding cannot directly address .

Key Factors in MLP Learning
No ratings yet
Key Factors in MLP Learning
17 pages
Deep Learning: XOR, Gradient Descent, and Architecture
No ratings yet
Deep Learning: XOR, Gradient Descent, and Architecture
43 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
22 pages
Gradient Descent: Learning Rate & Types
No ratings yet
Gradient Descent: Learning Rate & Types
10 pages
DNN Training and Optimization Techniques
No ratings yet
DNN Training and Optimization Techniques
114 pages
Optimization Techniques in Machine Learning
No ratings yet
Optimization Techniques in Machine Learning
19 pages
Understanding Neural Networks & Optimization
No ratings yet
Understanding Neural Networks & Optimization
37 pages
Deep Learning vs. Machine Learning Guide
No ratings yet
Deep Learning vs. Machine Learning Guide
37 pages
Deep Learning Fundamentals Explained
No ratings yet
Deep Learning Fundamentals Explained
48 pages
Machine Learning Model Fitting Techniques
No ratings yet
Machine Learning Model Fitting Techniques
67 pages
5 Optimizer
No ratings yet
5 Optimizer
28 pages
Lecture5 Gradient Descent Fitting
No ratings yet
Lecture5 Gradient Descent Fitting
64 pages
Deep Learning Optimizers Explained
No ratings yet
Deep Learning Optimizers Explained
12 pages
CNN Batch Size and Optimization Techniques
100% (1)
CNN Batch Size and Optimization Techniques
59 pages
Linear Classifiers and SVM Optimization
No ratings yet
Linear Classifiers and SVM Optimization
22 pages
Understanding Gradient Descent Algorithms
No ratings yet
Understanding Gradient Descent Algorithms
13 pages
Deep Learning
No ratings yet
Deep Learning
35 pages
Types of Artificial Neural Networks
No ratings yet
Types of Artificial Neural Networks
23 pages
Neural Network Optimization Challenges
No ratings yet
Neural Network Optimization Challenges
14 pages
Adam Optimizer in Neural Networks
No ratings yet
Adam Optimizer in Neural Networks
24 pages
Neural Network Training Essentials
No ratings yet
Neural Network Training Essentials
8 pages
Perceptron and Gradient Descent Overview
No ratings yet
Perceptron and Gradient Descent Overview
25 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
11 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
67 pages
Understanding Subgradients in Optimization
No ratings yet
Understanding Subgradients in Optimization
25 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
19 pages
Learning XOR with Deep Networks
No ratings yet
Learning XOR with Deep Networks
25 pages
Understanding Deep Learning Activation Functions
No ratings yet
Understanding Deep Learning Activation Functions
70 pages
AI600 Solutions
No ratings yet
AI600 Solutions
19 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
3 pages
Gradient Descent in Neural Network Optimization
No ratings yet
Gradient Descent in Neural Network Optimization
33 pages
Ensemble Learning and Model Training Insights
No ratings yet
Ensemble Learning and Model Training Insights
9 pages
Ensemble Learning and Model Training Insights
No ratings yet
Ensemble Learning and Model Training Insights
23 pages
Applications and Metrics of AI
No ratings yet
Applications and Metrics of AI
39 pages
AI & ML: Classification and Regression Techniques
No ratings yet
AI & ML: Classification and Regression Techniques
35 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
8 pages
K-Means Clustering and KNN Classification Guide
No ratings yet
K-Means Clustering and KNN Classification Guide
10 pages
Deep Learning Fundamentals and Techniques
No ratings yet
Deep Learning Fundamentals and Techniques
212 pages
Minimizing Gradient Problems in ANN
No ratings yet
Minimizing Gradient Problems in ANN
56 pages
Deep Learning vs Machine Learning Guide
No ratings yet
Deep Learning vs Machine Learning Guide
20 pages
Deep Learning Unit 1 Overview
No ratings yet
Deep Learning Unit 1 Overview
32 pages
Understanding Deep Feedforward Networks
No ratings yet
Understanding Deep Feedforward Networks
113 pages
Deep Learning Overview for Presentations
No ratings yet
Deep Learning Overview for Presentations
64 pages
Neural Network Training Optimization
No ratings yet
Neural Network Training Optimization
62 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
54 pages
Machine Learning Fundamentals and Techniques
No ratings yet
Machine Learning Fundamentals and Techniques
20 pages
Deep Learning Concepts and Techniques
No ratings yet
Deep Learning Concepts and Techniques
9 pages
Supervised Deep Learning Training Techniques
No ratings yet
Supervised Deep Learning Training Techniques
23 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
36 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
5 pages
Backpropagation in Deep Learning Explained
No ratings yet
Backpropagation in Deep Learning Explained
48 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
10 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
14 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
14 pages
Supervised Deep Learning Training Techniques
No ratings yet
Supervised Deep Learning Training Techniques
36 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
63 pages
Deep Learning Optimization Concepts Explained
No ratings yet
Deep Learning Optimization Concepts Explained
13 pages
NISM Admissions Helpline Information
No ratings yet
NISM Admissions Helpline Information
3 pages
Working Capital Management at CCL
No ratings yet
Working Capital Management at CCL
13 pages
Nitro Pro 13.9.1.155 Crack 2020 Guide
No ratings yet
Nitro Pro 13.9.1.155 Crack 2020 Guide
4 pages
Power in Close Relationships
100% (1)
Power in Close Relationships
4 pages
Medication Disposal SOP
100% (3)
Medication Disposal SOP
3 pages
Registered Dietitian Resume for Athletes
No ratings yet
Registered Dietitian Resume for Athletes
1 page
NCP1117 Voltage Regulator Overview
No ratings yet
NCP1117 Voltage Regulator Overview
19 pages
2015-05-07 Calvert County Times
No ratings yet
2015-05-07 Calvert County Times
24 pages
Pacing for Growth: Intelligent Restraint
No ratings yet
Pacing for Growth: Intelligent Restraint
12 pages
Neural Network Fault Diagnosis for PV Arrays
No ratings yet
Neural Network Fault Diagnosis for PV Arrays
27 pages
Product Manual 36600 (Revision G) : PG Governor Basic Elements
100% (1)
Product Manual 36600 (Revision G) : PG Governor Basic Elements
34 pages
Joining Time Extension Guidelines
No ratings yet
Joining Time Extension Guidelines
4 pages
7SL86 Protection Device Overview
No ratings yet
7SL86 Protection Device Overview
10 pages
BNU Re-evaluation Application Form
No ratings yet
BNU Re-evaluation Application Form
2 pages
Construction Material Requirements List
No ratings yet
Construction Material Requirements List
1 page
Organizational Role Stress at NTPC
No ratings yet
Organizational Role Stress at NTPC
4 pages
Water Bill Payment Slip Details
No ratings yet
Water Bill Payment Slip Details
1 page
R-422A Safety Data Sheet Overview
No ratings yet
R-422A Safety Data Sheet Overview
7 pages
Probability Distributions Assignment
No ratings yet
Probability Distributions Assignment
15 pages
SSC CGL Jargon Guide for Daily Life
No ratings yet
SSC CGL Jargon Guide for Daily Life
3 pages
Sound Attenuated Enclosures for C27/C32
No ratings yet
Sound Attenuated Enclosures for C27/C32
2 pages
Rubber Rheometer and Rheograph Overview
100% (2)
Rubber Rheometer and Rheograph Overview
11 pages
Measure AC Frequency with Sonometer
No ratings yet
Measure AC Frequency with Sonometer
4 pages
Human Capital Theory Explained
No ratings yet
Human Capital Theory Explained
14 pages
Boomi Dataintegration
No ratings yet
Boomi Dataintegration
10 pages
YETI Hopper M20 Backpack Cooler
No ratings yet
YETI Hopper M20 Backpack Cooler
1 page
Operator Guide - Fa-Tech Vega - English
No ratings yet
Operator Guide - Fa-Tech Vega - English
32 pages
Aarogyam C Pro Test Report for Nirmala Tiwari
No ratings yet
Aarogyam C Pro Test Report for Nirmala Tiwari
15 pages
Ankit Singh: Software Engineer Resume
No ratings yet
Ankit Singh: Software Engineer Resume
2 pages
IT Management Training and Trends
No ratings yet
IT Management Training and Trends
16 pages

Machine Learning Fundamentals Exam Guide

Uploaded by

Machine Learning Fundamentals Exam Guide

Uploaded by

[JUNE-2024]

IP. University-([Link] ]-Akash Books 2024-13

OF MACHINE LEARNING to find the min1mum of a function, typically a

Learning): These methods (LDA)? (9)

Whose Algorithm-Level account costs to

" subset of the

" Impact on Dimensionality: Does not change dimensionality. It replaces the

ins1ghts into the data

Machine Learning IP. University-([Link].J-Akash Books 2024-23

Trick: The spaces (even

n very thal space.

For balanced datasets, accuracy might be a reasonable starting point.

Relationship with Concept Learning:

Common questions

What are the differences between cost-sensitive learning and class weights in addressing imbalanced datasets?

In what scenarios might precision be prioritized over recall in a model evaluation, and why?

Explain the roles and interactions between states, actions, transition probabilities, and rewards in a Markov Decision Process (MDP).

Why is understanding the confusion matrix essential in evaluating a classification model's performance, particularly in imbalanced datasets?

How does the Gaussian Mixture Model (GMM) use the Expectation-Maximization (EM) algorithm, and why is this process crucial?

How can value function approximation scale reinforcement learning algorithms to larger problems, and what are the challenges associated?

What are the key reasons for using the Deep Q-Network (DQN) in reinforcement learning, and how has it revolutionized this field?

Describe how exploration and exploitation are balanced in Q-learning and the significance of this balance.

Discuss the potential applications of neural networks in healthcare and how they transform this domain.

How do one-hot encoding and label encoding affect the dimensionality of a dataset, and how do they compare with Principal Component Analysis (PCA)?

You might also like