0% found this document useful (0 votes)
23 views

ML Assignment

Uploaded by

veyeda7265
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

ML Assignment

Uploaded by

veyeda7265
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1) Consider a PLA with d=2 and a threshold with the sign(} function.

If the two weights are wi = 1


and W2=1 and the bias is w0 = -1. 5, then what would be the output for input vector (0, 0)? What
about for inputs (1, 0), (0, 1), and (1, 1)? Draw the discriminant function for this function, and write
down its equation. {note the input does not include x0 ; we add x0 which is always 1}

ANS:
In a Perceptron Learning Algorithm (PLA) with d=2, you have two input features, x1 and x2.
The bias term is typically represented as w0, which is also equal to -θ (the threshold),
where θ is the value at which the perceptron activates. The equation for the
perceptron's output is given by:
y = sign(w0 + w1 * x1 + w2 * x2)
In this case, w1 = 1, w2 = 1, and w0 = -1.5. Now, let's evaluate the output for the given input
vectors:
1. Input (0, 0): y = sign(-1.5 + 1 * 0 + 1 * 0) y = sign(-1.5) y = -1 (since sign(-1.5) is
typically defined as -1)
2. Input (1, 0): y = sign(-1.5 + 1 * 1 + 1 * 0) y = sign(-0.5) y = -1
3. Input (0, 1): y = sign(-1.5 + 1 * 0 + 1 * 1) y = sign(-0.5) y = -1
4. Input (1, 1): y = sign(-1.5 + 1 * 1 + 1 * 1) y = sign(-0.5) y = -1
So, for all four input vectors, the output is -1.
Now, let's draw the discriminant function for this perceptron. The discriminant function is
essentially the decision boundary that separates the two classes (in this case, -1 and
+1).
The equation of the discriminant function is given by: w0 + w1 * x1 + w2 * x2 = 0
Plugging in the values, we get: -1.5 + 1 * x1 + 1 * x2 = 0
Simplifying it: x1 + x2 = 1.5
This is the equation of the discriminant function. It represents a line in the two-dimensional
input space (x1, x2). It divides the space into two regions: one where the perceptron outputs
-1 and the other where it outputs +1. In this case, the line x1 + x2 = 1.5 acts as the decision
boundary, and any point on or below this line corresponds to an output of -1, while any point
above this line corresponds to an output of +1.

2 ) Hoeffding inequality: what will be the probability of being correct if we


claim 𝝁 is within 0.04 of v and N=1500? explain briefly {note: v is for in
sample and 𝝁 is for out of sample}.
ANS:
The Hoeffding Inequality provides a bound on the probability that the sample mean (v̄) is close
to the true mean (μ) of a random variable when drawing samples from a finite population. In this
case, you want to estimate μ using v̄, and you want to know the probability that your estimate v̄
is within 0.04 of the true mean μ. The Hoeffding Inequality can be expressed as follows:

P(|v̄ - μ| ≥ ε) ≤ 2e^(-2ε^2N)

Where:

 P(|v̄ - μ| ≥ ε) is the probability that v̄ differs from μ by


more than ε.
 ε is the allowed error or deviation between v̄ and μ.
 N is the number of samples drawn.

In your case, ε is 0.04, and N is 1500. You want to know the


probability that v̄ is within 0.04 of μ, so you are interested
in the complement of this event:

P(|v̄ - μ| < 0.04) = 1 - P(|v̄ - μ| ≥ 0.04)

Now, you can plug in the values into the inequality:

P(|v̄ - μ| ≥ 0.04) ≤ 2e^(-2 * 0.04^2 * 1500)

P(|v̄ - μ| ≥ 0.04) ≤ 2e^(-2 * 0.0016 * 1500)

P(|v̄ - μ| ≥ 0.04) ≤ 2e^(-2.4)

Now, calculate the value of this expression:

P(|v̄ - μ| ≥ 0.04) ≤ 2 * e^(-2.4)

P(|v̄ - μ| ≥ 0.04) ≤ 2 * 0.09071795

P(|v̄ - μ| ≥ 0.04) ≤ 0.1814359

So, the probability that your estimate v̄ is within 0.04 of the true mean μ is at least 1 - 0.1814359
= 0.8185641, or approximately 81.86%. This means that with a sample size of 1500, there is at
least an 81.86% chance that your estimate v̄ is within 0.04 of the true mean μ.

3) Explain briefly Reinforcement Learning RL (2 — 3 lines) and the diff. between RL and
supervised and unsupervised learning in 2 — 3 lines.
ANS: Reinforcement learning (RL) is an area of machine learning concerned with
how intelligent agents ought to take actions in an environment in order to maximize the
notion of cumulative reward. Reinforcement learning is one of three basic machine
learning paradigms, alongside supervised learning and unsupervised learning.

Difference Between RL, Supervised and Unsupervised Learning


In supervised learning, the AI model is trained based on the given input and its expected
output, i.e., the label of the input.

In unsupervised learning, the AI model is trained only on the inputs, without their labels.
The model classifies the input data into classes that have similar features.
In reinforcement learning, the AI model tries to take the best possible action in a given
situation to maximize the total profit. The model learns by getting feedback on its past
outcomes.

4 (a) Fair coin tossed five times; what is the prob not all heads (at least one tail),
explain in details, show all steps; (b) write the final answer only (no details) for
probability of: 70 coins each tossed 9 times; what is the probability of all 70 are
not all-heads; means every one of the 70 coins gets at least one tail; none of the 70
gets all-9 heads). ***no steps, no explain for part tb) but you need to explain with
steps for (a).

ANS::
(a) To find the probability that not all five coin tosses result in heads (i.e., at least one tail), we can
use the complementary probability approach. The probability of getting all heads in a single coin
toss is 1/2, and the probability of not getting all heads in a single coin toss is also 1/2. Since the
coin tosses are independent events, you can multiply these probabilities together for all five tosses
to find the probability of not getting all heads:

P(not all heads) = (1/2) * (1/2) * (1/2) * (1/2) * (1/2) = (1/2)^5 = 1/32

So, the probability of not getting all heads in five coin tosses is 1/32.

(b) Final answer for the probability that all 70 coins are not all heads: 1 - (1/2)^9
Q.5 a) PLA algorithm: One of the following is correct:

(a) ANS : h(x) = sign(w xT) . (b) h(x) = sign(wT x) = 1 (c) h(x) = sign(w x)
T
(d) sign(w x) = h(x). (e) sign(wox)= h(x). (d) None of these
correct

Ans : (d) sign(w^T x) = h(x)

option (d) is the correct one.

5b) PLA algorithm: one of the following is correct:


(a) h(x) = sign(wT x) . (b) h(x) ≤ sign(w xT) . (c) h(x) ≥ sign(wT x)
(d) sign(wT x) < 0. (e) h(x) > 0 and sign(wT x) > 0 . (ft none of these correct

ANS : (a) h(x) = sign(w^T x)

option (a) is the correct one.

5c) write one weight vector for 2C classification with dataset having 4 attributes
{that is: write one sample or example weight vector

ANS : here's an example of a weight vector for a 2-class (2C) classification problem with a
dataset having 4 attributes:

w = [0.5, -0.2, 0.7, -0.1]

In this example, we have 4 attributes, and each weight (w1, w2, w3, w4) corresponds to the
contribution of the respective attribute in making classification decisions. These values are
arbitrary and would be learned through a training process using a specific machine learning
algorithm.

6) Write the code for generating synthetic data with multiple datasets for
classification and for clustering in any machine learning platform (R,
Jupyter, skleam, Colab, Keras, ..etc); for example, you can use
make_classification(), make_blob(). Provide sample output. Submit three
datasets two for classification and one for regression (you can have d
=between 2 to 4 and N (number of rows) from 10 to 20).

ANS:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification,
make_regression, make_blobs

# Function to generate synthetic classification data


def generate_classification_data(N, d):
X, y = make_classification(
n_samples=N,
n_features=d,
n_informative=d // 2,
n_clusters_per_class=1,
random_state=42
)
return X, y
# Function to generate synthetic regression data
def generate_regression_data(N, d):
X, y = make_regression(
n_samples=N,
n_features=d,
noise=0.1,
random_state=42
)
return X, y

# Function to generate synthetic clustering data


def generate_clustering_data(N, d):
X, _ = make_blobs(
n_samples=N,
n_features=d,
centers=3,
random_state=42
)
return X

# Generating synthetic classification datasets


X_classification1, y_classification1 =
generate_classification_data(N=15, d=2)
X_classification2, y_classification2 =
generate_classification_data(N=20, d=3)

# Generating synthetic regression dataset


X_regression, y_regression = generate_regression_data(N=10, d=4)
# Generating synthetic clustering dataset
X_clustering = generate_clustering_data(N=12, d=2)

# Sample outputs
print("Classification Dataset 1:")
print("Features (X):")
print(X_classification1)
print("Labels (y):")
print(y_classification1)

print("\nClassification Dataset 2:")


print("Features (X):")
print(X_classification2)
print("Labels (y):")
print(y_classification2)

print("\nRegression Dataset:")
print("Features (X):")
print(X_regression)
print("Labels (y):")
print(y_regression)

print("\nClustering Dataset:")
print("Features (X):")
print(X_clustering)
# Plotting the clustering dataset
plt.scatter(X_clustering[:, 0], X_clustering[:, 1])
plt.title("Synthetic Clustering Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

This code generates two classification datasets


(X_classification1, y_classification1, X_classification2,
y_classification2), one regression dataset (X_regression,
y_regression), and one clustering dataset (X_clustering). You
can adjust the values of N and d to control the number of rows
and features in the datasets.

7) Write the pocket algorithm

ANS:
import numpy as np

class PocketAlgorithm:
def __init__(self, max_iterations=1000):
self.max_iterations = max_iterations

def fit(self, X, y):


num_samples, num_features = X.shape
self.w = np.zeros(num_features) # Initialize weight
vector
self.best_w = self.w.copy() # Initialize best weight
vector
self.best_accuracy = 0

for _ in range(self.max_iterations):
misclassified = 0

for i in range(num_samples):
xi, yi = X[i], y[i]
prediction = np.sign(np.dot(self.w, xi))

if prediction != yi:
self.w += yi * xi
misclassified += 1

# Calculate accuracy using the new weight vector


accuracy = self._calculate_accuracy(X, y, self.w)

if accuracy > self.best_accuracy:


self.best_w = self.w.copy()
self.best_accuracy = accuracy

if misclassified == 0:
break

self.w = self.best_w # Set the final weight vector to the


best found

def predict(self, X):


return np.sign(np.dot(X, self.w))
def _calculate_accuracy(self, X, y, w):
predictions = self.predict(X)
return np.mean(predictions == y)

8) Problem 1.1 We have 2 opaque bags, each containing


2 balls One bag has 2 black balls and the other
has a black and a white ball- You pick a bag at
random and then pick one of the balls in that bag
at random. When you look at the ball it is black
You now pick the second ball from that same bag.
What is the probability that this ball is also
black? (Hint: Use Bayes' Theorem; p[|A and B] = p[A
| B] P [B] = P[B | A] p [A] ]

To solve this problem using Bayes' Theorem, we can define two events:

A: we initially picked the bag with 2 black balls. B: You picked a black ball from the selected
bag.

We want to find the probability that the second ball from the same bag is also black, which is
P(B | A). Using Bayes' Theorem, we can express this as:

P(A | B) = P(B | A) * P(A) / P(B)

Now, let's break down each probability:

1. P(A): The probability of initially picking the bag with 2 black balls. Since there are 2 bags and
you pick one randomly, P(A) = 1/2.
2. P(B | A): The probability of picking a black ball from a bag with 2 black balls. This is 1 because
if you pick this bag, you are guaranteed to pick a black ball.
3. P(B): The probability of picking a black ball, which can happen in two ways:
 You pick the bag with 2 black balls and then pick a black ball, which has probability
P(A) * P(B | A) = (1/2) * 1 = 1/2.
 You pick the bag with one black and one white ball and then pick a black ball, which has
probability (1/2) * (1/2) = 1/4.

So, P(B) = 1/2 + 1/4 = 3/4.

Now, we can calculate P(A | B) using Bayes' Theorem:

P(A | B) = P(B | A) * P(A) / P(B) P(A | B) = 1 * (1/2) / (3/4) P(A | B) = (1/2) / (3/4)

To simplify, we can multiply the numerator and denominator by 2/3:

P(A | B) = (1/2) * (2/3) = 1/3

So, the probability that the second ball from the same bag is also black is 1/3.

Q.9 Exercise 1,1


Express each of the following tasks in the framework
of learning from data by specifying the input space
x, output space y, target Function X → Y and the
specifics of the data set that we will learn burn.
(a) Medical diagnosis: A patient walks in with a
medical history and some symptoms, and you want to
identify die problem.

In the context of learning from data, let's frame the task of medical diagnosis as follows:
Input Space (X): The input space consists of patient data, which includes various
features such as medical history, symptoms, demographic information, lab results, and
any other relevant data that can be collected about the patient.

Output Space (Y): The output space is binary, where Y = {0, 1}. Here, 0 represents the
absence of a specific medical problem or condition, and 1 represents the presence of that
medical problem or condition.

Target Function (f: X → Y): The target function is a mapping from the input space to the
output space. It takes patient data as input and predicts whether the patient has the
specific medical problem or condition.

Data Set: The data set for this task consists of historical patient records, each record
containing a set of features (X) and the corresponding diagnosis label (Y). The data set
should include examples of patients with and without the specific medical problem of
interest, allowing the algorithm to learn patterns and make predictions.

Specifics of the Data Set:

1. Features (X): These features may include the patient's age, gender, medical history (e.g.,
previous diagnoses and treatments), symptoms (e.g., fever, cough, pain), results of
medical tests (e.g., blood tests, imaging scans), and any other relevant information.
2. Labels (Y): The labels indicate whether the patient was diagnosed with the specific
medical problem (Y = 1) or not (Y = 0).
3. Size of the Data Set: The data set should ideally be large and diverse, containing a
sufficient number of positive (Y = 1) and negative (Y = 0) cases to enable the algorithm
to learn meaningful patterns.
4. Data Preprocessing: Data preprocessing steps may include handling missing values,
normalizing or standardizing features, and encoding categorical variables.
5. Model Training: Machine learning algorithms, such as classification algorithms (e.g.,
logistic regression, decision trees, or neural networks), are trained on this data set to
learn the mapping from patient data to diagnosis.
6. Evaluation: The model's performance is evaluated using metrics like accuracy, precision,
recall, F1-score, and ROC AUC to assess how well it can accurately diagnose medical
conditions.

The goal is to build a predictive model that can effectively diagnose medical problems
based on the patient's data, thus assisting healthcare professionals in making informed
decisions about patient care.
(b) Handwritten digit recognition (for example postal zip code
recognition for marl sorting).
In the context of learning from data, let's frame the task of handwritten digit recognition
for postal zip code recognition as follows:

Input Space (X): The input space consists of images of handwritten digits. Each image is
a 2D array of pixel values, where each pixel represents the intensity or color at a
particular location. These images are typically represented as feature vectors after
flattening the 2D arrays.

Output Space (Y): The output space is the set of possible digit classes, which is Y = {0, 1,
2, 3, 4, 5, 6, 7, 8, 9}. Each digit class corresponds to a specific digit from 0 to 9.

Target Function (f: X → Y): The target function is a mapping from the input space to the
output space. It takes an image of a handwritten digit as input and predicts the digit
class it represents.

Data Set: The data set for this task consists of a collection of handwritten digit images,
each associated with a label indicating the correct digit class. This data set is used to
train a machine learning model to recognize handwritten digits.

Specifics of the Data Set:

1. Features (X): Each data point in the data set is an image of a handwritten digit. The
features consist of pixel values in the image, which may be preprocessed by techniques
such as normalization or resizing to ensure uniformity.
2. Labels (Y): The labels indicate the correct digit class for each image, ranging from 0 to 9.
3. Size of the Data Set: The data set should contain a substantial number of examples,
including images of all digit classes. It should be sufficiently diverse to capture
variations in handwriting styles.
4. Data Preprocessing: Preprocessing steps may include converting images to grayscale,
resizing them to a consistent dimension, and normalizing pixel values to a certain range.
5. Model Training: Machine learning algorithms, particularly image classification models
such as convolutional neural networks (CNNs), are trained on this data set to learn the
mapping from images to digit classes.
6. Evaluation: The model's performance is evaluated using metrics such as accuracy,
confusion matrix, and loss function to assess how well it can correctly classify
handwritten digits.

The goal of this task is to develop a model that can automatically recognize handwritten
digits from postal codes on envelopes or packages, assisting in the automation of mail
sorting processes. The model's ability to accurately identify digits is critical for efficient
and error-free mail sorting.

(c) Determining if an email is spam or not


In the context of learning from data, let's frame the task of determining if an email is
spam or not as follows:

Input Space (X): The input space consists of email messages. Each email message can be
represented as a text document or a feature vector that encodes the content of the email.

Output Space (Y): The output space is binary, where Y = {0, 1}. Here, 0 represents a
non-spam or legitimate email, and 1 represents a spam email.

Target Function (f: X → Y): The target function is a mapping from the input space to the
output space. It takes an email message as input and predicts whether the email is spam
(Y = 1) or not (Y = 0).

Data Set: The data set for this task consists of a collection of email messages, each
labeled as either spam or non-spam. This data set is used to train a machine learning
model to classify emails.

Specifics of the Data Set:

1. Features (X): Each data point in the data set is an email message. The features consist of
the content of the email, which may include the subject line, sender's address, and the
body of the email. Text data is often preprocessed by techniques such as tokenization,
stemming, and vectorization (e.g., TF-IDF or word embeddings).
2. Labels (Y): The labels indicate whether each email is spam (Y = 1) or non-spam (Y = 0).
3. Size of the Data Set: The data set should contain a significant number of examples,
including both spam and non-spam emails. It should be representative of the types of
emails encountered in the real world.
4. Data Preprocessing: Preprocessing steps may include removing stop words, special
characters, and performing text cleaning to improve the quality of the text data.
5. Model Training: Various machine learning algorithms, such as Naive Bayes, Support
Vector Machines (SVMs), or deep learning models like recurrent neural networks
(RNNs) or transformer-based models, can be trained on this data set to learn to classify
emails as spam or non-spam.
6. Evaluation: The model's performance is evaluated using metrics like accuracy, precision,
recall, F1-score, and receiver operating characteristic (ROC) curve to assess how well it
can accurately classify emails.

The goal of this task is to develop an effective spam filter that can automatically identify
and categorize incoming emails as spam or legitimate, helping users manage their email
inboxes more efficiently and reduce exposure to unwanted or potentially harmful
content.

Exercise 1.3
The weigh update rule in (1.3) has the nice interpretation that it moves in
the direction of classifying x(t) correctly.
(a) Show that y(t)w^t(t)X(t)<0 [hint: x(t)is misclassified by w(t)]

ANS: In order to show that y(t)wT(t)x(t)<0, we need to establish that the current weight
vector w(t) misclassifies the input vector x(t). Recall that in a binary classification setting,

if y(t)wT(t)x(t)>0, then the current weight vector correctly classifies x(t), and if

y(t)wT(t)x(t)<0, then it misclassifies x(t).

Let's consider the case where y(t)wT(t)x(t)<0:

1. If =1y(t)=1 (indicating a positive class) and wT(t)x(t)<0, this means that the
current weight vector w(t) assigns a negative score to the positive class,
indicating that x(t) is misclassified as a negative class. In this case,
y(t)wT(t)x(t)<0, which means the prediction is incorrect, and x(t) is indeed
misclassified.

2. If y(t)=−1 (indicating a negative class) and wT(t)x(t)<0, this means that the
current weight vector w(t) assigns a positive score to the negative class,
indicating that x(t) is misclassified as a positive class. In this case
y(t)wT(t)x(t)<0, which means the prediction is incorrect, and x(t) is indeed
misclassified.
In both cases, when y(t)wT(t)x(t)<0, it implies that the current weight vector
w(t) misclassifies the input vector x(t), which is what we wanted to show.

This interpretation aligns with the update rule in the context of the perceptron
learning algorithm. When x(t) is misclassified, the algorithm updates the weight
vector w(t) to move in the direction of classifying x(t) correctly, which involves
increasing the value of y(t)wT(t)x(t).

(C) As far as classifying x(t) is concerned, argue that the move from w(t+1)
is a move in the right. direction'

Certainly, let's argue that the move from w(t+1) in the perceptron learning
algorithm is a move in the right direction concerning the classification of x(t).
The perceptron learning algorithm aims to find a decision boundary that
correctly classifies data points, and the update step ensures progress towards
this goal:

1. Motivation for Perceptron Learning: The perceptron learning algorithm is


designed to find a hyperplane (decision boundary) that can separate two
classes of data points. Its primary motivation is to correctly classify data points
based on their features.
2. Update Rule: The update rule for the perceptron is as follows:
If x(t) is misclassified (i.e., y(t)wT(t)x(t)<0), then the perceptron updates the
weight vector as:
w(t+1)=w(t)+ηy(t)x(t)
Where:
 w(t+1) is the updated weight vector.
 w(t) is the current weight vector.
 η is the learning rate, a positive constant.
 y(t) is the true label of x(t).
 x(t) is the misclassified input vector.
3. Effect of the Update: The update term ηy(t)x(t) is added to the current weight
vector. This term is proportional to the misclassified input x(t) and the true
label y(t).
4. Correcting Misclassifications: When x(t) is misclassified, the update moves
w(t+1) in the direction that corrects this misclassification. It adjusts the
decision boundary to reduce the error for x(t). If y(t) is positive (indicating a
positive class), the update increases the dot product wT(t)x(t), and if y(t) is
negative (indicating a negative class), it decreases the dot product. The
direction of the move is determined by the sign of y(t).
5. Iterative Improvement: The perceptron learning algorithm iteratively updates
the weight vector for each misclassified data point until all data points are
correctly classified or a stopping criterion is met. This iterative process ensures
that the decision boundary is adjusted to minimize classification errors.
6. Convergence: If the data is linearly separable, the perceptron learning
algorithm is guaranteed to converge to a solution where all data points are
correctly classified. In this sense, each update takes the algorithm closer to
this optimal solution.

In summary, the move from w(t) to w(t+1) in the perceptron update rule is
indeed a move in the right direction concerning the classification of x(t) because
it aims to correct misclassifications and adjust the decision boundary to
improve classification accuracy. The algorithm's design ensures that it learns
from mistakes and converges towards a solution that correctly classifies data
points based on their features.

You might also like