DSA Module 3
DSA Module 3
Syllabus: Modeling, What Is Machine Learning?, Over fitting and Under fitting,
Correctness, The Bias-Variance Tradeoff, Feature Extraction and Selection, k-Nearest
Neighbors, The Model, Example: The Iris Dataset, The Curse of Dimensionality, Naive Bayes,
A Really Dumb Spam Filter, A More Sophisticated Spam Filter, Implementation, Testing Our
Model, Using Our Model, Simple Linear Regression, The Model, Using Gradient Descent,
Maximum Likelihood Estimation, Multiple Regression, The Model, Further
Assumptions of the Least Squares Model, Fitting the Model, Interpreting the Model, Goodness
of Fit, Digression: The Bootstrap, Standard Errors of Regression Coefficients, Regularization,
Logistic Regression, The Problem, The Logistic Function, Applying the Model, Goodness of
Fit, Support Vector Machines.
Modeling
A model is essentially a simplified representation of reality that helps us to understand, predict, or
control some aspect of the world. It captures the key features and relationships of the phenomena.
The primary goal of a machine learning model is to make predictions or decisions based on input
data.
It’s simply a specification of a mathematical (or probabilistic) relationship that exists between
different variables.
For example if we want to raise money for one social networking site, It might build a business
model that takes inputs like “number of users” and “ad revenue per user” and “number of
employees” and outputs your annual profit for the next several years. A cookbook recipe entails a
model that relates inputs like “number of eaters” and “hungriness” to quantities of ingredients
needed.
The business model is probably based on simple mathematical relationships: profit is revenue minus
expenses, revenue is units sold times average price, and so on. The recipe model is probably based on
trial and error — someone went in a kitchen and tried different combinations of ingredients until they
found one they liked. And the poker model is based on probability theory, the rules of poker, and
some reasonably innocuous assumptions about the random process by which cards are dealt.
Under fitting, producing a model that doesn’t perform well even on the training data, or model
fails to understand the relationships between the input features and outcome.
The horizontal line shows the best fit degree 0 polynomial. It severely under fits the training data.
The best fit degree 9 polynomial goes through every training data point exactly, but it very severely
over fits — if we were to pick a few more data points it would quite likely miss them by a lot. And
the degree 1 line strikes a nice balance — it’s pretty close to every point, and the line will likely be
close to new data points as well. Clearly models that are too complex lead to over fitting and don’t
generalize well beyond the data they were trained on. The most fundamental approach involves using
different data to train the model and to test the model. Overfitting and Underfitting
Overfitting
Causes:
Too many parameters relative to the number of observations.
Model complexity is too high.
Insufficient training data.
Symptoms:
High accuracy on training data.
2
Example: Consider a polynomial regression problem where we are trying to fit a polynomial to
data that has aquadratic relationship.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
label=’Polynomial fit’;)
Page
plt.legend()
plt.show()
In this example, using a polynomial of degree 10 leads to overfitting. The model fits the training data
very well,capturing noise, but generalizes poorly to the test data.
Underfitting
Causes:
Model complexity is too low.
Not enough features.
Excessive regularization.
Symptoms:
Low accuracy on training data.
Low accuracy on validation/test data.
Example: Continuing with the same data, consider fitting a linear regression model:
Here, using a linear regression model leads to underfitting. The model is too simple to capture the
quadratic relationship in the data.
Correctness
Correctness refers to how accurately a model's predictions align with the actual outcomes. It can
be quantified using various evaluation metrics depending on the type of problem and the specific
goals of the model.
In binary classification problems, such as determining whether an email is spam or not, the
performance of the model can be evaluated using a confusion matrix. This matrix summarizes the
outcomes of the predictions made by the model compared to the actual outcomes.
A confusion matrix is a table that is used to describe the performance of a classification
model.
Let’s consider a data for building a model to make a judgment.
Given a set of labeled data and such a predictive model, every data point lies in one of four
categories:
• True Positive (TP): An email is actually spam, and the model correctly identifies it as spam.
• False Positive (FP) An email is not spam, but the model incorrectly identifies it as spam
• False Negative (FN) : An email is spam, but the model incorrectly identifies it as not spam.
• True Negative (TN): An email is not spam, and the model correctly identifies it as not spam.
Correctness can be measured by
Accuracy: The proportion of total predictions that is correct. It is a primary metric for many
classification problems, giving a straightforward measure of how often the model is correct
Code:
def accuracy(tp, fp, fn, tn): correct = tp + tn
total = tp + fp + fn + tn
return correct / total
print accuracy(70, 4930, 13930, 981070) # 0.98114
Code:
def precision(tp, fp, fn, tn):
return tp / (tp + fp)
print precision(70, 4930, 13930, 981070) # 0.014
Recall (Sensitivity or True Positive Rate): The proportion of actual positives that are
correctly identified. . Precision and recall are crucial in situations where the cost of false positives
and false negatives is high. For example, in medical diagnosis, spam detection, etc., balancing
precision and recall is more important than overall accuracy.
Code:
def recall(tp, fp, fn, tn):
return tp / (tp + fn)
print recall(70, 4930, 13930, 981070)
F1 Score: The harmonic mean of precision and recall, providing a balance between the two. It
provides a single metric that balances precision and recall, useful when there is an uneven class
distribution or when one type of error is more significant than the other.
Code:
def f1_score(tp, fp, fn, tn):
p = precision(tp, fp, fn, tn)
r = recall(tp, fp, fn, tn)
return 2 * p * r / (p + r)
Usually the choice of a model involves a trade-off between precision and recall.
A model that predicts “yes” when it’s even a little bit confident will probably have a high
recall but a low precision;
A model that predicts “yes” only when it’s extremely confident is likely to have a low recall
and a high precision.
6
Page
If the model has high bias, it means the model is too simple to capture the underlying patterns in the
data.In such cases, adding more features can help improve the model by providing it with more
information. For example, in the context of polynomial regression:
If the model has high variance, it means the model is too complex and is over fitting the
training data. Removing some features can help by simplifying the model, thus reducing the
7
variance.
Page
Another effective way to take action to reduce or prevent high variance is to gather more data.
More data can help the model generalize better because it provides more examples for the
model to learn from, reducing the risk of over fitting.
In Figure 11-2, To fit a degree 9 polynomial to different size samples. If the model is trained on 100
data points, there’s much less over fitting. And the model trained from 1,000 data points looks very
similar to the degree 1 model. Holding model complexity constant, the more data , the harder it is to
over fit.
The second is a number. And the third is a choice from a discrete set of options.
Page
To extract features from our data that falls into one of these threecategories.
The features are choosen by the combination of experience and domain expertise comes .
Working of KNN:
In the classification problem, the class labels of are determined by performing majority voting. The
class with the most occurrences among the neighbors becomes the predicted class for the target
data point.
In the regression problem, the class label is calculated by taking average of the target values of K
nearest neighbors. The calculated average value becomes the predicted output for the target data
point.
X is the training dataset with n data points, where each data point is represented by a d-dimensional
feature vector and Y be the corresponding labels or values for each data point in X. Given a new data
9
Page
point x, the algorithm calculates the distance between x and each data point in X using a distance
metric, such as Euclidean distance:
The algorithm selects the K data points from X that have the shortest distances to x. For classification
tasks, the algorithm assigns the label y that is most frequent among the K nearest neighbors to x. For
regression tasks, the algorithm calculates the average or weighted average of the values y of the K
nearest neighbors and assigns it as the predicted value for x.
Python program to build a nearest neighbor model that can predict the class
from the IRIS dataset
import numpy as np
from collections import Counter
# Sample dataset (Iris data: sepal length, sepal width, petal length, petal width, species)
data = [
(5.1, 3.5, 1.4, 0.2, ‘setosa’),
(4.9, 3.0, 1.4, 0.2, ‘setosa’),
(5.0, 3.6, 1.4, 0.2, ‘setosa’),
(6.7, 3.0, 5.0, 1.7,’versicolor’),
(6.3, 3.3, 6.0, 2.5, ‘virginica’),
(5.8, 2.7, 5.1, 1.9, ‘virginica’)
]
# New data point (sepal length, sepal width, petal length, petal width)
new_point = (5.5, 3.4, 1.5, 0.2)
Dimensionality Reduction
Dimensionality reduction is a process used to reduce the number of features (dimensions) in a
dataset while retaining as much information as possible. This technique helps in simplifying
models, reducing computational costs, and mitigating issues related to the curse of dimensionality.
Principal Component Analysis (PCA)
PCA is a widely used technique for dimensionality reduction that transforms the original features
into a new set of uncorrelated features called principal components. The first principal component
captures the most variance in the data,and each subsequent component captures the remaining
variance under the constraint of being orthogonal to the previous components.
Example: Dimensionality Reduction Using PCA
1. Standardize the Data: Standardization ensures that each feature contributes equally to the
analysis by scaling the data to have a mean of 0 and a standard deviation of 1.
2. Compute the Covariance Matrix: The covariance matrix describes the variance and the
covariance between the features.
3. Compute the Eigenvalues and Eigenvectors: The eigenvectors determine the directions of the
new feature space, while the eigenvalues determine their magnitude (i.e., the amount of variance
captured by each principal component).
4. Sort Eigenvalues and Select Principal Components: The eigenvalues are sorted in descending
order, and the top k eigenvalues are selected. The corresponding eigenvectors form the new feature
space.
5. Transform the Data: The original data is projected onto the new feature space to obtain the
reduced dataset.
12
Page
CODE:
Example using Python to illustrate PCA for dimensionality reduction:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Generate a synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 3) # 100 samples, 3 features
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Reduced data
ax[1].scatter(X_pca[:, 0+, X_pca*:, 1+, c=’red’, label=’PCA Reduced Data’)
ax[1].set_xlabel(‘Principal Component 1’)
ax*1+.set_ylabel(‘Principal Component 2’)
13
plt.legend()
plt.show()
Naive Bayes
Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem, with the
assumption that the features are independent given the class label.
This model predicts the probability of an instance belongs to a class with a given set of
feature value. It is a probabilistic classifier. It is because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to the
predictions with no relation between each other.. It uses Bayes theorem in the algorithm for training
and prediction
The core assumption of Naive Bayes is conditional independence.
Mathematically, if Xi & Xj represents the event that the ith word is present in the message, then the
assumption says:
P(Xi and Xj/spam)=P(Xi/spam)⋅P(Xj/spam)
since the “lottery” and “rolex” words actually never occur together. Despite the unrealisticness of this
assumption, this model often performs well and is used in actual spam filters.
The same Bayes’s Theorem reasoning used for the word “lottery-only” spam filter tells that we can
calculate the probability a message is spam using the equation:
Let us consider three words: “rolex" "lottery," and "meeting." the following probabilities based on
Page
historical data.
Given the message contains the words “rolex" and "lottery," there is a 98% chance it is spam and a
2% chance it is not spam (normal).
Python Code:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
15
Page
# Sample data
data = {
'message': [
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121
to
receive entry question(std txt rate)',
'Nah I don\'t think he goes to usf, he lives around here though',
'WINNER!! As a valued network customer you have been selected to receivea £900
prize
reward!',
'I HAVE A work ON SUNDAY !!',
'Had your mobile 11 months or more? U R entitled to update to the latest colour
mobiles with camera for free! Call The Mobile Update Co FREE on 08002986030'
],
'label': ['spam', 'ham', 'spam', 'ham', 'spam']
}
# Convert data to DataFrame
df = pd.DataFrame(data)
# Feature extraction
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']
# Make predictions
y_pred = nb.predict(X_test)
16
print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)
• Regression is a statistical technique used to model and analyze the relationships between
variables.
• It helps in understanding how the dependent variable (Y)changes when any one of the
independent variables (X)is varied.
• The primary goal of regression is to predict or estimate the value of the dependent variable based
on the values of one or more independent variables.
• Simple Linear Regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
• Independent Variable (X): Also known as the predictor or explanatory variable.
• Dependent Variable (Y): Also known as the response or outcome variable.
• The goal of Simple Linear Regression is to model the relationship between these two variables
by fitting a linear equation to the observed data.
• The linear equation for a Simple Linear Regression model is:
Yi= +βXi+ϵi
Y - is the dependent variable.
X -is the independent variable.
17
β -is the slope of the regression line. It represents the change in Y for a one-unit change in X.
ϵ - error term, which accounts for the variability in Y that cannot be explained by the linear
relationship with X.
R-squared (R²): The proportion of the variance in the dependent variable that is predictable from
the independent variable.
Code
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 7, 11]
18
# Calculate means
Page
x_mean = np.mean(x)
y_mean = np.mean(y
# Make predictions
y_pred = beta_0 + beta_1 * x
where:
m is the number of training examples.
hθ(x(i)) is the predicted value for the ith training example.
y(i) is the actual value for the ith training example.
The gradient descent algorithm for linear regression would involve computing the partial
derivatives of J(θ) with respect to each parameter θj:
import numpy as np
# Parameters
learning_rate = 0.1
n_iterations = 1000
m = len(X_b)
theta = np.random.randn(2, 1)
# Gradient Descent
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
20
Multiple Regressions
Multiple regressions are a statistical technique used to understand the relationship between one
dependent variable and two or more independent variables. It extends simple linear regression, which
involves only one independent variable, by incorporating multiple predictors to better capture the
complexity of real-world phenomena.
Y=β0+β1X1+β2X2+…+βn Xn
Y: Dependent variable.
β0: Intercept, the expected value of Y when all Xs are zero.
β1,β2,…,βn : Coefficients representing the change in Y for a one-unit change in the
corresponding X, holding other variables constant.
X1,X2,…,Xn : Independent variables.
ϵ: Error term, representing the deviation of observed values from the predicted values.
1. Model Specification: Define the dependent variable and select the independent variables
based on theoretical understanding or empirical evidence.
2. Data Collection: Gather data for the dependent and independent variables. Ensure the data is
clean and suitable for analysis.
3. Estimation of Coefficients: Use statistical software to estimate the coefficients (β\betaβ) of
the regression equation. This is typically done using the Ordinary Least Squares (OLS)
method, which minimizes the sum of the squared differences between observed and predicted
values.
4. Model Evaluation: Assess the model's performance using various metrics:
o R-squared (R²): Proportion of variance in the dependent variable explained by the
independent variables.
o Adjusted R-squared: Adjusts R² for the number of predictors in the model.
o F-test: Tests the overall significance of the model.
o t-tests: Assess the significance of individual coefficients.
5. Assumption Checking: Ensure that the model meets the assumptions of multiple regression:
o Linearity: The relationship between the dependent and independent variables is linear.
o Independence: Observations are independent of each other.
o Homoscedasticity: Constant variance of errors across all levels of the independent
variables.
o Normality: Errors are normally distributed.
6. Diagnostics and Refinement: Perform residual analysis to check for any patterns in the
residuals that might indicate model misspecification. Address issues like multicollinearity
(high correlation among predictors) if they arise.
7. Interpretation: Interpret the coefficients to understand the impact of each independent
variable on the dependent variable. For example, a coefficient of 2 for X1 means that a one-unit
21
The goodness of fit in a multiple regression model refers to how well the model's predicted values
match the actual observed values. It helps in determining the model's accuracy and reliability
1. R-squared (R²):R² represents the proportion of the variance in the dependent variable that is
predictable from the independent variables. R² values range from 0 to 1. An R² value closer to 1
indicates a better fit, meaning a higher proportion of variance is explained by the model. For instance,
an R² of 0.8 means 80% of the variance in the dependent variable is explained by the independent
variables.
2. Adjusted R-squared: Adjusted R² adjusts the R² value based on the number of predictors in the
model. It accounts for the fact that adding more variables to a model will inherently increase R²,
regardless of whether those variables are meaningful. Adjusted R² is useful for comparing models with
a different number of predictors. It can decrease if the added variable doesn't improve the model more
than would be expected by chance.
3. F-statistic and p-value: The F-statistic tests the overall significance of the model. It checks
whether at least one of the predictors is significantly related to the dependent variable. A higher F-
statistic value and a lower p-value (typically less than 0.05) indicate that the model is statistically
significant.
4. Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared
differences between the observed and predicted values. RMSE provides a measure of the average
magnitude of the errors in prediction. Lower RMSE values indicate a better fit.
5. Mean Absolute Error (MAE):MAE is the average of the absolute differences between the
observed and predicted values. Like RMSE, lower MAE values indicate a better fit. It is less sensitive
to large errors compared to RMSE.
6. Residual Plots: Residual plots show the difference between the observed and predicted values
(residuals) on the y-axis and the predicted values on the x-axis. A good fit is indicated by residuals
that are randomly scattered around zero, with no clear patterns. Patterns or systematic deviations
suggest model inadequacies.
Model Validation: Ensuring the model reliably predicts outcomes on new data.
Decision Making: Better fitting models provide more accurate and reliable information for
decision making.
Model Comparison: Goodness of fit metrics allow for comparison between different models
to choose the best one.
22
Page
Multiple regression allows for the examination of the relationship between a dependent variable and
multiple independent variables simultaneously. This helps in understanding how various factors
collectively influence the outcome.
In many real-world scenarios, the effect of one independent variable on the dependent variable might
be influenced by the presence of other variables. Multiple regression helps to isolate the effect of each
independent variable by controlling for others, reducing potential confounding effects.
By incorporating multiple predictors, the model can capture more information about the dependent
variable, leading to better predictive accuracy compared to simple regression models with a single
predictor.
Multiple regression helps in identifying which independent variables have a significant impact on the
dependent variable. This is particularly useful in fields like economics, medicine, and social sciences,
where understanding the importance of various factors is crucial.
The coefficients in a multiple regression model quantify the impact of each independent variable on
the dependent variable, providing valuable insights into the strength and direction of these
relationships.
6. Handling Multicollinearity
In multiple regression, it's important to assess and handle multicollinearity (when independent
variables are highly correlated). Properly fitting the model involves diagnosing and addressing
multicollinearity to ensure reliable and interpretable results.
7. Generalizability of Findings
A well-fitted multiple regression model that accounts for multiple factors is more likely to generalize
to new data, making the findings more robust and applicable in various contexts.
Fitting the model involves checking for assumptions (linearity, independence, homoscedasticity,
Page
normality) and performing diagnostics (residual analysis, influence analysis) to ensure the validity of
the model. This step is crucial for the reliability of the regression results.
Multiple regression serves as a foundation for more complex analyses like interaction effects,
polynomial regression, and hierarchical regression, expanding the analytical capabilities for research
and decision-making.
In applied fields, such as business, public policy, and healthcare, multiple regression models provide
evidence-based insights that inform strategic decisions and policy-making by highlighting key factors
and their relative importance
Precision of Estimates: Smaller standard errors indicate more precise estimates of the
regression coefficients.
Confidence Intervals: They are used to construct confidence intervals around the regression
coefficients, providing a range within which the true population parameter is likely to lie.
Hypothesis Testing: Standard errors are crucial for conducting hypothesis tests, such as
determining whether a regression coefficient is significantly different from zero.
Calculation
For a simple linear regression with one predictor, the standard error of the regression coefficient can
be calculated as follows:
Where:
In a multiple regression model with several predictors, the calculation involves more complex matrix
operations, but the concept remains the same.
Smaller Standard Errors: Indicate that the coefficient estimate is more reliable and that the
predictor variable has a more stable relationship with the response variable.
Larger Standard Errors: Suggest that the coefficient estimate is less reliable, indicating more
24
variability in the estimate and a less stable relationship with the response variable.
Page
Confidence Intervals
t is the critical value from the t-distribution for the desired confidence level, and SE(βj) is the
standard error of the coefficient.
Interpretation: A 95% confidence interval means that we are 95% confident that the true
population parameter lies within this interval.
Hypothesis Testing
Null Hypothesis (H₀): The null hypothesis usually states that the coefficient is equal to zero
(βj = 0), meaning the predictor has no effect on the response variable.
Test Statistic: The t-statistic for testing this hypothesis is calculated as:
p-value: The p-value associated with this t-statistic helps determine whether to reject the null
hypothesis. A smaller p-value (typically < 0.05) indicates that the coefficient is significantly
different from zero.
Sample Size: Larger sample sizes tend to produce smaller standard errors, indicating more
reliable estimates.
Variance of Errors: Higher variance in the residuals leads to larger standard errors, indicating
less precise estimates.
Multi collinearity: When predictor variables are highly correlated, it inflates the standard
errors of the coefficients, making it harder to determine the individual effect of each predictor.
Logistic regression
Logistic regression is a statistical method used for binary classification problems, where the outcome
variable is categorical and has two possible outcomes (0 or 1). It models the probability of a binary
response based on one or more predictor variables.
Logistic Function
The logistic function (or sigmoid function) is used to model the probability of the default class:
25
Page
In logistic regression, the linear relationship between the input features and the log-odds of the
probability is modeled as:
where:
CODE:
import math
def logistic(x):
return 1.0 / (1 + math.exp(-x))
The derivative of the logistic function, which is useful for gradient descent, is:
def logistic_prime(x):
return logistic(x) * (1 - logistic(x))
For the entire dataset, assuming independence of data points, the overall log-likelihood is the
sum of individual log-likelihoods:
The gradient of the log-likelihood function with respect to a parameter βj for a single data
point (xi,yi) is:
Logistic regression is used for binary classification problems, where the outcome is either 0 or 1. The
logistic function maps any real-valued number into the range (0, 1), making it suitable for predicting
probabilities of binary outcomes.
Predicting Probabilities: The output of the logistic function can be interpreted as the
probability that a given input belongs to the positive class.
The logistic function maps any real-valued number into the range (0, 1). This property
is particularly useful in classification problems, as it allows the model to output probabilities of
the input data belonging to a certain class. The output of the logistic function can be interpreted
as the probability P that a given input x belongs to the positive class (e.g., y=1y = 1y=1)
Non-linearity: The logistic function introduces non-linearity into the model, making it capable
of capturing complex relationships between the input features and the output.
Bounded Output: The output of the logistic function is always between 0 and 1, making it
suitable for probability prediction.
Differentiability: The logistic function is differentiable, which allows the use of gradient-
based optimization techniques for training the model.
27
Page
This boundary is a hyperplane in the feature space that separates the data into two classes. Points on
one side of the hyperplane are classified as one class while points on the other side are classified as the
other class. The hyperplane represents the threshold where the predicted probability of the positive
class is 0.5.
An alternative approach to classification is to just look for the hyperplane that “best” separates the
classes in the training data. This is the idea behind the support vector machine, which finds the
hyperplane that maximizes the distance to the nearest point in each class.
28
Page
Finding such a hyper plane is an optimization problem that involves techniques that are too advanced .
A different problem is that a separating hyperplane might not exist at all. In this “who pays?” data set
there simply is no line that perfectly separates the paid users from the unpaid users.
This can be resolved by around this by transforming the data into a higher-dimensional space. For
example, consider the simple one-dimensional data set shown in
It’s clear that there’s no hyperplane that separates the positive examples from the negative ones.
However, by mapping this data set to two dimensions by sending the point x to (x, x**2). Suddenly
it’s possible to find a hyperplane that splits the data as shown below fig. This is usually called the
kernel trick because rather than actually mapping the points into the higher-dimensional space this
can use a “kernel” function to compute dot products in the higher-dimensional space and use those to
find a hyperplane.
Support Vector Machines aim to find the best decision boundary (hyperplane) that separates the
classes in the feature space. The goal is to maximize the margin, which is the distance between the
29
hyperplane and the nearest data points from each class (support vectors).
Page
Mathematical Formulation
Given a dataset with n features {(xi,yi)} where xi is a feature vector and yi is the label (either -1 or
1 for binaryclassification), the hyperplane can be defined as:
w⋅x−b=0
where w is the weight vector and b is the bias term.
The objective of SVM is to find w and b that maximize the margin M, which is:
M=2/∥w∥
The constraints for correctly classifying all data points are:
yi(w⋅xi−b)≥1
Optimization Problem
The optimization problem can be formulated as:
Minimize 1/2∥w∥ 2
subject to
yi(w⋅xi−b)≥1
This is a quadratic programming problem that can be solved using various optimization techniques.
30
Page