0% found this document useful (0 votes)
6 views

DSA Module 3

Uploaded by

gaganad.21.beai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DSA Module 3

Uploaded by

gaganad.21.beai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Science and It’s Applications 21AD62

SUBJECT: Data Science and Its Applications (21AD62)

MODULE-1 MACHINE LEARNING

Syllabus: Modeling, What Is Machine Learning?, Over fitting and Under fitting,
Correctness, The Bias-Variance Tradeoff, Feature Extraction and Selection, k-Nearest
Neighbors, The Model, Example: The Iris Dataset, The Curse of Dimensionality, Naive Bayes,
A Really Dumb Spam Filter, A More Sophisticated Spam Filter, Implementation, Testing Our
Model, Using Our Model, Simple Linear Regression, The Model, Using Gradient Descent,
Maximum Likelihood Estimation, Multiple Regression, The Model, Further
Assumptions of the Least Squares Model, Fitting the Model, Interpreting the Model, Goodness
of Fit, Digression: The Bootstrap, Standard Errors of Regression Coefficients, Regularization,
Logistic Regression, The Problem, The Logistic Function, Applying the Model, Goodness of
Fit, Support Vector Machines.

Modeling
A model is essentially a simplified representation of reality that helps us to understand, predict, or
control some aspect of the world. It captures the key features and relationships of the phenomena.
The primary goal of a machine learning model is to make predictions or decisions based on input
data.
It’s simply a specification of a mathematical (or probabilistic) relationship that exists between
different variables.
For example if we want to raise money for one social networking site, It might build a business
model that takes inputs like “number of users” and “ad revenue per user” and “number of
employees” and outputs your annual profit for the next several years. A cookbook recipe entails a
model that relates inputs like “number of eaters” and “hungriness” to quantities of ingredients
needed.
The business model is probably based on simple mathematical relationships: profit is revenue minus
expenses, revenue is units sold times average price, and so on. The recipe model is probably based on
trial and error — someone went in a kitchen and tried different combinations of ingredients until they
found one they liked. And the poker model is based on probability theory, the rules of poker, and
some reasonably innocuous assumptions about the random process by which cards are dealt.

What Is Machine Learning?


The machine learning refer to creating and using models that are learned from data. In other
contexts this might be called predictive modeling or data mining. Typically, the goal is to use
existing data to develop models that can use to predict various outcomes for new data, such as:
 Predicting whether an email message is spam or not
 Predicting whether a credit card transaction is fraudulent
 Predicting which advertisement a shopper is most likely to click on
 Predicting which football team is going to win the Super Bowl
1
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Overfitting and Underfitting


A Over fitting is a model that performs well on the training data but that generalizes poorly to any
new data. This could involve learning noise in the data,or it could involve learning to identify
specific inputs rather than whatever factors are actually predictive for the desired output.

Under fitting, producing a model that doesn’t perform well even on the training data, or model
fails to understand the relationships between the input features and outcome.

Let’s us consider Figure below, to fit three polynomials to a sample of data.

The horizontal line shows the best fit degree 0 polynomial. It severely under fits the training data.
The best fit degree 9 polynomial goes through every training data point exactly, but it very severely
over fits — if we were to pick a few more data points it would quite likely miss them by a lot. And
the degree 1 line strikes a nice balance — it’s pretty close to every point, and the line will likely be
close to new data points as well. Clearly models that are too complex lead to over fitting and don’t
generalize well beyond the data they were trained on. The most fundamental approach involves using
different data to train the model and to test the model. Overfitting and Underfitting
Overfitting
Causes:
 Too many parameters relative to the number of observations.
 Model complexity is too high.
 Insufficient training data.
Symptoms:
 High accuracy on training data.

2

Low accuracy on validation/test data.


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Example: Consider a polynomial regression problem where we are trying to fit a polynomial to
data that has aquadratic relationship.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate some data


np.random.seed(0)
X = np.random.normal(0, 1, 100)
y = 2 * X ** 2 + 3 + np.random.normal(0, 0.5, 100)

# Split the data into training and test sets


X_train = X[:80]
y_train = y[:80]
X_test = X[80:]
y_test = y[80:]

# Reshape data for sklearn


X_train = X_train[:, np.newaxis]
X_test = X_test[:, np.newaxis]

# Fit polynomial regression with a high degree


poly = PolynomialFeatures(degree=10)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)
model = LinearRegression()
model.fit(X_poly_train, y_train)
y_poly_pred_train = model.predict(X_poly_train)
y_poly_pred_test = model.predict(X_poly_test)

# Plot the data and the polynomial regression line


plt.scatter(X_train, y_train, color=’blue’;, label=’Training data’)
plt.scatter(X_test, y_test, color=’red’, label=’Test data’)
plt.plot(np.sort(X_train[:, 0]), y_poly_pred_train[np.argsort(X_train[:, 0])], color=’green’,
3

label=’Polynomial fit’;)
Page

plt.legend()
plt.show()

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Calculate and print training and test errors


train_error = mean_squared_error(y_train, y_poly_pred_train)
test_error = mean_squared_error(y_test, y_poly_pred_test)
print(f’Training error: {train_error}’)
print(f’Test error: {test_error}’)

In this example, using a polynomial of degree 10 leads to overfitting. The model fits the training data
very well,capturing noise, but generalizes poorly to the test data.

Underfitting
Causes:
 Model complexity is too low.
 Not enough features.
 Excessive regularization.
Symptoms:
 Low accuracy on training data.
 Low accuracy on validation/test data.
Example: Continuing with the same data, consider fitting a linear regression model:

# Fit a linear regression model


model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Plot the data and the linear regression line


plt.scatter(X_train, y_train, color=’blue’, label=’Training data’)
plt.scatter(X_test, y_test, color=’red’, label=’Test data’)
plt.plot(np.sort(X_train[:,0]),y_pred_train[np.argsort(X_train[:,0])],color=’green’,label=’Linear fit’)
plt.legend()
plt.show()

# Calculate and print training and test errors


train_error = mean_squared_error(y_train, y_pred_train)
test_error = mean_squared_error(y_test, y_pred_test)
print(f’Training error: {train_error}’)
print(f’Test error: {test_error}’)
4
Page

Here, using a linear regression model leads to underfitting. The model is too simple to capture the
quadratic relationship in the data.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Correctness
Correctness refers to how accurately a model's predictions align with the actual outcomes. It can
be quantified using various evaluation metrics depending on the type of problem and the specific
goals of the model.
In binary classification problems, such as determining whether an email is spam or not, the
performance of the model can be evaluated using a confusion matrix. This matrix summarizes the
outcomes of the predictions made by the model compared to the actual outcomes.
A confusion matrix is a table that is used to describe the performance of a classification
model.
Let’s consider a data for building a model to make a judgment.

Predict "Spam" Predict "Not Spam"

Actual Spam True Positive (TP) False Negative (FN)

Actual Not Spam False Positive (FP) True Negative (TN)

Given a set of labeled data and such a predictive model, every data point lies in one of four
categories:
• True Positive (TP): An email is actually spam, and the model correctly identifies it as spam.
• False Positive (FP) An email is not spam, but the model incorrectly identifies it as spam
• False Negative (FN) : An email is spam, but the model incorrectly identifies it as not spam.
• True Negative (TN): An email is not spam, and the model correctly identifies it as not spam.
Correctness can be measured by

Accuracy: The proportion of total predictions that is correct. It is a primary metric for many
classification problems, giving a straightforward measure of how often the model is correct

Code:
def accuracy(tp, fp, fn, tn): correct = tp + tn
total = tp + fp + fn + tn
return correct / total
print accuracy(70, 4930, 13930, 981070) # 0.98114

Precision: The proportion of positive predictions that are actually correct


5
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Code:
def precision(tp, fp, fn, tn):
return tp / (tp + fp)
print precision(70, 4930, 13930, 981070) # 0.014
Recall (Sensitivity or True Positive Rate): The proportion of actual positives that are
correctly identified. . Precision and recall are crucial in situations where the cost of false positives
and false negatives is high. For example, in medical diagnosis, spam detection, etc., balancing
precision and recall is more important than overall accuracy.

Code:
def recall(tp, fp, fn, tn):
return tp / (tp + fn)
print recall(70, 4930, 13930, 981070)

F1 Score: The harmonic mean of precision and recall, providing a balance between the two. It
provides a single metric that balances precision and recall, useful when there is an uneven class
distribution or when one type of error is more significant than the other.

Code:
def f1_score(tp, fp, fn, tn):
p = precision(tp, fp, fn, tn)
r = recall(tp, fp, fn, tn)
return 2 * p * r / (p + r)

Usually the choice of a model involves a trade-off between precision and recall.
 A model that predicts “yes” when it’s even a little bit confident will probably have a high
recall but a low precision;
 A model that predicts “yes” only when it’s extremely confident is likely to have a low recall
and a high precision.
6
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

The Bias-Variance Trade-off


• The bias-variance tradeoff is a key concept in understanding the performance of machine
learning models.
• Bias is the error due to overly simplistic assumptions in the learning algorithm.
• High bias can cause an algorithm to miss relevant relations between features and target outputs
(under fitting).
• Variance is the error due to too much complexity in the learning algorithm.
• High variance can cause an algorithm to model the random noise in the training data rather
than the expected outputs (over fitting).
• The tradeoff is about finding a balance between bias and variance to minimize total error.
• Typically, by increasing model complexity will decrease bias but increase variance, while
decreasing complexity will increase bias but decrease variance.
• The goal is to find the spot where the model generalizes well to new data.
The Bias and Variance both will measures what would happen if the model is retrain many times on
different sets of training data.
For example, the degree 0 model in “Over fitting and Under fitting” will make a lot of mistakes for
pretty much any training set (drawn from the same population), which means that it has a high bias.
However, any two randomly chosen training sets should give pretty similar models (since any two
randomly chosen training sets should have pretty similar average values). So we say that it has a low
variance. High bias and low variance typically correspond to underfitting.
On the other hand, the degree 9 model fit the training set perfectly. It has very low bias but very high
variance (since any two training sets would likely give rise to very different models). This
corresponds to overfitting.
Thinking about model problems this way can help you figure out what do when your model doesn’t
work so well.

Adding More Features

If the model has high bias, it means the model is too simple to capture the underlying patterns in the
data.In such cases, adding more features can help improve the model by providing it with more
information. For example, in the context of polynomial regression:

o A degree 0 model (just a constant) is too simple.


o A degree 1 model (a straight line) is better because it can capture linear relationships.
o A higher-degree model can capture more complex relationships.

Reducing Features or Adding More Data

 If the model has high variance, it means the model is too complex and is over fitting the
training data. Removing some features can help by simplifying the model, thus reducing the
7

variance.
Page

 Another effective way to take action to reduce or prevent high variance is to gather more data.
More data can help the model generalize better because it provides more examples for the
model to learn from, reducing the risk of over fitting.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

In Figure 11-2, To fit a degree 9 polynomial to different size samples. If the model is trained on 100
data points, there’s much less over fitting. And the model trained from 1,000 data points looks very
similar to the degree 1 model. Holding model complexity constant, the more data , the harder it is to
over fit.

Feature Extraction and Selection


Feature selection involves selecting a subset of the most important features for use in model
construction. This can improve the model's performance by reducing over fitting, speeding up the
training process, and improving the model's interpretability.
When the data doesn’t have enough features, the model is likely to under fit. And when the data has
too many features, it’s easy to overf it.
Features are whatever inputs that provide to our model.
In the simplest case ex: If we want to predict someone’s salary based on her years of experience, then
years of experience is the only feature it has.
In complicated case ex: Imagine trying to build a spam filter to predict whether an email is junk or not.
Most model it is just a collection of text. To extract features.
For example:
Does the email contain the word “lottery”? How

many times does the letter d appear? What was

the domain of the sender?

The first is simply a yes or no, which we typically encode as a 1 or 0.


8

The second is a number. And the third is a choice from a discrete set of options.
Page

To extract features from our data that falls into one of these threecategories.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

• The Naive Bayes classifier - is suited to yes-or-no features.


• Regression models-require numeric features
• Decision trees- can deal with numeric or categorical data.

The features are choosen by the combination of experience and domain expertise comes .

K-Nearest Neighbors (KNN) Algorithm


The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to
tackle classification and regression problems..
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it predicts
the label or value of a new data point by considering the labels or values of its K nearest neighbors in
the training dataset.

Working of KNN:

Step 1: Selecting the optimal value of K


 K represents the number of nearest neighbors that needs to be considered while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.

Step 3: Finding Nearest Neighbors


 The k data points with the smallest distances to the target point are the nearest neighbors.

Step 4: Voting for Classification or Taking Average for Regression

 In the classification problem, the class labels of are determined by performing majority voting. The
class with the most occurrences among the neighbors becomes the predicted class for the target
data point.
 In the regression problem, the class label is calculated by taking average of the target values of K
nearest neighbors. The calculated average value becomes the predicted output for the target data
point.
X is the training dataset with n data points, where each data point is represented by a d-dimensional
feature vector and Y be the corresponding labels or values for each data point in X. Given a new data
9
Page

point x, the algorithm calculates the distance between x and each data point in X using a distance
metric, such as Euclidean distance:

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

The algorithm selects the K data points from X that have the shortest distances to x. For classification
tasks, the algorithm assigns the label y that is most frequent among the K nearest neighbors to x. For
regression tasks, the algorithm calculates the average or weighted average of the values y of the K
nearest neighbors and assigns it as the predicted value for x.

Python program to build a nearest neighbor model that can predict the class
from the IRIS dataset
import numpy as np
from collections import Counter
# Sample dataset (Iris data: sepal length, sepal width, petal length, petal width, species)
data = [
(5.1, 3.5, 1.4, 0.2, ‘setosa’),
(4.9, 3.0, 1.4, 0.2, ‘setosa’),
(5.0, 3.6, 1.4, 0.2, ‘setosa’),
(6.7, 3.0, 5.0, 1.7,’versicolor’),
(6.3, 3.3, 6.0, 2.5, ‘virginica’),
(5.8, 2.7, 5.1, 1.9, ‘virginica’)
]
# New data point (sepal length, sepal width, petal length, petal width)
new_point = (5.5, 3.4, 1.5, 0.2)

# Function to calculate Euclidean distance


def euclidean_distance(point1, point2):
return np.sqrt(sum((x - y) ** 2 for x, y in zip(point1, point2)))

# Function to get the nearest neighbors


10

def get_nearest_neighbors(data, new_point, k):


Page

distances = [(euclidean_distance(point[:-1], new_point), point[-1]) for point in data]


distances.sort(key=lambda x: x[0])

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

return [label for _, label in distances[:k]]

# Function to predict the class


def predict(data, new_point, k):
nearest_neighbors = get_nearest_neighbors(data, new_point, k)
most_common = Counter(nearest_neighbors).most_common(1)
return most_common[0][0]

# Predict the class for the new data point


k=3
predicted_class = predict(data, new_point, k)
print(f’The predicted class for the new point is: ,predicted_class-’)

The Curse of Dimensionality


The Curse of Dimensionality is a concept that describes the challenges and issues that arise when
working with high-dimensional data.As the number of dimensions increases, the volume of the
space increases exponentially. This means that data pointsbecome sparse, and the distances between
them grow, making it difficult to find meaningful patterns.
The Curse of Dimensionality impacts various aspects of data analysis, including distance
calculations, data sparsity, and overfitting.
1. Distance Measures Become Less Meaningful:
In high-dimensional spaces, the distances between points tend to become similar, making it harder
to distinguish between near and far points. This is problematic for algorithms that rely on distance
measures, such as k-Nearest Neighbors (k-NN) and clustering algorithms.
2. Data Sparsity:
With more dimensions, the data points spread out more, leading to sparsity. In a high-dimensional
space, even a large dataset may have very few data points in any given region. This sparsity makes
it hard to find reliable patterns and can reduce the effectiveness of algorithms.
3. Overfitting:
High-dimensional datasets often contain many irrelevant or noisy features, which can lead to
overfitting. The model may capture noise instead of the underlying pattern, performing well on
training data but poorly on unseen data.
Example
Consider a simple example using a dataset with points uniformly distributed in a unit cube. We can
observe how the volume and distances change as the number of dimensions increases.
1. Volume of a Hypercube:
11

In a 1-dimensional space , the unit hypercube is simply a line segment of length 1.


In a 2-dimensional space (a square), the unit hypercube has an area of 1.
Page

In a 3-dimensional space (a cube), the unit hypercube has a volume of 1.


However, in higher dimensions, the volume of the unit hypercube becomes negligible compared to

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

the space it occupies. For instance:


 In a 10-dimensional space, the unit hypercube has a volume of 1 10 =1.
 In a 100-dimensional space, the unit hypercube still has a volume of 1 100 =1, but the space
it occupies is vast.
2. Distance Calculations:
In a 1-dimensional space, consider two points at 0 and 1. The distance between them is 1.
In a 2-dimensional space, consider two points (0, 0) and (1, 1). The Euclidean distance is sqrt{2}.
In a 3-dimensional space, consider two points (0, 0, 0) and (1, 1, 1). The Euclidean distance is
sqrt{3}.
As dimensions increase, the distances between points increase as well. However, the difference
between the maximum and minimum distances decreases proportionally, making distances less
discriminative.

Dimensionality Reduction
Dimensionality reduction is a process used to reduce the number of features (dimensions) in a
dataset while retaining as much information as possible. This technique helps in simplifying
models, reducing computational costs, and mitigating issues related to the curse of dimensionality.
Principal Component Analysis (PCA)
PCA is a widely used technique for dimensionality reduction that transforms the original features
into a new set of uncorrelated features called principal components. The first principal component
captures the most variance in the data,and each subsequent component captures the remaining
variance under the constraint of being orthogonal to the previous components.
Example: Dimensionality Reduction Using PCA
1. Standardize the Data: Standardization ensures that each feature contributes equally to the
analysis by scaling the data to have a mean of 0 and a standard deviation of 1.
2. Compute the Covariance Matrix: The covariance matrix describes the variance and the
covariance between the features.
3. Compute the Eigenvalues and Eigenvectors: The eigenvectors determine the directions of the
new feature space, while the eigenvalues determine their magnitude (i.e., the amount of variance
captured by each principal component).
4. Sort Eigenvalues and Select Principal Components: The eigenvalues are sorted in descending
order, and the top k eigenvalues are selected. The corresponding eigenvectors form the new feature
space.
5. Transform the Data: The original data is projected onto the new feature space to obtain the
reduced dataset.
12
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

CODE:
Example using Python to illustrate PCA for dimensionality reduction:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Generate a synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 3) # 100 samples, 3 features

# Standardize the data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Print the explained variance ratios


Print(f’Explained variance ratios:", pca.explained_variance_ratio’)

# Visualize the original and reduced data


fig, ax = plt.subplots(1, 2, figsize=(12, 6))
# Original data
ax[0].scatter(X*:, 0+, X*:, 1+, c=’blue’, label=’Original Data’)
ax*0+.set_xlabel(‘Feature 1’;)
ax*0+.set_ylabel(‘Feature 2’)
ax*0+.set_title(‘Original Data’)

# Reduced data
ax[1].scatter(X_pca[:, 0+, X_pca*:, 1+, c=’red’, label=’PCA Reduced Data’)
ax[1].set_xlabel(‘Principal Component 1’)
ax*1+.set_ylabel(‘Principal Component 2’)
13

ax[1].set_title(‘PCA Reduced Data’)


Page

plt.legend()
plt.show()

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Naive Bayes
Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem, with the
assumption that the features are independent given the class label.
This model predicts the probability of an instance belongs to a class with a given set of
feature value. It is a probabilistic classifier. It is because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to the
predictions with no relation between each other.. It uses Bayes theorem in the algorithm for training
and prediction
The core assumption of Naive Bayes is conditional independence.
Mathematically, if Xi & Xj represents the event that the ith word is present in the message, then the
assumption says:
P(Xi and Xj/spam)=P(Xi/spam)⋅P(Xj/spam)

Sophisticated Spam Filter


Imagine now that a vocabulary of many words W1,W2,W3………WN. To move this into the
probability theory, let’s write Xi let the event “a message contains the word “Wi” Also imagine that
with an estimate P(Xi/S) for the probability that a spam message contains the i th word, and a similar
estimate P(Xi/N) for the probability that a nonspam message containsthe ith word.
The key to Naive Bayes is making the assumption that the presences of each word are independent of
one another, conditional on a message being spam or not.
Intuitively, this assumption means that knowing whether a certain spam message contains the word
“lottery” gives the no information about whether that same message contains the word “rolex.” In math
terms, this means that:
P(X1= x1.........Xn= xn/S) = P(X1= x1/S) x……………… x P(Xn= xn/S)
Imagine that a vocabulary consists only of the words “lottery” and “rolex,” and that half of all spam
messages are for “cheap rolex” and that the other half are for “authentic lottery.” In this case, the
Naive Bayes estimate that a spam message contains both “lottery” and “rolex” is:

since the “lottery” and “rolex” words actually never occur together. Despite the unrealisticness of this
assumption, this model often performs well and is used in actual spam filters.
The same Bayes’s Theorem reasoning used for the word “lottery-only” spam filter tells that we can
calculate the probability a message is spam using the equation:

P( S/X=x) = P(X=x/S) / [P(X=x/S) + p(X=x/N)]


The Naive Bayes assumption allows us to compute each of the probabilities on the right simply by
multiplying together the individual probability estimates for each vocabulary word.
14

Let us consider three words: “rolex" "lottery," and "meeting." the following probabilities based on
Page

historical data.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

For Training Data


Assume Likelihoods as
• Rolex : P(R/N) = 0.1 & P(R/S) = 0.8
• Lottery : P(L/N)=0.05 & P(L/S) = 0.7
• Meeting : P(M/N)=0.9 & P(M/S) = 0.2
• Project: P(P/N)=0.7 & P(P/S) = 0.25

Prior Probabilities:
• P(spam) = 0.3 (30% of messages are spam)
• P(Normal) = 0.7 (70% of messages are non-spam)

New Message containing the words “rolex" and "lottery."
Let us classify it as spam or not spam using Naive Bayes.
• Calculate the probability of the message being spam:
P(spam/message)∝P(S)⋅P(R/S)⋅P(L/S) = 0.3X0.8X0.7= 0.168
• Calculate the probability of the message being ham:
P(normal/message)∝P(N)⋅P(R/N)⋅P(L/N) = 0.7x0.1x0.05= 0.0035

• Normalize the Probabilities


• To get the actual probabilities, we need to normalize these values so they sum to 1.
• P(message)=P(S/ message)+P(N∣ message)
• P(message)=0.168 + 0.0035= 0.1715
• So, the normalized probabilities are:
• P(S/message)=0.1680/1715 ≈ 0.98
• P(N/message)=0.00350/1715 ≈ 0.02

Given the message contains the words “rolex" and "lottery," there is a 98% chance it is spam and a
2% chance it is not spam (normal).

Python Code:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
15
Page

# Sample data

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

data = {
'message': [
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121
to
receive entry question(std txt rate)',
'Nah I don\'t think he goes to usf, he lives around here though',
'WINNER!! As a valued network customer you have been selected to receivea £900
prize
reward!',
'I HAVE A work ON SUNDAY !!',
'Had your mobile 11 months or more? U R entitled to update to the latest colour
mobiles with camera for free! Call The Mobile Update Co FREE on 08002986030'
],
'label': ['spam', 'ham', 'spam', 'ham', 'spam']
}
# Convert data to DataFrame
df = pd.DataFrame(data)

# Feature extraction
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Naive Bayes classifier


nb = MultinomialNB()
nb.fit(X_train, y_train)

# Make predictions
y_pred = nb.predict(X_test)
16

# Evaluate the model


Page

accuracy = accuracy_score(y_test, y_pred)


report = classification_report(y_test, y_pred)

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)

# Function to classify a new message


def classify_message(message):
message_transformed = vectorizer.transform([message])
prediction = nb.predict(message_transformed)
return prediction[0]

# Test the classifier


new_message = 'Congratulations! You have won a free ticket to Bahamas. Call now!'
print(f'The message: "{new_message}" is classified as {classify_message(new_message)}')

Simple linear Regression

• Regression is a statistical technique used to model and analyze the relationships between
variables.
• It helps in understanding how the dependent variable (Y)changes when any one of the
independent variables (X)is varied.
• The primary goal of regression is to predict or estimate the value of the dependent variable based
on the values of one or more independent variables.
• Simple Linear Regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
• Independent Variable (X): Also known as the predictor or explanatory variable.
• Dependent Variable (Y): Also known as the response or outcome variable.
• The goal of Simple Linear Regression is to model the relationship between these two variables
by fitting a linear equation to the observed data.
• The linear equation for a Simple Linear Regression model is:

Yi= +βXi+ϵi
Y - is the dependent variable.
X -is the independent variable.
17

- is the intercept of the regression line. It is the value of Y when X=0.


Page

β -is the slope of the regression line. It represents the change in Y for a one-unit change in X.
ϵ - error term, which accounts for the variability in Y that cannot be explained by the linear
relationship with X.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

• To fit the Simple Linear Regression model.


• Estimate the parameters and β using Ordinary Least Squares (OLS)
• It minimizes the sum of the squared differences between the observed values and the values
predicted by the linear model.

After fitting the mode


• Mean Squared Error (MSE): The average of the squared differences between the observed and
predicted values.

R-squared (R²): The proportion of the variance in the dependent variable that is predictable from
the independent variable.

Code
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 7, 11]
18

# Calculate means
Page

x_mean = np.mean(x)
y_mean = np.mean(y

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Calculate the parameters


beta_1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta_0 = y_mean - beta_1 * x_mean
print(f’Estimated parameters: beta_0 = ,beta_0-, beta_1 = ,beta_1-’)

# Make predictions
y_pred = beta_0 + beta_1 * x

# Plot the data and the regression line


plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, y_pred, color='red', label='Regression line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

Using Gradient Descent


Gradient descent is a fundamental optimization algorithm used in machine learning to minimize a
cost function. Gradient Descent Basics Gradient descent aims to find the parameters (weights) of a
model that minimize the cost function, which measures how well the model fits the data.
The steps are as follows:
1. Initialize Parameters: Start with random initial values for the parameters (weights).
2. Compute the Gradient: Calculate the gradient of the cost function with respect to each parameter.
The gradient is a vector of partial derivatives, indicating the direction and rate of the steepest
increase in the cost function.
3. Update Parameters: Adjust the parameters in the opposite direction of the gradient by a small
amount, which is determined by the learning rate. This step is repeated iteratively:
θi = θi − α∂J(θ)/∂θi
where θi is the i-th parameter, α is the learning rate, and J(θ)/∂θi is the cost function.
4. Convergence Check: Repeat steps 2 and 3 until the change in the cost function is smaller than a
predefined threshold or a maximum number of iterations is reached.
Gradient Descent for Parameterized Models
When fitting parameterized models, the cost function depends on the difference between the
predicted and actual values. For example, in linear regression, the cost function J(θ) is typically the
mean squared error:
19
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

where:
 m is the number of training examples.
 hθ(x(i)) is the predicted value for the ith training example.
 y(i) is the actual value for the ith training example.
The gradient descent algorithm for linear regression would involve computing the partial
derivatives of J(θ) with respect to each parameter θj:

Python program to illustrate gradient descent for a simple linear regression


model

import numpy as np

# Example data: y = 2x + 3 with some noise


np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add x0 = 1 to each instance


X_b = np.c_[np.ones((100, 1)), X]

# Parameters
learning_rate = 0.1
n_iterations = 1000
m = len(X_b)
theta = np.random.randn(2, 1)

# Gradient Descent
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
20

theta = theta - learning_rate * gradients


print(‘Fitted parameters:’, theta)
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Multiple Regressions
Multiple regressions are a statistical technique used to understand the relationship between one
dependent variable and two or more independent variables. It extends simple linear regression, which
involves only one independent variable, by incorporating multiple predictors to better capture the
complexity of real-world phenomena.

The Multiple Regression Equation

The general form of the multiple regression equation is:

Y=β0+β1X1+β2X2+…+βn Xn
 Y: Dependent variable.
 β0: Intercept, the expected value of Y when all Xs are zero.
 β1,β2,…,βn : Coefficients representing the change in Y for a one-unit change in the
corresponding X, holding other variables constant.
 X1,X2,…,Xn : Independent variables.
 ϵ: Error term, representing the deviation of observed values from the predicted values.

Steps in Multiple Regression Analysis

1. Model Specification: Define the dependent variable and select the independent variables
based on theoretical understanding or empirical evidence.
2. Data Collection: Gather data for the dependent and independent variables. Ensure the data is
clean and suitable for analysis.
3. Estimation of Coefficients: Use statistical software to estimate the coefficients (β\betaβ) of
the regression equation. This is typically done using the Ordinary Least Squares (OLS)
method, which minimizes the sum of the squared differences between observed and predicted
values.
4. Model Evaluation: Assess the model's performance using various metrics:
o R-squared (R²): Proportion of variance in the dependent variable explained by the
independent variables.
o Adjusted R-squared: Adjusts R² for the number of predictors in the model.
o F-test: Tests the overall significance of the model.
o t-tests: Assess the significance of individual coefficients.
5. Assumption Checking: Ensure that the model meets the assumptions of multiple regression:
o Linearity: The relationship between the dependent and independent variables is linear.
o Independence: Observations are independent of each other.
o Homoscedasticity: Constant variance of errors across all levels of the independent
variables.
o Normality: Errors are normally distributed.
6. Diagnostics and Refinement: Perform residual analysis to check for any patterns in the
residuals that might indicate model misspecification. Address issues like multicollinearity
(high correlation among predictors) if they arise.
7. Interpretation: Interpret the coefficients to understand the impact of each independent
variable on the dependent variable. For example, a coefficient of 2 for X1 means that a one-unit
21

increase in X1 results in an average increase of 2 units in Y, holding other variables constant.


8. Prediction: Use the fitted model to make predictions for new data points by plugging in values
Page

for the independent variables into the regression equation.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

The goodness of fit in a multiple regression model refers to how well the model's predicted values
match the actual observed values. It helps in determining the model's accuracy and reliability

The concepts used to assess the goodness of fit are:

1. R-squared (R²):R² represents the proportion of the variance in the dependent variable that is
predictable from the independent variables. R² values range from 0 to 1. An R² value closer to 1
indicates a better fit, meaning a higher proportion of variance is explained by the model. For instance,
an R² of 0.8 means 80% of the variance in the dependent variable is explained by the independent
variables.

2. Adjusted R-squared: Adjusted R² adjusts the R² value based on the number of predictors in the
model. It accounts for the fact that adding more variables to a model will inherently increase R²,
regardless of whether those variables are meaningful. Adjusted R² is useful for comparing models with
a different number of predictors. It can decrease if the added variable doesn't improve the model more
than would be expected by chance.

3. F-statistic and p-value: The F-statistic tests the overall significance of the model. It checks
whether at least one of the predictors is significantly related to the dependent variable. A higher F-
statistic value and a lower p-value (typically less than 0.05) indicate that the model is statistically
significant.

4. Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared
differences between the observed and predicted values. RMSE provides a measure of the average
magnitude of the errors in prediction. Lower RMSE values indicate a better fit.

5. Mean Absolute Error (MAE):MAE is the average of the absolute differences between the
observed and predicted values. Like RMSE, lower MAE values indicate a better fit. It is less sensitive
to large errors compared to RMSE.

6. Residual Plots: Residual plots show the difference between the observed and predicted values
(residuals) on the y-axis and the predicted values on the x-axis. A good fit is indicated by residuals
that are randomly scattered around zero, with no clear patterns. Patterns or systematic deviations
suggest model inadequacies.

Importance of Goodness of Fit

 Model Validation: Ensuring the model reliably predicts outcomes on new data.
 Decision Making: Better fitting models provide more accurate and reliable information for
decision making.
 Model Comparison: Goodness of fit metrics allow for comparison between different models
to choose the best one.
22
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Need for fitting the model in multiple regressions


Fitting a model in multiple regressions is essential for several reasons, each contributing to the
robustness, accuracy, and interpretability of the analysis. The reasons are

1. Understanding Relationships between Variables

Multiple regression allows for the examination of the relationship between a dependent variable and
multiple independent variables simultaneously. This helps in understanding how various factors
collectively influence the outcome.

2. Controlling for Confounding Variables

In many real-world scenarios, the effect of one independent variable on the dependent variable might
be influenced by the presence of other variables. Multiple regression helps to isolate the effect of each
independent variable by controlling for others, reducing potential confounding effects.

3. Improved Prediction Accuracy

By incorporating multiple predictors, the model can capture more information about the dependent
variable, leading to better predictive accuracy compared to simple regression models with a single
predictor.

4. Identifying Significant Predictors

Multiple regression helps in identifying which independent variables have a significant impact on the
dependent variable. This is particularly useful in fields like economics, medicine, and social sciences,
where understanding the importance of various factors is crucial.

5. Quantifying the Impact of Variables

The coefficients in a multiple regression model quantify the impact of each independent variable on
the dependent variable, providing valuable insights into the strength and direction of these
relationships.

6. Handling Multicollinearity

In multiple regression, it's important to assess and handle multicollinearity (when independent
variables are highly correlated). Properly fitting the model involves diagnosing and addressing
multicollinearity to ensure reliable and interpretable results.

7. Generalizability of Findings

A well-fitted multiple regression model that accounts for multiple factors is more likely to generalize
to new data, making the findings more robust and applicable in various contexts.

8. Model Diagnostics and Validation


23

Fitting the model involves checking for assumptions (linearity, independence, homoscedasticity,
Page

normality) and performing diagnostics (residual analysis, influence analysis) to ensure the validity of
the model. This step is crucial for the reliability of the regression results.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

9. Enabling Complex Analyses

Multiple regression serves as a foundation for more complex analyses like interaction effects,
polynomial regression, and hierarchical regression, expanding the analytical capabilities for research
and decision-making.

10. Policy and Decision-Making

In applied fields, such as business, public policy, and healthcare, multiple regression models provide
evidence-based insights that inform strategic decisions and policy-making by highlighting key factors
and their relative importance

Standard Error of a Regression Coefficient (SE)

It is a measure of the variability or dispersion of the sampling distribution of a regression


coefficient. It quantifies the precision of the estimated coefficient.

 Precision of Estimates: Smaller standard errors indicate more precise estimates of the
regression coefficients.
 Confidence Intervals: They are used to construct confidence intervals around the regression
coefficients, providing a range within which the true population parameter is likely to lie.
 Hypothesis Testing: Standard errors are crucial for conducting hypothesis tests, such as
determining whether a regression coefficient is significantly different from zero.

Calculation

For a simple linear regression with one predictor, the standard error of the regression coefficient can
be calculated as follows:

Where:

 s is the standard deviation of the residuals (errors).


 xi represents the individual values of the predictor variable.
 xˉ is the mean of the predictor variable.

In a multiple regression model with several predictors, the calculation involves more complex matrix
operations, but the concept remains the same.

Interpreting Standard Errors

 Smaller Standard Errors: Indicate that the coefficient estimate is more reliable and that the
predictor variable has a more stable relationship with the response variable.
 Larger Standard Errors: Suggest that the coefficient estimate is less reliable, indicating more
24

variability in the estimate and a less stable relationship with the response variable.
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Confidence Intervals

 Construction: A confidence interval for a regression coefficient is typically constructed as:

 t is the critical value from the t-distribution for the desired confidence level, and SE(βj) is the
standard error of the coefficient.
 Interpretation: A 95% confidence interval means that we are 95% confident that the true
population parameter lies within this interval.

Hypothesis Testing

 Null Hypothesis (H₀): The null hypothesis usually states that the coefficient is equal to zero
(βj = 0), meaning the predictor has no effect on the response variable.
 Test Statistic: The t-statistic for testing this hypothesis is calculated as:

 p-value: The p-value associated with this t-statistic helps determine whether to reject the null
hypothesis. A smaller p-value (typically < 0.05) indicates that the coefficient is significantly
different from zero.

Factors Affecting Standard Errors

 Sample Size: Larger sample sizes tend to produce smaller standard errors, indicating more
reliable estimates.
 Variance of Errors: Higher variance in the residuals leads to larger standard errors, indicating
less precise estimates.
 Multi collinearity: When predictor variables are highly correlated, it inflates the standard
errors of the coefficients, making it harder to determine the individual effect of each predictor.

Logistic regression

Logistic regression is a statistical method used for binary classification problems, where the outcome
variable is categorical and has two possible outcomes (0 or 1). It models the probability of a binary
response based on one or more predictor variables.

Logistic Function

The logistic function (or sigmoid function) is used to model the probability of the default class:
25
Page

where z is the linear combination of input features.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Logistic Regression Model

In logistic regression, the linear relationship between the input features and the log-odds of the
probability is modeled as:

where:

 P is the probability of the positive class (outcome 1).


 β0 is the intercept.
 β1,β2,…,βn are the coefficients for the predictor variables x1,x2,…,xn.

The logistic function is defined as:

CODE:
import math

def logistic(x):
return 1.0 / (1 + math.exp(-x))

The derivative of the logistic function, which is useful for gradient descent, is:

def logistic_prime(x):
return logistic(x) * (1 - logistic(x))

Log-Likelihood and Its Gradient


The log-likelihood function for logistic regression is used to measure how well the model parameters
fit the data. For a single data point (xi,yi) the log-likelihood is:

def logistic_log_likelihood_i(x_i, y_i, beta):


if y_i == 1:
return math.log(logistic(dot(x_i, beta)))
else:
return math.log(1 - logistic(dot(x_i, beta)))
26
Page

For the entire dataset, assuming independence of data points, the overall log-likelihood is the
sum of individual log-likelihoods:

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

def logistic_log_likelihood(x, y, beta):


return sum(logistic_log_likelihood_i(x_i, y_i, beta) for x_i, y_i in zip(x, y))

The gradient of the log-likelihood function with respect to a parameter βj for a single data
point (xi,yi) is:

def logistic_log_partial_ij(x_i, y_i, beta, j):


"""i is the index of the data point, j the index of the derivative"""
return (y_i - logistic(dot(x_i, beta))) * x_i[j]

Use of Logistic Function in Logistic Regression

Logistic regression is used for binary classification problems, where the outcome is either 0 or 1. The
logistic function maps any real-valued number into the range (0, 1), making it suitable for predicting
probabilities of binary outcomes.

 Predicting Probabilities: The output of the logistic function can be interpreted as the
probability that a given input belongs to the positive class.
The logistic function maps any real-valued number into the range (0, 1). This property
is particularly useful in classification problems, as it allows the model to output probabilities of
the input data belonging to a certain class. The output of the logistic function can be interpreted
as the probability P that a given input x belongs to the positive class (e.g., y=1y = 1y=1)

 Decision Boundary: A threshold (commonly 0.5) is applied to the output probability to


classify inputs into binary outcomes. If σ(z) ≥ 0.5, the prediction is 1; otherwise, it's 0.

Advantages of Using the Logistic Function

 Non-linearity: The logistic function introduces non-linearity into the model, making it capable
of capturing complex relationships between the input features and the output.
 Bounded Output: The output of the logistic function is always between 0 and 1, making it
suitable for probability prediction.
 Differentiability: The logistic function is differentiable, which allows the use of gradient-
based optimization techniques for training the model.
27
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Support Vector Machines


In logistic regression, the decision boundary is determined by the set of points where the linear
combination of input features, weighted by the model coefficients, equals zero:

This boundary is a hyperplane in the feature space that separates the data into two classes. Points on
one side of the hyperplane are classified as one class while points on the other side are classified as the
other class. The hyperplane represents the threshold where the predicted probability of the positive
class is 0.5.

An alternative approach to classification is to just look for the hyperplane that “best” separates the
classes in the training data. This is the idea behind the support vector machine, which finds the
hyperplane that maximizes the distance to the nearest point in each class.
28
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Finding such a hyper plane is an optimization problem that involves techniques that are too advanced .
A different problem is that a separating hyperplane might not exist at all. In this “who pays?” data set
there simply is no line that perfectly separates the paid users from the unpaid users.

This can be resolved by around this by transforming the data into a higher-dimensional space. For
example, consider the simple one-dimensional data set shown in

It’s clear that there’s no hyperplane that separates the positive examples from the negative ones.
However, by mapping this data set to two dimensions by sending the point x to (x, x**2). Suddenly
it’s possible to find a hyperplane that splits the data as shown below fig. This is usually called the
kernel trick because rather than actually mapping the points into the higher-dimensional space this
can use a “kernel” function to compute dot products in the higher-dimensional space and use those to
find a hyperplane.

Support Vector Machines aim to find the best decision boundary (hyperplane) that separates the
classes in the feature space. The goal is to maximize the margin, which is the distance between the
29

hyperplane and the nearest data points from each class (support vectors).
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Mathematical Formulation
Given a dataset with n features {(xi,yi)} where xi is a feature vector and yi is the label (either -1 or
1 for binaryclassification), the hyperplane can be defined as:
w⋅x−b=0
where w is the weight vector and b is the bias term.
The objective of SVM is to find w and b that maximize the margin M, which is:
M=2/∥w∥
The constraints for correctly classifying all data points are:
yi(w⋅xi−b)≥1

Optimization Problem
The optimization problem can be formulated as:
Minimize 1/2∥w∥ 2
subject to
yi(w⋅xi−b)≥1
This is a quadratic programming problem that can be solved using various optimization techniques.

30
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore

You might also like