0% found this document useful (0 votes)
18 views45 pages

dHINESH STATISTICAL REC

This document is a record notebook for the Statistical Inference Lab (EBDS22L01) for the academic year 2024-2025, detailing the work of a student named Dhinesh R from the Computer Science and Engineering department. It includes a bonafide certificate, an index of experiments, and outlines various statistical methods and algorithms such as calculating mean, median, mode, quartiles, standard deviation, and implementing Bayes' theorem. The document serves as a practical examination record for the student's lab work in statistical inference.

Uploaded by

Dhanaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views45 pages

dHINESH STATISTICAL REC

This document is a record notebook for the Statistical Inference Lab (EBDS22L01) for the academic year 2024-2025, detailing the work of a student named Dhinesh R from the Computer Science and Engineering department. It includes a bonafide certificate, an index of experiments, and outlines various statistical methods and algorithms such as calculating mean, median, mode, quartiles, standard deviation, and implementing Bayes' theorem. The document serves as a practical examination record for the student's lab work in statistical inference.

Uploaded by

Dhanaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

FORM NO.

-F/TL/021
REV.00Date 20.03.2020

RECORD NOTEBOOK

STATISTICAL INFERENCE LAB–(EBDS22L01)

2024-2025(ODD SEMESTER)

DEPARTMENT

OF

COMPUTER SCIENCE AND ENGINEERING

NAME : DHINESH R

REGISTER NO : 221191101038

COURSE : B.TECH , CSE-DS(AI)

YEAR/SEM/SEC : 3/5/A
FORM NO.-F/TL/021
REV.00Date 20.03.2020

BONAFIDE CERTIFICATE

Register No : 221191101038

Name of Lab : STATISTICAL INFERENCE LAB–(EBDS22L01)

Department : COMPUTER SCIENCE AND ENGINEERING-DS(AI)

Certified that this is the bonafide record of work done by DHINESH R, 221191101038 of

III Year B.Tech (CSE-DS(AI)), Sec- A in the “STATISTICAL INFERENCE LAB–

(EBDS22L01)” during the year 2023-2024.

Signature of Lab-in-Charge Signature of Head of Dept

Submitted for the Practical Examination held on -------------------------

Internal Examiner External Examiner


FORM NO.-F/TL/021
REV.00Date 20.03.2020

INDEX

Ex. No Date Name of the Experiment Page Marks Signature of the


No Staff

For a given of Number find Mean,


1
Median, Mode. 1

Calculate Quartile and Standard


2
Deviation to a population. 4

Write a program for Probability


3
Distribution. 9

Write a program to implement Bayes’


4
Theorem. 12

Estimate the Variability of data using


5
programming language. 16

Calculate the Correlation of a


6
population. 23

Find the Normal and Binomial


7
Distribution of a sample data set 27

Perform a Hypothesis testing in a data


8
set. 31

Calculate Simple and Multiple Linear


9 .
Regressions. 35

Perform a Chi-Square Test of a sample


10
data. 40
EX.01: FOR A GIVEN SET OF NUMBERS FIND MEAN, MEDIAN,
MODE

Aim:

To find mean, median, mode for a given set of numbers.

Algorithm:

1. Create student marks Excel sheet and save in .CSV format.


2. Go to file menu, select Options.
3. Select Add-ins button, click GO button, then OK.
4. Go to Excel Menu bar click Data Menu, select data analysis tool.
5. Go to Data Analysis dialog box, select Descriptive Statistics, then OK.

6. Select Input range, Select output option then ok.

Mean:
In statistics, the mean is the average of a data set. It is calculated by adding all the
numbers in the data set and dividing by the number of values in the set. The mean is
also known as the average. It is sensitive to skewed data and extreme values.
DHINESH R 1 221191101038
Mean Formula:
Mean formula in statistics is defined as the sum of all observations in the given
dataset divided by the total number of observations. The image added below shows
the mean formula of given observation.

Median:
Median is the middle value of the dataset when arranged in ascending or descending
order. If the dataset has an odd number of values, the median is the middle value. If
the dataset has an even number of values, the median is the average of the two middle
values.
Median Formula When n is Odd:
The median formula of a given set of numbers, say having 'n' odd number of
observations, can be expressed as:
Median = [(n + 1)/2]th term
Median Formula When n is Even:
The median formula of a given set of numbers say having 'n' even number of
observations, can be expressed as:
Median = [(n/2)th term + ((n/2) + 1)th term]/2
Mode:
Mode in statistics is the value or data point that appears most frequently in a data set.
It is a measure of central tendency and can be calculated for both numerical and
categorical data.
Mode is a measure that is less widely used compared to mean and median. There can
be more than one type of mode in a given data set.

DHINESH R 2 221191101038
Problem: Student marks Information:

Name Gende Maths Physic Chemistr Englis Biolog Economic Histor Civics
r s y h y s y
John M 55 45 56 87 21 52 89 65
Suresh M 75 96 78 64 90 61 58 2
Rames M 25 54 89 76 95 87 56 74
h
Jessica F 78 96 86 63 54 89 75 45
Jennifer F 58 96 78 46 96 77 83 53
Annu F 45 87 52 89 55 89 87 52
pooja F 55 64 61 58 75 58 64 61
Ritesh M 54 76 87 56 25 56 76 87
Farha F 55 63 89 75 78 75 63 89
Mukes M 96 46 77 83 58 83 46 77
h

Solution: Using data analysis technique in Excel developed solution on Mean,


Median and Mode.

Maths Physics Chemistry

Mean 59.6 Mean 72.3


Mean 75.3
Standard Error 6.153950854 Standard Error 6.53375849
Standard Error 4.42731421
Median 55 Median 70
Median 78
Mode 55 Mode 96
Mode 78
Standard Standard Standard
Deviation 19.4605013 Deviation 20.6615585 Deviation 14.0003968
1 1 2
Sample 378.711111 Sample 426.9 Sample 196.0111111
Variance 1 Variance - Variance -
1.762494195 1.057490727
Kurtosis 0.8458807 Kurtosis Kurtosis
4
- -
Skewness 0.246963732 Skewness 0.045336091 Skewness 0.746462278
Range 71 Range 51 Range 37
Minimum 25 Minimum 45 Minimum 52
Maximum 96 Maximum 96 Maximum 89
Sum 596 Sum 723 Sum 753
Count 10 Count 10 Count 10

Result:
Hence the program was successfully executed and output was verified.
DHINESH R 3 221191101038
EX.02: CALCULATE QUARTILE AND STANDARD DEVIATION
TO A POPULATION

Aim:

To calculate quartile and standard deviation to a population.

Quartile Deviation:

Quartile deviation is a statistic that measures the deviation in the middle of the data.
Quartile deviation is also referred to as the semi interquartile range and is half of the
difference between the third quartile and the first quartile value. The formula for
quartile deviation of the data is Q.D = (Q3 - Q1)/2.
Where,
Q3 = Upper Quartile (Size of 3[N+1/4]th item)
Q1 = Lower Quartile (Size of [N+1/4]th item)

Problem:
With the help of the data given below, find the interquartile range, quartile
deviation.

Solution:
First arrange the data into ascending order, Then

DHINESH R 4 221191101038
Population and sample standard deviation:
Standard deviation measures the spread of a data distribution. It measures the typical
distance between each data point and the mean.
The formula we use for standard deviation depends on whether the data is being
considered a population of its own, or the data is a sample representing a larger
population.
 If the data is being considered a population on its own, we divide by the
number of data points, \[N\].
 If the data is a sample from a larger population, we divide by one fewer than
the number of data points in the sample, \[n-1\].

Population standard deviation:


Here's the formula again for population standard deviation:

DHINESH R 5 221191101038
Here's how to calculate population standard deviation:

Step 1: Calculate the mean of the data.


Step 2: Subtract the mean from each data point. These differences are called
deviations. Data points below the mean will have negative deviations, and data points
above the mean will have positive deviations.
Step 3: Square each deviation to make it positive.
Step 4: Add the squared deviations together.
Step 5: Divide the sum by the number of data points in the population. The result is
called the variance.
Step 6: Take the square root of the variance to get the standard deviation.

Example: Population standard deviation

Four friends were comparing their scores on a recent essay.


Calculate the standard deviation of their scores: 6, 2, 3, 1
Step 1: Find the mean.
6+2+3+1/4=12/4=3
The mean is 3.
Step 2: Subtract the mean from each score.

Score: Xi Deviation: (Xi-u)

6 6-3=3

2 2-3=-1

3 3-3=0

1 1-3=-2

DHINESH R 6 221191101038
Step 3: Square each deviation.

Score: \[x_i\] Deviation: \[(x_i-\mu)\] Squared deviation: \[(x_i-\mu)^2\]

6 6-3=3 (3)2=9

2 2-3=-1 (-1)2=1

3 3-3=0 (0)2=0

1 1-3=-2 (-2)2=4

Step 4: Add the squared deviations.


9+1+0+4=14
Step 5: Divide the sum by the number of scores.
14/4 =3.5
Step 6: Take the square root of the result from Step 5.
sqrt3.5, approx 1.87
The standard deviation is approximately 1.87.
Program:
Quartile Deviation

DHINESH R 7 221191101038
Output:

Standard Deviation:

Output:

Result:

Hence the program was successfully executed and output was verified.

DHINESH R 8 221191101038
EX.03: WRITE A PROGRAM FOR PROBABILITY
DISTRIBUTION

Aim:

To write a program for probability distribution.

Probability Distribution:

A probability distribution is an idealized frequency distribution. In statistics, a


frequency distribution represents the number of occurrences of different outcomes in
a dataset. It shows how often each different value appears within a dataset.
Probability distribution represents an abstract representation of the frequency
distribution. While a frequency distribution pertains to a particular sample or dataset,
detailing how often each potential value of a variable appears within it, the
occurrence of each value in the sample is dictated by its probability.

Types of Probability Distribution:

1. Discrete Probability Distributions

2. Continuous Probability Distributions

DHINESH R 9 221191101038
Formula:

Program:

DHINESH R 10 221191101038
Using JASP Statistical Software:

Result:

Hence the program was successfully executed and output was verified.

DHINESH R 11 221191101038
EX.04: WRITE A PROGRAM TO IMPLEMENT
BAYES THEOREM

Aim:

To write a program to implement bayes theorem.

Bayes Theorem:

 Bayes’ Theorem is used to determine the conditional probability of an event.

 It was named after an English statistician, Thomas Bayes who discovered this
formula in 1763.

 Bayes Theorem is a very important theorem in mathematics, that laid the


foundation of a unique statistical inference approach called the Bayes’
inference.

 It is used to find the probability of an event, based on prior knowledge of


conditions that might be related to that event.

Bayes’ Theorem states the following for any two events A and B:
P (A|B) = P (A)*P(B|A) / P(B)
where:
 P (A|B): The probability of event A, given event B has occurred.
 P (B|A): The probability of event B, given event A has occurred.
 P (A): The probability of event A.
 P (B): The probability of event B.

DHINESH R 12 221191101038
Benefits of Bayes theorem:

 Continuous Learning: By incorporating new data and evidence, models


using Bayes’ theorem can constantly improve their performance.
 Handling Uncertainty: Bayes’ theorem explicitly addresses uncertainty in
data, making it suitable for real-world scenarios with incomplete information.
 Interpretability: The underlying logic of Bayes’ theorem is relatively easy to
understand, allowing for better interpretation of model predictions.

Example:
Suppose the probability of the weather being cloudy is 40%.
Also suppose the probability of rain on a given day is 20%.
Also suppose the probability of clouds on a rainy day is 85%.
If it’s cloudy outside on a given day, what is the probability that it will rain that day?

Solution:
 P(cloudy) = 0.40
 P(rain) = 0.20
 P(cloudy | rain) = 0.85
Thus, we can calculate:
 P(rain | cloudy) = P(rain) * P(cloudy | rain) / P(cloudy)
 P(rain | cloudy) = 0.20 * 0.85 / 0.40
 P(rain | cloudy) = 0.425
If it’s cloudy outside on a given day, the probability that it will rain that day
is 42.5%.

DHINESH R 13 221191101038
Applications of Bayes theorem in Artificial Intelligence:

 Spam Filtering: As seen in the example, spam filters leverage Bayes’ theorem
to effectively categorize emails.
 Image Classification: Image recognition systems can use Bayes’ theorem to
assign probabilities to different object categories in an image.
 Recommendation Systems: Recommendation engines can utilize Bayes’
theorem to personalize suggestions based on a user’s past behavior and
preferences.
 Anomaly Detection: Identifying unusual patterns in data (e.g., fraudulent
credit card transactions) often involves Bayes’ theorem to calculate the
likelihood of an event being anomalous.
 Sentiment Analysis: Analysing the sentiment of text data (positive, negative,
or neutral) can be enhanced with Bayes’ theorem by considering the context
and prior knowledge about sentiment-related words.
 Natural Language Processing (NLP): Beyond sentiment analysis, NLP tasks
like machine translation and part-of-speech tagging can benefit from Bayes’
theorem. It can help determine the most likely translation for a sentence or the
most probable part of speech for a word based on surrounding words and
context.
 Medical Diagnosis: While not a replacement for medical expertise, Bayes’
theorem can be used in conjunction with patient data and medical history to
calculate the probability of a specific disease. This can aid doctors in making
informed decisions and prioritizing further tests.
 Robot Navigation: Robots navigating complex environments can leverage
Bayes’ theorem to update their understanding of the surroundings based on
sensor data. This helps them adapt to changes and avoid obstacles more
effectively.
 Self-Driving Cars: Similar to robot navigation, self-driving cars utilize Bayes’
theorem to interpret sensor data (like LiDAR or cameras) and make real-time
decisions about steering, braking, and lane changes while considering
uncertainties in the environment.
 Financial Modelling: Financial institutions can employ Bayes’ theorem to
assess creditworthiness of loan applicants or predict market trends by
incorporating historical data and economic indicators to calculate the
probability of different financial outcomes.

DHINESH R 14 221191101038
Program:

def BayesTheorem (pA, pB, pBA):


return pA * pBA / pB
#define probabilities
pRain = 0.2
pCloudy = 0.4
pCloudyRain = 0.85
#use function to calculate conditional probability
BayesTheorem (pRain, pCloudy, pCloudyRain)

Output:
0.425

Result:

Hence the program was successfully executed and output was verified.

DHINESH R 15 221191101038
EX.5: ESTIMATE THE VARIABILITY OF DATA USING
PROGRAMMING LANGUAGE
Aim:

To estimate the variability of data using programming language.

Variability:

It is the import dimension that measures the data variation i.e. whether the data is
spread out or tightly clustered. Also known as Dispersion When working on data sets
in Machine Learning or Data Science, involves many steps – variance measurement,
reduction, and distinguishing random variability from the real one. Identifying
sources of real variability, making decisions regarding the pre-processing choice or
model selection based on it.

Terms related to Variability Metrics:


-> Deviation
-> Variance
-> Standard Deviation
-> Mean Absolute Deviation
-> Median Absolute Deviation
-> Order Statistics
-> Range
-> Percentile
-> Inter-quartile Range

DHINESH R 16 221191101038
1. Deviation:
We can call it – errors or residuals also. It is the measure of how different/dispersed
the values are, from the central/observed value.
Example:

Sequence: [2, 3, 5, 6, 7, 9]
Central/Observed Value = 7
Deviation = [-5, -4, -2, -1, 0, 2]

2. Variance (s2):
It is the best-known measure to estimate the variability as it is Squared Deviation.
One can call it mean squared error as it is the average of standard deviation.

Example:

Sequence: [2, 3, 5, 6, 7, 9]
Mean = 5.33
Total Terms, n =
6
Squared Deviation = (2 - 5.33)2 + (3 - 5.33)2+ (5 - 5.33)2 + (6 - 5.33)2 + (7 - 5.33)2 +
(9 - 5.33)2
Variance = Squared Deviation / n
Program:
import numpy as np
Sequence = [2, 3, 5, 6, 7, 9]
var = np.var(Sequence)
print("Variance:", var)

Output:

Variance: 5.5555555555555545

3. Standard Deviation:
It is the square root of Variance. Is also referred to as Euclidean Norm.

DHINESH R 17 221191101038
Example:
Sequence: [2, 3, 5, 6, 7, 9]
Mean = 5.33
Total Terms, n =
6
Squared Deviation = (2 - 5.33)2 + (3 - 5.33)2 + (5 - 5.33)2 + (6 - 5.33)2 + (7 - 5.33)2
+ (9 - 5.33)2
Variance = Squared Deviation / n
Standard Deviation = (Variance)1/2

Program:

import numpy as np
Sequence = [2, 3, 5, 6, 7, 9]
Std = np.std(Sequence)
print("Standard Deviation:", std)

Output:

Standard Deviation: 2.357022603955158

4. Mean Absolute Deviation:

One can estimate a typical estimation for these deviations. If we average the values,
the negative deviations would offset the positive ones. Also, the sum of deviations
from the mean is always zero. So, it is a simple approach to take the average
deviation itself.

Example:
Sequence: [2, 4, 6, 8]
Mean = 5
Deviation around mean = [-3, -1, 1, 3]
Mean Absolute Deviation = (3 + 1 + 1 + 3)/ 4

Program:
import numpy as np
def mad(data):
return
np.mean(np.absolute( data
- np.mean(data)))

DHINESH R 18 221191101038
Sequence = [2, 4, 6, 8]
print ("Mean Absolute Deviation: ", mad(Sequence))Output:

Mean Absolute Deviation: 2.0

5. Median Absolute Deviation:

Mean Absolute Deviation, Variance, and Standard deviation (discussed in the


previous section) are not robust to extreme values and outliers. We average the sum
of deviations from the median.

Example:

Sequence: [2, 4, 6, 8]
Mean = 5
Deviation around mean = [-3, -1, 1, 3]
Mean Absolute Deviation = (3 + 1 + 1 + 3)/ 4

Program:

import numpy as np
def mad(data):
return
np.median(np.absolute( data
- np.median(data)))
Sequence = [2, 4, 10, 6, 8, 11]
print("Median Absolute Deviation:", mad(Sequence))

Output:

Median Absolute Deviation: 3.0

6. Order Statistics:

 This variability measurement approach is based on the spread of ranked


(sorted) data.

 To better understand this concept, we’ll take 5 random variables X1, X2, X3, X4,
X5. We’ll observe a random realization/outcome from the distribution of each
of these random variables. Suppose we get the following values:

DHINESH R 19 221191101038
 The kth order statistic for this experiment is the kth smallest value from the set
{4, 2, 7, 11, 5}. So, the 1 st order statistic is 2 (smallest value), the 2 nd order
statistic is 4 (next smallest), and so on. The 5th order statistic is the fifth
smallest value (the largest value), which is 11.

7. Range:
 It is the most basic measurement belonging to Order Statistics. It is the
difference between the largest and the smallest value of the dataset. It is good
to know the spread of data but it is very sensitive to outliers. We can make it
better by dropping the extreme values.

 The range is the difference between the smallest and the largest value of the
data.

Example:

Sequence: [2, 30, 50, 46, 37, 91]


Here, 2 and 91 are outliers
Range = 91 - 2 = 89
Range without outliers = 50 - 30 = 20

Program:

import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.range (values)
print (x)

Output:

Range=59

DHINESH R 20 221191101038
8. Percentile:
 It is a very good measure to measure the variability in data, avoiding outliers.
The Pth percentile in data is a value such that atleast P% or fewer values are
lesser than it and atleast (100 – P) % values are more than P.

 The Median is the 50th percentile of the data.

Example:

Sequence: [2, 30, 50, 46, 37, 91]


Sorted: [2, 30, 37, 46, 50, 91]
50th percentile = (37 + 46) / 2 = 41.5

Program:
import numpy as np
Sequence = [2, 30, 50, 46, 37, 91]
print ("50th Percentile : ", np.percentile(Sequence, 50))
print ("60th Percentile : ", np.percentile(Sequence, 60))

Output:
50th Percentile: 41.5
60th Percentile: 46.0

9. Inter-Quartile Range (IQR):


It works for the ranked (sorted data). It has 3 quartiles dividing data – Q1 (25th
percentile), Q2 (50th percentile), and Q3 (75th percentile). Inter-quartile Range is the
difference between Q3 and Q1.

Example:

Sequence: [2, 30, 50, 46, 37, 91]


Q1 (25th percentile): 31.75
Q2 (50th percentile): 41.5
Q3 (75th percentile): 49
IQR = Q3 - Q1 = 17.25

DHINESH R 21 221191101038
Program:

import numpy as np
from scipy.stats import iqr
Sequence = [2, 30, 50, 46, 37, 91]
print ("IQR : ", iqr(Sequence))

Output:
IQR: 17.25

Program:
import numpy as np
# Inter-Quartile Range
iqr = np.subtract(*np.percentile(Sequence, [75, 25]))
print ("\n IQR : ", iqr)

Output:
IQR: 17.25

Result:
Hence the program was successfully executed and output was verified.

DHINESH R 22 221191101038
EX.06: CALCULATE THE CORRELATION OF POPULATION

Aim:

To calculate the correlation of population.

Correlation:

A correlation exists between two variables when one is related to the other such that
there is co-movement. Positive co-movement means as one variable increases, the
other variable also increases. Negative co-movement means as one variable
increases, the other variable decreases.

What are population and sample in statistics?

The population in statistics is a collection of items that are being studied. It is usually
a large group of people or things, but it can also be a small group. A sample is a
subset of data taken from a larger population. A statistic is any numerical value that
describes or summarizes information about a sample or a population.

The equation for Correlation in relation to Covariance and Standard Deviation:

The equation for Sample Covariance:

Now for the sample standard deviation, the equation will be:

DHINESH R 23 221191101038
Example for Sample Correlation Coefficient:

X: 17, 13, 15, 16, 6, 11, 14, 9, 7, 12


Y: 36, 46, 35, 24, 12, 18, 27, 22, 2, 8

Let's calculate the Sample Covariance of x and y:

DHINESH R 24 221191101038
Sample Standard Deviation for x:

Sample Standard Deviation for y:

Equation of Correlation:

DHINESH R 25 221191101038
Program:

import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
df ['experience'].corr(df['salary'])

Output:

0.9929845761480398

Result:
Hence the program was successfully executed and output was verified.

DHINESH R 26 221191101038
EX.07: FIND THE NORMAL AND BINOMIAL DISTRIBUTION OF
A SAMPLE DATA SET

Aim:
To find the normal and binomial distribution of a sample data set.

Normal Distribution in Statistics:


Normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution that is symmetric about the mean, depicting that data near the
mean are more frequent in occurrence than data far from the mean.

Key Features of Normal Distribution:

 Symmetry: The normal distribution is symmetric around its mean. This means
the left side of the distribution mirrors the right side.
 Mean, Median, and Mode: In a normal distribution, the mean, median, and
mode are all equal and located at the center of the distribution.
 Bell-shaped Curve: The curve is bell-shaped, indicating that most of the
observations cluster around the central peak, and the probabilities for values
further away from the mean taper off equally in both directions.
 Standard Deviation: The spread of the distribution is determined by the
standard deviation. About 68% of the data falls within one standard deviation
of the mean, 95% within two standard deviations, and 99.7% within three
standard deviations.

Normal Distribution Examples:

We can draw Normal Distribution for various types of data that include,
 Distribution of Height of People.
 Distribution of Errors in any Measurement.
 Distribution of Blood Pressure of any Patient, etc.

DHINESH R 27 221191101038
Normal Distribution Formula – Probability Density Function (PDF):

The formula for the probability density function of Normal Distribution(Gaussian


Distribution) is added below:

where,
 x is Random Variable
 μ is Mean
 σ is Standard Deviation

Binomial distribution:

Binomial distribution is a fundamental probability distribution in statistics, used to


model the number of successes in a fixed number of independent trials, where each
trial has only two possible outcomes: success or failure. This distribution is
particularly useful when you want to calculate the probability of a specific number of
successes, such as flipping coins, qualitycontrol in manufacturing, or predicting
survey outcomes.

Binomial Distribution Formula which is used to calculate the probability, for a


random variable X = 0, 1, 2, 3,….,n is given as

P(X = r) = nCr prqn-r, r = 0, 1, 2, 3….


where,
 p is success
 q is failure and q = 1 – p
 p, q > 0 such that p + q = 1

Binomial Distribution:
 Binomial Distribution is a Discrete Distribution.
 It describes the outcome of binary scenarios, e.g. toss of a coin, it will either be
head or tails.
 It has three parameters:
 n - number of trials.
 p - probability of occurence of each trial (e.g. for toss of a coin 0.5 each).
 size - The shape of the returned array

DHINESH R 28 221191101038
Program:

from numpy import random


x = random.binomial (n=10, p=0.5, size=10)
print(x)

Output:
[3 2 6 4 3 7 5 4 5 4]

Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.binomial(n=10, p=0.5, size=1000), hist=True, kde=False)
plt.show()

Output:

DHINESH R 29 221191101038
Difference Between Normal and Binomial Distribution:

The main difference is that normal distribution is continous whereas binomial is


discrete, but if there are enough data points it will be quite similar to normal
distribution with certain loc and scale.

Program:

from numpy import random


import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.normal(loc=50, scale=5, size=1000), hist=False, label='normal')
sns.distplot(random.binomial(n=100, p=0.5, size=1000), hist=False, label='binomial')
plt.show()

Output:

Result:
Hence the program was successfully executed and output was verified.

DHINESH R 30 221191101038
EX.08: PERFORM A HYPOTHESIS TESTING IN A DATA SET

Aim:

To perform a hypothesis testing in a data set.

Hypothesis Testing:

Hypothesis testing involves statisticians checking a hypothesis about a population


measure. The way they do this is influenced by the data type and study goals. Using
sample data to confirm a hypothesis is what hypothesis testing is all about. This data
could come from a larger population or a data collection method.

Steps in Hypothesis Testing:


Step 1: The initial phase involves identifying the research questions and hypotheses.
Remember, these options are mutually exclusive. If one theory claims a truth, the
other must counter it.

Step 2: Consider the statistical assumptions, such as independence of observations,


data normality, random errors and their probability distribution, randomization during
sampling, and similar factors.

Step 3: The third step is about choosing the test to verify the hypothesis.
Simultaneously, determine the method for testing the null hypothesis using sample
data.

Step 4: In the fourth stage, the data from a sample is examined. This is when
assessments such as mean values, normal distributions, t distributions, and z-scores
are sought.

Step 5: The final stage involves making a decision on whether to reject the null
hypothesis in favor of the alternative or to retain it.

Hypothesis testing is used to determine whether the evidence within a sample dataset
is substantial enough to validate or refute research conditions for the entire
population. The Z-test serves to assess the assumption within a specific sample.
Typically, in hypothesis testing, we compare two sets by comparing them against a
synthesized dataset and an idealized model.

DHINESH R 31 221191101038
Hypothesis Testing Formula:

 symbolises the sample me


 μ denotes the population mean,
 σ stands for the standard deviation, and
 n represents the sample size.

Example:
Given: x =25.5 μ=24 σ=4 n=36

Substituting the given values:

Z = 25.5−24/ 4/√36
Z = 1.5 / 4 / 6
Z = 1.5/ 0.6667
Z ≈ 2.25

Using JASP calculate Hypothesis Testing:

1. First I imported Student data set into JSAP Software.

DHINESH R 32 221191101038
2. Next Select T-Test, Select Classical option: select one sample T-Test.

3. Click One Sample T-Test.

DHINESH R 33 221191101038
4. Select variables from all to variables box.

5. Test result in Results place.

Bar Chart:

Result:
Hence the program was successfully executed and output was verified.

DHINESH R 34 221191101038
EX.09: CALCULATE SIMPLE AND MULTIPLE LINEAR
REGRESSIONS

Aim:

To calculate simple and multiple linear regressions.

Linear Regression:
Linear regression is a statistical method that is used to predict a continuous dependent
variable (target variable) based on one or more independent variables (predictor
variables). This technique assumes a linear relationship between the dependent and
independent variables, which implies that the dependent variable changes
proportionally with changes in the independent variables. In other words, linear
regression is used to determine the extent to which one or more variables can predict
the value of the dependent variable.

Types of Linear Regression:

There are two main types of linear regression:

 Simple linear regression: This involves predicting a dependent variable based


on a single independent variable.
 Multiple linear regression: This involves predicting a dependent variable
based on multiple independent variables.

Simple Linear Regression:


 Simple linear regression (SLR) is a method to predict a response using one
feature. It is believed that both variables are linearly linked. Thus, we strive to
find a linear equation that can predict an answer value(y) as precisely as
possible in relation to features or the independently derived variable(x).
 Let's consider a dataset in which we have a number of responses y per
feature x:

DHINESH R 35 221191101038
For simplification, we define:

x as feature vector, i.e., x = [x1, x2, x3, ….,xn],

y as response vector, i.e., y = [y1, y2, y3 ….,yn]

Program:
import numpy as nmp
import matplotlib.pyplot as mtplt

def estimate_coeff(p, q):

# Here, we will estimate the total number of points or observation


n1 = nmp.size(p)
# Now, we will calculate the mean of a and b vector
m_p = nmp.mean(p)
m_q = nmp.mean(q)
# here, we will calculate the cross deviation and deviation about a
SS_pq = nmp.sum(q * p) - n1 * m_q * m_p
SS_pp = nmp.sum(p * p) - n1 * m_p * m_p
# here, we will calculate the regression coefficients
b_1 = SS_pq / SS_pp
b_0 = m_q - b_1 * m_p
return (b_0, b_1)

def plot_regression_line(p, q, b):

# Now, we will plot the actual points or observation as scatter plot


mtplt.scatter(p, q, color = "m", marker = "o", s = 30)
# here, we will calculate the predicted response vector
q_pred = b[0] + b[1] * p
# here, we will plot the regression line
mtplt.plot(p, q_pred, color = "g")
# here, we will put the labels
mtplt.xlabel('p')
mtplt.ylabel('q')
# here, we will define the function to show plot
mtplt.show()

DHINESH R 36 221191101038
def main():

# entering the observation points or data


p = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
q = np.array([11, 13, 12, 15, 17, 18, 18, 19, 20, 22])
# now, we will estimate the coefficients
b = estimate_coeff(p, q)
print("Estimated coefficients are :\nb_0 = {} \\ nb_1 = {}".format(b[0], b[1]))
# Now, we will plot the regression line
plot_regression_line(p, q, b)
if name == " main ":
main()

Output:
Estimated coefficients are:
b_0 = -0.4606060606060609
b_1 = 1.1696969696969697

Multiple linear regression:

Multiple linear regression attempts explain how the relationships are among several
elements and then respond by applying a linear equation with the data. Clearly, it's
not anything more than an extension of linear regression.

Imagine a set of data that has one or more features (or independent variables) as well
as one response (or dependent variable).

DHINESH R 37 221191101038
Program:
import matplotlib.pyplot as mtpplt
import numpy as nmp
from sklearn import datasets as DS
from sklearn import linear_model as LM
from sklearn import metrics as mts
# First, we will load the boston dataset
boston1 = DS.load_boston(return_X_y = False)
# Here, we will define the feature matrix(H) and response vector(f)
H = boston1.data
f = boston1.target
# Now, we will split X and y datasets into training and testing sets
from sklearn.model_selection import train_test_split as tts
H_train, H_test, f_train, f_test = tts(H, f, test_size = 0.4,
random_state = 1)
# Here, we will create linear regression object
reg1 = LM.LinearRegression()
# Now, we will train the model by using the training sets
reg1.fit(H_train, f_train)
# here, we will print the regression coefficients
print('Regression Coefficients are: ', reg1.coef_)
# Here, we will print the variance score: 1 means perfect prediction
print('Variance score is: {}'.format(reg1.score(H_test, f_test)))
# here, we will set the plot style
mtpplt.style.use('fivethirtyeight')
# here we will plot the residual errors in training data
mtpplt.scatter(reg1.predict(H_train), reg1.predict(H_train) - f_train,
color = "green", s = 10, label = 'Train data')
# Here, we will plot the residual errors in test data
mtpplt.scatter(reg1.predict(H_test), reg1.predict(H_test) - f_test,
color = "blue", s = 10, label = 'Test data')
# Here, we will plot the line for zero residual error
mtpplt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
# here, we will plot the legend
mtpplt.legend(loc = 'upper right')
# now, we will plot the title
mtpplt.title("Residual errors")
# here, we will define the method call for showing the plot
mtpplt.show()

DHINESH R 38 221191101038
Output:

Regression Coefficients are:


[-8.95714048e-02 6.73132853e-02 5.04649248e-02 2.18579583e+00
1.72053975e+01 3.63606995e+00 2.05579939e-03 1.36602886e+00
2.89576718e-01 -1.22700072e-02 -8.34881849e-01 9.40360790e-03
-5.04008320e-01]
Variance score is: 0.7209056672661751

Result:
Hence the program was successfully executed and output was verified.

DHINESH R 39 221191101038
EX.10: PERFORM A CHI-SQUARE TEST OF SAMPLE DATA

Aim:

To perform a chi-square test of sample data.

Chi-Square Test:

The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to decide whether it correlates
to our data's categorical variables. It helps to determine whether a difference between
two categorical variables is due to chance or a relationship between them.

A chi-square test or comparable nonparametric test is required to test a hypothesis


regarding the distribution of a categorical variable. Categorical variables, which
indicate categories such as animals or countries, can be nominal or ordinal. They
cannot have a normal distribution since they only have a few particular values.

Chi-Square Test Formula:

Where

c = Degrees of freedom

O = Observed Value

E = Expected Value

DHINESH R 40 221191101038
Program:

data <- matrix(c(30, 10, 20, 40), nrow = 2, byrow = TRUE)


# Assign row and column names for clarity
rownames (data) <- c("Group 1", "Group 2")
colnames (data) <- c("Product A", "Product B")
# Print the dataset
print("Contingency Table:")
print(data)
# Perform the Chi-Square test
chi_square_result<- chisq.test(data)
# Display the result
print("Chi-Square Test Result:")
print(chi_square_result)
# Interpret the result
if (chi_square_result$p.value< 0.05) {
print("Reject the null hypothesis: There is a significant association between the
groups and product preferences.")
} else {
print("Fail to reject the null hypothesis: There is no significant association between
the groups and product preferences.")
}

DHINESH R 41 221191101038
Output:

Result:

Hence the program was successfully executed and output was verified.

DHINESH R 42 221191101038

You might also like