dHINESH STATISTICAL REC
dHINESH STATISTICAL REC
-F/TL/021
REV.00Date 20.03.2020
RECORD NOTEBOOK
2024-2025(ODD SEMESTER)
DEPARTMENT
OF
NAME : DHINESH R
REGISTER NO : 221191101038
YEAR/SEM/SEC : 3/5/A
FORM NO.-F/TL/021
REV.00Date 20.03.2020
BONAFIDE CERTIFICATE
Register No : 221191101038
Certified that this is the bonafide record of work done by DHINESH R, 221191101038 of
INDEX
Aim:
Algorithm:
Mean:
In statistics, the mean is the average of a data set. It is calculated by adding all the
numbers in the data set and dividing by the number of values in the set. The mean is
also known as the average. It is sensitive to skewed data and extreme values.
DHINESH R 1 221191101038
Mean Formula:
Mean formula in statistics is defined as the sum of all observations in the given
dataset divided by the total number of observations. The image added below shows
the mean formula of given observation.
Median:
Median is the middle value of the dataset when arranged in ascending or descending
order. If the dataset has an odd number of values, the median is the middle value. If
the dataset has an even number of values, the median is the average of the two middle
values.
Median Formula When n is Odd:
The median formula of a given set of numbers, say having 'n' odd number of
observations, can be expressed as:
Median = [(n + 1)/2]th term
Median Formula When n is Even:
The median formula of a given set of numbers say having 'n' even number of
observations, can be expressed as:
Median = [(n/2)th term + ((n/2) + 1)th term]/2
Mode:
Mode in statistics is the value or data point that appears most frequently in a data set.
It is a measure of central tendency and can be calculated for both numerical and
categorical data.
Mode is a measure that is less widely used compared to mean and median. There can
be more than one type of mode in a given data set.
DHINESH R 2 221191101038
Problem: Student marks Information:
Name Gende Maths Physic Chemistr Englis Biolog Economic Histor Civics
r s y h y s y
John M 55 45 56 87 21 52 89 65
Suresh M 75 96 78 64 90 61 58 2
Rames M 25 54 89 76 95 87 56 74
h
Jessica F 78 96 86 63 54 89 75 45
Jennifer F 58 96 78 46 96 77 83 53
Annu F 45 87 52 89 55 89 87 52
pooja F 55 64 61 58 75 58 64 61
Ritesh M 54 76 87 56 25 56 76 87
Farha F 55 63 89 75 78 75 63 89
Mukes M 96 46 77 83 58 83 46 77
h
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 3 221191101038
EX.02: CALCULATE QUARTILE AND STANDARD DEVIATION
TO A POPULATION
Aim:
Quartile Deviation:
Quartile deviation is a statistic that measures the deviation in the middle of the data.
Quartile deviation is also referred to as the semi interquartile range and is half of the
difference between the third quartile and the first quartile value. The formula for
quartile deviation of the data is Q.D = (Q3 - Q1)/2.
Where,
Q3 = Upper Quartile (Size of 3[N+1/4]th item)
Q1 = Lower Quartile (Size of [N+1/4]th item)
Problem:
With the help of the data given below, find the interquartile range, quartile
deviation.
Solution:
First arrange the data into ascending order, Then
DHINESH R 4 221191101038
Population and sample standard deviation:
Standard deviation measures the spread of a data distribution. It measures the typical
distance between each data point and the mean.
The formula we use for standard deviation depends on whether the data is being
considered a population of its own, or the data is a sample representing a larger
population.
If the data is being considered a population on its own, we divide by the
number of data points, \[N\].
If the data is a sample from a larger population, we divide by one fewer than
the number of data points in the sample, \[n-1\].
DHINESH R 5 221191101038
Here's how to calculate population standard deviation:
6 6-3=3
2 2-3=-1
3 3-3=0
1 1-3=-2
DHINESH R 6 221191101038
Step 3: Square each deviation.
6 6-3=3 (3)2=9
2 2-3=-1 (-1)2=1
3 3-3=0 (0)2=0
1 1-3=-2 (-2)2=4
DHINESH R 7 221191101038
Output:
Standard Deviation:
Output:
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 8 221191101038
EX.03: WRITE A PROGRAM FOR PROBABILITY
DISTRIBUTION
Aim:
Probability Distribution:
DHINESH R 9 221191101038
Formula:
Program:
DHINESH R 10 221191101038
Using JASP Statistical Software:
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 11 221191101038
EX.04: WRITE A PROGRAM TO IMPLEMENT
BAYES THEOREM
Aim:
Bayes Theorem:
It was named after an English statistician, Thomas Bayes who discovered this
formula in 1763.
Bayes’ Theorem states the following for any two events A and B:
P (A|B) = P (A)*P(B|A) / P(B)
where:
P (A|B): The probability of event A, given event B has occurred.
P (B|A): The probability of event B, given event A has occurred.
P (A): The probability of event A.
P (B): The probability of event B.
DHINESH R 12 221191101038
Benefits of Bayes theorem:
Example:
Suppose the probability of the weather being cloudy is 40%.
Also suppose the probability of rain on a given day is 20%.
Also suppose the probability of clouds on a rainy day is 85%.
If it’s cloudy outside on a given day, what is the probability that it will rain that day?
Solution:
P(cloudy) = 0.40
P(rain) = 0.20
P(cloudy | rain) = 0.85
Thus, we can calculate:
P(rain | cloudy) = P(rain) * P(cloudy | rain) / P(cloudy)
P(rain | cloudy) = 0.20 * 0.85 / 0.40
P(rain | cloudy) = 0.425
If it’s cloudy outside on a given day, the probability that it will rain that day
is 42.5%.
DHINESH R 13 221191101038
Applications of Bayes theorem in Artificial Intelligence:
Spam Filtering: As seen in the example, spam filters leverage Bayes’ theorem
to effectively categorize emails.
Image Classification: Image recognition systems can use Bayes’ theorem to
assign probabilities to different object categories in an image.
Recommendation Systems: Recommendation engines can utilize Bayes’
theorem to personalize suggestions based on a user’s past behavior and
preferences.
Anomaly Detection: Identifying unusual patterns in data (e.g., fraudulent
credit card transactions) often involves Bayes’ theorem to calculate the
likelihood of an event being anomalous.
Sentiment Analysis: Analysing the sentiment of text data (positive, negative,
or neutral) can be enhanced with Bayes’ theorem by considering the context
and prior knowledge about sentiment-related words.
Natural Language Processing (NLP): Beyond sentiment analysis, NLP tasks
like machine translation and part-of-speech tagging can benefit from Bayes’
theorem. It can help determine the most likely translation for a sentence or the
most probable part of speech for a word based on surrounding words and
context.
Medical Diagnosis: While not a replacement for medical expertise, Bayes’
theorem can be used in conjunction with patient data and medical history to
calculate the probability of a specific disease. This can aid doctors in making
informed decisions and prioritizing further tests.
Robot Navigation: Robots navigating complex environments can leverage
Bayes’ theorem to update their understanding of the surroundings based on
sensor data. This helps them adapt to changes and avoid obstacles more
effectively.
Self-Driving Cars: Similar to robot navigation, self-driving cars utilize Bayes’
theorem to interpret sensor data (like LiDAR or cameras) and make real-time
decisions about steering, braking, and lane changes while considering
uncertainties in the environment.
Financial Modelling: Financial institutions can employ Bayes’ theorem to
assess creditworthiness of loan applicants or predict market trends by
incorporating historical data and economic indicators to calculate the
probability of different financial outcomes.
DHINESH R 14 221191101038
Program:
Output:
0.425
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 15 221191101038
EX.5: ESTIMATE THE VARIABILITY OF DATA USING
PROGRAMMING LANGUAGE
Aim:
Variability:
It is the import dimension that measures the data variation i.e. whether the data is
spread out or tightly clustered. Also known as Dispersion When working on data sets
in Machine Learning or Data Science, involves many steps – variance measurement,
reduction, and distinguishing random variability from the real one. Identifying
sources of real variability, making decisions regarding the pre-processing choice or
model selection based on it.
DHINESH R 16 221191101038
1. Deviation:
We can call it – errors or residuals also. It is the measure of how different/dispersed
the values are, from the central/observed value.
Example:
Sequence: [2, 3, 5, 6, 7, 9]
Central/Observed Value = 7
Deviation = [-5, -4, -2, -1, 0, 2]
2. Variance (s2):
It is the best-known measure to estimate the variability as it is Squared Deviation.
One can call it mean squared error as it is the average of standard deviation.
Example:
Sequence: [2, 3, 5, 6, 7, 9]
Mean = 5.33
Total Terms, n =
6
Squared Deviation = (2 - 5.33)2 + (3 - 5.33)2+ (5 - 5.33)2 + (6 - 5.33)2 + (7 - 5.33)2 +
(9 - 5.33)2
Variance = Squared Deviation / n
Program:
import numpy as np
Sequence = [2, 3, 5, 6, 7, 9]
var = np.var(Sequence)
print("Variance:", var)
Output:
Variance: 5.5555555555555545
3. Standard Deviation:
It is the square root of Variance. Is also referred to as Euclidean Norm.
DHINESH R 17 221191101038
Example:
Sequence: [2, 3, 5, 6, 7, 9]
Mean = 5.33
Total Terms, n =
6
Squared Deviation = (2 - 5.33)2 + (3 - 5.33)2 + (5 - 5.33)2 + (6 - 5.33)2 + (7 - 5.33)2
+ (9 - 5.33)2
Variance = Squared Deviation / n
Standard Deviation = (Variance)1/2
Program:
import numpy as np
Sequence = [2, 3, 5, 6, 7, 9]
Std = np.std(Sequence)
print("Standard Deviation:", std)
Output:
One can estimate a typical estimation for these deviations. If we average the values,
the negative deviations would offset the positive ones. Also, the sum of deviations
from the mean is always zero. So, it is a simple approach to take the average
deviation itself.
Example:
Sequence: [2, 4, 6, 8]
Mean = 5
Deviation around mean = [-3, -1, 1, 3]
Mean Absolute Deviation = (3 + 1 + 1 + 3)/ 4
Program:
import numpy as np
def mad(data):
return
np.mean(np.absolute( data
- np.mean(data)))
DHINESH R 18 221191101038
Sequence = [2, 4, 6, 8]
print ("Mean Absolute Deviation: ", mad(Sequence))Output:
Example:
Sequence: [2, 4, 6, 8]
Mean = 5
Deviation around mean = [-3, -1, 1, 3]
Mean Absolute Deviation = (3 + 1 + 1 + 3)/ 4
Program:
import numpy as np
def mad(data):
return
np.median(np.absolute( data
- np.median(data)))
Sequence = [2, 4, 10, 6, 8, 11]
print("Median Absolute Deviation:", mad(Sequence))
Output:
6. Order Statistics:
To better understand this concept, we’ll take 5 random variables X1, X2, X3, X4,
X5. We’ll observe a random realization/outcome from the distribution of each
of these random variables. Suppose we get the following values:
DHINESH R 19 221191101038
The kth order statistic for this experiment is the kth smallest value from the set
{4, 2, 7, 11, 5}. So, the 1 st order statistic is 2 (smallest value), the 2 nd order
statistic is 4 (next smallest), and so on. The 5th order statistic is the fifth
smallest value (the largest value), which is 11.
7. Range:
It is the most basic measurement belonging to Order Statistics. It is the
difference between the largest and the smallest value of the dataset. It is good
to know the spread of data but it is very sensitive to outliers. We can make it
better by dropping the extreme values.
The range is the difference between the smallest and the largest value of the
data.
Example:
Program:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.range (values)
print (x)
Output:
Range=59
DHINESH R 20 221191101038
8. Percentile:
It is a very good measure to measure the variability in data, avoiding outliers.
The Pth percentile in data is a value such that atleast P% or fewer values are
lesser than it and atleast (100 – P) % values are more than P.
Example:
Program:
import numpy as np
Sequence = [2, 30, 50, 46, 37, 91]
print ("50th Percentile : ", np.percentile(Sequence, 50))
print ("60th Percentile : ", np.percentile(Sequence, 60))
Output:
50th Percentile: 41.5
60th Percentile: 46.0
Example:
DHINESH R 21 221191101038
Program:
import numpy as np
from scipy.stats import iqr
Sequence = [2, 30, 50, 46, 37, 91]
print ("IQR : ", iqr(Sequence))
Output:
IQR: 17.25
Program:
import numpy as np
# Inter-Quartile Range
iqr = np.subtract(*np.percentile(Sequence, [75, 25]))
print ("\n IQR : ", iqr)
Output:
IQR: 17.25
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 22 221191101038
EX.06: CALCULATE THE CORRELATION OF POPULATION
Aim:
Correlation:
A correlation exists between two variables when one is related to the other such that
there is co-movement. Positive co-movement means as one variable increases, the
other variable also increases. Negative co-movement means as one variable
increases, the other variable decreases.
The population in statistics is a collection of items that are being studied. It is usually
a large group of people or things, but it can also be a small group. A sample is a
subset of data taken from a larger population. A statistic is any numerical value that
describes or summarizes information about a sample or a population.
Now for the sample standard deviation, the equation will be:
DHINESH R 23 221191101038
Example for Sample Correlation Coefficient:
DHINESH R 24 221191101038
Sample Standard Deviation for x:
Equation of Correlation:
DHINESH R 25 221191101038
Program:
import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
df ['experience'].corr(df['salary'])
Output:
0.9929845761480398
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 26 221191101038
EX.07: FIND THE NORMAL AND BINOMIAL DISTRIBUTION OF
A SAMPLE DATA SET
Aim:
To find the normal and binomial distribution of a sample data set.
Symmetry: The normal distribution is symmetric around its mean. This means
the left side of the distribution mirrors the right side.
Mean, Median, and Mode: In a normal distribution, the mean, median, and
mode are all equal and located at the center of the distribution.
Bell-shaped Curve: The curve is bell-shaped, indicating that most of the
observations cluster around the central peak, and the probabilities for values
further away from the mean taper off equally in both directions.
Standard Deviation: The spread of the distribution is determined by the
standard deviation. About 68% of the data falls within one standard deviation
of the mean, 95% within two standard deviations, and 99.7% within three
standard deviations.
We can draw Normal Distribution for various types of data that include,
Distribution of Height of People.
Distribution of Errors in any Measurement.
Distribution of Blood Pressure of any Patient, etc.
DHINESH R 27 221191101038
Normal Distribution Formula – Probability Density Function (PDF):
where,
x is Random Variable
μ is Mean
σ is Standard Deviation
Binomial distribution:
Binomial Distribution:
Binomial Distribution is a Discrete Distribution.
It describes the outcome of binary scenarios, e.g. toss of a coin, it will either be
head or tails.
It has three parameters:
n - number of trials.
p - probability of occurence of each trial (e.g. for toss of a coin 0.5 each).
size - The shape of the returned array
DHINESH R 28 221191101038
Program:
Output:
[3 2 6 4 3 7 5 4 5 4]
Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.binomial(n=10, p=0.5, size=1000), hist=True, kde=False)
plt.show()
Output:
DHINESH R 29 221191101038
Difference Between Normal and Binomial Distribution:
Program:
Output:
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 30 221191101038
EX.08: PERFORM A HYPOTHESIS TESTING IN A DATA SET
Aim:
Hypothesis Testing:
Step 3: The third step is about choosing the test to verify the hypothesis.
Simultaneously, determine the method for testing the null hypothesis using sample
data.
Step 4: In the fourth stage, the data from a sample is examined. This is when
assessments such as mean values, normal distributions, t distributions, and z-scores
are sought.
Step 5: The final stage involves making a decision on whether to reject the null
hypothesis in favor of the alternative or to retain it.
Hypothesis testing is used to determine whether the evidence within a sample dataset
is substantial enough to validate or refute research conditions for the entire
population. The Z-test serves to assess the assumption within a specific sample.
Typically, in hypothesis testing, we compare two sets by comparing them against a
synthesized dataset and an idealized model.
DHINESH R 31 221191101038
Hypothesis Testing Formula:
Example:
Given: x =25.5 μ=24 σ=4 n=36
Z = 25.5−24/ 4/√36
Z = 1.5 / 4 / 6
Z = 1.5/ 0.6667
Z ≈ 2.25
DHINESH R 32 221191101038
2. Next Select T-Test, Select Classical option: select one sample T-Test.
DHINESH R 33 221191101038
4. Select variables from all to variables box.
Bar Chart:
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 34 221191101038
EX.09: CALCULATE SIMPLE AND MULTIPLE LINEAR
REGRESSIONS
Aim:
Linear Regression:
Linear regression is a statistical method that is used to predict a continuous dependent
variable (target variable) based on one or more independent variables (predictor
variables). This technique assumes a linear relationship between the dependent and
independent variables, which implies that the dependent variable changes
proportionally with changes in the independent variables. In other words, linear
regression is used to determine the extent to which one or more variables can predict
the value of the dependent variable.
DHINESH R 35 221191101038
For simplification, we define:
Program:
import numpy as nmp
import matplotlib.pyplot as mtplt
DHINESH R 36 221191101038
def main():
Output:
Estimated coefficients are:
b_0 = -0.4606060606060609
b_1 = 1.1696969696969697
Multiple linear regression attempts explain how the relationships are among several
elements and then respond by applying a linear equation with the data. Clearly, it's
not anything more than an extension of linear regression.
Imagine a set of data that has one or more features (or independent variables) as well
as one response (or dependent variable).
DHINESH R 37 221191101038
Program:
import matplotlib.pyplot as mtpplt
import numpy as nmp
from sklearn import datasets as DS
from sklearn import linear_model as LM
from sklearn import metrics as mts
# First, we will load the boston dataset
boston1 = DS.load_boston(return_X_y = False)
# Here, we will define the feature matrix(H) and response vector(f)
H = boston1.data
f = boston1.target
# Now, we will split X and y datasets into training and testing sets
from sklearn.model_selection import train_test_split as tts
H_train, H_test, f_train, f_test = tts(H, f, test_size = 0.4,
random_state = 1)
# Here, we will create linear regression object
reg1 = LM.LinearRegression()
# Now, we will train the model by using the training sets
reg1.fit(H_train, f_train)
# here, we will print the regression coefficients
print('Regression Coefficients are: ', reg1.coef_)
# Here, we will print the variance score: 1 means perfect prediction
print('Variance score is: {}'.format(reg1.score(H_test, f_test)))
# here, we will set the plot style
mtpplt.style.use('fivethirtyeight')
# here we will plot the residual errors in training data
mtpplt.scatter(reg1.predict(H_train), reg1.predict(H_train) - f_train,
color = "green", s = 10, label = 'Train data')
# Here, we will plot the residual errors in test data
mtpplt.scatter(reg1.predict(H_test), reg1.predict(H_test) - f_test,
color = "blue", s = 10, label = 'Test data')
# Here, we will plot the line for zero residual error
mtpplt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
# here, we will plot the legend
mtpplt.legend(loc = 'upper right')
# now, we will plot the title
mtpplt.title("Residual errors")
# here, we will define the method call for showing the plot
mtpplt.show()
DHINESH R 38 221191101038
Output:
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 39 221191101038
EX.10: PERFORM A CHI-SQUARE TEST OF SAMPLE DATA
Aim:
Chi-Square Test:
The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to decide whether it correlates
to our data's categorical variables. It helps to determine whether a difference between
two categorical variables is due to chance or a relationship between them.
Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
DHINESH R 40 221191101038
Program:
DHINESH R 41 221191101038
Output:
Result:
Hence the program was successfully executed and output was verified.
DHINESH R 42 221191101038