0% found this document useful (0 votes)
2 views

dsa

The document outlines a laboratory course on Data Science and Analytics, detailing a series of experiments using tools like Python, Pandas, and Matplotlib. Each experiment focuses on different aspects of data analysis, including working with data frames, basic plotting, statistical tests (Z-test, T-test, ANOVA), and regression analysis. The document includes algorithms and sample code for each experiment, providing a practical guide for students.

Uploaded by

sharmila11121311
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

dsa

The document outlines a laboratory course on Data Science and Analytics, detailing a series of experiments using tools like Python, Pandas, and Matplotlib. Each experiment focuses on different aspects of data analysis, including working with data frames, basic plotting, statistical tests (Z-test, T-test, ANOVA), and regression analysis. The document includes algorithms and sample code for each experiment, providing a practical guide for students.

Uploaded by

sharmila11121311
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

AD3411 - DATA SCIENCE AND ANALYTICS LABORATORY

Sl.No LIST OF EXPERIMENTS Pg.No

Tools: Python, Numpy, Scipy, Matplotib, Pandas,Statmodels,


Seaborn,Plotly,Bokeh,working with Numpy arrays

1. Working with Pandas data frame 2

Basic Plots using Matplotlib


2. 3

Frequency distributors, Averages, Variability


3. 5

Normal Curves, Correlation and scatter plots, Correlation


4. coefficient 6

5. Regression 9

6. Z-test 11

7. T-test 13

8. Anova 15

9. Building and validating linear models 16

10. Building and validating logistic models 19

11. Time series analysis 22


EXP 1: WORKING WITH PANDAS DATA FRAMES

Aim:
To working with pandas data frames

Algorithm
Step 1: Start

Step 2: Define Class Cal_Average

Step 3: Sum_Num = Sum_Num + T

Step 4: Avg = Sum_Num / Len(Num)

Step 5: Stop

Program:
import pandas as pd
data = {"calories": [420, 380, 390], "duration": [50, 40,
45]}
#load data into a DataFrame object: df =
pd.DataFrame(data)
print (df.loc[0])

Output:
calories 420
duration 50
Name: 0, dtype: int64
EXP 2: BASIC PLOTS USING MATPLOTLIB
Aim:
To Basic plots using matplotlib
Algorithm
Step1: import the library

Step2: Plot the points using matplotlib

Step3: Display the output


Step4: Stop

Program:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is # for red
plt.plot(b, "or") plt.plot(list(range(0,
22, 3))) # naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep') # get
current axes command
ax = plt.gca()
# get command over the individual #
boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False) # set the
range or the bounds of
# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which # the
x-axis set the marks
plt.xticks(list(range(-3, 10)))
# set the intervals by which y-axis # set the
marks plt.yticks(list(range(-3, 20, 3))) #
legend denotes that what color
# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep']) # annotate
command helps to write
# ON THE GRAPH any text xy denotes #
the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15)) # gives a title
to the Graph
plt.title('All Features Discussed')
plt.show()

Output:


EXP 3: FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY
Aim:
To Frequency Distributions, Averages, Variability
Algorithm
Step 1: Start

Step 2: Import Pandas, Numpy And Nltk

Step 3: List The Items As ‘F’ For Fruits And ’V’ For Vegetables

Step 4: Display The Frequency Of Each Items In The List

Step 5: Stop

Program:
# Python program to get average of a list
Output:
105.57142857142857
Algorithm
Step 1: Start

Step 2: Import Statistics

Step 3: Define A List


Step 4:PrintStatistics.Variance(Sample))
Step 5: Stop

#Python program to get variance of a list

# Importing the NumPy module

import numpy as np
# Taking a list of elements list = [2, 4,
4, 4, 5, 5, 7, 9]
# Calculating variance using var()
print(np.var(list))
Output:
4.0
Algorithm
Step 1: Start

Step 2: Import Statistics

Step 3: Define A List


Step 4:Print Standard deviation

Step 5: Stop
# Python program to get standard deviation of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements list =
[290, 124, 127, 899]
# Calculating standard #
deviation using var()
print(np.std(list))
Output:
318.35750344541907

EXP 4: NORMAL CURVES, CORRELATION AND SCATTER PLOTS,
CORRELATION COEFFICIENT
Aim:
To Normal curves, Correlation and Scatter Plots, Correlation Coefficient

Algorithm

Step 1:Import Necessary Libraries

Step 2: Initialize Parameters

Step 3: Generate Random Data

Step 4:Create a Histogram

Step 5: Compute the Probability Density Function

Step 6: Plot the Normal Curve

Step 7:Add Labels and Title

Step 8:Display the Plot

Program:
#Normal curves
import matplotlib.pyplot as plt import
numpy as np
mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma, 1000) #
Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

Output:
Algorithm

Step 1:Import Necessary Libraries

Step 2:Define two pandas Series

Step 3::Calculate the correlation coefficient

Step 4:Print the correlation coefficient

Step 5:Create a scatter plot

Step 6:Label the axes and add a title

Step 7:Show the plot

Program:

#Correlation and scatter plots import


sklearn
import numpy as np
import matplotlib.pyplot as plt import
pandas as pd
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation = y.corr(x) correlation

Output:

0.8603090020146067
Algorithm

Step 1:Initialize Variables


Step 2:Iterate Through the Data
Step 3:Compute the Pearson Correlation Coefficient
Step 4:Return the Result

Program:

# Correlation coefficient import math


# function that returns correlation coefficient. def
correlationCoefficient(X, Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i=0
while i < n :
# sum of elements of array X. sum_X =
sum_X + X[i]
# sum of elements of array Y. sum_Y =
sum_Y + Y[i
# sum of X[i] * Y[i].
sum_XY = sum_XY + X[i] * Y[i]
# sum of square of array elements.
squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
i=i+1
# use formula for calculating correlation #
coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y)/ (float)
(math.sqrt((n * squareSum_X - sum_X *
sum_X)* (n * squareSum_Y - sum_Y *
sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
# Find the size of array. n =
len(X)
# Function call to correlationCoefficient.
print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Output:
0.953463

Fundamentals of Data Science

EXP 5: REGRESSION

Aim:

To write a proram Multiple linear Regression

Algorithm

Step 1: Import Required Libraries

Step 2: Define Function to Estimate Regression Coefficients

Step 3: Define Function to Plot Regression Line

Step 4: Main Execution Function

Step 5: Execute the Program in the Main Block

Program:
import numpy as np
import matplotlib.pyplot as plt def
estimate_coef(x, y):
# number of observations/points n =
np.size(x)
# mean of x and y vector m_x
= np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x SS_xy =
np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients b_1 =
SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot plt.scatter(x, y,
color = "m",
marker = "o", s = 30) #
predicted response vector y_pred = b[0]
+ b[1]*x
# plotting the regression line plt.plot(x,
y_pred, color = "g") # putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plotplt.show() def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients b =
estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1])) #
plotting regression line plot_regression_line(x,
y, b)
if name == " main ": main()

Output:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437


Fundamentals of Data Science

EXP 6: Z-TEST

Aim:
To perform Z-TEST

Algorithm

Step 1:Import Necessary Libraries

Step 2:Set Parameters

Step 3:Generate or Collect Sample Data:

Step 4:Calculate Sample Statistics:

Step 5:Perform the One-Sample Z-Test

Step 6:Make a Decision

Program:
# imports import
math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15
# similar to the IQ scores data we assume above mean_iq =
110
sd_iq = 15/math.sqrt(50) alpha =
0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq #
print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# now we perform the test. In this function, we passed data, in the value
parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we
check whether the
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='la rger')
# the function outputs a p_value and z-score corresponding to that value, we
compare the
# p-value with alpha, if it is greater than alpha then we do not null
hypothesis
# else we reject it. if(p_value <
alpha): print("Reject Null
Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")

Output:
Reject Null Hypothesis


Fundamentals of Data Science

EXP 7: T-TEST

Aim:
To perform T-TEST

Algorithm

Step 1:Import Necessary Libraries

Step 2:Define Sample Size

Step 3:Generate Data for Two Independent Groups

Step 4:Calculate Variance for Each Group

Step 5:Compute Pooled Standard Deviation (SD)

Step 6:Calculate t-Statistic (tval)

Step 7:Determine Degrees of Freedom (dof)

Step 8:Compute p-Value (pval)

Step 9:Output Results

Program:
# Importing the required libraries and packages import
numpy as np
from scipy import stats
# Defining two random distributions #
Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var = 1 x =
np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1 y =
np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation var_x =
x.var(ddof = 1)
var_y = y.var(ddof = 1) #
Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD) #
Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N)) # Comparing
with the critical T-Value
# Degrees of freedom dof =
2*N-2
# p-value after comparison with the T-Statistics pval = 1 -
stats.t.cdf( tval, df = dof) print("t = " + str(tval))
print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Packa ge
tval2, pval2 = stats.ttest_ind(x, y) print("t = " +
str(tval2))
print("p = " + str(pval2))

Output:
Standard Deviation = 0.7642398582227466 t =
4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205


Fundamentals of Data Science

Exp 8: ANOVA

Aim:

To Write a program oerform ANOVA.

Algorithm

Step 1:Install and Load Necessary Packages:

Step 2:Visualize Data with Boxplot:

Step 3:Set Up Hypotheses:

Step 4:Perform One-Way ANOVA:

Step 5:Interpret Results:

Step 6:Post-Hoc Analysis (if applicable):

Step 7:Check ANOVA Assumptions:

Program:
# Installing the package
install.packages("dplyr") #
Loading the package
library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp~factor(mtcars$gear),
xlab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis # H0 = mu
= mu01 = mu02 (There is no difference
# between average displacement for different gear) # H1 = Not
all means are equal
# Step 2: Calculate test statistics using aov function mtcars_aov <-
aov(mtcars$disp~factor(mtcars$gear)) summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value
# and conclude test p <alpha, Reject Null Hypothesis
Output:
Fundamentals of Data Science

EXP 9: BUILDING AND VALIDATING LINEAR MODELS


Aim:
To Building and Validating Linear models
algorithm

Step1: Start

Step2: Import numpy,pandas,seaborn,matplotlib&sklearn

Step3: calculate linear regression using the appropriate functions

Step4: display the result

Step 5: Stop

Program
# Importing the necessary libraries import
pandas as pd
import numpy as np
import matplotlib.pyplot as plt import
seaborn as sns
from sklearn.datasets import load_boston sns.set(style=”ticks”,color_codes=True)
plt.rcParams[‘figure.figsize’] = (8,5)
plt.rcParams[‘figure.dpi’] = 150
# loading the databoston = load_boston()
You can check those keys with the following code. print(boston.keys())
The output will be as follow:
dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’, ‘filename’])
print(boston.DESCR)

You will find these details in output:


Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river;
0 otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000
— PTRATIO pupil-teacher ratio by town
— B 1000 (Bk — 0.63)² where Bk is the proportion of blacks by town
— LSTAT % lower status of the population
— MEDV Median value of owner-occupied homes in $1000’s :Missing
Attribute Values: None
df=pd.DataFrame(boston.data,columns=boston.feature_names) df.head()
# print the columns present in the dataset
print(df.columns)
# print the top 5 rows in the dataset
print(df.head())

First five records from data set


#plotting heatmap for overall data setsns.heatmap(df.corr(), square=True,
cmap=’RdYlGn’)
Fundamentals of Data Science

Heat map of overall data set


So let’s plot a regression plot to see the correlation between RM and MEDV.
sns.lmplot(x = ‘RM’, y = ‘MEDV’, data = df)

Regression plot with RM and MEDV


EXP 10: BUILDING AND VALIDATING LOGISTICS MODELS

Aim:
To Building and Validating Logistics models

Algorithm:

Step 1:Import Necessary Libraries

Step 2:Load the Dataset

Step 3:Define Independent and Dependent Variables

Step 4:Add a Constant Term

Step 5:Fit the Logistic Regression Model

Step 6:Review Model Summary

Program

Building the Logistic Regression model:


# importing libraries import
statsmodels.api as sm import
pandas as pd
# loading the training dataset
df = pd.read_csv('logit_train1.csv', index_col = 0) # defining the
dependent and independent variables Xtrain = df[['gmat', 'gpa',
'work_experience']] ytrain = df[['admitted']]
# building the model and fitting the data log_reg =
sm.Logit(ytrain, Xtrain).fit()

Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Program
# printing the summary table
print(log_reg.summary())

Output :
Logit Regression Results
=============================================================
Dep. Variable: admitted No. Observations: 30
Fundamentals of Data Science
Model: Logit Df Residuals: 27
Method: MLE Df Model: 2
Date: Wed, 15 Jul 2020 Pseudo R-squ.: 0.4912
Time: 16:09:17 Log-Likelihood: -10.581

converged: True LL-Null: -20.794


Covariance Type: nonrobust LLR p-value: 3.668e-05
=============================================================
===
coef std err z P>|z| [0.025 0.975]

gmat -0.0262 0.011 -2.383 0.017 -0.048 -0.005


gpa 3.9422 1.964 2.007 0.045 0.092 7.792
work_experience 1.1983 0.482 2.487 0.013 0.254 2.143

Algorithm:
Step 1:Import Necessary Libraries:
Step 2:Load the Testing Dataset:
Step 3:Define Independent and Dependent Variables:
Step 4:Add a Constant to the Independent Variables:
Step 5:Load the Pre-trained Logistic Regression Model:
Step 6:Make Predictions on the Test Dataset:
Step 7:Convert Probabilities to Binary Predictions:
Step 8:Compare Actual and Predicted Values:

Program

Predicting on New Data :

# loading the testing dataset


df = pd.read_csv('logit_test1.csv', index_col = 0) # defining the
dependent and independent variables Xtest = df[['gmat', 'gpa',
'work_experience']] ytest = df['admitted']
# performing predictions on the test dataset yhat =
log_reg.predict(Xtest)
prediction = list(map(round, yhat))
# comparing original and predicted values of y print('Actual
values', list(ytest.values)) print('Predictions :', prediction)

Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
Algorithm:

Step 1:Import Necessary Libraries

Step 2:Compute the Confusion Matrix

Step 3:Display the Confusion Matrix

Step 4:Calculate the Accuracy Score

Step 5:Display the Accuracy Score

Program:

Testing the accuracy of the model :

from sklearn.metrics import (confusion_matrix, accuracy_score)


# confusion matrix
cm = confusion_matrix(ytest, prediction) print
("Confusion Matrix : \n", cm)
# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))

Output :
Confusion Matrix :
[[6 0]
[2 2]]
Test accuracy = 0.8


Fundamentals of Data Science

EXP 11: TIME SERIES ANALYSIS

Aim:
To Perform Time Series Analysis

Algorithm:
Step1: Start
Step2: Import numpy,pandas, matplotlib&seaborn
Step3: draw the plot
Step4: display the plot
Step 5: Stop
Program

We are using Superstore sales data .


import warnings
import itertools import
numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight') import
pandas as pd
import statsmodels.api as sm
import matplotlibmatplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12 matplotlib.rcParams['text.color'] = 'k'

We start from time series analysis and forecasting for furniture sales.
df=pd.read_excel("Superstore.xls")
furniture = df.loc[df['Category'] == 'Furniture'] A good 4-year
furniture sales data.
furniture['Order Date'].min(), furniture['Order Date'].max() Timestamp(‘2014–
01–06 00:00:00’), Timestamp(‘2017–12–30
00:00:00’)
Data Preprocessing
This step includes removing columns we do not need, check missing values,
aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product
ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols,axis=1,inplace=True) furniture=furniture.sort_values('Order
Date')furniture.isnull().sum() furniture=furniture.groupby('OrderDate')
['Sales'].sum().reset_ index()
Figure 1

Order Date 0
Sales 0
dtype:
int64
Indexing with Time Series Data
furniture=furniture.set_index('OrderDate') furniture.index

Figure 2
We will use the averages daily sales value for that month instead, and we are using
the start of each month as the timestamp.
y = furniture
['Sales'].resample('MS').mean() Have a
quick peek 2017 furniture sales data.
y['2017':]
Fundamentals of Data Science

Figure 3

Visualizing Furniture Sales Time Series Data


y.plot (figsize=(15,6))
plt.show()



You might also like