dsa-lab-manual (1)
dsa-lab-manual (1)
lOMoARcPSD|36439593
SYLLABUS
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Suggested Exercises:
TOTAL : 60 PERIODS
HARDWARE:
Standalone Desktops with Windows OS
SOFTWARE:
Python with statistical Packages
viii
Downloaded
lOMoARcPSD|36439593
NUMPY:
NumPy is a Python library used for working with arrays .It also has functions for working
in domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by Travis
Oliphant. It is an open source project and you can use it freely. NumPy stands for Numerical
Python.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
AIM
Write a Python program to demonstrate basic array characteristics.
ALGORITHM
Step1: Start
Step2: Import numpy module
PROGRAM
importnumpy as np
# Creating array object
arr = np.array( [[ 1, 2,
3],
[ 4, 2, 5]] )
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT
Array is of type: <class
'numpy.ndarray'> No. of dimensions: 2
Shape of array: (2,
3) Size of array: 6
Array stores elements of type: int32
RESULT
Thus the python program working with NumPy array has been implemented and executed
successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
SLICING:
Similar to Python lists, numpy arrays can be sliced. Since arrays may be
multidimensional, we mustspecify a slice for each dimension of the array
AIM
Write a Python Program to Perform Array Slicing.
ALGORITHM
Step1: Start
PROGRAM
Importnumpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a)
print("After
slicing") print(a[1:])
OUTPUT
[[1 2 3]
[3 4 5]
[4 5 6]]
After slicing
[3 4 5] [4 5 6]]
RESULT
Thus the python program to perform array slicing has been implemented and
executed successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
AIM
Write a Python Program to Perform Array Slicing.
ALGORITHM
Step1: Start
Step2: import numpy module
Step3: Create an array and apply the slicing
operator Step4: Print the output
Step5: Stop
PROGRAM
# array to begin
with import numpy
as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print('Our array
is:' ) print(a)
# this returns array of items in the second
column print('The items in the second column
are:' ) print(a[...,1])
print('\n' )
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT:
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
[3 4 5]
The items column 1 onwards are:
[[2 3]
[4 5]
[5 6]]
RESULT
Thus the python program to perform array slicing has been implemented and executed
successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
list df = pd.DataFrame(lst)
print(df)
OUTPUT
0
0 A
1 B
2 C
3 D
4 E
5 F
6 G
RESULT
Thus the python program for dataframe using list of elements has been implemented
and executedsuccessfully.
Downloaded by DURGALAKSHMI B
EX NO: 2B CREATE A DATAFRAME USING THE DICTIONARY
DATAFRAME:
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is
passed then the length index should be equal to the length of arrays. If no index is passed, then by
default, index will be range(n) where n is the array length.
AIM
Write a program to create a dataframe using dictionary of elements.
ALGORITHM
Step1: Start
Step5: Stop
PROGRAM
import pandas as pd
OUTPUT:
Name Age
0 Tom 20
1 nick 21
0 krish 19
1 jack 18
RESULT
Thus the python to create dataframe using dictionary program has been implemented
and executedsuccessfully
EX NO: 2C COLUMN SELECTION
Column Selection
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and
renaming.
Column Selection: In Order to select a column in Pandas DataFrame, we can either access the
columns by calling them by their columns name.
AIM
Write a program to select a column from dataframe.
ALGORITHM
Step1: Start
Step2: import pandas module
Step3: Create a dataframe using the dictionary
PROGRAM
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Address':['Delhi',
'Kanpur', 'Allahabad', 'Kannauj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into
DataFrame df = pd.DataFrame(data)
print(df)
RESULT
Thus the python program for coloumn selection has been implemented and executed
successfully.
EX NO: 2D CHECKING FOR MISSING VALUES USING ISNULL() AND NOTNULL()
In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not.These function can also be
used in Pandas Series in order to find null values in a series.
AIM
Write a program to check the missing values from the dataframe.
ALGORITHM
Step1: Start
Step2: import pandas module
Step3: Create a dataframe using the dictionary
Step4: Check the missing values using isnull()
function Step5: print the output
Step6: Stop
PROGRAM
# importing pandas as
pd import pandas as pd
# importing numpy as
np importnumpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan,
95], 'Second Score': [30, 45, 56,
np.nan],
'Third Score':[np.nan, 40, 80, 98]}
RESULT
Thus the python program checking for missing value using isnull() and nonull() has
been implemented and executed successfully.
EX NO: 2E DROPPING MISSING VALUES USING DROPNA()
In order to drop a null values from a dataframe, we used dropna() function this function
drop Rows/Columns of datasets with Null values in different ways.
AIM
Write a program to drop rows with at least one Nan value (Null value)
ALGORITHM
Step1: Start
Step2: import pandas module
Step3: Create a dataframe using the dictionary
Step4: Drop the null values using dropna() funtion
Step5: print the output
Step6: Stop
PROGRAM
Drop rows with at least one Nan value (Null value)
# importing pandas as
pd import pandas as pd
# importing numpy as
np import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan,
65]} # creating a dataframe from dictionary
df =
pd.DataFrame(dict) # using
dropna()function
df.dropna()
OUTPUT:
RESULT
Thus the python program for Drop missing values has been implemented and executed successfully.
EX NO: 3A BASIC PLOTS USING MATPLOTLIB
MATPLOTLIB:
It is a Python library that helps in visualizing and analyzing the data and helps in better
understanding of the data with the help of graphical, pictorial visualizations that can be simulated
using the matplotlib library. Matplotlib is a comprehensive library for static, animated and
interactivevisualizations.
AIM
Write a python program to create a simple plot using plot() function.
ALGORITHM
Step1:Define the x-axis and corresponding y-axis values as
lists. Step2:Plot them on canvas using .plot() function.
Step3:Give a name to x-axis and y-axis using .xlabel() and .ylabel()
functions. Step4:Give a title to your plot using .title() function.
Step5:Finally, to view your plot, we use .show()
function. Step6: Stop
PROGRAM
importmatplotlib.pyplot as
plt # x axis values
x = [1,2,3]
# corresponding y axis
values y = [2,4,1]
# plotting the
points plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my
graph plt.title('My first
graph!')
# function to show the
plot plt.show()
OUTPUT:
RESULT
Thus the python program for basic Matplotlib has been implemented and executed successfully.
EX NO: 3B COMPUTE THE X AND Y COORDINATES AND CREATE A PLOT
AIM
Write a python program to create a plot by computing the x and y coordinates.
ALGORITHM
Step1: Compute the x and y coordinates for points on a sine
curve Step2: Plot the points using matplotlib
Step3:Display the
output Step4: Stop
PROGRAM
importnumpyasnp
importmatplotlib.pyplotasplt
x =np.arange(0, 3*np.pi,
0.1) y =np.sin(x)
plt.plot(x,
y)
plt.show()
OUTPUT
RESULT
Thus the python program to compute X and Y coordinates has been implemented
and executedsuccessfully.
EX NO: 3C DRAWING MULTIPLE LINES USING PLOT FUNCTION
AIM
Write a python program to draw multiple lines using plot() function.
ALGORITHM
Step1: Compute the x and y coordinates for points on a sine and cosine
curve Step2: Plot the points using matplotlib
Step3:Display the
output Step4: Stop
PROGRAM
importnumpy as np
importmatplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine
curves x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine',
'Cosine']) plt.show()
OUTPUT
RESULT
Thus the python program multiple line using plot functiont has been implemented
and executedsuccessfully.
Ex No: 3D BASIC PLOT USING MATPLOTLIB
AIM
Write a python program for basic plot using matplotlib
ALGORITHM
Step1: import the library
Step2: Plot the points using
matplotlib Step3: Display the output
Step4: Stop
PROGRAM
Line plot :
plt.plot(x,y
) plt.show()
Bar plot :
plt.bar(x,y
)
plt.show()
Histogram :
Scatter Plot :
RESULT
Thus the python program for basic plot using Matplotlib has been implemented
and executedsuccessfully.
EX NO:4A CONDITIONAL FREQUENCY DISTRIBUTION
Conditional Frequency:
In the previous topic, you have studied about Frequency Distributions FreqDist function
computesthe frequency of each item in a list. While computing a frequency distribution, you observe
occurrence count of an event.
A Conditional Frequency is a collection of frequency distributions, computed based on a
condition. For computing a conditional frequency, you have to attach a condition to every
occurrence of an event. Let's consider the following list for computing Conditional Frequency.
AIM
To write a python program to show the conditional Frequency distribution
ALGORITHM
Step 1: Start
Step 2: Import Pandas, Numpy And Nltk
Step 3: List The Items As ‘F’ For Fruits And ’V’ For Vegetables
Step 4: Display The Frequency Of Each Items In The
List Step 5: Stop
PROGRAM:
importnumpyasnp# linear algebra
importpandasaspd# data processing, CSV file I/O (e.g. pd.read_csv)
importnltk
items = ['apple', 'apple', 'kiwi', 'cabbage', 'cabbage',
'potato'] nltk.FreqDist(items)
c_items= [('F','apple'), ('F','apple'), ('F','kiwi'), ('V','cabbage'), ('V','cabbage'), ('V','potato') ]
cfd=nltk.ConditionalFreqDist(c_items)
cfd.conditions(
) cfd.plot()
cfd['V']
OUTPUT
RESULT
Thus the python program for conditional frequency distribution has been implemented
and executedsuccessfully.
EX NO: 4B FREQUENCY OF WORDS, OF A PARTICULAR GENRE, IN
BROWN CORPUS.
AIM
To write a python program determine the frequency of words, of a particular genre, in
corpus brown
.
ALORITHM
Step 1: Start
Step 2: Import All Necessary Libraries
Step 3: Display The Frequency Of Each Items In The
List Step 4:Setting Cumulative Argument Value To
True.
Step 5: Stop
PROGRAM
fromnltk.corpusimport brown
cfd=nltk.ConditionalFreqDist([ (genre, word) for genre inbrown.categories() for word
inbrown.words(categories=genre
) ]) cfd
cfd.conditions()
cfd.tabulate(conditions=['government', 'humor', 'reviews'],samples=['leadership', 'worsh
ip', 'hardship'])
cfd.plot(conditions=['government', 'humor', 'reviews'],samples=['leadership', 'worship', 'hardship'])
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership',
'worship', 'hardship'], cumulative =True)
news_fd=cfd['news']
news_fd.most_common
(3) news_fd['the']
OUTPUT
leadership worship hardship
government 12 3 2
humor 1 0 0
reviews 14 1 2
RESULT
Thus the python program frequency of words, of a particular genre, in brown corpus has
been implemented and executed successfully.
EX NO: 4C FREQUENCY OF LAST CHARACTER APPEARING IN ALL
NAMES ASSOCIATED WITH MALES AND FEMALES RESPECTIVELY
AND COMPARES THEM
AIM
To write a python program frequency of last character appearing in all names associated
with malesand females respectively and compares them.
ALORITHM
Step 1: Start
Step 2: Import All Necessary Libraries
Step 3: Display The Frequency Of Each Items In The
List Step 4: Plot
Step 5: Stop
PROGRAM
fromnltk.corpusimport names
nt= [(fid.split('.')[0], name[-1]) for fid innames.fileids() for name
innames.words(fid) ] cfd2 =nltk.ConditionalFreqDist(nt)
cfd2['female']['a']
cfd2['male']['a']
cfd2['female'] > cfd2['male']
cfd2.tabulate(samples=['a',
'e']) cfd2.plot()
OUTPUT
a e
female 1773 1432
male 29 468
RESULT
Thus the python program frequency of last character appearing in all names
associated with malesand females respectively and compares them has been implemented and
executed successfully.
EX NO: 4D AVERAGE OF LIST USING LOOP
AIM
To write a python program for finding a average of list using loop.
ALGORITHM
Step 1: Start
Step 2: Define A Class Cal_Average
Step 3: Sum_Num = Sum_Num + T
Step 4: Avg = Sum_Num /
Len(Num) Step 5: Stop
PROGRAM:
defcal_average(num):
sum_num =
0 for t in
num:
sum_num = sum_num + t
avg = sum_num /
len(num) returnavg
print("The average is", cal_average([18,25,3,41,5]))
OUTPUT:
The average is 18.4
RESULT
Thus the python program finding a average of list using loop has been implemented and
executedsuccessfully.
EX NO: 4E AVERAGE OF LIST USING BUILT IN FUNCTIONS
AIM
To write a python program to find the average of list using built in functions.
ALGORITHM
STEP 1: Start STEP
STEP 2: define a list
STEP 3: avg =
sum(number_list)/len(number_list) STEP
4:printavg
STEP 5: Stop
PROGRAM
number_list = [45, 34, 10, 36, 12, 6, 80]
avg =
sum(number_list)/len(number_list)
print("The average is ", round(avg,2))
OUTPUT:
The average is 31.86
RESULT
Thus the python program finding a average of list using built in functions has
been implemented andexecuted successfully.
Ex No: 4F AVERAGE OF LIST USING MEAN FUNCTION
AIM
To write a python program to find the average of list using mean function.
ALGORITHM
Step 1: Start
Step 2: Define A List
Step 3: Import Mean From
Statistics Step 4: Avg =
Mean(Number_List) Step 5:
Printavg
Step 6: Stop
PROGRAM
from statistics import mean
number_list = [45, 34, 10, 36, 12, 6,
80] avg = mean(number_list)
print("The average is ", round(avg,2))
OUTPUT:
The average is 31.86
RESULT
Thus the python program average of list using mean function has been implemented and
executedsuccessfully.
EX NO: 4G AVERAGE OF LIST USING NUMPY LIBRARY
AIM
To write a python program to find the average of list using numpy library.
ALGORITHM
Step 1: Start
Step 2: Import Mean From
Numpy Step 3: Define A List
Step 4: Avg =
Mean(Number_List) Step
5:Printavg
Step 6: Stop
PR0GRAM
fromnumpy import mean
number_list = [45, 34, 10, 36, 12, 6, 80]
avg = mean(number_list)
print ("The average is ", round(avg,2))
OUTPUT:
The average is 31.86
RESULT
Thus the python program average of list using numpy library has been implemented and
executedsuccessfully.
EX NO: 4H VARIANCE OF SAMPLE SET
AIM
To write a python program to show variance of sample set.
ALGORITHM
Step 1: Start
Step 2: Import
Statistics Step 3:
Define A List
Step 4: Print
Statistics.Variance(Sample)) Step 5: Stop
PROGRAM
import statistics
sample = [2.74, 1.23, 2.63, 2.22, 3, 1.98]
print("Variance of sample set is % s" , statistics.variance(sample))
OUTPUT :
Variance of sample set is 0.40924
RESULT
Thus the python program to show variance of sample set has been implemented and executed
successfully.
EX NO: 4I VARIANCE ON A RANGE OF DATA-TYPES
AIM
To write a python program to show variance on a range of data-types.
ALGORITHM
Step 1: Start
Step 2: Import All Necessary
Libraries Step 3: Define Samples
Step 4: Print Variance Of
Sample Step 5: Stop
PROGRAM
from statistics import variance
from fractions import Fraction as
fr sample1 = (1, 2, 5, 4, 8, 9, 12)
sample2 = (-2, -4, -3, -1, -5, -6)
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),fr(5, 6), fr(7, 8))
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
print("Variance of Sample1 is ",variance(sample1))
print("Variance of Sample2 is ",variance(sample2))
print("Variance of Sample3 is ",variance(sample3))
print("Variance of Sample4 is ", variance(sample4))
print("Variance of Sample5 is ",variance(sample5))
OUTPUT
Variance of Sample1 is
15.80952380952381 Variance of Sample2
is 3.5
Variance of Sample3 is 61.125
Variance of Sample4 is 1/45
Variance of Sample5 is 0.17613000000000006
RESULT
Thus the python program to show variance on a range of data-types has been implemented
and executed successfully.
EX NO: 4J STATISTICS
AIM
To write a python program to show statistics.
ALGORITHM
Step 1: Start
Step 2: Import
Statistics Step 3:
Define A List
Step 4: M=Statistics.Mean(Sample)
Step 5: Stop
PROGRAM
import statistics
sample = (1, 1.3, 1.2, 1.9, 2.5, 2.2)
m = statistics.mean (sample)
print("Variance of Sample set is ",statistics.variance(sample, xbar = m))
OUTPUT
Variance of Sample set is 0.3656666666666667
RESULT
Thus the python program to show statistics has been implemented and executed successfully.
EX NO: 5A CREATE NORMAL CURVE
`
AIM
ALGORITHM
STEP 1: Start
STEP 2: import all necessary
packages STEP 3: create distribution
STEP 4: visualize the
distribution STEP 5: Stop
PROGRAM
RESULT
Thus the python program to create a normal curve has been implemented and executed
successfully.
EX NO: 5B CORRELATION AND SCATTER PLOTS
CORRELATION:
Correlation means an association. It is a measure of the extent to which two variables are related.
AIM:
To write a python program correlation and scatter plots.
ALGORITHM:
Step 1: Importing the libraries.
Step 2: Finding the Correlation between two variables.
Step 3: Plotting the graph. Here we are using scatter plots. A scatter plot is a diagram where each
value in the data set is represented by a dot. Also, it shows a relationship between two variables.
PROGRAM:
importsklearn
importnumpy as np
importmatplotlib.pyplot as
plt import pandas as pd
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation =
y.corr(x)
print(correlation)
plt.scatter(x, y)
# This will fit the best line into the graph
RESULT:
Thus the python program to correlation and scatter plots has been implemented and executed
successfully.
SCATTER PLOT:
Scatter plot is a graph of two sets of data along the two axes. It is used to visualize
the relationship between the two variables.
In python matplotlib, the scatterplot can be created using the pyplot.plot() or the
pyplot.scatter(). Using these functions, you can add more feature to your scatter plot, like changing
the size, color orshape of the points.
AIM:
PROGRAM:
x = range(50)
y = range(50) +
np.random.randint(0,30,50) plt.scatter(x, y)
plt.rcParams.update({'figure.figsize':(10,8),
'figure.dpi':100}) plt.title('Simple Scatter plot')
plt.xlabel('X -
value') plt.ylabel('Y
- value') plt.show()
OUTPUT:
RESULT
Thus the python program for simple scatter Plot has been implemented and executed
successfully.
ii) SIMPLE SCATTER PLOT WITHCOLORED
POINTS AIM:
To write a python program Simple Scatterplot with colored points.
ALGORITHM:
Step 1: Importing the libraries.
Step 2: Finding the Correlation between two variables.
Step 3: Plotting the graph. Here we are using scatter plots. A scatter plot is a diagram where each
value in the data set is represented by a dot. Also, it shows a relationship between two variables.
PROGRAM:
x = range(50)
y = range(50) + np.random.randint(0,30,50)
plt.rcParams.update({'figure.figsize':(10,8),
'figure.dpi':100}) plt.scatter(x, y, c=y, cmap='Spectral')
plt.colorbar()
plt.title('Simple Scatter
plot') plt.xlabel('X - value')
plt.ylabel('Y -
value') plt.show()
OUTPUT:
RESULT:
Thus the python program Simple Scatterplot with colored points has been implemented
andexecuted successfully.
EX NO: 5C CORRELATION COEFFICIENT
Variables within a dataset can be related for lots of reasons.
For example:
One variable could cause or depend on the values of another
variable. One variable could be lightly associated with another
variable.
Two variables could depend on a third unknown variable.
It can be useful in data analysis and modelling to better understand the relationships
between variables. The statistical relationship between two variables is referred to as their
correlation.
A correlation could be positive, meaning both variables move in the same direction, or
negative, meaning that when one variable’s value increases, the other variables’ values decrease.
Correlation can also be neutral or zero, meaning that the variables are unrelated.
Positive Correlation: both variables change in the same direction.
Neutral Correlation: No relationship in the change of the
variables. Negative Correlation: variables change in opposite
directions.
AIM:
To write a program to calculate the correlation coefficient.
ALGORITHM:
STEP 1: Import the numpy packages.
STEP 2: Define two NumPy arrays. Call them x and y
STEP3: Call np.corrcoef() with both arrays as arguments
STEP 4: corrcoef() returns the correlation matrix, which is a two-dimensional array with
the correlation coefficients.
PROGRAM:
importnumpy as np
x = np.arange(10, 20)
y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
r = np.corrcoef(x,
y) print(r)
OUTPUT:
RESULT:
Thus the python program calculate the correlation coefficient has been implemented
and executedsuccessfully.
PEARSON’S CORRELATION
The Pearson correlation coefficient can be used to summarize the strength of the linear
relationship between two data samples.The Pearson’s correlation coefficient is calculated as the
covariance of thetwo variables divided by the product of the standard deviation of each data sample.
It is thenormalization of the covariance between the two variables to give an interpretable score.
Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))
AIM:
To write a program to calculate the Pearson correlation coefficient between two variables.
ALGORITHM:
Step 1: Import The Needed
Packages. Step 2: Provide The Data.
Step 3: Thepearsonr() Scipy Function Can Be Used To Calculate The Pearson’s Correlation
Coefficient Between Two Data Samples With The Same
Length. Step 4: Display The Correlation Coefficient.
PROGRAM:
fromnumpy.random import
randn fromnumpy.random
import seed fromscipy.stats
import pearsonr seed(1)
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
corr,_ = pearsonr(data1, data2)
print('Pearsons correlation:', corr)
OUTPUT:
Pearsons correlation: 0.887611908579531
RESULT:
Thus the python program to calculate the Pearson correlation coefficient between
two variables hasbeen implemented and executed successfully.
6.REGRESSION
AIM:
To write a program simple linear regression with scikit-learn.
ALGORITHM:
Step 1: Import The Packages And Classes.
Step 2: Provide Data To Work With And Eventually Do Appropriate
Transformations. Step 3: Create A Regression Model And Fit It With Existing Data.
Step 4: Check The Results Of Model Fitting To Know Whether The Model Is
Satisfactory. Step 5: Apply The Model For Predictions.
PROGRAM:
importnumpy as np
fromsklearn.linear_model import
LinearRegression x = np.array([5, 15, 25, 35, 45,
55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
model = LinearRegression().fit(x,
y) r_sq = model.score(x, y)
print('coefficient of determination:',
r_sq) y_pred = model.predict(x)
print('predicted response:', y_pred)
OUTPUT:
coefficient of determination: 0.715875613747954
predicted response: [ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333
35.33333333]
RESULT:
Thus the python program simple linear regression with scikit-learn has been implemented
andexecuted successfully.
EX NO: 6B MULTIPLE LINEAR REGRESSIONWITH SCIKIT-LEARN
AI
M
To write a program multiple linear regression with scikit-learn.
ALGORITHM:
Step 1:Import Packages And
Classes Step 2: Provide Data
Step 3:Create A Model And Fit
It Step 4: Get Results
Step 5: Predict Response
PROGRAM:
importnumpy as np
fromsklearn.linear_model import LinearRegression
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)
model = LinearRegression().fit(x,
y)
r_sq = model.score(x, y)
print('coefficient of determination:',
r_sq) print('intercept:',
model.intercept_) print('slope:',
model.coef_)
y_pred = model.predict(x)
print('predicted response:',
y_pred)
OUTPUT:
coefficient of determination:
0.8615939258756775 intercept:
5.52257927519819
slope: [0.44706965 0.25502548]
predicted response: [ 5.77760476 8.012953 12.73867497 17.9744479 23.97529728 29.4660957
38.78227633 41.27265006]
RESULT:
Thus the python program multiple linear regression with scikit-learn has been
implemented andexecuted successfully.
lOMoARcPSD|36439593
AIM
To Perform Z-test
ALGORITHM
Step1: Start
Step4: Stop
PROGRAM
# imports
import math
importnumpy as np
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha =0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
# now we perform the test. In this function, we passed data, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check whether the
# mean is larger
# the function outputs a p_value and z-score corresponding to that value, we compare the
# p-value with alpha, if it is greater than alpha then we do not null hypothesis
if(p_value< alpha):
else:
OUTPUT
mean=110.17 stdv=2.34
RESULT
Thus the program for Z-Test case studies has been executed and verified successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
AIM
ALGORITHM
Step1: Start
Step4: Stop
PROGRAM
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT:
t = 4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205
RESULT
Thus the program for T-test case studies has been executed and verified successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
AIM
ALGORITHM
Step1: Start
Step2: Import scipy
Step3: import statsmodels
Step4: calculate ANOVA F and p value
Step 5: Stop
PROGRAM
importscipy.stats as stats
# statsf_oneway functions takes the groups as input and returns ANOVA F and p value
fvalue, pvalue = stats.f_oneway(df['A'], df['B'], df['C'], df['D'])
print(fvalue, pvalue)
# 17.492810457516338 2.639241146210922e-05
# get ANOVA table as R like output
importstatsmodels.api as sm
fromstatsmodels.formula.api import ols
# Ordinary Least Squares (OLS) model
model = ols('value ~ C(treatments)', data=df_melt).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT
# sum_sq df F PR(>F)
# ANOVA table using bioinfokit v1.0.3 or later (it uses wrapper script for anova_lm)
res = stat()
RESULT
Thus the program for ANOVA case studies has been executed and verified successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
AIM :
ALGORITHM
Step1: Start
Step 5: Stop
PROGRAM
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import scale
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
housing.head()
# number of observations
len(housing.index)
# filter only area and price
df = housing.loc[:, ['area', 'price']]
df.head()
# recaling the variables (both)
df_columns = df.columns
scaler = MinMaxScaler()
df = scaler.fit_transform(df)
df.head()
# visualise area-price relationship
sns.regplot(x="area", y="price", data=df, fit_reg=False)
# split into train and test
df_train, df_test = train_test_split(df,
train_size = 0.7,
test_size = 0.3,
random_state = 10)
print(len(df_train))
print(len(df_test))
381
164
# split into X and y for both train and test sets
# reshaping is required since sklearn requires the data to be in shape
# (n, 1), not as a series of shape (n, )
X_train = df_train['area']
X_train = X_train.values.reshape(-1, 1)
y_train = df_train['price']
X_test = df_test['area']
X_test = X_test.values.reshape(-1, 1)
y_test = df_test['price']
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT:
13300
0 7420 4 2 3 yes no no no yes 2 yes furnished
000
12250
1 8960 4 4 4 yes no no no yes 3 no furnished
000
12250 semi-
2 9960 3 2 2 yes no yes no no 2 yes
000 furnished
12215
3 7500 4 2 2 yes no yes no yes 3 yes furnished
000
11410
4 7420 4 1 2 yes yes yes
000
545
area price
0 0.396564 1.000000
1 0.502405 0.909091
2 0.571134 0.909091
3 0.402062 0.906061
4 0.396564 0.836364
<matplotlib.axes._subplots.AxesSubplot at 0x7fe94d630160>
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
381
164
RESULT
Thus the program for Linear Regression has been executed and verified successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
AIM :
ALGORITHM
Step1: Start
Step 5: Stop
PROGRAM
portnumpy as np
import pandas as pd
importseaborn as sb
importmatplotlib.pyplot as plt
importsklearn
from pandas import Series,
DataFrame frompylab import
rcParams fromsklearn import
preprocessing
fromsklearn.linear_model import LogisticRegression
fromsklearn.model_selection import train_test_split
fromsklearn import metrics
fromsklearn.metrics import classification_report
'exec(%matplotlib inline)'
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')
url = 'https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/master/titanic-
train.csv'
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
titanic = pd.read_csv(url)
titanic.columns =
['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarke
d']
titanic.head()
sb.countplot(x='Survived',data=titanic, palette='hls')
titanic.isnull().sum()
titanic.info()
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Braund, Mr.
0 1 0 male 22.0 1 0 A/5 21171 7.2500 NaN S
Owen Harris
STON/O2.
2 3 1 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
RESULT
Thus the program for Logistics Regression has been executed and verified successfully.
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
AIM :
ALGORITHM
Step1: Start
Step 5: Stop
PROGRAM
importmatplotlib as mpl
importmatplotlib.pyplot as plt
importseaborn as sns
importnumpy as np
import pandas as pd
# Import as Dataframe
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'])
df.head()
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/MarketArrivals.csv')
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
df = df.loc[df.market=='MUMBAI', :]
df.head()
importmatplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
# Draw Plot
plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.show()
plot_df(df, x=df.index, y=df.value, title='Monthly anti-diabetic drug sales in Australia from 1992 to
2008.')
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
OUTPUT
ser = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
ser.head()
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/MarketArrivals.csv')
df = df.loc[df.market=='MUMBAI', :]
df.head()
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
importmatplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
# Draw Plot
defplot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
plot_df(df, x=df.index, y=df.value, title='Monthly anti-diabetic drug sales in Australia from 1992 to
2008.')
# Import data
df = pd.read_csv('datasets/AirPassengers.csv', parse_dates=['date'])
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
x = df['date'].values
y1 = df['value'].values
# Plot
fig, ax = plt.subplots(1, 1, figsize=(16,5), dpi= 120)
plt.fill_between(x, y1=y1, y2=-y1, alpha=0.5, linewidth=2, color='seagreen')
plt.ylim(-800, 800)
plt.title('Air Passengers (Two Side View)', fontsize=16)
plt.hlines(y=0, xmin=np.min(df.date), xmax=np.max(df.date), linewidth=.5)
plt.show()
# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
df.reset_index(inplace=True)
# Prepare data
df['year'] = [d.year for d in df.date]
df['month'] = [d.strftime('%b') for d in
df.date] years = df['year'].unique()
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
# Prep Colors
np.random.seed(100)
mycolors = np.random.choice(list(mpl.colors.XKCD_COLORS.keys()), len(years), replace=False)
# Draw Plot
plt.figure(figsize=(16,12), dpi= 80)
for i, y in enumerate(years):
if i > 0:
plt.plot('month', 'value', data=df.loc[df.year==y, :], color=mycolors[i], label=y)
plt.text(df.loc[df.year==y, :].shape[0]-.9, df.loc[df.year==y, 'value'][-1:].values[0], y, fontsize=12,
color=mycolors[i])
# Decoration
plt.gca().set(xlim=(-0.3, 11), ylim=(2, 30), ylabel='$Drug Sales$', xlabel='$Month$')
plt.yticks(fontsize=12, alpha=.7)
plt.title("Seasonal Plot of Drug Sales Time Series", fontsize=20)
plt.show()
# Import Data
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
df.reset_index(inplace=True)
# Prepare data
df['year'] = [d.year for d in df.date]
df['month'] = [d.strftime('%b') for d in
df.date] years = df['year'].unique()
# Draw Plot
fig, axes = plt.subplots(1, 2, figsize=(20,7), dpi= 80)
sns.boxplot(x='year', y='value', data=df, ax=axes[0])
sns.boxplot(x='month', y='value', data=df.loc[~df.year.isin([1991, 2008]), :])
# Set Title
axes[0].set_title('Year-wise Box Plot\n(The Trend)', fontsize=18);
axes[1].set_title('Month-wise Box Plot\n(The Seasonality)',
fontsize=18) plt.show()
Downloaded by DURGALAKSHMI B
lOMoARcPSD|36439593
RESULT
Thus the program for Time series analysis has been executed and verified successfully.
Downloaded by DURGALAKSHMI B