CS3361 Data Science Lab Manual
CS3361 Data Science Lab Manual
AIM:
To download, install and explore the features of NumPy, SciPy, Jupyter, Stasmodels and
Pandas packages.
INSTALLATION OF PACKAGES:
PRE-REQUISITES:
Operating System : Windows 7 Professional (Service Pack 1)
Software : Python 3.8.7
NUMPY:
Numpy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python. Besides its obvious scientific uses, Numpy can
also be used as an efficient multi-dimensional container of generic data.
Features:
High-performance N-dimensional array object.
It contains tools for integrating code from C/C++ and FORTRAN.
It contains a multidimensional container for generic data.
Additional linear algebra, Fourier transforms, and random number capabilities.
It consists of broadcasting functions.
It had data type definition capability to work with varied databases.
Sample Program:
import numpy as np
a=np.array([1,2,3])
print(a)
Output:
[1 2 3]
SCIPY:
SciPy is a python library that is useful in solving many mathematical equations and
algorithms. It is designed on the top of Numpy library that gives more extension of finding
scientific mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU
Decomposition, etc. Using its high level functions will significantly reduce the complexity of the
code and helps in better analyzing the data. SciPy is an interactive Python session used as a data-
processing library that is made to compete with its rivalries such as MATLAB, Octave, R-
Lab,etc. It has many user-friendly, efficient and easy-to-use functions that helps to solve
problems like numerical integration, interpolation, optimization, linear algebra and statistics.
Sample Program:
from scipy import constants
print(constants.pi)
Output:
3.141592653589793
JUPYTER:
The IPython Notebook concept was expanded upon to allow for additional programming
languages and was therefore renamed "Jupyter". "Jupyter" is a loose acronym meaning Julia,
Python and R, but today, the notebook technology supports many programming languages. An
IDE normally consists of at least a source code editor, build automation tools and a
debugger. Jupyter Notebook is an IDE for Python that allows its users to create documents
containing both rich text and code. It also supports the programming languages Julia, and R.
Jupyter Notebook allows users to compile all aspects of a data project in one place
making it easier to show the entire process of a project to your intended audience. Through the
web-based application, users can create data visualizations and other components of a project to
share with others via the platform.
To open jupyter-lab:
Open command prompt and type jupyter-lab.
Then after initializing all the necessary packages, it will open as follows:
Click on new notebook, then the new file will be opened with .ipynb file extension.
Then type python code and execute the code using Shift+Enter.
STASMODELS:
statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics are available for each estimator. The results are
tested against existing statistical packages to ensure that they are correct. statsmodels supports
specifying models using R-style formulas and pandas DataFrames.
statsmodels is a Python package that provides a complement to scipy for statistical
computations including descriptive statistics and estimation and inference for statistical models.
Sample Program:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv(r"C:\Users\UGCS\Desktop\headbrain11.csv")
print(df.head())
# fitting the model
df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()
# model summary
print(model.summary())
Output:
PANDAS:
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. Pandas allow us to analyze big data and make
conclusions based on statistical theories. Pandas can clean messy data sets, and make them
readable and relevant. Relevant data is very important in data science.
Sample Program:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:
EX.NO:2 WORKING WITH NUMPY ARRAYS
AIM:
To write a python code to work with numpy arrays.
ALGORITHM:
PROGRAM:
#Create a 0-D array:
import numpy as np
arr = np.array(42)
print(arr)
OUTPUT:
42
OUTPUT:
[1,2,3,4,5]
#Create a 2-D array:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
OUTPUT:
[[1 2 3]
[4 5 6]]
OUTPUT:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
#Check how many dimensions the arrays have:
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
OUTPUT:
0
1
2
3
OUTPUT:
2nd element on 1st row: 2
#Array Slicing:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
OUTPUT:
[5 6 7]
#Slicing 2-D Arrays:
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
OUTPUT:
[7 8 9]
OUTPUT:
<U6
OUTPUT:
1
2
3
OUTPUT:
[1 2 3]
[4 5 6]
OUTPUT:
[1 2 3 4 5 6]
OUTPUT:
[array([1, 2]), array([3, 4]), array([5, 6])]
#Searching Arrays:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
OUTPUT:
(array([3, 5, 6], dtype=int32),)
#Sorting Arrays:
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
OUTPUT:
[0 1 2 3]
#Filtering Arrays:
import numpy as np
arr = np.array([41, 42, 43, 44])
x = [True, False, True, False]
newarr = arr[x]
print(newarr)
OUTPUT:
[41 43]
RESULT:
Thus the python code to work with numpy arrays has been implemented and executed
successfully.
EX.NO:3 WORKING WITH PANDAS DATA FRAMES
AIM:
To write a python program to work with pandas data frames.
ALGORITHM:
PROGRAM:
#Creating a dataframe using List:
import pandas as pd
lst = [‘Anna’, ‘University, ‘Chennai’, ‘Sri’, ‘Ramakrishna’, ‘College’, ‘of’,’Engineering’]
df = pd.DataFrame(lst)
print(df)
OUTPUT:
OUTPUT:
#Column Selection:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
print(df[['Name', 'Qualification']])
OUTPUT:
OUTPUT:
# Viewing the Data
import pandas as pd
df = pd.read_csv(r"C:\Users\UGCS\Desktop\data.csv")
print(df.head(10))
print(df.tail(5))
OUTPUT:
#Replacing Nullvalues:
import pandas as pd
df = pd.read_csv(r"C:\Users\UGCS\Desktop\data.csv")
df.fillna(130, inplace = True)
print(df)
OUTPUT:
#Checking for missing values using isnull() and notnull() :
import pandas as pd
df = pd.read_csv(r"C:\Users\UGCS\Desktop\data.csv")
bool_series = pd.isnull(df["Pulse"])
print(df[bool_series])
bool_series = pd.notnull(df["Pulse"])
print(df[bool_series])
OUTPUT:
RESULT:
Thus the python program to work with pandas data frames have been implemented and
executed successfully.
EX.NO:4 READING DATA FROM TEXT FILES, EXCEL AND THE WEB AND
EXPLORING VARIOUS COMMANDS
AIM:
To read the data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set.
PRE-REQUISITES:
pip install xlrd
pip install openpyxlpip
install requests
pip install beautifulsoup4
ALGORTIHM:
PROGRAM:
#Reading data from text file:
# Program to show various ways to read and write data in a file.
file1 = open("myfile.txt","w")
L = ["This is Python \n","This is datascience \n","This is jupyter \n"]
# \n is placed to indicate EOL (End of Line)
file1.write("Hello \n")
file1.writelines(L)
file1.close() #to change file access modes
file1 = open("myfile.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
# seek(n) takes the file handle to the nth byte from the beginning.
file1.seek(0)
print( "Output of Readline function is ")
print(file1.readline())
print()
file1.seek(0)
# To show difference between read and readline
print("Output of Read(9) function is ")
print(file1.read(9))
print()
file1.seek(0)
print("Output of Readline(9) function is ")
print(file1.readline(9))
file1.seek(0)
# readlines function
print("Output of Readlines function is ")
print(file1.readlines())
print()
file1.close()
OUTPUT:
OUTPUT:
OUTPUT:
sns.countplot(x='Species', data=df, )
plt.show()
OUTPUT:
RESULT:
Thus the python code to read the data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris data set.
EX.NO:5A UNIVARIATE ANALYSIS USING DIABETES DATASET
AIM:
To perform Univariate analysis such as Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis on the diabetes dataset.
ALGORITHM:
1. Install pandas.
2. To find the frequency of a single variable on a dataset, use the value_counts() function.
3. To find the mean of a single variable on a dataset, use the mean() function.
4. To find the median of a single variable on a dataset, use the median() function.
5. To find the mode of a single variable on a dataset, use the mode() function.
6. To find the variance of a single variable on a dataset, install and import statistics and use
the statistics.variance() function.
7. To find the standard deviation of a single variable on a dataset, use the std() function.
8. To find the skewness of a single variable on a dataset, install and import scipy and use the
scipy.stats.skew() function.
9. To find the kurtosis of a single variable on a dataset, install and import scipy and use the
scipy.stats.kurtosis() function.
PROGRAM:
#Reading dataset
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
df.info()
df.describe()
#Finding Frequency
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
#create frequency table for 'Glucose' variable
f1=df['Glucose'].value_counts()
print('frequency table for Glucose variable\n',f1)
#Finding Mean
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
m1=df['Pregnancies'].mean()
print('Mean of Pregnancies',m1)
m2=df['Glucose'].mean()
print('Mean of Glucose',m2)
m3=df['BloodPressure'].mean()
print('Mean of BloodPressure',m3)
m4=df['SkinThickness'].mean()
print('Mean of SkinThickness',m4)
m5=df['Insulin'].mean()
print('Mean of Insulin',m5)
m6=df['BMI'].mean()
print('Mean of BMI',m6)
m7=df['DiabetesPedigreeFunction'].mean()
print('Mean of DiabetesPedigreeFunction',m7)
m8=df['Age'].mean()
print('Mean of Age',m8)
#Finding Median
import pandas as pd
df = pd.read_csv("diabetes.csv")
m1=df['Pregnancies'].median()
print('median of Pregnancies',m1)
m2=df['Glucose'].median()
print('median of Glucose',m2)
m3=df['BloodPressure'].median()
print('median of BloodPressure',m3)
m4=df['SkinThickness'].median()
print('median of SkinThickness',m4)
m5=df['Insulin'].median()
print('median of Insulin',m5)
m6=df['BMI'].median()
print('median of BMI',m6)
m7=df['DiabetesPedigreeFunction'].median()
print('median of DiabetesPedigreeFunction',m7)
m8=df['Age'].median()
print('median of Age',m8)
#Finding Mode
import pandas as pd
#create DataFrame
df =
pd.read_csv("diabete
s.csv")
m1=df['Pregnancies']
.mode() print('mode
of Pregnancies',m1)
m2=df['Glucose'].mo
de() print('mode of
Glucose',m2)
m3=df['BloodPressur
e'].mode()
print('mode of
BloodPressure',m3)
m4=df['SkinThicknes
s'].mode()
print('mode of
SkinThickness',m4)
m5=df['Insulin'].mod
e() print('mode of
Insulin',m5)
m6=df['BMI'].mode()
print('mode of BMI',m6)
m7=df['DiabetesPedigreeFunction'].mode()
print('mode of DiabetesPedigreeFunction',m7)
m8=df['Age'].mode()
print('mode of Age',m8)
#Finding Variance
import pandas as pd
import statistics
#create DataFrame
df = pd.read_csv("diabetes.csv")
print("Variance of Glucose set is % s"%(statistics.variance(df.Glucose)))
print("Variance of Pregnancies set is % s"%(statistics.variance(df.Pregnancies)))
print("Variance of Age set is % s"%(statistics.variance(df.Age)))
#Finding Kurtosis
import scipy
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
k1=scipy.stats.kurtosis(df.Age, axis=0, bias=True)
print('the kurtosis of Age is',k1)
k2=scipy.stats.kurtosis(df.Glucose, axis=0, bias=True)
print('the kurtosis of Glucose is',k2)
RESULT:
Thus the Univariate analysis such as Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis on the diabetes dataset have been performed
successfully.
EX.NO:5B BIVARIATE ANALYSIS USING DIABETES DATASET
AIM:
To perform Bivariate analysis such as Linear and logistic regression modeling on the
diabetes dataset.
ALGORITHM:
1. Linear regression uses the relationship between the data-points to draw a straight line
through all them.
2. This line can be used to predict future values.
3. Import scipy and draw the line of Linear Regression
4. Define response and explanatory variable.
5. Add constant to predictor variables.
6. Create the model using, sm.OLS(y, x).fit().
7. View the model using summary().
8. To construct the correlation matrix, use corr().
9. To model the logistic regression, Install scikit-learn of version 0.24.2.
10. Read and explore the data.
11. Split the Dataset as Train and Test dataset
12. Train the model using, LogisticRegression()
13. Visualize the performance of logistic regression model.
PROGRAM:
#creating scatterplots
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("diabetes.csv")
plt.scatter(df.BMI, df.Age)
plt.title('BMI vs. Age')
plt.xlabel('BMI')
plt.ylabel('Age')
plt.show()
#simple linear regression
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv("diabetes.csv")
#define response variable
y = df['Insulin']
#define explanatory variable
x = df[['BloodPressure']]
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y, x).fit()
#view model summary
print(model.summary())
#creating histogram
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
df = pd.read_csv("diabetes.csv")
sns.histplot(df.Age,kde=True)
plt.show()
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
RESULT:
Thus the Bivariate analysis such as Linear and logistic regression modeling on the
diabetes dataset have been performed and analyzed successfully.
EX.NO:5C MULTIPLE REGRESSION ANALYSIS USING DIABETES DATASET
AIM:
To perform multiple regression analysis using diabetes dataset.
ALGORITHM:
1. Multiple regression is like linear regression, but with more than one independent value,
meaning that we try to predict a value based on two or more variables.
2. Import pandas, numpy and matplotlib packages.
3. Install and import sklearn(scikit-learn) package.
4. Import linear_model from scikit-learn.
5. Plot the graph using scatter()
6. Generate training and testing data from the dataset.
7. Model the dataset using, regr.fit()
8. Analyze the coefficients and intercepts.
PROGRAM:
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
np.random.seed(19680801)
data=pd.read_csv("diabetes.csv")
data.head(210)
data = data[["Glucose","Age","Pregnancies"]]
fig=plt.figure()
ax=fig.add_subplot(111,projection='3d')
n=100
ax.scatter(data["Glucose"],data["Age"],data["Pregnancies"],color="red")
ax.set_xlabel("Glucose")
ax.set_ylabel("Age")
ax.set_zlabel("Pregnancies")
plt.show()
#Generating training and testing data from our data:
train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]
# Modeling:Using sklearn package to model data :
regr = linear_model.LinearRegression()
train_x = np.array(train[["Glucose"]])
train_y = np.array(train[["Age"]])
regr.fit(train_x,train_y)
ax.scatter(data["Glucose"],data["Age"],data["Pregnancies"],color="red")
plt.plot(train_x, regr.coef_*train_x + regr.intercept_, '-r')
ax.set_xlabel("Glucose")
ax.set_ylabel("Age")
ax.set_zlabel("Pregnancies")
print ("coefficients : ",regr.coef_)
#Slope
print ("Intercept : ",regr.intercept_)
RESULT:
Thus the multiple regression analysis using diabetes dataset have been implemented and
executed successfully.
EX.NO:6 EXPLORING VARIOUS PLOTTING FUNCTIONS USING ANY
DATASET
AIM:
To apply and explore various plotting functions such as Normal curves, Density and
contour plots, Correlation and scatter plots, Histograms and Three dimensional plotting on UCI
data sets.
ALGORITHM:
PROGRAM:
#NORMAL CURVES:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
mu=df['Pregnancies'].mean()
std=df['Pregnancies'].std()
snd = stats.norm(mu, std)
# Generate 1000 random values between -100, 100
x = np.linspace(-100, 100, 1000)
plt.figure(figsize=(7.5,7.5))
plt.plot(x, snd.pdf(x))
plt.xlim(-60, 60)
plt.title('Normal Distribution', fontsize='15')
plt.xlabel('Values of Random Variable X', fontsize='15')
plt.ylabel('Probability', fontsize='15')
plt.show()
RESULT:
Thus the various plotting functions such as Normal curves, Density and contour plots,
Correlation and scatter plots, Histograms and three dimensional plotting have been explored
successfully on UCI data sets.
EX.NO:7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
AIM:
To implement visualization of geographic data with basemap.
PRE-REQUISITIES:
Install folium.
ALGORITHM:
PROGRAM:
Installation of Folium: