CS3362 Data Science Laboratory Alok Kumar
CS3362 Data Science Laboratory Alok Kumar
Problem Statement:
Download, Install and Explore the features of Python Libraries.
Aim:
To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandas packages.
Description:
Python Libraries
● There are a lot of reasons why Python is popular among developers and one of them is
that it has an amazingly large collection of libraries that users can work with. In this, we
will discuss some Python libraries offered by the Python Programming Language:
NumPy, SciPy, Jupyter, Statsmodels and Pandas.
● We know that a module is a file with some Python code, and a package is a directory for
sub packages and modules. A Python library is a reusable chunk of code that you may
want to include in your programs/ projects.
For example:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7,8,9]]) # 3x3 matrix
print(x.ndim) # Prints 2
print(x.shape) # Prints (3, 3)
print(x.size) # Prints 9
● Linear Algebra with SciPy: The most common problem in linear algebra is eigenvalues
and eigenvectors which can be easily solved using the eig() function.
For example:
from scipy import linalg
import numpy as np
arr = np.array([[5,4],[6,3]])
eg_val, eg_vect = linalg.eig(arr)
print(eg_val)
print(eg_vect)
For example:
#Series Example
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
#DataFrame Example
import pandas as pd
data = { "calories" : [420, 380, 390], "duration" : [50, 40, 45] }
myvar = pd.DataFrame(data)
print(myvar)
● The IPython (Interactive Python) Notebook is now known as the Jupyter Notebook. It is
an interactive computational environment, in which you can combine code execution,
mathematics, plots etc.
● The Jupyter Notebook App is a server-client application that allows editing and running
notebook documents via a web browser. The Jupyter Notebook App can be executed on
a local desktop requiring no internet access or can be installed on a remote server and
accessed through the internet..
● Steps to Download and Install Jupyter on Windows 10:
Step 1: Hit the Windows key, type Command Prompt, and click on Run as administrator.
Step 2: Type pip install notebook command and press Enter key to start the Jupyter
Notebook installation.
Step 3: The Jupyter Notebook package download and installation will automatically get
started and finished. To run the notebook type jupyter notebook command and press
Enter key to start the Jupyter Notebook.
Problem Statement:
Various operations on Numpy Arrays.
Aim:
To write a python program to perform various operations on numpy arrays.
Algorithm:
Step 1: Start the program.
Program:
# Creation of 1D Array
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
# Creation of 2D Array
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
# Creation of 3D Array
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
# Creation of N-Dimensional Array
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,10,11,12], ndmin = 3)
print(a)
# Using 1D Array
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
# Using 2D Array
import numpy as np
# Using 3D Array
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
3. Array Slicing :
# First Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
# Second Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
# Third Example
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])
# First Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.ndim)
print(arr.shape)
print(arr.dtype)
5. Array Reshaping
# First Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5,6])
print(arr.reshape(3,2))
# Second Example
import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print(arr.reshape(2,3,2))
Output:
1. Creation of Array:
# Creation of 1D Array
[1 2 3 4 5]
# Creation of 2D Array
# Using 1D Array
1
# Using 2D Array
5th element on 2nd row: 10
Last element from 2nd dim: 10
# Using 3D Array
6
# First Example
[5 6 7]
# Second Example
[2 4]
# Third Example
[3 8]
# First Example
1
(5,)
int32
# Second Example
3
(2, 2, 3)
int32
5. Array Reshaping
# First Example
[[1 2]
[3 4]
[5 6]]
# Second Example
[[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]]
# Flatten Function Example
Original array:
Result:
Thus the above program has been implemented and verified successfully.
Problem Statement:
Various operations on Pandas DataFrames.
Aim:
To write a python program to perform various operations on Pandas DataFrames.
Algorithm:
Step 1: Start the program.
Step 3: View the Pandas DataFrame properties using various parameters and functions .
Step 4: Access and Modify the DataFrames data using pandas accessors.
Program:
The Pandas DataFrame is a structure that contains two-dimensional data and its
corresponding labels. DataFrames are widely used in data science, machine learning,
scientific computing, and many other data-intensive fields.
DataFrames are faster, easier to use, and more powerful than tables or spreadsheets
because they’re an integral part of the Python and NumPy ecosystems.
import numpy as np
import pandas as pd
d = {'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': 100}
print(pd.DataFrame(d))
2. Viewing/Inspecting Data:
import numpy as np
import pandas as pd
l = [{'x': 1, 'y': 2, 'z': 100},{'x': 2, 'y': 4, 'z': 100},{'x': 3, 'y': 8, 'z': 100}]
a=pd.DataFrame(l)
print(a)
print(a.ndim)
import numpy as np
import pandas as pd
l = [[1, 2, 13],[2, 4, 10],[3, 8, 14]]
a=pd.DataFrame(l, index=[100, 200, 300],columns=['x','y','z'])
print(a)
print(a['y'])
print(a[['x','z']])
print(a.loc[100])
print(a.loc[:,'y'])
print(a.loc[100:300,['y','z']])
print(a.iloc[0])
print(a.iloc[1,1])
print(a.iloc[1,:])
print(a.iloc[0:2,[0,2]])
print(a.at[100,'y'])
print(a.iat[1,2])
import numpy as np
import pandas as pd
l = [[1, 2, 13],[2, 4, 10],[3, 8, 14]]
a=pd.DataFrame(l, index=[100, 200, 300],columns=['x','y','z'])
a['y']=[50,60,70]
print(a)
a['w']=0
print(a)
Output:
x y z
0 1 2 100
1 2 4 100
2 3 8 100
(3, 3)
(Alok Kumar, A.P/CSE)
9
x y z
0 1 2 100
1 2 4 100
x y z
2 3 8 100
x y z
100 1 2 13
200 2 4 10
300 3 8 14
100 2
200 4
300 8
Name: y, dtype: int64
x z
100 1 13
200 2 10
300 3 14
x 1
y 2
z 13
Name: 100, dtype: int64
100 2
200 4
300 8
y z
100 2 13
200 4 10
300 8 14
x 1
y 2
z 13
Name: 100, dtype: int64
x 2
y 4
z 10
Name: 200, dtype: int64
x z
100 1 13
200 2 10
10
x y z
100 1 2 13
200 2 4 1000
300 3 8 14
x y z
100 1 2 13
200 2 2000 2000
300 3 8 14
x y z
100 1 50 13
200 2 60 10
300 3 70 14
x y z w
100 1 50 13 0
200 2 60 10 0
300 3 70 14 0
x y v z w
100 1 50 74.0 13 0
200 2 60 70.0 10 0
300 3 70 81.0 14 0
x y v z
100 1 50 74.0 13
200 2 60 70.0 10
300 3 70 81.0 14
Result:
Thus the above program has been implemented and verified successfully.
Problem Statement:
Descriptive analytics on the Iris data set.
Aim:
To write a python program to read data from text files, Excel and the web and
exploring various commands for doing descriptive analytics on the Iris data set.
Algorithm:
Step 1: Start the program.
Step 2: Download the publicly available Iris dataset and convert it to the required format.
Step 5: Read web file (html) of Iris dataset using pandas library
Step 6: Read csv file of Iris dataset using pandas library and perform descriptive analytics on
the Iris data set using various commands.
Step 7: Stop the program.
Program:
Pandas library is used for reading various forms of dataset also. Firstly, we have to
collect the dataset by downloading publicly available historical dataset.
After collecting the dataset and converting it to the required format like csv, text, excel,
web (html) format, we have to keep the dataset in some directory for accessing and doing
descriptive analytics on the Iris data set.
Suppose a publicly available dataset is in csv format then convert it based on your
requirements like text, excel, web formats.
The Iris Flower Dataset contains three flower species with 50 samples each flower
species as well as some properties (1. sepal length in cm, 2. sepal width in cm, 3. petal length
in cm, 4. petal width in cm, 5. class: Iris Setosa, Iris Versicolour, Iris Virginica) about each
flower.
# Shape of dataset
print(data.shape)
# For Finding the count of null value (missing value) in the dataset
print(data.isnull().sum())
Output:
0 1 2 3 4 5
0 sepal.length sepal.width petal.length petal.width variety NaN
1 5.1 3.5 1.4 0.2 Setosa NaN
2 4.9 3 1.4 0.2 Setosa NaN
3 4.7 3.2 1.3 0.2 Setosa NaN
4 4.6 3.1 1.5 0.2 Setosa NaN
.. ... ... ... ... ... ...
146 6.7 3 5.2 2.3 Virginica NaN
147 6.3 2.5 5 1.9 Virginica NaN
148 6.5 3 5.2 2 Virginica NaN
149 6.2 3.4 5.4 2.3 Virginica NaN
150 5.9 3 5.1 1.8 Virginica NaN
[151 rows x 6 columns]]
# Shape of dataset
(150, 5)
# Finding subset of dataset
sepal.length sepal.width petal.length petal.width variety
10 5.4 3.7 1.5 0.2 Setosa
11 4.8 3.4 1.6 0.2 Setosa
12 4.8 3.0 1.4 0.1 Setosa
13 4.3 3.0 1.1 0.1 Setosa
14 5.8 4.0 1.2 0.2 Setosa
15 5.7 4.4 1.5 0.4 Setosa
16 5.4 3.9 1.3 0.4 Setosa
17 5.1 3.5 1.4 0.3 Setosa
18 5.7 3.8 1.7 0.3 Setosa
19 5.1 3.8 1.5 0.3 Setosa
20 5.4 3.4 1.7 0.2 Setosa
# For Finding the count of null value (missing value) in the dataset
sepal.length 0
sepal.width 0
petal.length 0
petal.width 0
variety 0
dtype: int64
Result:
Thus the above program has been implemented and verified successfully.
Step 3: Read csv file of Pima Indians Diabetes dataset using pandas library and perform
Univariate and Bivariate analysis using various commands.
Step 4: Stop the program.
Program:
Pandas library is used for reading various forms of dataset. Firstly, we have to collect
the dataset by downloading publicly available historical dataset.
The Pima Indians Diabetes dataset contains 768 patients' diabetes data. This dataset
contains 9 attributes about the patient’s health data.
7. For finding correlation coefficient and plotting heatmap (for Bivariate Analysis)
Result:
Thus the above program has been implemented and verified successfully.
Problem Statement:
Exploration of various plotting functions on UCI dataset.
A. Normal curves
B. Density and contour plots
C. Correlation and scatter plots
D. Histograms
E. Three dimensional plotting
Aim:
To apply various plotting functions on UCI dataset using Python Programming.
Algorithm:
Program:
A. Normal curves
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
df=pd.read_csv("F:\\diabetic_data.csv")
mean =df['time_in_hospital'].mean()
std =df['time_in_hospital'].std()
x_axis = np.arange(1, 10, 0.01)
plt.plot(x_axis, norm.pdf(x_axis, mean, std))
plt.show()
#1
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
df.time_in_hospital.plot.density(color='green')
plt.title('Density plot for time_in_hospital')
plt.show()
#2
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
df.num_lab_procedures.plot.density(color='green')
plt.title('Density Plot for num_lab_procedures')
plt.show()
#3
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
def func(x, y):
return np.sin(x) ** 2 + np.cos(y) **2
mean =df['time_in_hospital'].mean()
std =df['time_in_hospital'].std()
x = np.linspace(0, mean)
y = np.linspace(0, std)
# Generate combination of grids
X, Y = np.meshgrid(x, y)
Z = func(X, Y)
# Draw rectangular contour plot
plt.contour(X, Y, Z, cmap='gist_rainbow_r')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
plt.figure(figsize = (9,9))
sns.heatmap(df.corr(),annot=True)
D. Histograms
#1
import pandas as pd
df=pd.read_csv("F:\\diabetic_data.csv")
df.hist(figsize=(12,12),layout=(5,3))
#2
import pandas as pd
import seaborn as sns
df=pd.read_csv("F:\\diabetic_data.csv")
sns.histplot(df["num_lab_procedures"], kde=True)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
ax = plt.axes(projection = '3d')
x = df['number_emergency']
x = pd.Series(x, name= '')
y = df['number_inpatient']
y = pd.Series(x, name= '')
z = df['number_outpatient']
z = pd.Series(x, name= '')
Output:
A. Normal curves
#1
#3
D. Histograms
#1
Result:
Thus the above program has been implemented and verified successfully.
Problem Statement:
Visualization of Geographic Data with Basemap
Aim:
To visualize Geographic Data using the BaseMap module in Python Programming.
Algorithm:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12)
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries()
plt.title("Country boundaries", fontsize=20)
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' US', fontsize=12)
plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (10,8))
m=
Basemap(projection='robin',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,l
on_0 = 0, lat_0 = 0)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title(" Robinson Projection", fontsize=20)
Output:
A. Basemap along with latitude and longitude
Result:
Thus the above program has been implemented and verified successfully.