0% found this document useful (0 votes)
4 views

CS3362 Data Science Laboratory Alok Kumar

The document outlines the features and installation processes of various Python libraries including NumPy, SciPy, Pandas, Statsmodels, and Jupyter. It provides detailed descriptions of each library's functionalities, such as NumPy's capabilities for mathematical operations and Pandas' data structures. Additionally, it includes examples of array creation, manipulation, and operations using NumPy.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CS3362 Data Science Laboratory Alok Kumar

The document outlines the features and installation processes of various Python libraries including NumPy, SciPy, Pandas, Statsmodels, and Jupyter. It provides detailed descriptions of each library's functionalities, such as NumPy's capabilities for mathematical operations and Pandas' data structures. Additionally, it includes examples of array creation, manipulation, and operations using NumPy.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Ex. No.

1 Features of Python Libraries

Problem Statement:
Download, Install and Explore the features of Python Libraries.
Aim:
To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandas packages.

Description:

Python Libraries

● There are a lot of reasons why Python is popular among developers and one of them is
that it has an amazingly large collection of libraries that users can work with. In this, we
will discuss some Python libraries offered by the Python Programming Language:
NumPy, SciPy, Jupyter, Statsmodels and Pandas.
● We know that a module is a file with some Python code, and a package is a directory for
sub packages and modules. A Python library is a reusable chunk of code that you may
want to include in your programs/ projects.

Python Numpy Library

● NumPy is an open source library available in Python that aids in mathematical,


scientific, engineering, and data science programming.
● NumPy is an incredible library to perform mathematical and statistical operations. It
works perfectly well for multi-dimensional arrays and matrix multiplication for any
scientific project.
● It has been built to work with the N- dimensional array, linear algebra, random number,
Fourier transform, etc. On top of the arrays and matrices, NumPy supports a large
number of mathematical operations.
● NumPy is memory efficient, it can handle the vast amount of data more accessible than
any other library. Besides, NumPy is very convenient to work with, especially for matrix
multiplication and reshaping.

(Alok Kumar, A.P/CSE)


● Numpy Arrays : NumPy’s main object is the homogeneous multidimensional array. We
use python numpy array instead of a list because of the below three reasons:
1. Less Memory
2. Fast
3. Convenient
Numpy arrays carry attributes around with them. The most important ones are:
ndim: The number of axes or rank of the array
shape: A tuple containing the length in each dimension
size: The total number of elements

For example:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7,8,9]]) # 3x3 matrix
print(x.ndim) # Prints 2
print(x.shape) # Prints (3, 3)
print(x.size) # Prints 9

● Steps to Download and Install NumPy on Windows 10:


Step 1: Download and Install Python
Step 2: Hit the Windows key, type Command Prompt, and click on Run as administrator.
Step 3: Type pip install numpy command and press Enter key to start the NumPy
installation.
Step 4: The NumPy package download and installation will automatically get started
and finished. You will see the message: Successfully installed numpy(-version).

Python SciPy Library

● SciPy is an Open Source Python-based library, which is used in mathematics, scientific


computing, Engineering, and technical computing.
● SciPy is also pronounced as "Sigh Pi." SciPy contains varieties of sub packages which
help to solve the most common issue related to Scientific Computation. It can operate
on an array of NumPy library.

(Alok Kumar, A.P/CSE)


● Numpy Vs SciPy
Numpy:
1. Numpy is written in C and used for mathematical or numeric calculation.
2. It is faster than other Python Libraries
3. Numpy is the most useful library for Data Science to perform basic calculations.
4. Numpy contains nothing but array data type which performs the most basic operation
like sorting, shaping, indexing, etc.
SciPy:
1. SciPy is built in top of the NumPy
2. SciPy is a fully-featured version of Linear Algebra while Numpy contains only a few
features.
3. Most new Data Science features are available in Scipy rather than Numpy.

● Linear Algebra with SciPy: The most common problem in linear algebra is eigenvalues
and eigenvectors which can be easily solved using the eig() function.
For example:
from scipy import linalg
import numpy as np
arr = np.array([[5,4],[6,3]])
eg_val, eg_vect = linalg.eig(arr)
print(eg_val)
print(eg_vect)

● Steps to Download and Install SciPy on Windows 10:


Step 1: Download and Install Python
Step 2: Hit the Windows key, type Command Prompt, and click on Run as administrator.
Step 3: Type pip install scipy command and press Enter key to start the SciPy
installation.
Step 4: The SciPy package download and installation will automatically get started and
finished. You will see the message: Successfully installed scipy(-version).

(Alok Kumar, A.P/CSE)


Python Pandas Library

● Pandas is an open source library providing high-performance, easy-to-use data


structures and data analysis tools for the Python programming language.
● Pandas makes importing, analyzing, and visualizing data much easier. It builds on
packages like NumPy and matplotlib to give you a single, convenient, place to do most
of your data analysis and visualization work.
● The primary two components of pandas are the Series and DataFrame.
● A Series is essentially a column, and a DataFrame is a multi-dimensional table made up
of a collection of Series. DataFrames and Series are quite similar in that there are many
operations that you can do with one.
● Numpy Array has an implicitly defined integer index used to access the values, the
Pandas Series has an explicitly defined index associated with the values.

For example:

#Series Example
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

#DataFrame Example
import pandas as pd
data = { "calories" : [420, 380, 390], "duration" : [50, 40, 45] }
myvar = pd.DataFrame(data)
print(myvar)

● Steps to Download and Install Pandas on Windows 10:


Step 1: Download and Install Python
Step 2: Hit the Windows key, type Command Prompt, and click on Run as administrator.
Step 3: Type pip install pandas command and press Enter key to start the Pandas
installation.
Step 4: The Pandas package download and installation will automatically get started and
finished. You will see the message: Successfully installed pandas(-version).
(Alok Kumar, A.P/CSE)
Python Statsmodels Library

● Statsmodels is a popular library in Python that enables us to estimate and analyze


various statistical models. It is built on numeric and scientific libraries like NumPy and
SciPy.
● It includes various models of linear regression like ordinary least squares, generalized
least squares, weighted least squares, etc. It provides some efficient functions for time
series analysis.
● Steps to Download and Install Statsmodels on Windows 10:
Step 1: Download and Install Python
Step 2: Hit the Windows key, type Command Prompt, and click on Run as administrator.
Step 3: Type pip install statsmodels command and press Enter key to start the
Statsmodels installation.
Step 4: The Statsmodels package download and installation will automatically get
started and finished. You will see the message: Successfully installed
Statsmodels(-version).

Python Jupyter Library

● The IPython (Interactive Python) Notebook is now known as the Jupyter Notebook. It is
an interactive computational environment, in which you can combine code execution,
mathematics, plots etc.
● The Jupyter Notebook App is a server-client application that allows editing and running
notebook documents via a web browser. The Jupyter Notebook App can be executed on
a local desktop requiring no internet access or can be installed on a remote server and
accessed through the internet..
● Steps to Download and Install Jupyter on Windows 10:
Step 1: Hit the Windows key, type Command Prompt, and click on Run as administrator.
Step 2: Type pip install notebook command and press Enter key to start the Jupyter
Notebook installation.
Step 3: The Jupyter Notebook package download and installation will automatically get
started and finished. To run the notebook type jupyter notebook command and press
Enter key to start the Jupyter Notebook.

(Alok Kumar, A.P/CSE)


Ex. No. 2 Numpy Arrays

Problem Statement:
Various operations on Numpy Arrays.
Aim:
To write a python program to perform various operations on numpy arrays.

Algorithm:
Step 1: Start the program.

Step 2: Create 1D, 2D, 3D, N Dimensional arrays using numpy.

Step 3: Perform various basic operations on numpy arrays.

Step 4: Stop the program.

Program:

1. Creation of Array: It is the collection of items of the same type.

# Creation of 1D Array
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
# Creation of 2D Array
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
# Creation of 3D Array
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
# Creation of N-Dimensional Array
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,10,11,12], ndmin = 3)
print(a)

(Alok Kumar, A.P/CSE)


# Creation of 1D Array using arange function
import numpy as np
a=np.arange(0,10)
print(a)
print(a.ndim)

# Other way to create 2D Array


import numpy as np
a=np.arange(0,10).reshape(2,5)
print(a)
print(a.ndim)

# Create 1D array with all elements as 0


import numpy as np
a=np.zeros(5, dtype=int)
print(a)
print(a[1])
print(a.ndim)

# Create 1D array with all elements as 1


import numpy as np
a=np.ones((2,5),dtype=int)
print(a)

2. Accessing Array Elements:

# Using 1D Array
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])

# Using 2D Array
import numpy as np

(Alok Kumar, A.P/CSE)


arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd row: ', arr[1, 4])
print('Last element from 2nd dim: ', arr[1, -1])

# Using 3D Array
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])

3. Array Slicing :

# First Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])

# Second Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
# Third Example
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])

4. Dimension Shape and Data Type of an Array

# First Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.ndim)
print(arr.shape)
print(arr.dtype)

(Alok Kumar, A.P/CSE)


# Second Example
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr.ndim)
print(arr.shape)
print(arr.dtype)

5. Array Reshaping

# First Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5,6])
print(arr.reshape(3,2))

# Second Example
import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print(arr.reshape(2,3,2))

# Flatten Function Example


import numpy as np
a = np.array([[40, 32, 67], [61, 79, 15]])
print('Original array:\n', a)
b = a.flatten()
print('Flattened array:\n', b)
b[1] = 1000
print('Original array after making changes in Flattened array:\n', a)
print('Modified Flattened array:\n', b)

# Ravel Function Example


import numpy as np
x = np.array([[40, 32, 67], [61, 79, 15]])
print('Original array:\n', x)

(Alok Kumar, A.P/CSE)


y = x.ravel()
print('Flattened array:\n', y)
y[1] = 1000
print('Original array after making changes in Flattened array:\n', x)
print('Modified Flattened array:\n', y)

6. Random Value Generation using NumPy

# Random Number Generation using Numpy


import numpy as np
a=np.random.randint(0,10)
print(a)

# 1D Array using Random Number Generation


import numpy as np
a=np.random.randint(0,100,10)
print(a)
print(a.max())
print(a.min())
print(a.mean())
print(a.argmax())
print(a.argmin())

# 2D Array using Random Number Generation


import numpy as np
a=np.random.randint(0,100,(3,3))
print(a)

Output:

1. Creation of Array:

# Creation of 1D Array
[1 2 3 4 5]
# Creation of 2D Array

(Alok Kumar, A.P/CSE)


[[1 2 3]
[4 5 6]]
# Creation of 3D Array
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
# Creation of N-Dimensional Array
[[[ 1 2 3 4 5 6 7 8 9 10 11 12]]]
# Creation of 1D Array using arange function
[0 1 2 3 4 5 6 7 8 9]
1
# Other way to create 2D Array
[[0 1 2 3 4]
[5 6 7 8 9]]
2
# Create 1D array with all elements as 0
[0 0 0 0 0]
0
1
# Create 1D array with all elements as 1
[[1 1 1 1 1]
[1 1 1 1 1]]

2. Accessing Array Elements:

# Using 1D Array
1
# Using 2D Array
5th element on 2nd row: 10
Last element from 2nd dim: 10
# Using 3D Array
6

(Alok Kumar, A.P/CSE)


3. Array Slicing :

# First Example
[5 6 7]
# Second Example
[2 4]
# Third Example
[3 8]

4. Dimension Shape and Data Type of an Array

# First Example
1
(5,)
int32
# Second Example
3
(2, 2, 3)
int32

5. Array Reshaping

# First Example
[[1 2]
[3 4]
[5 6]]
# Second Example
[[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]]
# Flatten Function Example
Original array:

(Alok Kumar, A.P/CSE)


[[40 32 67]
[61 79 15]]
Flattened array:
[40 32 67 61 79 15]
Original array after making changes in Flattened array:
[[40 32 67]
[61 79 15]]
Modified Flattened array:
[ 40 1000 67 61 79 15]
# Ravel Function Example
Original array:
[[40 32 67]
[61 79 15]]
Flattened array:
[40 32 67 61 79 15]
Original array after making changes in Flattened array:
[[ 40 1000 67]
[ 61 79 15]]
Modified Flattened array:
[ 40 1000 67 61 79 15]

6. Random Value Generation using NumPy

# Random Number Generation using Numpy


4
# 1D Array using Random Number Generation
[93 36 51 64 24 67 38 86 70 5]
93
5
53.4
0
9

(Alok Kumar, A.P/CSE)


# 2D Array using Random Number Generation
[[54 56 5]
[89 28 9]
[80 81 88]]

Result:

Thus the above program has been implemented and verified successfully.

(Alok Kumar, A.P/CSE)


Ex. No. 3 Pandas DataFrames

Problem Statement:
Various operations on Pandas DataFrames.
Aim:
To write a python program to perform various operations on Pandas DataFrames.

Algorithm:
Step 1: Start the program.

Step 2: Create Pandas DataFrames using different ways.

Step 3: View the Pandas DataFrame properties using various parameters and functions .

Step 4: Access and Modify the DataFrames data using pandas accessors.

Step 5: Insert and Delete DataFrames data using different ways.

Step 6: Stop the program.

Program:

The Pandas DataFrame is a structure that contains two-dimensional data and its
corresponding labels. DataFrames are widely used in data science, machine learning,
scientific computing, and many other data-intensive fields.
DataFrames are faster, easier to use, and more powerful than tables or spreadsheets
because they’re an integral part of the Python and NumPy ecosystems.

1. Creation of Pandas DataFrame:

# Using Python dictionary


#The keys of the dictionary are the DataFrame’s column labels, and the dictionary values are
the data values in the corresponding DataFrame columns. The values can be contained in a
tuple, list, one-dimensional NumPy array, Pandas Series object, or one of several other
data types. You can also provide a single value that will be copied along the entire column.

import numpy as np
import pandas as pd
d = {'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': 100}
print(pd.DataFrame(d))

(Alok Kumar, A.P/CSE)


# change of columns order with the columns parameter and the row labels with index.
a=pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x'])
print(a)
# Using List (using a list of dictionaries)
import numpy as np
import pandas as pd
l = [{'x': 1, 'y': 2, 'z': 100},{'x': 2, 'y': 4, 'z': 100},{'x': 3, 'y': 8, 'z': 100}]
print(pd.DataFrame(l))
# Using List (using nested list)
import numpy as np
import pandas as pd
l = [[1, 2, 13],[2, 4, 10],[3, 8, 14]]
print(pd.DataFrame(l, columns=['x', 'y', 'z']))
# Using List (using list of tuples)
import numpy as np
import pandas as pd
l = [(1, 2, 'a'),(2, 'b', 10),('c', 8, 14)]
print(pd.DataFrame(l, columns=['x', 'y', 'z']))
# Using NumPy Arrays
import numpy as np
import pandas as pd
arr = np.array([[1, 2, 100],[2, 4, 100],[3, 8, 100]])
d=pd.DataFrame(arr, columns=['x', 'y', 'z'])
print(d)

2. Viewing/Inspecting Data:

import numpy as np
import pandas as pd
l = [{'x': 1, 'y': 2, 'z': 100},{'x': 2, 'y': 4, 'z': 100},{'x': 3, 'y': 8, 'z': 100}]
a=pd.DataFrame(l)
print(a)
print(a.ndim)

(Alok Kumar, A.P/CSE)


print(a.shape)
print(a.size)
print(a.columns)
print(a.columns[2])
print(a.head(2))
print(a.tail(1))

3. Accessing and Modifying Data:

#Pandas has four accessors in total:


.loc[] accepts the labels of rows and columns and returns Series or DataFrames.
.iloc[] accepts the zero-based indices of rows and columns and returns Series or
DataFrames.
.at[] accepts the labels of rows and columns and returns a single data value.
.iat[] accepts the zero-based indices of rows and columns and returns a single data
value.

import numpy as np
import pandas as pd
l = [[1, 2, 13],[2, 4, 10],[3, 8, 14]]
a=pd.DataFrame(l, index=[100, 200, 300],columns=['x','y','z'])
print(a)
print(a['y'])
print(a[['x','z']])
print(a.loc[100])
print(a.loc[:,'y'])
print(a.loc[100:300,['y','z']])
print(a.iloc[0])
print(a.iloc[1,1])
print(a.iloc[1,:])
print(a.iloc[0:2,[0,2]])
print(a.at[100,'y'])
print(a.iat[1,2])

(Alok Kumar, A.P/CSE)


# We can use accessors to modify parts of a Pandas DataFrame by passing a Python
sequence, NumPy array, or single value.
a.loc[200,'z']=1000
print(a)
a.iloc[1,1:]=2000
print(a)

4. Inserting and Deleting Data

import numpy as np
import pandas as pd
l = [[1, 2, 13],[2, 4, 10],[3, 8, 14]]
a=pd.DataFrame(l, index=[100, 200, 300],columns=['x','y','z'])
a['y']=[50,60,70]
print(a)
a['w']=0
print(a)

# insert the new column at a given index


a.insert(loc=2, column='v',value=np.array([74.0, 70.0, 81.0]))
print(a)

# delete the column using del


del a['w']
print(a)

Output:

1. Creation of Pandas DataFrame:

# Using Python dictionary


x y z
0 1 2 100
1 2 4 100
2 3 8 100

(Alok Kumar, A.P/CSE)


z y x
100 100 2 1
200 100 4 2
300 100 8 3
# Using List (using a list of dictionaries)
x y z
0 1 2 100
1 2 4 100
2 3 8 100
# Using List (using nested list)
x y z
0 1 2 13
1 2 4 10
2 3 8 14
# Using List (using a list of tuples)
x y z
0 1 2 a
1 2 b 10
2 c 8 14
# Using NumPy Arrays
x y z
0 1 2 100
1 2 4 100
2 3 8 100

2. Accessing Array Elements:

x y z
0 1 2 100
1 2 4 100
2 3 8 100

(3, 3)
(Alok Kumar, A.P/CSE)
9

Index(['x', 'y', 'z'], dtype='object')

x y z
0 1 2 100
1 2 4 100

x y z
2 3 8 100

3. Accessing and Modifying Data :

x y z
100 1 2 13
200 2 4 10
300 3 8 14

100 2
200 4
300 8
Name: y, dtype: int64

x z
100 1 13
200 2 10
300 3 14
x 1
y 2
z 13
Name: 100, dtype: int64

100 2
200 4
300 8

(Alok Kumar, A.P/CSE)


Name: y, dtype: int64

y z
100 2 13
200 4 10
300 8 14

x 1
y 2
z 13
Name: 100, dtype: int64

x 2
y 4
z 10
Name: 200, dtype: int64

x z
100 1 13
200 2 10

10
x y z
100 1 2 13
200 2 4 1000
300 3 8 14

x y z
100 1 2 13
200 2 2000 2000
300 3 8 14

(Alok Kumar, A.P/CSE)


4. Inserting and Deleting Data

x y z
100 1 50 13
200 2 60 10
300 3 70 14

x y z w
100 1 50 13 0
200 2 60 10 0
300 3 70 14 0

x y v z w
100 1 50 74.0 13 0
200 2 60 70.0 10 0
300 3 70 81.0 14 0

x y v z
100 1 50 74.0 13
200 2 60 70.0 10
300 3 70 81.0 14

Result:

Thus the above program has been implemented and verified successfully.

(Alok Kumar, A.P/CSE)


Ex. No. 4 Descriptive analytics on the Iris dataset

Problem Statement:
Descriptive analytics on the Iris data set.
Aim:
To write a python program to read data from text files, Excel and the web and
exploring various commands for doing descriptive analytics on the Iris data set.

Algorithm:
Step 1: Start the program.

Step 2: Download the publicly available Iris dataset and convert it to the required format.

Step 3: Read text file of Iris dataset using pandas library.

Step 4: Read excel file of Iris dataset using pandas library.

Step 5: Read web file (html) of Iris dataset using pandas library

Step 6: Read csv file of Iris dataset using pandas library and perform descriptive analytics on
the Iris data set using various commands.
Step 7: Stop the program.

Program:

Pandas library is used for reading various forms of dataset also. Firstly, we have to
collect the dataset by downloading publicly available historical dataset.
After collecting the dataset and converting it to the required format like csv, text, excel,
web (html) format, we have to keep the dataset in some directory for accessing and doing
descriptive analytics on the Iris data set.
Suppose a publicly available dataset is in csv format then convert it based on your
requirements like text, excel, web formats.
The Iris Flower Dataset contains three flower species with 50 samples each flower
species as well as some properties (1. sepal length in cm, 2. sepal width in cm, 3. petal length
in cm, 4. petal width in cm, 5. class: Iris Setosa, Iris Versicolour, Iris Virginica) about each
flower.

(Alok Kumar, A.P/CSE)


1. Reading text file of Iris dataset:

# Reading text file


import pandas as pd
data=pd.read_csv('F:\\iris.txt', sep="\t")
print(data)

2. Reading excel file of Iris dataset:

# Reading excel file


import pandas as pd
data=pd.read_excel('F:\\iris.xls')
print(data)

3. Reading web (html) file of Iris dataset:

# Reading web file


import pandas as pd
data=pd.read_html('F:\\iris_files\\sheet001.htm')
print(data)

4. Reading csv file of Iris dataset and Descriptive Analytics on Dataset

# Reading csv file


import numpy as np
import pandas as pd
data=pd.read_csv('F:\\iris.csv')
print(data)

# First five rows of dataset


print(data.head())

# Last five rows of dataset


print(data.tail())

# 10 Random rows of dataset


print(data.sample(10))

(Alok Kumar, A.P/CSE)


# All columns names of dataset
print(data.columns)

# Shape of dataset
print(data.shape)

# Finding subset of dataset


print(data[10:21])

# Specific data of first 10 rows of dataset


specific_data=data[["sepal.length","petal.length"]]
print(specific_data.head(10))

# Retrieving data using index


print(data.iloc[5])

# Retrieving data using attribute


print(data.loc[data["variety"] == "Virginica"])

# Data rows count using attribute


print(data["variety"].value_counts())

# For Checking the null value (missing value) in the dataset


print(data.isnull())

# For Finding the count of null value (missing value) in the dataset
print(data.isnull().sum())

Output:

1. Reading text file of Iris dataset:

sepal.length sepal.width petal.length petal.width variety


0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
(Alok Kumar, A.P/CSE)
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica
[150 rows x 5 columns]

2. Reading excel file of Iris dataset:

sepal.length sepal.width petal.length petal.width variety


0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica
[150 rows x 5 columns]

3. Reading web (html) file of Iris dataset:

0 1 2 3 4 5
0 sepal.length sepal.width petal.length petal.width variety NaN
1 5.1 3.5 1.4 0.2 Setosa NaN
2 4.9 3 1.4 0.2 Setosa NaN
3 4.7 3.2 1.3 0.2 Setosa NaN
4 4.6 3.1 1.5 0.2 Setosa NaN
.. ... ... ... ... ... ...
146 6.7 3 5.2 2.3 Virginica NaN
147 6.3 2.5 5 1.9 Virginica NaN
148 6.5 3 5.2 2 Virginica NaN
149 6.2 3.4 5.4 2.3 Virginica NaN
150 5.9 3 5.1 1.8 Virginica NaN
[151 rows x 6 columns]]

(Alok Kumar, A.P/CSE)


4. Reading csv file of Iris dataset and Descriptive Analytics on Dataset:

sepal.length sepal.width petal.length petal.width variety


0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica

[150 rows x 5 columns]


# First five rows of dataset
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa

# Last five rows of dataset


sepal.length sepal.width petal.length petal.width variety
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica

# 10 Random rows of dataset


sepal.length sepal.width petal.length petal.width variety
57 4.9 2.4 3.3 1.0 Versicolor
103 6.3 2.9 5.6 1.8 Virginica
128 6.4 2.8 5.6 2.1 Virginica

(Alok Kumar, A.P/CSE)


86 6.7 3.1 4.7 1.5 Versicolor
84 5.4 3.0 4.5 1.5 Versicolor
80 5.5 2.4 3.8 1.1 Versicolor
112 6.8 3.0 5.5 2.1 Virginica
89 5.5 2.5 4.0 1.3 Versicolor
24 4.8 3.4 1.9 0.2 Setosa
129 7.2 3.0 5.8 1.6 Virginica

# All columns names of dataset


Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width', 'variety'], dtype='object')

# Shape of dataset
(150, 5)
# Finding subset of dataset
sepal.length sepal.width petal.length petal.width variety
10 5.4 3.7 1.5 0.2 Setosa
11 4.8 3.4 1.6 0.2 Setosa
12 4.8 3.0 1.4 0.1 Setosa
13 4.3 3.0 1.1 0.1 Setosa
14 5.8 4.0 1.2 0.2 Setosa
15 5.7 4.4 1.5 0.4 Setosa
16 5.4 3.9 1.3 0.4 Setosa
17 5.1 3.5 1.4 0.3 Setosa
18 5.7 3.8 1.7 0.3 Setosa
19 5.1 3.8 1.5 0.3 Setosa
20 5.4 3.4 1.7 0.2 Setosa

# Specific data of first 10 rows of dataset


sepal.length petal.length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
3 4.6 1.5
4 5.0 1.4

(Alok Kumar, A.P/CSE)


5 5.4 1.7
6 4.6 1.4
7 5.0 1.5
8 4.4 1.4
9 4.9 1.5

# Retrieving data using index


sepal.length 5.4
sepal.width 3.9
petal.length 1.7
petal.width 0.4
variety Setosa
Name: 5, dtype: object

# Retrieving data using attribute


sepal.length sepal.width petal.length petal.width variety
100 6.3 3.3 6.0 2.5 Virginica
101 5.8 2.7 5.1 1.9 Virginica
102 7.1 3.0 5.9 2.1 Virginica
103 6.3 2.9 5.6 1.8 Virginica
104 6.5 3.0 5.8 2.2 Virginica
105 7.6 3.0 6.6 2.1 Virginica
106 4.9 2.5 4.5 1.7 Virginica
107 7.3 2.9 6.3 1.8 Virginica
108 6.7 2.5 5.8 1.8 Virginica
109 7.2 3.6 6.1 2.5 Virginica
110 6.5 3.2 5.1 2.0 Virginica
111 6.4 2.7 5.3 1.9 Virginica
112 6.8 3.0 5.5 2.1 Virginica
113 5.7 2.5 5.0 2.0 Virginica
114 5.8 2.8 5.1 2.4 Virginica
115 6.4 3.2 5.3 2.3 Virginica
116 6.5 3.0 5.5 1.8 Virginica

(Alok Kumar, A.P/CSE)


117 7.7 3.8 6.7 2.2 Virginica
118 7.7 2.6 6.9 2.3 Virginica
119 6.0 2.2 5.0 1.5 Virginica
120 6.9 3.2 5.7 2.3 Virginica
121 5.6 2.8 4.9 2.0 Virginica
122 7.7 2.8 6.7 2.0 Virginica
123 6.3 2.7 4.9 1.8 Virginica
124 6.7 3.3 5.7 2.1 Virginica
125 7.2 3.2 6.0 1.8 Virginica
126 6.2 2.8 4.8 1.8 Virginica
127 6.1 3.0 4.9 1.8 Virginica
128 6.4 2.8 5.6 2.1 Virginica
129 7.2 3.0 5.8 1.6 Virginica
130 7.4 2.8 6.1 1.9 Virginica
131 7.9 3.8 6.4 2.0 Virginica
132 6.4 2.8 5.6 2.2 Virginica
133 6.3 2.8 5.1 1.5 Virginica
134 6.1 2.6 5.6 1.4 Virginica
135 7.7 3.0 6.1 2.3 Virginica
136 6.3 3.4 5.6 2.4 Virginica
137 6.4 3.1 5.5 1.8 Virginica
138 6.0 3.0 4.8 1.8 Virginica
139 6.9 3.1 5.4 2.1 Virginica
140 6.7 3.1 5.6 2.4 Virginica
141 6.9 3.1 5.1 2.3 Virginica
142 5.8 2.7 5.1 1.9 Virginica
143 6.8 3.2 5.9 2.3 Virginica
144 6.7 3.3 5.7 2.5 Virginica
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica

(Alok Kumar, A.P/CSE)


149 5.9 3.0 5.1 1.8 Virginica

# Data rows count using attribute


Setosa 50
Versicolor 50
Virginica 50
Name: variety, dtype: int64

# For Checking the null value (missing value) in the dataset


sepal.length sepal.width petal.length petal.width variety
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
.. ... ... ... ... ...
145 False False False False False
146 False False False False False
147 False False False False False
148 False False False False False
149 False False False False False

[150 rows x 5 columns]

# For Finding the count of null value (missing value) in the dataset
sepal.length 0
sepal.width 0
petal.length 0
petal.width 0
variety 0
dtype: int64

Result:

Thus the above program has been implemented and verified successfully.

(Alok Kumar, A.P/CSE)


Univariate and Bivariate analysis on Pima Indians
Ex. No. 5 Diabetes dataset
Problem Statement:
Univariate and Bivariate analysis on Pima Indians Diabetes dataset.
Aim:
To write a python program to use Pima Indians Diabetes dataset for performing
Univariate and Bivariate analysis.
Algorithm:

Step 1: Start the program.

Step 2: Download the publicly available Pima Indians Diabetes dataset.

Step 3: Read csv file of Pima Indians Diabetes dataset using pandas library and perform
Univariate and Bivariate analysis using various commands.
Step 4: Stop the program.

Program:

Pandas library is used for reading various forms of dataset. Firstly, we have to collect
the dataset by downloading publicly available historical dataset.
The Pima Indians Diabetes dataset contains 768 patients' diabetes data. This dataset
contains 9 attributes about the patient’s health data.

1. Reading csv file of Pima Indians Diabetes dataset

# Reading csv file


import pandas as pd
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data

2. Data after removing null values from dataset

# Data after removing null values


import pandas as pd
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
data

(Alok Kumar, A.P/CSE)


3. Histogram for all attributes of the dataset

# Histogram for all attributes of the dataset


import pandas as pd
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
data.hist(figsize = (10,10))

4. Finding min,max,IQR,median using describe function

# Finding min,max,IQR,median using describe function


import pandas as pd
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
data.describe()

5. For finding the frequencies of any attribute

# For finding the frequencies of each attribute


import pandas as pd
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
print(data['Pregnancies'].value_counts())
print(data['Outcome'].value_counts())

6. For finding the mean, median, mode, variance, std

# For finding the frequencies of any attribute


import pandas as pd
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
print(data['Pregnancies'].mean())
print(data['Pregnancies'].median())
print(data['Pregnancies'].mode())
print(data['Pregnancies'].var())
print(data['Pregnancies'].std())

(Alok Kumar, A.P/CSE)


7. For finding correlation coefficient and plotting heatmap (for Bivariate Analysis)

# For finding correlation coefficient and plotting heatmap


import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
plt.figure(figsize = (9,9))
sns.heatmap(np.abs(data.corr()),annot=True)
data.corr()

8. For plotting boxplot between two attributes (for Bivariate Analysis)

# For plotting boxplot between two attributes


import pandas as pd
import seaborn as sns
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
sns.boxplot(x='Outcome',y='BMI',data=data)
data.describe()

9. For plotting data using scatterplot/pairplot (for Bivariate Analysis)

# For plotting boxplot between two attributes


import pandas as pd
import seaborn as sns
data=pd.read_csv('F:\\pima-indians-diabetes.csv')
data=data[-(data[data.columns[1:-1]]==0).any(axis=1)]
sns.pairplot(data, hue="Outcome")

(Alok Kumar, A.P/CSE)


Output:
1. Reading csv file of Pima Indians Diabetes dataset

2. Data after removing null values from dataset

(Alok Kumar, A.P/CSE)


3. Histogram for all attributes of the dataset

4. Finding min,max,IQR,median using describe function

5. For finding the frequencies of any attribute

(Alok Kumar, A.P/CSE)


6. For finding the mean, median, mode, variance, std

7. For finding correlation coefficient and plotting heatmap (for Bivariate Analysis)

(Alok Kumar, A.P/CSE)


8. For plotting boxplot between two attributes (for Bivariate Analysis)

9. For plotting data using scatterplot/pairplot (for Bivariate Analysis)

Result:

Thus the above program has been implemented and verified successfully.

(Alok Kumar, A.P/CSE)


Ex. No. 6 Exploration of various plotting functions on UCI dataset

Problem Statement:
Exploration of various plotting functions on UCI dataset.
A. Normal curves
B. Density and contour plots
C. Correlation and scatter plots
D. Histograms
E. Three dimensional plotting

Aim:
To apply various plotting functions on UCI dataset using Python Programming.
Algorithm:

Step 1: Start the program


Step 2: Import the required packages
Step 3: Load dataset file from UCI dataset
Step 4: Visualize the dataset using various plotting functions on the UCI dataset
Step 5: Analyze the sample data and do the required operations
Step 6: Stop the program

Program:
A. Normal curves

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
df=pd.read_csv("F:\\diabetic_data.csv")
mean =df['time_in_hospital'].mean()
std =df['time_in_hospital'].std()
x_axis = np.arange(1, 10, 0.01)
plt.plot(x_axis, norm.pdf(x_axis, mean, std))
plt.show()

(Alok Kumar, A.P/CSE)


B. Density and contour plots

#1
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
df.time_in_hospital.plot.density(color='green')
plt.title('Density plot for time_in_hospital')
plt.show()

#2
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
df.num_lab_procedures.plot.density(color='green')
plt.title('Density Plot for num_lab_procedures')
plt.show()

#3
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
def func(x, y):
return np.sin(x) ** 2 + np.cos(y) **2
mean =df['time_in_hospital'].mean()
std =df['time_in_hospital'].std()
x = np.linspace(0, mean)
y = np.linspace(0, std)
# Generate combination of grids
X, Y = np.meshgrid(x, y)
Z = func(X, Y)
# Draw rectangular contour plot
plt.contour(X, Y, Z, cmap='gist_rainbow_r')

(Alok Kumar, A.P/CSE)


C. Correlation and scatter plots

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
plt.figure(figsize = (9,9))
sns.heatmap(df.corr(),annot=True)

D. Histograms

#1
import pandas as pd
df=pd.read_csv("F:\\diabetic_data.csv")
df.hist(figsize=(12,12),layout=(5,3))

#2
import pandas as pd
import seaborn as sns
df=pd.read_csv("F:\\diabetic_data.csv")
sns.histplot(df["num_lab_procedures"], kde=True)
plt.show()

E. Three dimensional plotting

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("F:\\diabetic_data.csv")
ax = plt.axes(projection = '3d')
x = df['number_emergency']
x = pd.Series(x, name= '')
y = df['number_inpatient']
y = pd.Series(x, name= '')
z = df['number_outpatient']
z = pd.Series(x, name= '')

(Alok Kumar, A.P/CSE)


ax.plot3D(x, y, z, 'green')
ax.set_title('3D line plot diabetes dataset')
plt.show()

Output:
A. Normal curves

B. Density and contour plots

#1

(Alok Kumar, A.P/CSE)


#2

#3

(Alok Kumar, A.P/CSE)


C. Correlation and scatter plots

D. Histograms

#1

(Alok Kumar, A.P/CSE)


#2

E. Three dimensional plotting

Result:

Thus the above program has been implemented and verified successfully.

(Alok Kumar, A.P/CSE)


Ex. No. 7 Visualization of Geographic Data with Basemap

Problem Statement:
Visualization of Geographic Data with Basemap
Aim:
To visualize Geographic Data using the BaseMap module in Python Programming.
Algorithm:

Step 1: Start the program


Step 2: Import the required packages
Step 3: Display the Basemap using built in method like basemap along with latitude
and longitude parameters
Step 4: Display the Coastal lines meters and Country boundaries using built-in
methods and fill with suitable colors
Step 5: Create a global map with Orthographic Projection, Robinson Projection
Step 6: Stop the program
Program:

A. Basemap along with latitude and longitude

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None,width=8E6, height=8E6,lat_0=45,
lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12)

(Alok Kumar, A.P/CSE)


B. Global map with a Coastlines

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()

C. Global map with a Country boundaries

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries()
plt.title("Country boundaries", fontsize=20)
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' US', fontsize=12)
plt.show()

D. Global map with Orthographic Projection

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')

(Alok Kumar, A.P/CSE)


m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Orthographic Projection", fontsize=18)

E. Global map with a Robinson Projection

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (10,8))
m=
Basemap(projection='robin',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,l
on_0 = 0, lat_0 = 0)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title(" Robinson Projection", fontsize=20)

Output:
A. Basemap along with latitude and longitude

(Alok Kumar, A.P/CSE)


B. Global map with a Coastlines

C. Global map with a Country boundaries

D. Global map with Orthographic Projection

(Alok Kumar, A.P/CSE)


E. Global map with a Robinson Projection

Result:

Thus the above program has been implemented and verified successfully.

(Alok Kumar, A.P/CSE)

You might also like