0% found this document useful (0 votes)
10 views

DS LAB MANUAL (1)

Data Structure lab manual in Computer Science and Engineering

Uploaded by

Yobu D Job
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DS LAB MANUAL (1)

Data Structure lab manual in Computer Science and Engineering

Uploaded by

Yobu D Job
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Download ,Install and Explore the features of NumPy ,

Ex. No:1
SciPY , Jupyter, Statsmodels and Pandas Packages

AIM:
To Download, Install and Explore the features of NumPy, SciPY, Jupyter, Statsmodels and
Pandas Packages

ALGORITHMS:

(i) To Install Python and PIP:


Step 1: Select version of python to install.
Step 2: Download python Executable Installer
Step 3: Run Executable Installer
Step 4: Verify Python is installed on windows
 Open the command prompt.
 Type “python” and press enter.
Step 5: Verify pip was installed
 Type “pip -V” and press enter.

(ii) To Install NumPy:

 Hit the windows key, type command prompt, and click on run as administrator
 Type pip install numpy command and press enter key to start the numpy installation.
 The numpy package download and installation will automatically get started and finished.
Verify numpy installation
 Launch command prompt and type pip show numpy command and hit the enter key to
verify if numpy is part of python packages.
 The output will show you the numpy version with the location at which it is stored in the
system.

(iii) To Install SciPY:

 Hit the windows key, type command prompt, and click on run as administrator
 pip install scipy command and press enter key to start the scipy installation.
 The scipy package download and installation will automatically get started and finished.

Verify scipy installation

 Launch command prompt and type pip show scipy command and hit the enter key to verify
if scipy is part of python packages.
 The output will show you the scipy version with the location at which it is stored in the
system.

(iv) To Install Jupyter:

 Hit the windows key, type command prompt, and click on run as administrator
1
 Type pip install jupyter command and press enter key to start the scipy installation.
 The jupyter package download and installation will automatically get started and finished.

Verify jupyter installation

 Launch command prompt and type pip show jupyter command and hit the enter key to
verify if jupyter is part of python packages.
 The output will show you the jupyter version with the location at which it is stored in the
system.

(v) To Install Statsmodels:

 Hit the windows key, type command prompt, and click on run as administrator
 Type pip install statsmodels command and press enter key to start the statsmodels
installation.
 statsmodels package download and installation will automatically get started and finished.

Verify Statsmodels installation

 Launch command prompt and type pip show statsmodels command and hit the enter key
to verify if statsmodels is part of python packages.
 The output will show you the statsmodels version with the location at which it is stored in
the system.

(vi) To Install Pandas:

 Hit the windows key, type command prompt, and click on run as administrator
 Type pip install pandas command and press enter key to start the pandas installation.
 The pandas package download and installation will automatically get started and finished.

Verify Pandas installation

 Launch command prompt and type pip show pandas command and hit the enter key to
verify if pandas is part of python packages.
 The output will show you the pandas version with the location at which it is stored in the
system.

Source Code:

(i) Download PIP get-pip.py

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

(ii) Installing PIP on Windows:

python get-pip.py
2
(iii) Verify Installation
pip help

(iv) Upgrading PIP:

To check the current version of PIP: pip –version

To upgrade PIP on Windows: python -m pip install --upgrade pip

(v) To Install Numpy:

Installing numpy : pip install nympy


Verify numpy installation : pip show numpy

(vi) To Install SciPY:

Installing Scipy : pip install scipy


Verify Scipy installation : pip show scipy

(vii) To Install Jupyter:

Installing Jupyter : pip install jupyter


Verify Jupyter installation : pip show jupyter

(viii) To Install Statsmodels:

Installing Statsmodels : pip install statsmodels


Verify Statsmodels installation : pip show statsmodels

(ix) To Install Pandas:

Installing Pandas : pip install pandas


Verify Pandas installation : pip show pandas

OUTPUT
Download PIPget-pip.py

Installing PIP onWindows

3
Installing numpy : pip install nympy

Installing scipy : pip install scipy

Installing jupyter: pip install jupyter

Installingpandas: pip install pandas

Result:

Thus the given libraries are Download, Install and Explore the features of NumPy, SciPY,
Jupyter, Statsmodels and Pandas.

4
VIVA QUESTIONS :

1. What is meant by data science?


Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data.

2. List the libraries in Python used for Data Analysis.


Python's most popular libraries for data analytics include Plotly, NumPy, SciPy, Visby, Pandas,
Matplotlib, Seaborn, Scikit-learn, Statsmodels, and Apache Superset.

3. Why we are using numpy?


NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful
data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies
an enormous library of high-level mathematical functions that operate on these arrays and matrices.

4. What is meant by pip?


PIP is a recursive acronym for “Preferred Installer Program” or PIP Installs Packages. It is a command-
line utility that installs, reinstalls, or uninstalls PyPI packages with one simple command: pip.

5. which library is used to create dataframe?


Pandas is a data analysis and manipulation library for Python. It provides numerous functions and
methods for efficient data analysis. The core Pandas object for storing data is called dataframe which
consists of labelled rows and columns. Pandas is quite flexible in terms of the ways to create a
dataframe.

5
ASSIGNMENT QUESTIONS:

SL. CO
ASSIGNMENT BT LEVEL COMPLEXITY
NO MAPPING
1. Write a Pandas program to select the
specified columns and rows from a given data
frame. Sample Python dictionary data and list
labels:
Select 'name' and 'score' columns in rows 1,
3, 5, 6 from the following data
frame.exam_data = {'name': ['Anastasia',
'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew','Laura','Kevin', 'Jonas'],
'score':[12.5,9,16.5,np.nan,9,20,14.5,np.nan,8,19
],
'attempts':[1,3,2, 3,2,3,1, 1,2,1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no',
CO1
'no', 'yes']} Create High

labels=['a','b','c','d','e','f','g','h','i','j']
ExpectedOutput:
Selectspecificcolumnsandrows:
Score
qualifyb9.0
no
dNaN
nof20.0yes
g14.5yes

2. Use NumPy, Create an array with 5 dimensions


CO1 Create High
and verify that it has 5 dimensions.
3.
Using NumPy, Sort a Boolean array. CO1 Create High

4. How to Remove rows in Numpy array that


CO1 Apply High
contains non-numeric values?
5. How to get all 2D diagonals of a 3D NumPy
CO1 Apply High
array?

6
Ex No: 2
Working with Numpy Arrays
Date:

Aim:
To Explore the features of NumPy with Arrays using NumPy package.

Definition:
(i) Numpy
Numpy is the core library for scientific computing in Python. It provides a high-performance
multidimensional array object, and tools for working with these arrays.

(ii) Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative
integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers
giving the size of the array along each dimension.

Algorithm:
Step 1: Verify Numpy package is installed in windows by using command ( pip show numpy).
Step 2: Create array using numpy packages.
Step 3: Perform operation such as Indexing, Slicing, Check Dimension, Check Data Types, Perform
Iterating, Joining, splitting, Searching, Sorting and Filter.
Step 4: Finally print the output of each operations.

Source Code:

(i) Numpy Creating and Check Dimensions:


import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
(ii) NumPy Array Indexing:
7
One Dimension:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
print(arr[2] + arr[3])

Two Dimension:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
print('5th element on 2nd row: ', arr[1, 4])

Three Dimension:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])

(iii) NumPy Array Slicing


One Dimension:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
print(arr[4:])
print(arr[:4])
print(arr[-3:-1])
print(arr[1:5:2])
print(arr[::2])

Two Dimension:
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
print(arr[0:2, 2])
print(arr[0:2, 1:4])

8
(iv) NumPy Data Types
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)

import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)

import numpy as np
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)

(v) NumPy Array Iterating


One Dimension:
import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)

Two Dimension:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)

Three Dimension:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
9
(vi) NumPy Joining Array
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)

import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)

(vii) NumPy Splitting Array


import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)

import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)
(viii) NumPy Searching Arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)

10
(ix) NumPy Sorting Arrays
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))

import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))

import numpy as np
arr = np.array([True, False, True])
print(np.sort(arr))

import numpy as np
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))

(x) NumPy Filter Array


import numpy as np
arr = np.array([41, 42, 43, 44])
x = arr[[True, False, True, False]]
print(x)

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
filter_arr = []
for element in arr:
if element % 2 == 0:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)

11
OUTPUT:
(i) Numpy Creating and Check Dimensions:

(ii) NumPy Array Indexing:

(iii) NumPy Array Slicing

(iv) NumPy Data Types

12
(v) NumPy Array Iterating

(vi) NumPy Joining Array

(vii) NumPy Splitting Array

(viii) NumPy Searching Arrays

(ix) NumPy Sorting Arrays

13
(x) NumPy Filter Array

Result:
Thus the given operations are executed and verified successfully using numpy packages.

14
VIVA QUESTIONS :

1. Which command is used to list library functions?


To list down all the functions present in a Python module by simply using the dir() method in the
Python shell or in the command prompt shell.

2. What are the operations performed by numpy library functions?


Numpy arrays are mutable. NumPy is useful to perform basic operations like finding the dimensions,
the bite-size, and also the data types of elements of the array.

3. How do you create a NumPy array?


To create a NumPy array, you can use the function np.array() .

4. How do you use NumPy for data science tasks?


NumPy is commonly used within data science in order to work through numerical analyses and
functions, such as creating and working with arrays, returning descriptive statistics, and a variety of
machine learning models and mathematical formulas.

5. Differentiate between Data Analytics and Data Science.


While Data Science focuses on finding meaningful correlations between large datasets, Data Analytics
is designed to uncover the specifics of extracted insights. In other words, Data Analytics is a branch
of Data Science that focuses on more specific answers to the questions that Data Science brings forth.

15
ASSIGNMENT QUESTIONS :

BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
(a) Write a NumPy program to convert an
array to a float type.

(b) Write a NumPy program to create an


1. empty and a full array.

(c) Write a NumPy program to convert a CO1 Create High


list and tuple into arrays.

(d) Write a NumPy program to


find the real and imaginary
parts of an array of complex
numbers.

Write a NumPy program to convert a list


2. CO1 Create High
and tuple into arrays.
Write a NumPy program to create a 3x3
3. CO1 Create High
matrix with values ranging from 2 to 10.
How to count the frequency of unique
4. CO1 Apply High
values in NumPy array?
How to Calculate the determinant of a
5. CO1 Apply High
matrix using NumPy?

16
Ex.No:3 Working with Pandas DataFrames

Aim:
To implement the basic concepts of Pandas Dataframe.

Pandas DataFrame:
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e.,
data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal
components, the data, rows, and columns.

The basic operation which can be performed on Pandas DataFrame :

 Creating a DataFrame
 Dealing with Rows and Columns
 Indexing and Selecting Data
 Working with Missing Data
 Iterating over rows and columns

Creating a DataFrame:
In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from
the lists, dictionary, and from a list of dictionary etc.
Dealing with Rows and Columns
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows
and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and
renaming.
Column Selection:
In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by
their columns name.
Row Selection:
Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to
retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to
an iloc[] function.
Indexing and Selecting Data:
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and

17
all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset
Selection.
Indexing a Dataframe using indexing operator []:
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers
also use the indexing operator to make selections. In this indexing operator to refer to df[].
Selecting a single columns
In order to select a single column, we simply put the name of the column in-between the brackets

Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rows and columns. The df.loc indexer selects data in a
different way than just the indexing operator. It can select subsets of rows or columns. It can also
simultaneously select subsets of rows and columns.

Indexing a DataFrame using .iloc[ ] :


This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify
the positions of the rows that we want, and the positions of the columns that we want as well.
The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

Working with Missing Data


Missing Data can occur when no information is provided for one or more items or for a whole unit.
Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not
Available) values in pandas.

Checking for missing values using isnull() and notnull() :


In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both
function help in checking whether a value is NaN or not. These function can also be used in Pandas
Series in order to find null values in a series.
Filling missing values using fillna(), replace() and interpolate() :
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function
replace NaN values with some value of their own. All these function help in filling a null values in
datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it
uses various interpolation technique to fill the missing values rather than hard-coding the value.
Dropping missing values using dropna() :

In order to drop a null values from a dataframe, we used dropna() function this fuction drop
Rows/Columns of datasets with Null values

18
Iterating over rows and columns:
Iteration is a general term for taking each item of something, one after another. Pandas DataFrame
consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a
dictionary.

Iterating over rows :


In order to iterate over rows, we can use three function iteritems(), iterrows(), itertuples() . These three
function will help in iteration over rows.
Iterating over Columns :

In order to iterate over columns, we need to create a list of dataframe columns and then iterating through
that list to pull out the dataframe columns.

Algorithm:
Step1:Verify Pandas is installed in windows by using command pip show pandas

Step2:Create data frame from list,list of list,dict of narray/lists, by proving index label explicitly using
pandas.

Step3:Perform operations such as coloumnselection,rowselection,indexing,checking for missing values

Step4:Print the Output of each operations.

i) Creating Dataframe from Lists:


import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)

ii) Creating Pandas DataFrame from lists of lists.


import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

19
df

iii) Creating DataFrame from dict of narray/lists

import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18]}

df = pd.DataFrame(data)
df

iv) Creating a DataFrame by proving index label explicitly:


import pandas as pd
data = {'Name': ['Tom', 'Jack', 'nick', 'juli'],
'marks': [99, 98, 95, 90]}
df = pd.DataFrame(data, index=['rank1',
'rank2',
'rank3',
'rank4'])
df

v) Creating Dataframe from list of dicts:


import pandas as pd

data = [{'a': 1, 'b': 2},


{'a': 5, 'b': 10, 'c': 20}]

df1 = pd.DataFrame(data, index=['first',


'second'],
columns=['a', 'b'])

df2 = pd.DataFrame(data, index=['first',


'second'],
columns=['a', 'b1'])

print(df1, "\n")
20
print(df2)

vi) Column Selection:

import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
print(df[['Name', 'Qualification']])
vii) Row Selection:

1. DataFrame.loc[]

import pandas as pd
import xlrd
read_file=pd.read_excel ("Test.xlsx")
read_file.to_csv ("Test.csv", index = None, header=True)

data = pd.read_csv("Test.csv", index_col ="Name")

first = data.loc["Ronald"]
second = data.loc["Ben"]
print(first, "\n\n\n", second)

import pandas as pd
import xlrd

read_file=pd.read_excel ("Test.xlsx")

read_file.to_csv ("Test.csv",
index = None,
header=True)

21
data = pd.read_csv("Test.csv", index_col ="Name")

first = data["Cost"]
print(first)

2.Indexing a DataFrame using .iloc[ ] :


import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving rows by iloc method
row2 = data.iloc[3]

print(row2)

viii) Working with Missing Data


import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
df.isnull()

1. Filling missing values using fillna(), replace() and interpolate() :


import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

df = pd.DataFrame(dict)
df.fillna(0)

2.Dropping missing values using dropna() :


import pandas as pd
import numpy as np
22
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)

df

3.Drop rows with at least one Nan value (Null value)

import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
df.dropna()

ix) Iterating over rows : iterrows()


import pandas as pd
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}

df = pd.DataFrame(dict)

for i, j in df.iterrows():
print(i, j)
print()

x) Iterating over Columns :


import pandas as pd
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
23
df = pd.DataFrame(dict)
print(df)
columns = list(df)

for i in columns:
print (df[i][2])
OUTPUT:
1. Creating a DataFrame
i) Creating Dataframe from Lists:

ii) Creating Pandas DataFrame from lists of lists.

iii) Creating DataFrame from dict of narray/lists

iv) Creating a DataFrame by proving index label explicitly:

2. Dealing with Rows and Columns

i) Column Selection:

ii) Row Selection: DataFrame.loc[]


24
3. Indexing and Selecting Data

i) Indexing a Dataframe using indexing operator [] :

ii) Indexing a DataFrame using .iloc[ ] :

4. Working with Missing Data

i) Checking for missing values using isnull() and notnull() :

ii) Filling missing values using fillna(), replace() and interpolate() :

25
iii) Dropping missing values using dropna() :

iv) Drop rows with at least one Nan value (Null value)

Iterating over rows


i) iterrows()

ii) Iterating over Columns :

Result:
Thus all the basic concepts of Pandas DataFrame are implemented Successfully.

26
VIVA QUESTIONS :

1. What is Pandas in Python?


Pandas is an open-source Python package that is most commonly used for data science, data analysis,
and machine learning tasks. It is built on top of another library named Numpy. It provides various data
structures and operations for manipulating numerical data and time series and is very efficient in
performing various functions like data visualization, data manipulation, data analysis, etc.

2. What are the significant features of the pandas Library?


Pandas library is known for its efficient data analysis and state-of-the-art data visualization.
The key features of the panda’s library are as follows:
• Fast and efficient DataFrame object with default and customized indexing.
• High-performance merging and joining of data.
• Data alignment and integrated handling of missing data.
• Label-based slicing, indexing, and subsetting of large data sets.
• Reshaping and pivoting of data sets.

3.Define Series in Pandas?
It is a one-dimensional array-like structure with homogeneous data which means data of different data
types cannot be a part of the same series. It can hold any data type such as integers, floats, and strings
and its values are mutable i.e. it can be changed but the size of the series is immutable i.e. it cannot be
changed. By using a ‘series’ method, we can easily convert the list, tuple, and dictionary into a series.
A Series cannot contain multiple columns.

4. Define DataFrame in Pandas?


It is a two-dimensional array-like structure with heterogeneous data. It can contain data of different
data types and the data is aligned in a tabular manner i.e. in rows and columns and the indexes with
respect to these are called row index and column index respectively. Both size and values of DataFrame
are mutable. The columns can be heterogeneous types like int and bool.

5. Explain Categorical data in Pandas?


Categorical data is a discrete set of values for a particular outcome and has a fixed range. Also, the
data in the category need not be numerical, it can be textual in nature. Examples are gender, social
class, blood type, country affiliation, observation time, etc. There is no hard and fast rule for how many
values a categorical value should have.

27
ASSIGNMENT QUESTIONS :

BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
Write a Pandas program to
count the number of rows and
columns of a
DataFrame.SamplePython
1. dictionarydata andlist labels:
exam_data = {'name': ['Anastasia',
'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew','Laura','Kevin',
'Jonas'],
CO1 Create High
'score':[12.5,9,16.5,np.nan,9,20,14.5,np.nan
,8,19],
'attempts':[1,3,2, 3,2,3,1, 1,2,1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes',
'yes', 'no', 'no', 'yes']}
labels=['a','b','c','d','e','f','g','h','i','j']
Expected Output:
Number of Columns : 10
Number of rows : 4

Write a NumPy program to convert a

2. Python dictionary to a NumPy ndarray.


Sample Output:
Originaldictionary:
{'column0':{'a':1,'b':0.0,'c':0.0,'d':2.0}, CO1 Create High
'column1':{'a':3.0,'b':1,'c':0.0,'d':-1.0},
'column2':{'a':4,'b':1,'c':5.0,'d':-1.0},
'column3':{'a':3.0,'b':-1.0,'c':-1.0,'d':-1.0}}

28
Write a Pandas program to select
the rows where the number of
attempts in the examination
isgreaterthan2.
SamplePython dictionary dataandlistlabels:
exam_data={'name':['Anastasia','Dima','Kat
herine','James','Emily','Michael','Matthew','
3. Laura','Kevin', 'Jonas'], CO1 Create High
'score':[12.5, 9,16.5, np.nan, 9,20, 14.5,
np.nan,8, 19],
'attempts':[1, 3,2, 3,2, 3,1, 1,2, 1],
'qualify':['yes','no','yes','no','no','yes','yes','n
o','no','yes']}
labels=['a','b','c','d','e','f','g','h','i','j']

How to iterate over rows in Pandas


4. CO1 Apply High
Dataframe?

How to randomly select rows from


5. CO1 Apply High
Pandas DataFrame?

29
Reading data from text files, Excel and the web and exploring
Ex.No:4
various commands for doing descriptive analytics on the Iris data set

Aim:
i) To read data from text files, Excel file and web
ii) To explore various commands for doing descriptive analytics on the iris data set.

Algorithm:
Step 1:Download the Iris Test data text file and csv file from
https://sourceforge.net/projects/irisdss/files/
Step 2: Find the path location of Text file,Excel file
Step 3: Read the Text file using read() and view the text data
Step4: Read the Excel file using read_csv() and view the data
Step 4: Read the Iris dataset directly from the url 'https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data'and view the dataset.
Step 5:Perform descriptive analytics on the iris datasets using various commands such as describe()
,index ,coloumns ,head().

Reading data from a text file :


T = open(r'Data.txt')
print(T.read())

Read Excel File:


import pandas as pd
df = pd.read_csv(r'C:\Users\user\Desktop\iris_csv.csv')
print(df)

Descriptive analytics on the Iris data set:


print(df.head())
print(df.shape)
print(df.info())
print(df.describe())

Checking Missing Values


print(df.isnull().sum())

30
Checking Duplicates

data = df.drop_duplicates(subset ="Species",)


print(data)

counts of unique values


df.value_counts("Species")

Read data from web:


import pandas as pd
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(csv_url, header = None)
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
iris = pd.read_csv(csv_url, names = col_names)
print(iris)
print(iris.dtypes)

Viewing Data:
print(iris.head())

View the index of the DataFrame


print(iris.index)

View the columns of the DataFrame


print(iris.columns)

sorting by an axis:
iris.sort_index(axis=1, ascending=False).head(10)

sorting by values:
iris.sort_values(by='Petal_Width').head(10)

Selection Getting:
iris['Sepal_Length'].head()

31
Selecting via []-slices the first 5 rows
iris[0:5]

Selection by Label
1. Selecting on a multi-axis by label
iris.loc[0:10, ['Sepal_Length', 'Petal_Length']]

2. Reduction in the dimensions of the returned object


iris.loc[0, ['Sepal_Length', 'Petal_Length']]

3.A scalar value


iris.loc[0, 'Petal_Length']

4. Retrieve a column of the dataframe using a dict-like notation


iris['Petal_Length'].head()

5. Retrieve a column of data by attribute


iris.Sepal_Length.head()

6. Series using the loc attribute


iris.loc[0]

7. Selection by position
iris.iloc[0:3, 0:4]

8. Getting a scalar value by position using iat


iris.iat[0,0]

OUTPUT:
Reading data from a text file :

32
Read Excel File:
To display Data:

To display 5 rows:

To view Total rows and coloumns:

To view dataset information:

33
To view the description of data:

Checking Missing Values

Checking Duplicates

Counts of unique values

34
Reading data from web:

sorting by an axis:

Sorting by values:

Selection Getting:

Selecting via []-slices the first 5 rows

35
Selection by Label

1. Selecting on a multi-axis by label

2. Reduction in the dimensions of the returned object

3.A scalar value

4. Retrieve a column of the dataframe using a dict-like notation

36
5. Retrieve a column of data by attribute

6. Series using the loc attribute

7. Selection by position

8. Getting a scalar value by position using iat

Result :

Thus reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set is performed.

37
VIVA QUESTIONS :

1. What is the basic descriptive analysis?


Descriptive analysis is one of the most crucial phases of statistical data analysis. It provides you with
a conclusion about the distribution of your data and aids in detecting errors and outliers. It lets you
spot patterns between variables, preparing you for future statistical analysis.

2. Why is descriptive analysis used?


Descriptive analysis identifies patterns in data to answer questions about who, what, where, when, and
to what extent. This guide describes how to more effectively approach, conduct, and communicate
quantitative descriptive analysis.

3. What are the advantages of descriptive analysis in research?


One of the main benefits of using descriptive statistics is that they can simplify and organize large
amounts of data into a few numbers or graphs. This can make it easier to grasp the main features and
patterns of your data, as well as identify any outliers or errors.

4. Who uses descriptive analytics?


Subscription streaming services like Spotify and Netflix, and e-commerce sites like Amazon and eBay
all use descriptive analytics to identify trends

5. What are the limitations of descriptive analytics?


Descriptive analytics helps businesses identify trends, provides a baseline for performance, enables
data-driven decision-making, and facilitates resource allocation. However, it is limited to historical
data, lacks context, requires accurate data, and may not always lead to actionable recommendations.

38
ASSIGNMENT QUESTIONS :

SL. BT
ASSIGNMENT CO MAPPING COMPLEXITY
NO LEVEL
1. Explore various commands for doing
CO3 Apply High
descriptive analytics on Crop yield production

Examine the mean, median, Standard deviation CO2 Analyze High


2.
on UCI Heart Disease dataset

Using Descriptive Analysis explore various CO3 Apply High


3.
data visualization on Lung Cancer dataset

Explore various commands for doing


4. CO3 Apply High
descriptive analytics on Vehicle Sales Data

Using Descriptive Analysis explore various CO3 Apply High


5.
data visualization on Bank Marketing

39
Ex.No.5a Univariate Analysis on diabetes data set from UCI and Pima
Indians Diabetes data set.

Aim:
To perform Univariate Analysis on diabetes data set from UCI and Pima Indians
Diabetes data set.

Algorithm:
Step1: Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4:Perform univariate analysis such as frequency, mean, median, mode, variance, standard
deviation, skewness and kurtosis using corresponding functions.

I UCI DIABETIC DATASET:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
#from sklearn import linear_model
DataPath = r'C:\Users\asus\Desktop\diabetes.csv'
df = pd.read_csv(DataPath)

1. To find Frequency:
print("Frequency of values in column ")
count = df['Pregnancies'].value_counts()
print(count)
count = df['Glucose'].value_counts()
print(count)
count = df['BloodPressure'].value_counts()
print(count)
count = df['SkinThickness'].value_counts()
print(count)
count = df['Insulin'].value_counts()
print(count)

40
count = df['BMI'].value_counts()
print(count)
count = df['DiabetesPedigreeFunction'].value_counts()
print(count)
count = df['Age'].value_counts()
print(count)
count = df['Outcome'].value_counts()
print(count)

2. To find Mean:
print("mean:")
print(df.mean())

3. To find Median:
print("median:")
print(df.median())

4. To find Mode:
print("mode:")
print(df.mode().T)

5. To find Standard Deviation:


print("Standard Deviation:\n",df.std())

6. To find Variance:
print("Variance:\n",df.var())

7. To find Skewness:
print("Skewness:\n",df.skew())

8. To find Kurtosis:
print("Kurtosis:\n",df.kurtosis())

II PIMA INDIANS DIABETIC DATASET:

import pandas as pd
41
import numpy as np
import matplotlib.pyplot as plt
path= r'C:\Users\asus\Desktop\pima-indians-diabetes.csv'
print(path)
df=pd.read_csv(path)
df.columns
=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','
Age','Outcome']
print(df)

1. To find Frequency:
print("Frequency of values in column ")
count = df['Pregnancies'].value_counts()
print(count)
count = df['Glucose'].value_counts()
print(count)
count = df['BloodPressure'].value_counts()
print(count)
count = df['SkinThickness'].value_counts()
print(count)
count = df['Insulin'].value_counts()
print(count)
count = df['BMI'].value_counts()
print(count)
count = df['DiabetesPedigreeFunction'].value_counts()
print(count)
count = df['Age'].value_counts()
print(count)
count = df['Outcome'].value_counts()
print(count)

2. To find Mean:
print("mean:")
print(df.mean())

42
3. To find Median:

print("median:")
print(df.median())

4. To find Mode:
print("mode:")
print(df.mode().T)

5. To find Standard Deviation:


print("Standard Deviation:\n",df.std())

6. To find Variance:
print("Variance:\n",df.var())

7. To find Skewness:
print("Skewness:\n",df.skew())

8. To find Kurtosis:
print("Kurtosis:\n",df.kurtosis())

OUTPUT:

============= RESTART: C:/Users/Admin/Desktop/test.py ===================


1.UCI dataset:
Frequency of values in column
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
43
15 1
17 1
Name: Pregnancies, dtype: int64

To find mean:
mean:
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64

To find median

median:
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64

To find mode:
mode:
0 1
Pregnancies 1.000 NaN
Glucose 99.000 100.000
BloodPressure 70.000 NaN
SkinThickness 0.000 NaN
Insulin 0.000 NaN
BMI 32.000 NaN
DiabetesPedigreeFunction 0.254 0.258
Age 22.000 NaN
Outcome 0.000 NaN

Standard Deviation:

Pregnancies 3.369578
Glucose 31.972618
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
44
Age 11.760232
Outcome 0.476951
dtype: float64

Variance:

Pregnancies 11.354056
Glucose 1022.248314
BloodPressure 374.647271
SkinThickness 254.473245
Insulin 13281.180078
BMI 62.159984
DiabetesPedigreeFunction 0.109779
Age 138.303046
Outcome 0.227483
dtype: float64

Skewness:

Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI- 0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64

Kurtosis:

Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
Age 0.643159
Outcome -1.600930
dtype: float64

2. PIMA INDIANS DIABETES

Frequency of values in column


1 135
0 111
2 103
3 75
4 68
5 57
6 50
45
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies, dtype: int64

mean:

Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64

median:

Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64

mode:
0 1
Pregnancies 1.000 NaN
Glucose 99.000 100.000
BloodPressure 70.000 NaN
SkinThickness 0.000 NaN
Insulin 0.000 NaN
BMI 32.000 NaN
DiabetesPedigreeFunction 0.254 0.258
Age 22.000 NaN
Outcome 0.000 NaN

Standard Deviation:

Pregnancies 3.369578
Glucose 31.972618
46
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
Age 11.760232
Outcome 0.476951
dtype: float64
Variance:
Pregnancies 11.354056
Glucose 1022.248314
BloodPressure 374.647271
SkinThickness 254.473245
Insulin 13281.180078
BMI 62.159984
DiabetesPedigreeFunction 0.109779
Age 138.303046
Outcome 0.227483
dtype: float64

Skewness:

Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI -0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64

Kurtosis:
Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
Age 0.643159
Outcome -1.600930
dtype: float64

Result:
Thus univariate analysis is performed on diabetes data set from UCI and Pima Indians
Diabetes data set and executed successfully.

47
VIVA QUESTIONS :

1. What is Univariate analysis?


Univariate analysis explores each variable in a data set, separately. It looks at the range of values, as
well as the central tendency of the values. It describes the pattern of response to the variable. It
describes each variable on its own. Descriptive statistics describe and summarize data.

2. What is univariate and bivariate analysis?


Univariate analysis focuses on understanding individual variables. - Bivariate analysis examines
relationships between two variables.

3. What is an example of a univariate test?


Examples include t-tests of means, analysis of variance (ANOVA), analysis of covariance, linear
regression, and generalized linear models such as binary logistic regression. In all of these cases, there
is only one dependent variable.

4. Why is univariate analysis used?


Univariate analyses are conducted for the purpose of making data easier to interpret and to understand
how data is distributed within a sample or population being studied.

5. What are the disadvantages of univariate analysis?


While univariate analysis is useful for understanding the distribution and central tendencies of a single
variable, it has limitations: It doesn't provide any insight into relationships between variables. It can't
identify causality or correlation.

48
ASSIGNMENT QUESTIONS :

SL. BT
ASSIGNMENT CO MAPPING COMPLEXITY
NO LEVEL
1. Determine whether the following statement
refers to univariate (single-variable) or
bivariate (two-variable) data: CO4 Evaluate High
Jen measured the height and number of
leaves of each plant in her laboratory.
Detect the mode(s), if any, for the following set
2.
of data. It may be helpful to order the data first. CO4 Evaluate High
4, 5, 2, 8, 2, 1, 0, 0, 9, 5, 0

For the following list, detect the quartiles: 0, 0,


3. CO2 Evaluate High
0, 1, 2, 4, 10, 20, 30, 47.

4. Detect the range of the following list of test


scores. You may want to order each list first.
CO4 Evaluate High

98, 32, 60, 54, 78, 80, 54, 78, 77, 89.
5. Detect the rmean, median, mode, Standard
deviation of the following list of test scores.
CO4 Evaluate High
You may want to order each list first.
98, 32, 60, 54, 78, 80, 54, 78, 77, 89.

49
Bivariate Analysis such as linear regression modeling and logistic
Ex.No.5b regression modeling on diabetes data set from UCI and Pima
Indians Diabetes data set.

Aim:
To perform Bivariate Analysis on diabetes data set from UCI and Pima Indians
Diabetes data set.
Algorithm:
Step1: Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Perform Bivariate analysis such as Linear regression and Logistic regression on both the
datasets.

UCI DIABETICS DATASET:

I. Linear Regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

path= r'C:\Users\asus\Desktop\diabetes.csv'
print(path)
data=pd.read_csv(path)

print(data)
x = data['Age']
y = data['BMI']

def linear_regression(x, y):


N = len(x)
x_mean = x.mean()
y_mean = y.mean()

B1_num = ((x - x_mean) * (y - y_mean)).sum()


B1_den = ((x - x_mean)**2).sum()
50
B1 = B1_num / B1_den

B0 = y_mean - (B1*x_mean)

reg_line = 'y = {} + {}β'.format(B0, round(B1, 3))

return (B0, B1, reg_line)

def corr_coef(x, y):


N = len(x)

num = (N * (x*y).sum()) - (x.sum() * y.sum())


den = np.sqrt((N * (x**2).sum() - x.sum()**2) * (N * (y**2).sum() - y.sum()**2))
R = num / den
return R

def predict(B0, B1, new_x):


y = B0 + B1 * new_x
return y

B0, B1, reg_line = linear_regression(x, y)


print('Regression Line: ', reg_line)
R = corr_coef(x, y)
print('Correlation Coef.: ', R)
print('"Goodness of Fit": ', R**2)
p=predict(B0, B1,100)
print(p)
plt.figure(figsize=(12,5))
plt.scatter(x, y, s=300, linewidths=1, edgecolor='black')
text = '''X Mean: {} Years
Y Mean: ${}
R: {}
R^2: {}
y = {} + {}X'''.format(round(x.mean(), 2),
round(y.mean(), 2),
51
round(R, 4),
round(R**2, 4),
round(B0, 3),
round(B1, 3))
plt.text(x=1, y=100000, s=text, fontsize=12, bbox={'facecolor': 'grey', 'alpha': 0.2, 'pad': 10})
plt.title('Impact of AGE and BMI')
plt.xlabel('AGE', fontsize=15)
plt.ylabel('BMI', fontsize=15)
plt.plot(x, B0 + B1*x, c = 'r', linewidth=5, alpha=.5, solid_capstyle='round')
plt.scatter(x=x.mean(), y=y.mean(), marker='*', s=10**2.5, c='r') # average point
plt.show()

II. Logistic Regression


import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
DataPath = (r'C:\Users\8316\Downloads\diabetes.csv')
data = pd.read_csv(DataPath)
x=data.drop("Outcome",axis=1)
y=data[["Outcome"]]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=LogisticRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
model_score=model.score(x_test,y_test)
#Logistic Regression Model Score
print("Logistic Regression Model Score = ",model_score)
#confusion matrix
print("Confusion Matrix : \n",metrics.confusion_matrix(y_test,y_predict))
sns.heatmap(metrics.confusion_matrix(y_test,y_predict), annot=True, fmt='d',
52
cmap='Blues')
plt.title("LogisticRegression Confusion Matrix")
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.savefig('confusion_matrix.png')
plt.show()

PIMA INDIANS DIABETICS DATASET:


I.Linear Regression:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

path= r'C:\Users\asus\Desktop\pima-indians-diabetes.csv'
print(path)
data=pd.read_csv(path)
data.columns
=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','
Age','Outcome']
print(data)

x = data['Age']
y = data['BMI']

def linear_regression(x, y):


N = len(x)
x_mean = x.mean()
y_mean = y.mean()

B1_num = ((x - x_mean) * (y - y_mean)).sum()


B1_den = ((x - x_mean)**2).sum()
B1 = B1_num / B1_den
B0 = y_mean - (B1*x_mean)
53
reg_line = 'y = {} + {}β'.format(B0, round(B1, 3))
return (B0, B1, reg_line)

def corr_coef(x, y):


N = len(x)

num = (N * (x*y).sum()) - (x.sum() * y.sum())


den = np.sqrt((N * (x**2).sum() - x.sum()**2) * (N * (y**2).sum() - y.sum()**2))
R = num / den
return R

def predict(B0, B1, new_x):


y = B0 + B1 * new_x
return y

B0, B1, reg_line = linear_regression(x, y)


print('Regression Line: ', reg_line)
R = corr_coef(x, y)
print('Correlation Coef.: ', R)
print('"Goodness of Fit": ', R**2)
p=predict(B0, B1,100)
print(p)
plt.figure(figsize=(12,5))
plt.scatter(x, y, s=300, linewidths=1, edgecolor='black')
text = '''X Mean: {} Years
Y Mean: ${}
R: {}
R^2: {}
y = {} + {}X'''.format(round(x.mean(), 2),
round(y.mean(), 2),
round(R, 4),
round(R**2, 4),
round(B0, 3),
round(B1, 3))
plt.text(x=1, y=100000, s=text, fontsize=12, bbox={'facecolor': 'grey', 'alpha': 0.2, 'pad': 10})
plt.title('Impact of AGE and BMI')
54
plt.xlabel('AGE', fontsize=15)
plt.ylabel('BMI', fontsize=15)
plt.plot(x, B0 + B1*x, c = 'r', linewidth=5, alpha=.5, solid_capstyle='round')
plt.scatter(x=x.mean(), y=y.mean(), marker='*', s=10**2.5, c='r') # average point
plt.show()

II Logistic Regression:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
DataPath = (r'"C:\Users\asus\Desktop\pima-indians-diabetes.csv"')
data = pd.read_csv(DataPath)
x=data.drop("Outcome",axis=1)
y=data[["Outcome"]]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=LogisticRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
model_score=model.score(x_test,y_test)
#Logistic Regression Model Score
print("Logistic Regression Model Score = ",model_score)
#confusion matrix
print("Confusion Matrix : \n",metrics.confusion_matrix(y_test,y_predict))
sns.heatmap(metrics.confusion_matrix(y_test,y_predict), annot=True, fmt='d',
cmap='Blues')
plt.title("LogisticRegression Confusion Matrix")
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.savefig('confusion_matrix.png')
plt.show()
55
OUTPUT:

1. UCI Diabetes dataset

Linear Regression.

56
Logistic Regression.

57
2. Pima Indians Diabetes dataset
Linear Regression.

58
Logistic Regression.

Result:
Thus Bivariate analysis such as linear regression modeling and logistic regression modeling
is performed on diabetes data set from UCI and Pima Indians Diabetes data set and executed
successfully.

59
VIVA QUESTIONS :

1. What is Linear Regression?


Linear regression is a data analysis technique that predicts the value of unknown data by using another
related and known data value. It mathematically models the unknown or dependent variable and the
known or independent variable as a linear equation.

2. What are types of linear regression?


Simple Linear Regression.
Multiple Linear Regression.

3. Why is it linear regression?


Linear regression shows the linear relationship between the independent(predictor) variable i.e. X-
axis and the dependent(output) variable i.e. Y-axis, called linear regression.

4. What is Logistic Regression?


Logistic regression is a data analysis technique that uses mathematics to find the relationships
between two data factors. It then uses this relationship to predict the value of one of those factors
based on the other. The prediction usually has a finite number of outcomes, like yes or no.

5. What is the main purpose of logistic regression?


Logistic regression is used to predict the categorical dependent variable. It's used when the
prediction is categorical, for example, yes or no, true or false, 0 or 1. For instance, insurance
companies decide whether or not to approve a new policy based on a driver's history, credit history
and other such factors.

60
ASSIGNMENT QUESTIONS:

BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
For the following scatter plot, determine if
the dots are trying to form a line. If so,
approximate the line of best fit.

1.
CO4 Apply High

If a linear regression model achieves zero


training error, can we say that all the data
2.
points lie on a hyperplane in the (d+1)- CO4 Apply High
dimensional space? Here, d is the number of
features.
Which of the following are regression
problems? Assume that appropriate data is
given.
Predicting the house price.
3. Predicting whether it will rain or not on a CO4 create High
given day.
Predicting the maximum temperature on a
given day.
Predicting the sales of the ice-creams.
A regression model with the following
function y = 60 + 5.2x was built tounderstand
the impact of humidity (x) on rainfall (y). The
4. humidity this week is 30 more than the CO4 create High
previous week. What is the predicted
difference in rainfall?
156 mm

61
15.6 mm
-156 mm
None of the above
X and Y are two variables that have a strong
linear relationship. Which of the following
statements are incorrect?
There cannot be a negative relationship
between the two variables.
The relationship between the two variables Underst
5. CO4 High
and
is purely causal.
One variable may or may not cause a change
in the other variable.
The variables can be positively or
negatively correlated with each other.

62
Multiple Regression Analysis on on diabetes data set from
Ex.No.5C
UCI and Pima Indians Diabetes data set.

Aim:
To perform Multiple Regression Analysis on diabetes data set from UCI and Pima
Indians Diabetes data set.

Algorithm:

Step1:Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2:Import necessary Modules and functions
Step3:Read the Dataset path using read_csv().
Step4:Perform Multiple regression analysis on both the datasets.

I- UCI DIABETIC DATASET:


import pandas as pd
from sklearn import linear_model
DataPath = (r'C:\Users\8316\Downloads\diabetes.csv')
df = pd.read_csv(DataPath)
df.head()
x=df[['Insulin','Glucose']]
y=df[['Outcome']]
regr=linear_model.LinearRegression()
regr.fit(x,y)
predicted=regr.predict([[500,200]])
print("Predicted Outcome = ", predicted)

II- PIMA INDIANS DIABETICS DATASET:


import pandas as pd
from sklearn import linear_model
DataPath = (r'C:\Users\8316\Downloads\pima-indians-diabetes.csv')
df = pd.read_csv(DataPath)
df.head()
x=df[['Insulin','Glucose']]
y=df[['Outcome']]
regr=linear_model.LinearRegression()

63
regr.fit(x,y)
predicted=regr.predict([[500,200]])
print("Predicted Outcome = ", predicted)

OUTPUT:
1. UCI DATASET:
OUTPUT:
Predicted Outcome = [[0.86312063]]

2. PIMA INDIANS DIABETICS DATASET:


OUTPUT:
Predicted Outcome = [[0.86312063]]

Result:
Thus Multiple Regression Analysis on diabetes data set from UCI and Pima Indians Diabetes
data set is performed and executed successfully.
64
VIVA QUESTIONS :

1. What is Multiple Regression Analysis?


Multiple regression is a statistical technique that can be used to analyze the relationship between a
single dependent variable and several independent variables. The objective of multiple regression
analysis is to use the independent variables whose values are known to predict the value of the single
dependent value.

2. What is the principle of multiple regression?


Model parameters in a multiple regression model are usually estimated using ordinary least squares
minimizing the sum of squared deviations between each observed value and predicted values. It
involves solving a set of simultaneous normal equations, one for each parameter in the model.

3. What are the applications of multiple regression?


3 Employee performance. You can use multiple regression to analyze the determinants of employee
performance, such as productivity, satisfaction, or turnover. For example, you can create a model that
predicts productivity based on variables such as education, experience, skills, motivation, feedback,
and incentives.

4. What are the 2 types of multiple regression?


Multiple regressions can be linear and nonlinear. Multiple regressions are based on the assumption that
there is a linear relationship between both the dependent and independent variables. It also assumes no
major correlation between the independent variables.

5. What are the advantages of multiple regression?


One of the main advantages of multiple regression is that it can capture the complex and multifaceted
nature of real-world phenomena. By including multiple independent variables, you can account for
more factors that influence the dependent variable, and reduce the error and bias in your estimates.

65
ASSIGNMENT QUESTIONS :

SL. BT
ASSIGNMENT CO MAPPING COMPLEXITY
NO LEVEL
1. Give σx = 3 and Regression equation 8X – 10Y
+ 66 = 0; 40X – 18Y = 214.
Find Out (i) The mean value of X and Y, (ii) CO4 Create High
Coefficient of correlation between X and Y and
(iii) Standard deviation of Y.

2. How will you perform Multiple Regression


CO4 Create High
Analysis on Breast Cancer dataset ?

3.
Use the Lung Cancer dataset from UCI for
performing the following: CO4 Create High

Multiple Regression Analysis


4.
In order to determine the correlation
coefficient, we will have to find out
regression coefficients. Here, we do not CO4 Apply High
know which of the two regression equation
(i) as regression equation X on Y.
5.
Explain the difference between simple
linear regression and multiple Underst
CO4 High
regression?Identify assumptions of multiple and

regression?

66
Result Comparison of diabetes data set from UCI and Pima
Ex.No.5D
Indians Diabetes data set.

Aim:
To perform result comparision Analysison diabetes data set from UCI and Pima
Indians Diabetes data set.

Algorithm:

Step1: Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Perform result comparison analysis on both the datasets using statsmodel
GLM.from_formula(),Fit(),summary()
Step5: Print the result using print()

Source Code:

1. UCI DIABETES DATASETS:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
df2 = pd.read_csv(r"C:\Users\asus\Desktop\diabetes.csv")
print(df2.shape)
df2.head(5)
model = sm.GLM.from_formula("Outcome ~
Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Ag
e", family=sm.families.Binomial(), data=df2)
result = model.fit()
result.summary()
print(result.summary())

67
2. PIMA INDIANS DIABETES DATASETS:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
df2 = pd.read_csv(r"C:\Users\asus\Desktop\pima-indians-diabetes.csv")
print(df2.shape)
df2.head(5)
model = sm.GLM.from_formula("Outcome ~
Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Ag
e", family=sm.families.Binomial(), data=df2)
result = model.fit()
result.summary()
print(result.summary())

OUTPUT:

1. UCI Diabetes Dataset

68
2. Pima Indians Diabetes Dataset:

Result:

Thus result comparison is performed on diabetes data set from UCI and Pima Indians
Diabetes data set and executed successfully.

69
VIVA QUESTIONS :

1. What is Statsmodels?
Statsmodels is a Python package that allows users to explore data, estimate statistical models, and
perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions,
and result statistics are available for different types of data and each estimator.

2. What is the difference between Sklearn and Statsmodel API?


The differences between them highlight what each in particular has to offer: scikit-learn's other popular
topics are machine-learning and data-science; StatsModels are econometrics, generalized-linear-
models, timeseries-analysis, and regression-models.

3. What are the results of a regression analysis?


Regression analysis generates an equation to describe the statistical relationship between one or more
predictor variables and the response variable. After you use Minitab Statistical Software to fit a
regression model, and verify the fit by checking the residual plots, you'll want to interpret the results.

4. How do you present regression results?


The report of the regression analysis should include the estimated effect of each explanatory variable
– the regression slope or regression coefficient – with a 95% confidence interval, and a P-value. The
P-value is for a test of the null hypothesis that the true regression coefficient is zero.

5. What is the predictor and outcome of a regression?


The outcome variable is also called the response or dependent variable, and the risk factors and
confounders are called the predictors, or explanatory or independent variables. In regression analysis,
the dependent variable is denoted "Y" and the independent variables are denoted by "X".

70
ASSIGNMENT QUESTIONS :

BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
Compare the results of the Univariate and
1.
Bivariate analysis for the UCI diabetes CO4 Underst
High
and
dataset
2. How do you interpret regression formula?
CO4 Apply High

From following information determine the


correlation coefficient between
advertisement
expenses and sales volume using Karl
3. CO4 Evaluate High
Pearson’s coefficient of correlation
method.

A computer while calculating the


correlation coefficient between the variable
X and Y
obtained the following results:
N = 30; ∑X = 120 ∑X2 = 600 ∑Y = 90
∑Y2 = 250 ∑XY = 335
It was, however, later discovered at the

4. time of checking that it had copied down CO4 Evaluate High


two
pairs of observations as: (X, Y) : (8, 10)
(12, 7)
While the correct values were: (X, Y) : (8,
12) (10, 8)
determine the correct value of the
correlation coefficient between X and Y.
Coefficient of correlation between X and Y
is 0.3. Their covariance is 9. The variance
5. CO4 Evaluate High
of X is 16. Find the standard deviation of Y
series.

71
Ex.No.6 Apply and explore various plotting functions on UCI data sets.

Aim:
To apply and explore various plotting functions on diabetes data set from UCI

Algorithm:
Step1: Download diabetes data set from UCI
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Using norm() plot the normal curve for the UCI diabetes dataset variables.
Step5: Using lineplot() plot the lineplot curve for variables in the UCI diabetes dataset.
Step6: Using scatter()plot the scatter plot for the UCI diabetes dataset variables.
Step7: Using distplot() plot the density plot for the variables in UCI diabetes dataset.
Step8: Using contour() plot the normal curve for the UCI diabetes dataset variables.
Step9: Using hist() plot the histogram for the UCI diabetes dataset variables.

1. Constructing Normal Curve


import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
import statistics
x = np.arange(-20, 20, 0.01)
mean = statistics.mean(x)
sd = statistics.stdev(x)
plt.plot(x, norm.pdf(x, mean, sd))
plt.title("Normal Curve")
plt.show()

2. Constructing lineplot,scatterplot,density plot and Contour plot


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# read the csv data
DataPath = (r'C:\Users\8316\Downloads\diabetes.csv')

72
df = pd.read_csv(DataPath)
df.head()
#Line Plot for Diabetes Dataset
sns.lineplot(df['BloodPressure'],df['Age'], hue =df["Outcome"])
plt.title("Lineplot for Diabetes Dataset")
plt.show()
#Scatter Plot for Diabetes Dataset
sns.scatterplot(df['BloodPressure'],df['Age'], hue =df["Outcome"])
plt.title("Scatterplot for Diabetes Dataset")
plt.show()

#Density Plot for Diabetes Dataset


x=df["Insulin"]
sns.distplot(x, hist=False)
plt.title("Density plot for Diabetes Dataset")
plt.show()

#Contour Plot for Diabetes Dataset


def f(x, y): return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x1=df["Age"]
x2=df["Outcome"]
X, Y = np.meshgrid(x1, x2)
Z = f(X, Y)
#plt.contour(X, Y, Z, colors='black');
plt.contour(X, Y, Z, 20, cmap='RdGy');
plt.title("Contour plot for Diabetes Dataset")
plt.xlabel("Age")
plt.ylabel("Outcome")
plt.show()

Histogram:
1. histogram of all columns:
import pandas as pd
import matplotlib.pyplot as plt

df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
73
df.hist(figsize=(10,10),color='red')

2. histogram of Pregnancies column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['Pregnancies'].hist(figsize=(10,8),color='red')
plt.title('Records of pregnancies')
plt.xlabel('Number')
plt.ylabel('Limit')

3. histogram of Glucose column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['Glucose'].hist(figsize=(10,8),color='yellow')
plt.title('Records of Glucose')
plt.xlabel('Number')
plt.ylabel('Limit')

4. histogram of Blood Pressure column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['BloodPressure'].hist(figsize=(8,8),color='blue')
plt.title('Records of Blood Pressure')
plt.xlabel('Number')
plt.ylabel('Limit')

5. histogram of Skin Thickness column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['SkinThickness'].hist(figsize=(8,8),color='green')
plt.title('Records of Skin Thickness')
74
plt.xlabel('Number')
plt.ylabel('Limit')

6. histogram of Insulin column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['Insulin'].hist(figsize=(10,8),color='grey')
plt.title('Records of Insulin')
plt.xlabel('Number')
plt.ylabel('Limit')

7. histogram of Body Mass Index column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['BMI'].hist(figsize=(10,8),color='black')
plt.title('Records of Body Mass Index')
plt.xlabel('Number')
plt.ylabel('Limit')

8. histogram of Diabetes Pedigree Function column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['DiabetesPedigreeFunction'].hist(figsize=(10,8))
plt.title('Records of Pedigree Function of Diabetic Patients')
plt.xlabel('Number')
plt.ylabel('Limit')

9. histogram of Age column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
75
df['Age'].hist(figsize=(10,8),color='pink')
plt.title('Records of Age Of Patients')
plt.xlabel('Number')
plt.ylabel('Limit')

10. histogram of Outcome(results) column


import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
df['Outcome'].hist(figsize=(10,8),color='yellow')
plt.title('Records of Results')
plt.xlabel('Number')
plt.ylabel('Limit')

Three Dimensional Plotting:


import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
print("success")
df = pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
#print(df.head())

#3D ScatterPlot:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Pregnancies'], df['Glucose'], df['BloodPressure'], c='skyblue', s=60)
ax.view_init(30, 185)
plt.show()

#3D Line Plots


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot(df['Pregnancies'], df['Glucose'], df['BloodPressure'], c='skyblue')
ax.view_init(40, 180)
plt.show()
76
#3D Surface Plot:
import pandas as pd
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.axes(projection="3d")
def z_function(x, y):
return np.sin(x)**10+np.cos(10+y*x)*np.cos(x)

# Read data from a csv


df = pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
x = df['Age']
y = df['Outcome']
X, Y = np.meshgrid(x, y)
Z = z_function(X, Y)
ax.plot_surface(X, Y, Z, cmap='RdGy')
ax.set_xlabel('AGE')
ax.set_ylabel('OUTCOME')
plt.show()

import pandas as pd
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.axes(projection="3d")
def z_function(x, y):
return np.sin(x)**10+np.cos(10+y*x)*np.cos(x)

# Read data from a csv


df = pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
x = df['Age']
77
y = df['Outcome']
X, Y = np.meshgrid(x, y)
Z = z_function(X, Y)
ax.plot_surface(X, Y, Z, cmap='RdGy')
ax.set_xlabel('AGE')
ax.set_ylabel('OUTCOME')
plt.show()

OUTPUT:
1. histogram of all columns:

2. histogram of Pregnancies column

78
3. histogram of Glucose column

4. histogram of Blood Pressure column

79
5. histogram of Skin Thickness column

6. histogram of Insulin column

7. histogram of Body Mass Index column

80
8. histogram of Diabetes Pedigree Function column

9. histogram of Age column

81
10. histogram of Outcome(results) column

Three Dimensional Plotting:


#3D ScatterPlot:

82
#3D Line Plots

#3D Surface Plot:

Result:

Thus result comparison is performed on diabetes data set from UCI and Pima Indians Diabetes
data set and executed successfully.

83
VIVA QUESTIONS :

1. What are plotting functions?


Plotting points is used to graph each function. Keep in mind that f(x)=y, so f(x) and y
can be used interchangeably. A constant function is any function of the form f(x)=c,
where c can be any real number. Constant functions are linear and have the form
f(x)=0x+c.

2. What is Scatterplots?
Scatter plots are the graphs that present the relationship between two variables in a data-
set. It represents data points on a two-dimensional plane or on a Cartesian system. The
independent variable or attribute is plotted on the X-axis, while the dependent variable
is plotted on the Y-axis.

3. which library is used to create scatterplots?


To create a simple scatter plot in Matplotlib, we can use the `scatter` function provided
by the library. This function takes two arrays of data points – one for the x-axis and one
for the y-axis – and plots them as individual points on the graph.

4. What is Three dimensional Plotting?


The most basic three-dimensional plot is a line or collection of scatter plot created from
sets of (x, y, z) triples. In analogy with the more common two-dimensional plots
discussed earlier, these can be created using the ax. plot3D and ax.

5. Which Python library is used for data visualization?


Matplotlib is a data visualization library and 2-D plotting library of Python It was
initially released in 2003 and it is the most popular and widely-used plotting library in
the Python community.

84
ASSIGNMENT QUESTIONS :

BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
1. The Gapminder dataset provides
population data from 1952 to 2007 (at 5
year intervals) for several countries around
the world. Compare the populations of the
European countries France, United
Kingdom, Italy, Germany and Spain over
this period using a line chart. Make
appropriate modifications to the chart title,
axis titles, legend, figure size, font size, CO5 Evaluate High
colors etc. to make the chart readable and
visually appealing.
Hints (not all of these may be useful):
You can use either Matplotlib or Plotly to
create this chart
To select the data for the given countries,
you may determine the is in method of a
Pandas series useful
diamonds_url points to a CSV file
2. containing various attributes like carat,
cut, color, clarity, price etc. for over
53,000 diamonds. Visualize the
relationship between the carat (size of
diamond) and price using a scatter plot.
Instead of using the entire dataset for this
visualization, just pick the diamonds with
CO5 Create High
a clarify "SI2" and color "E". Use the
values of the "cut" column to color the
dots in the scatter plot. Make appropriate
modifications to the chart title, axis titles,
legend, figure size, font size, colors etc. to
make the chart readable and visually
appealing.

85
Hints (not all of these may be useful):
You can use Seaborn or Plotly to create
the scatter plot for this dataset
Check this stackoverflow answer for
selecting data frame rows using multiple
conditions.
3. The Planets dataset contains details about
the 1,000+ extrasolar planets discovered
up to 2014. Visualize the distribution of
the masses of the planets (expressed as a
multiple of the mass of Jupiter), using a
histogram and a box plot. Make
appropriate modifications to the chart title,
axis titles, legend, figure size, font size,
CO5 Create High
colors etc. to make the chart readable and
visually appealing.
Hints:
You use use Matplotlib, Seaborn or Plotly
to create these plots
If you're using Plotly, you can show both
charts together (use the marginal argument
of px.histogram)+
4. The Job Automation Probability dataset,
created during a Future of Employment
study from 2013, estimates the probability
of different jobs being automated in the
21st century due to computerization.
Create a bar chart to show the 25 jobs
requiring a "Bachelor's degree" (and no Co5 Create High
higher qualification) that are most likely to
be automated. Make appropriate
modifications to the chart title, axis titles,
legend, figure size, font size, colors etc. to
make the chart readable and visually
appealing.

86
Ex.No:7 Visualizing Geographic Data with Basemap

Aim:
To Visualize Geographic Data using Basemap

Algorithm:

Step1: Install the Basemap module using pip


Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Draw coastal regions by using drawcoastline() and show() to visualize it
Step5: Using drawcountries() draw the country areas and to visualize use show().
Step6: Draw latitude and longitude using drawparallels() and drawmeridians()

#Coastlines

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines()
plt.title("Coastlines", fontsize=20)
plt.show()

#Country_Boundaries

fig = plt.figure(figsize = (12,12))


m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries()
plt.title("Country boundaries", fontsize=20)
plt.show()

87
#latitude and longitudes

fig = plt.figure(figsize = (12,12))


m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawparallels(range(-90, 100, 10), color='k', linewidth=1.0, dashes=[4, 4], labels=[1, 0, 0, 0])
plt.ylabel("Latitude", fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)
plt.show()

#locating a region

fig = plt.figure(figsize = (10,8))


m=Basemap(projection='cyl',llcrnrlon=32.5,llcrnrlat=3,urcrnrlon=49,urcrnrlat=15, resolution = 'h')
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
m.drawcoastlines()
plt.show()

Result:

Thus Visualization of Geographic Data is done and executed successfully using Basemap.

88
VIVA QUESTIONS :

1. What is meant by visualization?


Data visualization is the process of using visual elements like charts, graphs, or maps to
represent data. It translates complex, high-volume, or numerical data into a visual
representation that is easier to process.

2. List the plot and visualize geographical data.


Visual variables are distinctions that we can use to create and differentiate symbols on
a map. There are 10 visual distinctions available for symbolization: location, size, shape,
orientation, focus, arrangement, texture, saturation, hue, and value.

3. Why we are using Geographic Data in basemap?


These datasets can be used to plot coastlines, rivers and political boundaries on a map
at several different resolutions. Basemap uses the Geometry Engine-Open Source
(GEOS) library at the bottom to clip coastline and boundary features to the desired map
projection area.

4. What is a basemap to represent geographical data?


A map contains different layers of data with geographic information that serves as a
background, called a basemap. A basemap provides context for additional layers
overlaid on the reference map. It usually provides location references for features like
boundaries, rivers, lakes, roads, and highways.

5. How do you represent geographic data?


Two methods of representing geographic data in digital form are raster and vector. In
principle, both can be used to code fields and discrete objects, but in practice there is a
strong association between raster and fields and between vector and discrete objects.

89
ASSIGNMENT QUESTIONS:

SL. CO BT
ASSIGNMENT COMPLEXITY
NO MAPPING LEVEL
1.
How to plot GIS databases in Python with
CO5 Apply High
Basemap?

How will you plot and visualize geographical CO5


data with the help of Basemap. State the Apply High
2.
Procedure for it with an example.
Using plt.contour(), plt.contourf(), CO5
plt.imshow(), plt.colorbar(), plt.clabel() Create High
3.
functions visualize a contour plot.

Determine Surface Temperature Data using CO5


4. Evaluate High
Basemap

Customizing Plot Legends, Determine the use CO5


of size and color in a scatter plot to convey
5. Evaluate High
information about the location, size, and
population of California cities.

90
Ex.No.8 Analyzing Selling Price of used Cars
Aim:
To Analyse the selling price of used cars
Algorithm:
Step 1:Download the dataset from
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Step 2: Find the path location of downloaded dataset and convert it into csv format
Step 3: Import the packages
Step4:Set the path to the data file(.csv file)
Step5:Find if there are any null data or NaN data in our file. If any, remove them
Step6:Perform various data cleaning and data visualisation operations on your data.
Step7:Obtain the result
Source Code:
Import the modules:
# importing section
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
Check the first five entries of dataset:
# using the Csv file
df = pd.read_csv('output.csv')
# Checking the first 5 entries of dataset
df.head()
Defining headers for our dataset.
headers = ["symboling", "normalized-losses", "make",
"fuel-type", "aspiration","num-of-doors",
"body-style","drive-wheels", "engine-location",
"wheel-base","length", "width","height", "curb-weight",
"engine-type","num-of-cylinders", "engine-size",
"fuel-system","bore","stroke", "compression-ratio",
"horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
df.head()
91
Finding the missing value if any.
data = df
# Finding the missing values
data.isna().any()
# Finding if missing values
data.isnull().any()
Converting mpg to L/100km and checking the data type of each column.
# converting mpg to L / 100km
data['city-mpg'] = 235 / df['city-mpg']
data.rename(columns = {'city_mpg': "city-L / 100km"}, inplace = True)
print(data.columns)
# checking the data type of each column
data.dtypes
Price is of object type(string), it should be int or float :
data.price.unique()
# Here it contains '?', so we Drop it
data = data[data.price != '?']
# checking it again
data.dtypes
Normalizing values by using simple feature scaling method examples(do for the
rest) and binning- grouping values
data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()
data['height'] = data['height']/data['height'].max()

# binning- grouping values


bins = np.linspace(min(data['price']), max(data['price']), 4)
group_names = ['Low', 'Medium', 'High']
data['price-binned'] = pd.cut(data['price'], bins,
labels = group_names,
include_lowest = True)

print(data['price-binned'])
plt.hist(data['price-binned'])
plt.show()

92
Doing descriptive analysis of data categorical to numerical values.

# categorical to numerical variables


pd.get_dummies(data['fuel-type']).head()

# descriptive analysis
# NaN are skipped
data.describe()
Descriptive analysis of data categorical to numerical values.
# categorical to numerical variables
pd.get_dummies(data['fuel-type']).head()

# descriptive analysis
# NaN are skipped
data.describe()
Plotting the data according to the price based on engine size.
# examples of box plot
plt.boxplot(data['price'])

# by using seaborn
sns.boxplot(x ='drive-wheels', y ='price', data = data)

# Predicting price based on engine size


# Known on x and predictable on y
plt.scatter(data['engine-size'], data['price'])
plt.title('Scatterplot of Enginesize vs Price')
plt.xlabel('Engine size')
plt.ylabel('Price')
plt.grid()
plt.show()
Grouping the data according to wheel, body-style and price.
# Grouping Data
test = data[['drive-wheels', 'body-style', 'price']]

93
data_grp = test.groupby(['drive-wheels', 'body-style'],
as_index = False).mean()
data_grp
Using the pivot method and plotting the heatmap according to the data obtained
by pivot method

# pivot method
data_pivot = data_grp.pivot(index = 'drive-wheels',
columns = 'body-style')
data_pivot

# heatmap for visualizing data


plt.pcolor(data_pivot, cmap ='RdBu')
plt.colorbar()
plt.show()
Obtaining the final result and showing it in the form of a graph
# Analysis of Variance- ANOVA
# returns f-test and p-value
# f-test = variance between sample group means divided by
# variation within sample group
# p-value = confidence degree
data_annova = data[['make', 'price']]
grouped_annova = data_annova.groupby(['make'])
annova_results_l = sp.stats.f_oneway(
grouped_annova.get_group('honda')['price'],
grouped_annova.get_group('subaru')['price']
)
print(annova_results_l)

# strong corealtion between a categorical variable


# if annova test gives large f-test and small p-value

# Correlation- measures dependency, not causation


sns.regplot(x ='engine-size', y ='price', data = data)
plt.ylim(0, )

94
Output:

First five entries of dataset:

headers for our dataset:

Finding the missing value:

95
Converting mpg to L/100km and checking the data type of each column.

Converting int or float:

96
Normalizing values by using simple feature scaling method

Descriptive analysis of data categorical to numerical values:

Plotting the data according to the price based on engine size:

97
Grouping the data according to wheel, body-style and price:

Plotting the heatmap according to the data obtained by pivot method

98
Final result:

Result:
Thus the selling price of used cars are implemented and Analyzed Successfully.

99
Ex.No 9 Loan Approval Prediction
Aim:
To create loan Approval Prediction to check whether the applicant’s profile is relevant to be
granted with loan or not
Algorithm:
Step 1:Download the Loan approval Prediction from
https://drive.google.com/file/d/1LIvIdqdHDFEGnfzIgEh4L6GFirzsE3US/view
Step 2: Find the path location of downloaded dataset.
Step 3: import libraries Pandas, Seaborn, Matplotlib.
Step4: Perform Data Preprocessing by Getting the number of columns of object datatype.
Step 4: Visualize all the unique values in columns using barplot.
Step5: The heatmap is showing the correlation between Loan Amount and Applicant Income. It also
shows that Credit_ History has a high impact on Loan Status.
Step6: Now find out if there is any missing values in the dataset.
Step7: If there is no missing value then proceed to model training using KNeighbors Classifiers,
Random Forest Classifiers, Support Vector Classifiers (SVC), Logistics Regression.
Step8:Finally best model classifier is found out to predict the loan approval process.
Source Code:
Importing Libraries and Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("LoanApprovalPrediction.csv")
data.head(5)
Data Preprocessing and Visualization
obj = (data.dtypes == 'object')
print("Categorical variables:",len(list(obj[obj].index)))
# Dropping Loan_ID column
data.drop(['Loan_ID'],axis=1,inplace=True)
obj = (data.dtypes == 'object')
object_cols = list(obj[obj].index)
plt.figure(figsize=(18,36))
index = 1
for col in object_cols:
100
y = data[col].value_counts()
plt.subplot(11,4,index)
plt.xticks(rotation=90)
sns.barplot(x=list(y.index), y=y)
index +=1
# Import label encoder
from sklearn import preprocessing

# label_encoder object knows how


# to understand word labels.
label_encoder = preprocessing.LabelEncoder()
obj = (data.dtypes == 'object')
for col in list(obj[obj].index):
data[col] = label_encoder.fit_transform(data[col])
obj = (data.dtypes == 'object')
print("Categorical variables:",len(list(obj[obj].index)))
plt.figure(figsize=(12,6))

sns.heatmap(data.corr(),cmap='BrBG',fmt='.2f',
linewidths=2,annot=True)
sns.catplot(x="Gender", y="Married",
hue="Loan_Status",
kind="bar",
data=data)
for col in data.columns:
data[col] = data[col].fillna(data[col].mean())

data.isna().sum()
Splitting Dataset:
from sklearn.model_selection import train_test_split

X = data.drop(['Loan_Status'],axis=1)
Y = data['Loan_Status']
X.shape,Y.shape
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.4,
101
random_state=1)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Model Training and Evaluation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn import metrics


knn = KNeighborsClassifier(n_neighbors=3)
rfc = RandomForestClassifier(n_estimators = 7,
criterion = 'entropy',
random_state =7)
svc = SVC()
lc = LogisticRegression()
# making predictions on the training set
for clf in (rfc, knn, svc,lc):
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_train)
print("Accuracy score of ", clf. class . name , "=",100*metrics.accuracy_score(Y_train,
Y_pred))
Prediction on the test set:
# making predictions on the testing set
for clf in (rfc, knn, svc,lc):
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Accuracy score of ",
clf. class . name ,"=",
100*metrics.accuracy_score(Y_test, Y_pred))

102
Output:

Importing Libraries and Dataset

Data Preprocessing and Visualization

103
104
Finding Missing Values:

Splitting Dataset:

Model Training and Evaluation

105
Prediction on the test set:

Result:
Thus to create loan Approval Prediction to check whether the applicant’s profile is relevant to be
granted with loan or not is implemented successfully.

106
Eye Colour Detection

Aim:

To detect the eye colour of the person using OpenCV -python

Algorithm:

Step1: Install required Libraries using Command

pip install opencv-python

pip install numpy

Step2: Import Libraries as required

Import cv2

Import numpy as np

Step3: Load and preprocess Images needed or load dataset

Step4: Use an eye detection algorithm to locate the eyes in the image.

Opencv provides a Haarcascade classifier for face and eye detection.

Step5: Extract Eye regions to detect

Step6: Analyze the color distribution within the eye regions to the predominate eye color.

Step7: Display the original image with bounching boxes around detected eyes and visualize the
determined eye color.

Source Code:

Eye Color detection.py

# face color analysis given eye center position

import tensorflow as tf

import tensorflow_probability as tfp

import sys

import os

import numpy as np

import cv2

import argparse

import time

from mtcnn.mtcnn import MTCNN

107
detector = MTCNN()

parser = argparse.ArgumentParser()

parser.add_argument('--input_path', default="./images/Ranveer-singh.jpg")

parser.add_argument('--input_type', default='image')

opt = parser.parse_args()

# define HSV color ranges for eyes colors

class_name = ("Blue", "Blue Gray", "Brown", "Brown Gray", "Brown Black", "Green",
"Green Gray", "Other")

EyeColor = {

class_name[0] : ((166, 21, 50), (240, 100, 85)),

class_name[1] : ((166, 2, 25), (300, 20, 75)),

class_name[2] : ((2, 20, 20), (40, 100, 60)),

class_name[3] : ((20, 3, 30), (65, 60, 60)),

class_name[4] : ((0, 10, 5), (40, 40, 25)),

class_name[5] : ((60, 21, 50), (165, 100, 85)),

class_name[6] : ((60, 2, 25), (165, 20, 65))

def check_color(hsv, color):

if (hsv[0] >= color[0][0]) and (hsv[0] <= color[1][0]) and (hsv[1] >= color[0][1]) and \

hsv[1] <= color[1][1] and (hsv[2] >= color[0][2]) and (hsv[2] <= color[1][2]):

return True

else:

return False

# define eye color category rules in HSV space

def find_class(hsv):

color_id = 7

for i in range(len(class_name)-1):

if check_color(hsv, EyeColor[class_name[i]]) == True:

108
color_id = i

return color_id

def eye_color(image):

imgHSV = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)

h, w = image.shape[0:2]

imgMask = np.zeros((image.shape[0], image.shape[1], 1))

result = detector.detect_faces(image)

if result == []:

print('Warning: Can not detect any face in the input image!')

return

bounding_box = result[0]['box']

left_eye = result[0]['keypoints']['left_eye']

right_eye = result[0]['keypoints']['right_eye']

eye_distance = np.linalg.norm(np.array(left_eye)-np.array(right_eye))

eye_radius = eye_distance/15 # approximate

cv2.circle(imgMask, left_eye, int(eye_radius), (255,255,255), -1)

cv2.circle(imgMask, right_eye, int(eye_radius), (255,255,255), -1)

cv2.rectangle(image,

(bounding_box[0], bounding_box[1]),

(bounding_box[0]+bounding_box[2], bounding_box[1] + bounding_box[3]),

(255,155,255),

2)

cv2.circle(image, left_eye, int(eye_radius), (0, 155, 255), 1)

cv2.circle(image, right_eye, int(eye_radius), (0, 155, 255), 1)

eye_class = np.zeros(len(class_name), np.float64)

for y in range(0, h):

for x in range(0, w):

if imgMask[y, x] != 0:
109
eye_class[find_class(imgHSV[y,x])] +=1

main_color_index = np.argmax(eye_class[:len(eye_class)-1])

total_vote = eye_class.sum()

print("\n\nDominant Eye Color: ", class_name[main_color_index])

print("\n **Eyes Color Percentage **")

for i in range(len(class_name)):

print(class_name[i], ": ", round(eye_class[i]/total_vote*100, 2), "%")

label = 'Dominant Eye Color: %s' % class_name[main_color_index]

cv2.putText(image, label, (left_eye[0]-10, left_eye[1]-40),


cv2.FONT_HERSHEY_SIMPLEX, 0.5, (155,255,0))

cv2.imshow('EYE-COLOR-DETECTION', image)

if name == ' main ':

# image

if opt.input_type == 'image':

image = cv2.imread(opt.input_path, cv2.IMREAD_COLOR)

# detect color percentage

eye_color(image)

cv2.imwrite('sample/result.jpg', image)

cv2.waitKey(0)

else :

print("Image is not detected")

Eyecolor-1.py:

import tkinter as tk

from tkinter import filedialog

import cv2

import numpy as np

110
def detect_eye_color(image_path):

# Load the input image using OpenCV

image = cv2.imread(image_path)

# Convert the image to grayscale

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Apply a thresholding technique to segment the eye region

_, threshold = cv2.threshold(gray, 50, 255, cv2.THRESH_BINARY)

# Use contour detection to find the contours of the segmented eyes

contours, _ = cv2.findContours(threshold, cv2.RETR_EXTERNAL,


cv2.CHAIN_APPROX_SIMPLE)

# Iterate over the contours and filter out non-eye shapes

eyes = []

for contour in contours:

x, y, w, h = cv2.boundingRect(contour)

aspect_ratio = w / float(h)

if 0.5 <= aspect_ratio <= 1.5:

eyes.append((x, y, w, h))

# Determine the color of the eyes

for (x, y, w, h) in eyes:

eye = image[y:y+h, x:x+w]

mean_color = cv2.mean(eye)

if mean_color[2] / mean_color[1] >= 0.5: # Reddish/Brownish

print("Eye color: Reddish/Brownish")

elif mean_color[2] / mean_color[0] >= 0.5: # Greenish/Hazel

print("Eye color: Greenish/Hazel")

else: # Blue or Gray

111
print("Eye color: Blue/Gray")

# Create a Tkinter window

root = tk.Tk()

# Ask the user to select an image file

file_path = filedialog.askopenfilename(title='Select Image', filetypes=[('Image Files', '*.jpg


*.png')])

# Call the detect_eye_color function on the selected image

detect_eye_color(file_path)

# Close the Tkinter window

root.destroy()

result.py:

import cv2

# Load the pre-trained Haar Cascade classifier for eye detection

eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_eye.xml')

# Function to detect eyes in an image

def detect_eyes(image_path):

img = cv2.imread(image_path)

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Detect eyes in the image

eyes = eye_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30,


30))

for (x, y, w, h) in eyes:

cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)

return img

# Replace 'path_to_your_image.jpg' with the path of the image you want to test

image_path = "./images/Ranveer-singh.jpg"
112
# Detect eyes in the image

result_image = detect_eyes(image_path)

# Display the result

cv2.imshow('Eye Detection', result_image)

cv2.waitKey(0)

cv2.destroyAllWindows()

Output:

Result:
Thus the eye colour of the person using OpenCV -python is implemented and detected
successfully.

113

You might also like