0% found this document useful (0 votes)
6 views

lab manual fds

The document outlines the process of downloading, installing, and exploring various Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas. Each package is described with its features and functionalities, emphasizing their roles in scientific computing, data manipulation, and analysis. The document also provides installation procedures for these packages on a Windows environment and details various operations that can be performed using NumPy and Pandas.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

lab manual fds

The document outlines the process of downloading, installing, and exploring various Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas. Each package is described with its features and functionalities, emphasizing their roles in scientific computing, data manipulation, and analysis. The document also provides installation procedures for these packages on a Windows environment and details various operations that can be performed using NumPy and Pandas.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

EX: NO: 1 DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF PYTHON PACKAGES

DATE :

AIM:
To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandas packages.
FEATURES:
Python is open source object oriented interpreted language. Of the many features,
one of the important features that make python a strong programming language is Python
packages. A lot of external packages are written in python which you can be installed and
used depending upon our requirement.
Python packages are nothing but directory of python scripts. Each script is a module
which can be a function, methods or new python type created for particular functionality.
NUMPY:
NumPy is one library which is very integral to Python Programming. The features are as,
 High Performance N-dimensional array object.
 It contains tools for integrating code from C/C++ and FORTRAN.
 It contains a multi dimensional container for generic data.
 It performs complex operations like Linear Algebra, Fourier Transform and Random
number capabilities.
 It consists of broadcasting functions.
 It has data type definition capability to work with varied databases.

SCIPY:
 The SciPy is an open-source scientific library of Python that is distributed under a BSD
license. It is used to solve the complex scientific and mathematical problems.
 It is built on top of the NumPy extension, which means if we import the SciPy, there is no
need to import NumPy.
 The SciPy is pronounced as Sigh pi, and it depends on the Numpy, including the
appropriate and fast N-dimension array manipulation.
 It provides many user-friendly and effective numerical functions for numerical
integration and optimization.
 The SciPy library supports integration, gradient optimization, special functions, ordinary
differential equation solvers, parallel programming tools, and many more. We can say
that SciPy implementation exists in every complex numerical computation.
 The SciPy is a data-processing and system-prototyping environment as similar to
MATLAB. It is easy to use and provides great flexibility to scientists and engineers.

JUPYTER:
 Jupyter Notebooks are locally run on web applications that contain live code, equations,
figures, interactive apps, and Markdown text in which the default programming language is
Python.
 Notebook will assume we are writing Python unless we tell it otherwise.
 Jupyter Notebooks support many programming languages through the use of kernels,
which act as bridges between the Notebook and the language. These include R, C++, and
JavaScript, among many others.
STATSMODELS:
 StatsModels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration. An extensive list of result statistics is available for each estimator.
 It is built on numeric and scientific libraries like NumPy and SciPy.
 It includes various models of linear regression like ordinary least squares, generalized
least squares, weighted least squares, etc.

PANDAS:
Pandas is used for data manipulation, analysis and cleaning. Python pandas are well suited
for different kinds of data, such as:
 Tabular data with heterogeneously-typed columns.
 Ordered and unordered time series data.
 Arbitrary matrix data with row & column labels.
 Unlabelled data.
 Any other form of observational or statistical data sets.

PROCEDURE: (DOWNLOAD / INSTALLATION - windows)


1. Python version 3.7 to be installed prior.
2. To check whether Python exists,
Go to Search  Type cmd  Command prompt appears and type the commands as given
below,
>>> python –V //python --version
Python 3.7.8rc1
>>>python –m pip install numpy
(If already installed, a message will be prompted as “Requirement already satisfied” else
the installation would be continued and completed with success message).
>>>python –m pip install scipy
>>>python –m pip install statsmodels
>>>python –m pip install jupyter
>>>python –m pip install pandas
Note:
pip - Package Installer for Python is the de facto and recommended package-
management system written in Python and is used to install and manage software
packages. It connects to an online repository of public packages, called the Python Package
Index.
Package installation: NumPy, SciPy, Jupyter, StatsModel, Pandas
RESULT:
Thus the procedure to download, install and analyze the features of various python
packages was successfully implemented in windows environment.
EX: NO: 2 WORKING WITH NUMPY ARRAYS
DATE :

AIM:
To download, install NumPy package and perform various NumPy array
manipulations in Python.

DESCRIPTION:
NumPy is a Python package that stands for ‘Numerical Python’. It is the core library
for scientific computing, which contains a powerful n-dimensional array object. Python
NumPy arrays provide tools for integrating C, C++, etc. It is also useful in linear algebra,
random number capability etc. NumPy array can also be used as an efficient multi-
dimensional container for generic data. Numpy array is a powerful N-dimensional array
object which is in the form of rows and columns. We can initialize NumPy arrays from
nested Python lists and access it elements.

PROCEDURE:
If we have Python and PIP already installed on a system, then installation of NumPy is very
easy.
Installation – NumPy package:
C:\Users\User>pip install numpy

Once NumPy is installed, import it in your applications by adding the import keyword:
>>> import numpy as np

ARRAY CREATION:
Single-dimensional NumPy Array:
>>> import numpy as np
>>> a=np.array([1,2,3])
>>> print(a)
[1,2,3]

Multi-dimensional Numpy Array:


>>> a=np.array([(1,2,3),(4,5,6)])
>>> print(a)
[[ 1 2 3]
[4 5 6]]

>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5])
>>>print(arr)
[1 2 3 4 5]

>>>print(type(arr))
<class 'numpy.ndarray'>

>>>a = np.array(42)
>>>b = np.array([1, 2, 3, 4, 5])
>>>c = np.array([[1, 2, 3], [4, 5, 6]])
>>>d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
>>>print(a.ndim)
0
>>>print(b.ndim)
1
>>>print(c.ndim)
2
>>>print(d.ndim)
3

ARRAY INDEXING:
Array indexing is the same as accessing an array element. We can access an array element
by referring to its index number. The indexes in NumPy arrays start with 0, meaning that
the first element has index 0, and the second has index 1 etc.

>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr[0])
1
>>>print(arr[2])
3
>>>print(arr[4])
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
print(arr[4])
IndexError: index 4 is out of bounds for axis 0 with size 4

>>>arr = np.array([1, 2, 3, 4])


>>>print(arr[2] + arr[3])
7

>>>arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])


>>>print('2nd element on 1st row: ', arr[0, 1])
2nd element on 1st row: 2
>>>print('5th element on 2nd row: ', arr[1, 4])
5th element on 2nd row: 10
>>>print('Last element from 2nd dim: ', arr[1, -1])
Last element from 2nd dim: 10
>>>arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
>>>print(arr[0, 1, 2])
6

ARRAY SLICING:
Slicing in python means retrieving elements from one given index to another given index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0. If we don't pass end it’s considered length of array in
that dimension. If we don't pass step it’s considered 1.
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7])
>>>print(arr[1:5])
[2 3 4 5]
>>>print(arr[4:])
[5 6 7]
>>>print(arr[:4])
[1 2 3 4]
>>>print(arr[-3:-1])
[5 6]
>>>print(arr[1:5:2])
[2 4]
>>>print(arr[::2])
[1 3 5 7]
>>>print(arr[1, 1:4])
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
print(arr[1, 1:4])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
>>>arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
>>>print(arr[0:2, 2])
[3 8]
>>>print(arr[0:2, 1:4])
[[2 3 4]
[7 8 9]]

ARRAY SHAPE / RESHAPE:


Array Shape - NumPy arrays have an attribute called shape that returns a tuple with each
index having the number of corresponding elements.
import numpy as np
>>>arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>>print(arr.shape)
(2, 4)

Array Reshape - By reshaping we can add or remove dimensions or change the number of
elements in each dimension.
#Converting a 1d array to 2d
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
>>>newarr = arr.reshape(4, 3)
>>>print(newarr)
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]

ARRAY ITERATION:
Iterating means looping through elements one by one for specific number of times.
>>>import numpy as np
>>> arr = np.array([1, 2, 3])
>>> for x in arr:
print(x)
1
2
3
ARRAY JOINING:
Joining is the process of combining contents of two or more arrays in a single array.
>>>import numpy as np
>>>arr1 = np.array([1, 2, 3])
>>>arr2 = np.array([4, 5, 6])
>>>arr = np.concatenate((arr1, arr2))
>>>print(arr)
[1 2 3 4 5 6]

ARRAY SPLITTING:
Splitting is the reverse process operation of Joining. Splitting breaks one array into multiple
subarrays.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6])
>>>newarr = np.array_split(arr,3)
>>>print(newarr)
[array([1, 2]), array([3, 4]), array([5, 6])]
>>>print(np.array_split(arr,5))
[array([1, 2]), array([3]), array([4]), array([5]), array([6])]

ARRAY SORTING:
Sorting is the process of combining elements in an ordered sequence either in the ascending
or descending order.
>>>import numpy as np
#sorting numbers in ascending order
>>>arr = np.array([3, 2, 0, 1])
>>>print(np.sort(arr))
[0 1 2 3]
#sorting in alphabetical order
>>>arr = np.array(['banana', 'cherry', 'apple'])
>>>print(np.sort(arr))
['apple' 'banana' 'cherry']

SEARCHING ARRAYS:
Search an array for a certain value returns the index that gets a match. To search an array,
use the where ( ) method.

Find the indexes where the value is 4:


>>>arr = np.array([1, 2, 3, 4, 5, 4, 4])
>>>x = np.where(arr == 4)
>>>print(x)
(array([3, 5, 6], dtype=int32),)
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
>>>x = np.where(arr%2 == 0)
>>>print(x)
(array([1, 3, 5, 7], dtype=int32),)
>>>x = np.where(arr%2 == 1)
>>>print(x)
(array([0, 2, 4, 6], dtype=int32),)
DATA TYPES:
NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc. Below is a list of all data types in NumPy and the
characters used to represent them.
 i - integer
 b - boolean
 u - unsigned integer
 f - float
 c - complex float
 m - timedelta
 M - datetime
 O - object
 S - string
 U - unicode string
 V - fixed chunk of memory for other type ( void )

>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr.dtype)
int32
>>>arr = np.array(['apple', 'banana', 'cherry'])
>>>print(arr.dtype)
<U6
>>>arr = np.array([1, 2, 3, 4], dtype='S')
>>>print(arr)
[b'1' b'2' b'3' b'4']
>>>print(arr.dtype)
|S1
>>>arr = np.array([1, 2, 3, 4], dtype='i4')
>>>print(arr)
[1 2 3 4]
>>>print(arr.dtype)
int32
>>>arr = np.array(['a', '2', '3'], dtype='i')
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
arr = np.array(['a', '2', '3'], dtype='i')
ValueError: invalid literal for int() with base 10: 'a'

>>>arr = np.array([1, 0, 3])


>>>newarr = arr.astype(bool)
>>>print(newarr)
[ True False True]
>>>print(newarr.dtype)
bool

RESULT:
Thus working with NumPy arrays and possible manipulations using arrays were
successfully executed.
EX: NO: 3 WORKING WITH PANDAS DATAFRAMES
DATE :

AIM:
To download, install pandas package and perform various manipulations in Python.
DECRIPTION:
Pandas are a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008. It
allows us to analyze big data and make conclusions based on statistical theories. Pandas
can clean messy data sets, and make them readable and relevant. Relevant data is very
important in data science.
Pandas give data analysis for:
 Is there a correlation between two or more columns?
 What is average value?
 Max value?
 Min value?
Pandas are also able to delete rows that are not relevant, or contain wrong values, like
empty or NULL values. This is called cleaning the data.
PROCEDURE:
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns. The methods used in pandas DataFrames are listed below.
FUNCTION DESCRIPTION
index() Method returns index (row labels) of the DataFrame
insert() Method inserts a column into a DataFrame
Method returns addition of dataframe and other, element-wise (binary
add()
operator add)
Method returns subtraction of dataframe and other, element-wise
sub()
(binary operator sub)
Method returns multiplication of dataframe and other, element-wise
mul()
(binary operator mul)
Method returns floating division of dataframe and other, element-wise
div()
(binary operator truediv)
unique() Method extracts the unique values in the dataframe
nunique() Method returns count of the unique values in the dataframe
Method counts the number of times each unique value occurs within the
value_counts()
Series
columns() Method returns the column labels of the DataFrame
axes() Method returns a list representing the axes of the DataFrame
isnull() Method creates a Boolean Series for extracting rows with null values
Method creates a Boolean Series for extracting rows with non-null
notnull()
values
Method extracts rows where a column value falls in between a
between()
predefined range
Method extracts rows from a DataFrame where a column value exists in
isin()
a predefined collection
Method returns a Series with the data type of each column. The result’s
dtypes()
index is the original DataFrame’s columns
astype() Method converts the data types in a Series
Method returns a Numpy representation of the DataFrame i.e. only the
values() values in the DataFrame will be returned, the axes labels will be
removed
sort_values()- Method sorts a data frame in Ascending or Descending order of passed
Set1, Set2 Column
Method sorts the values in a DataFrame based on their index positions
or labels instead of their values but sometimes a data frame is made out
sort_index()
of two or more data frames and hence later index can be changed using
this method
loc[] Method retrieves rows based on index label
iloc[] Method retrieves rows based on index position
Method retrieves DataFrame rows based on either index label or index
ix[] position. This method combines the best features of the .loc[] and .iloc[]
methods
Method is called on a DataFrame to change the names of the index labels
rename()
or column names
columns() Method is an alternative attribute to change the coloumn name
drop() Method is used to delete rows or columns from a DataFrame
pop() Method is used to delete rows or columns from a DataFrame
Method pulls out a random sample of rows or columns from a
sample()
DataFrame
nsmallest() Method pulls out the rows with the smallest values in a column
nlargest() Method pulls out the rows with the largest values in a column
Method returns a tuple representing the dimensionality of the
shape()
DataFrame
Method returns an ‘int’ representing the number of axes / array
ndim()
dimensions. Returns 1 if Series, otherwise returns 2 if DataFrame
Method allows the user to analyze and drop Rows/Columns with Null
dropna()
values in different ways
Method manages and let the user replace NaN values with some value of
fillna()
their own
rank() Values in a Series can be ranked in order with this method
Method is an alternate string-based syntax for extracting a subset from a
query()
DataFrame
copy() Method creates an independent copy of a pandas object
Method creates a Boolean Series and uses it to extract rows that have
duplicated()
duplicate values
Method is an alternative option to identifying duplicate rows and
drop_duplicates()
removing them through filtering
Method sets the DataFrame index (row labels) using one or more
set_index()
existing columns
Method resets index of a Data Frame. This method sets a list of integer
reset_index()
ranging from 0 to length of data as index
Method is used to check a Data Frame for one or more condition and
where() return the result accordingly. By default, the rows not satisfying the
condition are filled with NaN value

Pandas DataFrame will be created by loading the datasets from existing storage, storage
can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the
lists, dictionary, and from a list of dictionary etc.
 Create a simple Pandas DataFrame
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
calories duration
0 420 50
1 380 40
2 390 45
 Pandas use the loc attribute to return one or more specified row(s)
Return row 0:
#refer to the row index:
print(df.loc[0])
calories 420
duration 50
Name: 0, dtype: int64
Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40

Named Indexes:
With the index argument, we can name your own indexes.
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
calories duration
day1 420 50
day2 380 40
day3 390 45
 Load Files into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame. Load a comma
separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
iso_code ... excess_mortality_cumulative_per_million
0 AFG ... NaN
1 AFG ... NaN
2 AFG ... NaN
3 AFG ... NaN
4 AFG ... NaN
... ... ... ...
166321 ZWE ... NaN
166322 ZWE ... NaN
166323 ZWE ... NaN
166324 ZWE ... NaN
166325 ZWE ... NaN

[166326 rows x 67 columns]

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)
Name Age
rank1 Tom 28
rank2 Jack 34
rank3 Steve 29
rank4 Ricky 42
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)
a b c
first 1 2 NaN
second 5 10 20.0

Creating a DataFrame using List:


DataFrame can be created using a single list or a list of lists.
import pandas as pd
# list of strings
lst = ['Pandas', 'SciPy', 'DataFrames', 'NumPy', 'Analytics']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
0 Pandas
1 SciPy
2 DataFrames
3 NumPy
4 Analytics

Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of


narray/list, all the narray must be of same length. If index is passed then the length index
should be equal to the length of arrays. If no index is passed, then by default, index will be
range(n) where n is the array length.
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
print(df)
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

Column Selection: In order to select a column in Pandas DataFrame, we can either access
the columns by calling them by their columns name.
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],'Age':[27, 24, 22, 32],'Address':['Delhi',
'Kanpur', 'Allahabad', 'Kannauj'],'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Row Selection: Pandas provide a unique method to retrieve rows from a Data
frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can
also be selected by passing integer location to an iloc[] function.

File used: country.csv


import pandas as pd
data = pd.read_csv("country.csv", index_col ="iso_code")
first = data.loc["AFG"]
second = data.loc["NOR"]
print(first, "\n\n\n", second)

iso_code continent location date total_cases


AFG Asia Afghanistan 2/24/2020 5
AFG Asia Afghanistan 2/25/2020 5
AFG Asia Afghanistan 2/26/2020 5
AFG Asia Afghanistan 2/27/2020 5
AFG Asia Afghanistan 2/28/2020 5
AFG Asia Afghanistan 2/29/2020 5

iso_code continent location date total_cases


NOR Europe Norway 10/1/2021 189915
NOR Europe Norway 10/2/2021 190224
NOR Europe Norway 10/3/2021 190533
NOR Europe Norway 10/4/2021 191017
NOR Europe Norway 10/5/2021 191599
NOR Europe Norway 10/6/2021 192079
NOR Europe Norway 10/7/2021 192587

Indexing a DataFrame using indexing operator []:


Indexing operator is used to refer to the square brackets following an object.
The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing
operator to refer to df[].

Working with Missing Data:


Missing Data can occur, when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also
refer to as NA(Not Available) values in pandas.

isnull() and notnull():


Both function help in checking whether a value is NaN or not. This function can also be used
in Pandas Series in order to find null values in a series.
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third
Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.isnull())
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False

fillna(), replace() and interpolate():


All these function help in filling null values in datasets of a DataFrame. Interpolate () function
is basically used to fill NA values in the DataFrame, but it uses various interpolation
technique to fill the missing values rather than hard-coding the value.
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third
Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.fillna(0))
First Score Second Score Third Score
0 100.0 30.0 0.0
1 90.0 45.0 40.0
2 0.0 56.0 80.0
3 95.0 0.0 98.0

Iterating over rows and columns:


Pandas DataFrame consists of rows and columns so, in order to iterate over DataFrame, we
have to iterate a DataFrame like a dictionary. In order to iterate over rows, we can use three
function iteritems(), iterrows(), itertuples() . These three functions will help in iteration over
rows.
import pandas as pd
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],'degree': ["MBA", "BCA", "M.Tech",
"MBA"],'score':[90, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df)
name degree score
0 aparna MBA 90
1 pankaj BCA 40
2 sudhir M.Tech 80
3 Geeku MBA 98

RESULT:
Thus working with Pandas DataFrames and possible manipulations was successfully
verified and executed.
EX: No: 4 DESCRIPTIVE ANALYTICS
DATE :

AIM:
To write a python code for reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris data set.

DESCRIPTION:
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.

Iris Dataset
Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant; the researchers have measured various features of the different iris
flowers and recorded them digitally.

PROCEDURE:
CASE 1: READING DATA FROM EXCEL/CSV FILE
We will use the Pandas library to load the Iris data set CSV file, and will convert it into the
dataframe. read_csv() method which is used to read CSV files.

import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]

# Printing top 5 rows


print(df.head())
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
#Use the shape parameter to get the shape of the dataset.
print(df.shape)
(150, 5)
#To view the columns and their data types
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepallength 150 non-null float64
1 sepalwidth 150 non-null float64
2 petallength 150 non-null float64
3 petalwidth 150 non-null float64
4 class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
print(df.describe())
sepallength sepalwidth petallength petalwidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

CASE 2: READING DATA FROM TEXT FILE

file1 = open("/content/sample_data/Basics-Python.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
Output of Read function is
Python is a very popular general-purpose interpreted, interactive, object-
oriented, and high-level programming language. Python is dynamically-typed
and garbage-collected programming language. It was created by Guido van
Rossum during 1985- 1990. Like Perl, Python source code is also available
under the GNU General Public License (GPL).

Python is consistently rated as one of the world's most popular programming


languages. Python is fairly easy to learn, so if you are starting to learn
any programming language then Python could be your great choice. Today
various Schools, Colleges and Universities are teaching Python as their
primary programming language. There are many other good reasons which makes
Python as the top choice of any programmer:

Python is Open Source which means its available free of cost.


Python is simple and so easy to learn
Python is versatile and can be used to create many different things.
Python has powerful development libraries include AI, ML etc.
Python is much in demand and ensures high salary
CASE 3: READING DATA FROM WEB
//To download a file from web using wget module
# wget module to be installed
pip install wget
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-
wheels/public/simple/
Collecting wget
Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
Building wheel for wget (setup.py) ... done
Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674
sha256=c0e498fded138e8bf764bbcda6a413bfac3d6338f40f4be9b5ce9384baa4c957
Stored in directory:
/root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c1
3e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2

Then follow these two steps to download a file:


1. Import the wget module into your project.
2. Use wget.download() to download a file from a specific URL and save it on your machine.
For example, let’s get the Instagram icon using wget:
import wget
URL = "https://instagram.com/favicon.ico"
response = wget.download(URL, "instagram.ico")
As a result we can see an Instagram icon appear in the folder of your program.

Descriptive Statistics — is used to understand your data by calculating various statistical


values for given numeric variables. For any given data our approach is to understand it and
calculated various statistical values. This will help us to identify various statistical tests that
can be done on provided data.
Under descriptive statistics we can calculate following values,
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation, range, interquartile range(IQR)
3. Skewness — symmetry of data along with mean value
4. Kurtosis — peakedness of data at mean value
We have system defined functions to get these values for any given datasets.
# Changing the column headers in Iris dataset
import pandas as pd
import numpy as np
df = pd.read_csv("/content/sample_data/Iris.csv")
data=pd.DataFrame(df,columns=list("ABCDE"))
print(data)
A B C D E
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
.. .. .. .. .. ..
145 NaN NaN NaN NaN NaN
146 NaN NaN NaN NaN NaN
147 NaN NaN NaN NaN NaN
148 NaN NaN NaN NaN NaN
149 NaN NaN NaN NaN NaN
[150 rows x 5 columns]

1. Calculating Central Tendency


data[‘A’].mean()
data[‘A’].median()
data[‘A’].mode()
#mean — is average value of given numeric values
#median — is middle most value of given values
#mode — is most frequently occurring value of given numeric variables
# Mean, Median, Mode on Iris dataset
print(df)
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]

df['sepallength'].mean()
5.843333333333334

df['sepalwidth'].median()
3.0

df['petalwidth'].mode()
0 0.2
dtype: float64

df['class'].mode()
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica
dtype: object
2. Dispersion
Dispersion is used to define variation present in given variable. Variation means how
values are close or away from the mean value.
Variance — its gives average deviation from mean value
Standard Deviation — it is square root of variance
Range — it gives difference between max and min value
InterQuartile Range(IQR) — it gives difference between Q3 and Q1, where Q3 is 3rd
Quartile value and Q1 is 1st Quartile value.
data[‘A’].var()
data[‘A’].std()
data[‘A’].max()-data[‘A’].min()
data[‘A’].quantile([.25,.5,.75])

df["sepalwidth"].var()
0.1880040268456376
df["sepallength"].std()
0.4335943113621737
df["sepallength"].max()-df["sepalwidth"].min()
5.9
df["petalwidth"].quantile([.25,.5,.75])
0.25 0.3
0.50 1.3
0.75 1.8
Name: petalwidth, dtype: float64

3. Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry
means equal distribution of observation above or below the mean.
skewness = 0: if data is symmetric along with mean
skewness = Negative: if data is not symmetric and right side tail is longer than left side tail
of density plot.
skewness = Positive: if data is not symmetric and left side tail is longer than right side tail
in density plot.
We can find skewness of given variable by below given formula.
data[‘A’].skew()
df["sepallength"].skew()
0.3149109566369728
df["sepalwidth"].skew()
0.3340526621720866
df["class"].skew()
ValueError: could not convert string to float: 'Iris-setosa'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/nanops.py in _f(*args,
**kwargs)
99 # object arrays that contain strings
100 if is_object_dtype(args[0]):
--> 101 raise TypeError(e) from e
102 raise
103
TypeError: could not convert string to float: 'Iris-setosa'
4. Kurtosis
Kurtosis is used to defined peakedness (or flatness) of density plot (normal distribution
plot). As per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure
of the combined weight of the tails relative to the rest of the distribution.” This means we
measure tail heaviness of given distribution.
kurtosis = 0: if peakedness of graph is equal to normal distribution.
kurtosis = Negative: if peakedness of graph is less than normal distribution(flat plot)
kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked
plot)
We can find kurtosis of given variable by below given formula.
data[‘A’].kurt()

df["sepalwidth"].kurt()
0.2907810623654279

df["sepallength"].kurt()
-0.5520640413156395

Let see the graph representation of given variable and interpretation of skewness and
peakedness of distribution from it.
import seaborn as sns
sns.distplot(df[“sepallength”],hist=True,kde=True)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `histplot` (an axes-level
function for histograms).
warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa94e2957d0>

Density plot of variable ‘sepallength’


In the above graph, we can clearly see that left side and right side of plot is equally
distributed. Histogram is above the line that means data has flat plot. This means kurtosis
of this distribution is Normal.
Checking Missing Values
Missing values can occur when no information is provided for one or more items or for a
whole unit. We will use the isnull() method.
df.isnull().sum()
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
Checking Duplicates
Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method
helps in removing duplicates from the data frame.

#interactive table view


data = df.drop_duplicates(subset ="class",)
data
index sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
Iris-
50 7.0 3.2 4.7 1.4
versicolor
100 6.3 3.3 6.0 2.5 Iris-virginica

Let’s see if the dataset is balanced or not i.e. all the column in class contains equal amounts
of rows or not. We will use the Series.value_counts() function. This function returns a
Series containing counts of unique values.

df.value_counts("sepalwidth")
sepalwidth
3.0 26
2.8 14
3.2 13
3.4 12
3.1 12
2.9 10
2.7 9
2.5 8
3.3 6
3.5 6
3.8 6
2.6 5
2.3 4
2.4 3
2.2 3
3.6 3
3.7 3
3.9 2
4.1 1
4.2 1
2.0 1
4.0 1
4.4 1
dtype: int64

Data Visualization
Visualizing the target column - Our target column will be the sepalwidth column because at
the end, we need the result according to the sepalwidth only. Let’s see a countplot for
species. (We will use Matplotlib and Seaborn library for the data visualization.)

import seaborn as sns


import matplotlib.pyplot as plt

sns.countplot(x="sepalwidth", data=df, )
plt.show()
Comparing Sepal Length and Sepal Width
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x="sepallength", y="sepalwidth",hue="class", data=df, )

# Placing Legend outside the Figure


plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Histograms
Histograms allow seeing the distribution of data for various columns. It can be used for uni
as well as bi-variate analysis.

import seaborn as sns


import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(10,10))

axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df["sepallength"], bins=7)

axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df["sepalwidth"], bins=5);

axes[1,0].set_title("Petal Length")
axes[1,0].hist(df["petallength"], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df["petalwidth"], bins=6);

Output:
 The highest frequency of the sepal length is between 30 and 35 which is between
5.5 and 6.
 The highest frequency of the sepal width is around 70 which is between 3.0 and 3.5.
 The highest frequency of the petal length is around 50 which is between 1 and 2.
 The highest frequency of the petal width is between 40 and 50 which is between 0.0
and 0.5

RESULT:
Thus the python code for reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris data set have been applied
and results retrieved successfully.
EX: No: 5 EXPLORATORY DATA ANALYSIS
DATE:

AIM:
To write a python code for performing Univariate analysis, Bivariate analysis,
Multiple Regression analysis and compare the results obtained on diabetes data set from
UCI and Pima Indians Diabetes data sets.

DESCRIPTION:
Exploratory Data Analysis is majorly performed using the following methods:
 Univariate analysis: provides summary statistics for each field in the raw data set (or)
summary only on one variable. Ex: CDF, PDF, Box plot, Violin plot
 Bivariate analysis: is performed to find the relationship between each variable in the
dataset and the target variable of interest (or) using 2 variables and finding the
relationship between them. Ex: Box plot, Violin plot.
 Multivariate analysis: is performed to understand interactions between different fields
in the dataset (or) finding interactions between variables more than 2. Ex: Pair plot and 3D
scatter plot.
UCI Diabetes Dataset:
This dataset contains the distribution for 70 sets of data recorded on diabetes patients
(several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a
description of the problem domain).

Pima Indians Diabetes Dataset:


This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether or not a patient
has diabetes, based on certain diagnostic measurements included in the dataset. Several
constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage.

PROCEDURE:
(5a) Univariate Analysis - Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis

import pandas as pd
import numpy as np
df = pd.read_csv("diabetes.csv")
print(df)
Age Gender Polyuria ... Alopecia Obesity class
0 40 Male No ... Yes Yes Positive
1 58 Male No ... Yes No Positive
2 41 Male Yes ... Yes No Positive
3 45 Male No ... No No Positive
4 60 Male Yes ... Yes Yes Positive
.. ... ... ... ... ... ... ...
515 39 Female Yes ... No No Positive
516 48 Female Yes ... No No Positive
517 58 Female Yes ... No Yes Positive
518 32 Female No ... Yes No Negative
519 42 Male No ... No No Negative
[520 rows x 17 columns]
>>>print(df['Age'].mean())
48.02884615384615
>>>print(df['Age'].median())
47.5
>>>print(df['Age'].mode())
0 35
dtype: int64
>>>print(df["Age"].var())
147.65812583370388
>>>print(df["Age"].std())
12.151465995249458
>>>print(df["Age"].skew())
0.3293593578272701
>>>print(df["Age"].kurt())
-0.19170941407070163

Data-Visualization:(pima-diabetes.csv)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
Data_X= df.copy(deep=True)
Data_X= Data_X.drop(['Outcome'],axis=1)
plt.rcParams['figure.figsize']=[40,40]
#Plotting Histogram of Data
Data_X.hist(bins=40)
plt.show()
(5b) Bivariate Analysis – Linear and Logistic Regression
Simple Linear Regression - It is an approach for predicting a response using a single
feature. It is assumed that the two variables are linearly related. So, we try to find a linear
function that predicts the response value(y) as accurately as possible as a function of the
feature or independent variable(x). Let us consider a dataset where we have a value of
response y for every feature x as given below(example):

For generality, we define:


x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n] for n observations (n=10).

A scatter plot of the above dataset looks like:-

Now, the task is to find a line that fits best in the above scatter plot so that we can
predict the response for any new feature values. (i.e. a value of x not present in a dataset).
This line is called a regression line. The equation of regression line is represented as:
h(xi) = b0+b1xi
Here,
 h(xi) represents the predicted response value for ith observation.
 b0 and b1 are regression coefficients and represent y-intercept and slope of regression
line respectively.

SOURCE CODE:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\n b_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
Logistic Regression:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
import sklearn
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
LR = LogisticRegression()
LR.fit(X_train, y_train)
y_pred = LR.predict(X_test)
print("Accuracy ", LR.score(X_test, y_test)*100)
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()

Output:
Accuracy 79.16666666666666

(5c) Multiple Regression Analysis


import pandas as pd
from sklearn import linear_model
df = pd.read_csv("pima-diabetes.csv")
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#Predict age based on Glucose and BloodPressure
predictedage = regr.predict([[185, 145]])
print(predictedage)

Output:
[48.13025197]
(5d) Comparative Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Outcome'] == 0]["Pregnancies"], color='green') # Healthy - green
sns.distplot(df[df['Outcome'] == 1]["Pregnancies"], color='red') # Diabetic - Red
plt.title('Healthy vs Diabetic by Pregnancy', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()

Output:

From above graph, we can infer that the Pregnancy isn't likely cause for diabetes as the
distribution between the Healthy and Diabetic is almost same.
//diabetes.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Gender'] == 'Male']["Age"], color='green')
sns.distplot(df[df['Polyuria'] == 'No']["Age"], color='red')
plt.title('Male vs Polyuria by Age', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()

Output:

RESULT:
Thus the python code for performing Univariate, Bivariate, multiple regression analysis
and comparison between dependent and independent variables on the diabetes or pima-
diabetes data set have been applied and results retrieved successfully.
EX: NO: 6 PLOTTING FUNCTIONS
DATE :

AIM:
To write a python code for performing various plotting functions on the UCI data sets.
DESCRIPTION:
The plotting methods are used to represent the data in a visualization format for enhancing
the data analytics. The various methods are listed below.
a. Normal curves - Normal Distribution is a probability function used in statistics that
tells about how the data values are distributed. It is the most important probability
distribution function used in statistics because of its advantages in real case scenarios.
b. Density and contour plots – It represent events with a density gradient or contour
gradient depending on the number of events. Density Plots - In a density plot, the color of
an area reflects how many events are in that position of the plot.
c. Correlation and scatter plots - A scatter plot displays the strength, direction, and form
of the relationship between two quantitative variables. A correlation coefficient measures
the strength of that relationship.
d. Histograms - A histogram is a graphical representation of data points organized into
user-specified ranges. Similar in appearance to a bar graph, the histogram condenses a data
series into an easily interpreted visual by taking many data points and grouping them into
logical ranges or bins.
e. Three dimensional plotting - plots are enabled by importing the mplot3d toolkit,
included with the main Matplotlib installation.
from mpl_toolkits import mplot3d
Once this submodule is imported, a three-dimensional axes can be created by passing the
keyword projection='3d' to any of the normal axes creation routines.

SOURCE CODE:
# Normal Curve
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.Glucose[0:50]
mean=st.mean(x)
sd=st.stdev(x)
pyplot.plot(x,norm.pdf(x,mean,sd))
pyplot.title("Normal plot")
pyplot.show()
OUTPUT:

#density plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

OUTPUT:

#contour plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.BloodPressure[0:2]
y=data.Glucose[0:2]
z=((data.BMI[0:2],data.Age[0:2]))
pyplot.figure(figsize=(7,5))
pyplot.title("Contour plot")
contours=pyplot.contour(x,y,z)
pyplot.show()

OUTPUT:

#correlation plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
names=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction", "Age"]
correlation = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlation, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,8,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.title("Correlation")
pyplot.show()
OUTPUT:

#scatter plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
scatter_matrix(data)
pyplot.show()

OUTPUT:
#Histograms
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.hist()
pyplot.show()

OUTPUT:

#three dimensonal plotting


import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d

data = pd.read_csv('diabetes.csv')
fig = pyplot.figure()

ax = pyplot.axes(projection='3d')
ax = pyplot.axes(projection='3d')
zline = np.array(data.BMI)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Blues')
pyplot.show()

OUTPUT:

RESULT:
Thus the python code for performing various plotting functions like scatter plot, normal
plot, histogram, density plot, contour plot, 3D plot were applied on the UCI diabetes data
sets and visualized successfully.
EX: NO: 7 GEOGRAPHIC DATA WITH BASEMAP
DATE :

AIM:
To write a python code for visualizing the geographic data with Basemap package and
methods by applying various projection techniques.
DESCRIPTION:
One common type of visualization in data science is that of geographic data. Matplotlib's
main tool for this type of visualization is the Basemap toolkit, which is one of several
Matplotlib toolkits which lives under the mpl_toolkits namespace. More modern solutions
such as leaflet or the Google Maps API may be a better choice for more intensive map
visualizations. Still, Basemap is a useful tool for Python users to have in their virtual
toolbelts.

PROCEDURE:
#Basemap and other packages installation
!pip install basemap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-
wheels/public/simple/
Collecting basemap
Downloading basemap-1.3.6-cp38-cp38-manylinux1_x86_64.whl (863 kB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 863 kB 14.5 MB/s
Collecting basemap-data<1.4,>=1.3.2
Downloading basemap_data-1.3.2-py2.py3-none-any.whl (30.5 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 30.5 MB 1.4 MB/s
Requirement already satisfied: matplotlib<3.7,>=1.5 in /usr/local/lib/python3.8/dist-
packages (from basemap) (3.2.2)
Collecting pyproj<3.5.0,>=1.9.3
Downloading pyproj-3.4.0-cp38-cp38-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 7.8 MB 55.4 MB/s
Collecting pyshp<2.4,>=1.2
Downloading pyshp-2.3.1-py2.py3-none-any.whl (46 kB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 46 kB 3.6 MB/s
Collecting numpy<1.24,>=1.22
Downloading numpy-1.23.5-cp38-cp38-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 17.1 MB 46.7 MB/s
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-
packages (from matplotlib<3.7,>=1.5->basemap) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in
/usr/local/lib/python3.8/dist-packages (from matplotlib<3.7,>=1.5->basemap) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages
(from matplotlib<3.7,>=1.5->basemap) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-
packages (from matplotlib<3.7,>=1.5->basemap) (1.4.4)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from
pyproj<3.5.0,>=1.9.3->basemap) (2022.9.24)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from
python-dateutil>=2.1->matplotlib<3.7,>=1.5->basemap) (1.15.0)
Installing collected packages: numpy, pyshp, pyproj, basemap-data, basemap
Attempting uninstall: numpy
Found existing installation: numpy 1.21.6
Uninstalling numpy-1.21.6:
Successfully uninstalled numpy-1.21.6
ERROR: pip's dependency resolver does not currently take into account all the packages
that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.23.5 which is
incompatible.
Successfully installed basemap-1.3.6 basemap-data-1.3.2 numpy-1.23.5 pyproj-3.4.0 pyshp-
2.3.1

!pip install basemap-data


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-
wheels/public/simple/
Requirement already satisfied: basemap-data in /usr/local/lib/python3.8/dist-packages (1.3.2)

!pip install basemap-data-hires


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-
wheels/public/simple/
Collecting basemap-data-hires
Downloading basemap_data_hires-1.3.2-py2.py3-none-any.whl (91.1 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 91.1 MB 57 kB/s
Installing collected packages: basemap-data-hires
Successfully installed basemap-data-hires-1.3.2

!pip install chain


Requirement already satisfied: chain in /usr/local/lib/python3.8/dist-packages (1.0)

SOURCE CODE:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5)
plt.show()
The useful thing is that the globe shown here is not a mere image; it is a fully-functioning
Matplotlib axes that understands spherical coordinates and which allows us to easily over
plot data on the map.

fig = plt.figure(figsize=(8, 8))


m=Basemap(projection='lcc', resolution=None,width=8E6, height=8E6,lat_0=45, lon_0=-100)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting


x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

Map Projections:
The Basemap package implements several dozen such projections, all referenced by a short
format code.

from itertools import chain

def draw_map(m, scale=0.2):


# draw a shaded-relief image
m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary


lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))

# keys contain the plt.Line2D instances


lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style


for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')
Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant
latitude and longitude are mapped to horizontal and vertical lines, respectively. This type
of mapping represents equatorial regions quite well, but results in extreme distortions near
the poles. The spacing of latitude lines varies between different cylindrical projections,
leading to different conservation properties, and different distortion near the poles.

fig = plt.figure(figsize=(8, 6), edgecolor='w')


m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant
longitude) remain vertical; The Mollweide projection (projection='moll') is one common
example of this, in which all meridians are elliptical arcs. It is constructed so as to preserve
area across the map: though there are distortions near the poles, the area of small patches
reflects the true area. Other pseudo-cylindrical projections are the sinusoidal
(projection='sinu') and Robinson (projection='robin') projections.

fig = plt.figure(figsize=(8, 6), edgecolor='w')


m = Basemap(projection='moll', resolution=None,lat_0=0, lon_0=0)
draw_map(m)
Perspective projections
Perspective projections are constructed using a particular choice of perspective point,
similar to if you photographed the Earth from a particular point in space (a point which, for
some projections, technically lies within the Earth!). One common example is the
orthographic projection (projection='ortho'), which shows one side of the globe as seen
from a viewer at a very long distance.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=0)
draw_map(m)

Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can
lead to very good local properties, but regions far from the focus point of the cone may
become much distorted.

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None, lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)
Example – Dataset (California_cities.csv)
import pandas as pd
cities = pd.read_csv('/content/sample_data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m=Basemap(projection='lcc',resolution='h',lat_0=37.5,lon_0=119,width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,c=np.log10(population), s=area, cmap='Reds', alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='lower left');

This shows us where larger populations of people have settled in California: they are clustered near
the coast in the Los Angeles and San Francisco areas, stretched along the highways in the flat
central valley, and avoiding almost completely the mountainous regions along the borders of the
state.

RESULT:
Thus the python code for visualizing the geographic data with various types of
projections has been applied and results retrieved successfully.

You might also like