lab manual fds
lab manual fds
DATE :
AIM:
To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandas packages.
FEATURES:
Python is open source object oriented interpreted language. Of the many features,
one of the important features that make python a strong programming language is Python
packages. A lot of external packages are written in python which you can be installed and
used depending upon our requirement.
Python packages are nothing but directory of python scripts. Each script is a module
which can be a function, methods or new python type created for particular functionality.
NUMPY:
NumPy is one library which is very integral to Python Programming. The features are as,
High Performance N-dimensional array object.
It contains tools for integrating code from C/C++ and FORTRAN.
It contains a multi dimensional container for generic data.
It performs complex operations like Linear Algebra, Fourier Transform and Random
number capabilities.
It consists of broadcasting functions.
It has data type definition capability to work with varied databases.
SCIPY:
The SciPy is an open-source scientific library of Python that is distributed under a BSD
license. It is used to solve the complex scientific and mathematical problems.
It is built on top of the NumPy extension, which means if we import the SciPy, there is no
need to import NumPy.
The SciPy is pronounced as Sigh pi, and it depends on the Numpy, including the
appropriate and fast N-dimension array manipulation.
It provides many user-friendly and effective numerical functions for numerical
integration and optimization.
The SciPy library supports integration, gradient optimization, special functions, ordinary
differential equation solvers, parallel programming tools, and many more. We can say
that SciPy implementation exists in every complex numerical computation.
The SciPy is a data-processing and system-prototyping environment as similar to
MATLAB. It is easy to use and provides great flexibility to scientists and engineers.
JUPYTER:
Jupyter Notebooks are locally run on web applications that contain live code, equations,
figures, interactive apps, and Markdown text in which the default programming language is
Python.
Notebook will assume we are writing Python unless we tell it otherwise.
Jupyter Notebooks support many programming languages through the use of kernels,
which act as bridges between the Notebook and the language. These include R, C++, and
JavaScript, among many others.
STATSMODELS:
StatsModels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration. An extensive list of result statistics is available for each estimator.
It is built on numeric and scientific libraries like NumPy and SciPy.
It includes various models of linear regression like ordinary least squares, generalized
least squares, weighted least squares, etc.
PANDAS:
Pandas is used for data manipulation, analysis and cleaning. Python pandas are well suited
for different kinds of data, such as:
Tabular data with heterogeneously-typed columns.
Ordered and unordered time series data.
Arbitrary matrix data with row & column labels.
Unlabelled data.
Any other form of observational or statistical data sets.
AIM:
To download, install NumPy package and perform various NumPy array
manipulations in Python.
DESCRIPTION:
NumPy is a Python package that stands for ‘Numerical Python’. It is the core library
for scientific computing, which contains a powerful n-dimensional array object. Python
NumPy arrays provide tools for integrating C, C++, etc. It is also useful in linear algebra,
random number capability etc. NumPy array can also be used as an efficient multi-
dimensional container for generic data. Numpy array is a powerful N-dimensional array
object which is in the form of rows and columns. We can initialize NumPy arrays from
nested Python lists and access it elements.
PROCEDURE:
If we have Python and PIP already installed on a system, then installation of NumPy is very
easy.
Installation – NumPy package:
C:\Users\User>pip install numpy
Once NumPy is installed, import it in your applications by adding the import keyword:
>>> import numpy as np
ARRAY CREATION:
Single-dimensional NumPy Array:
>>> import numpy as np
>>> a=np.array([1,2,3])
>>> print(a)
[1,2,3]
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5])
>>>print(arr)
[1 2 3 4 5]
>>>print(type(arr))
<class 'numpy.ndarray'>
>>>a = np.array(42)
>>>b = np.array([1, 2, 3, 4, 5])
>>>c = np.array([[1, 2, 3], [4, 5, 6]])
>>>d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
>>>print(a.ndim)
0
>>>print(b.ndim)
1
>>>print(c.ndim)
2
>>>print(d.ndim)
3
ARRAY INDEXING:
Array indexing is the same as accessing an array element. We can access an array element
by referring to its index number. The indexes in NumPy arrays start with 0, meaning that
the first element has index 0, and the second has index 1 etc.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr[0])
1
>>>print(arr[2])
3
>>>print(arr[4])
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
print(arr[4])
IndexError: index 4 is out of bounds for axis 0 with size 4
ARRAY SLICING:
Slicing in python means retrieving elements from one given index to another given index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0. If we don't pass end it’s considered length of array in
that dimension. If we don't pass step it’s considered 1.
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7])
>>>print(arr[1:5])
[2 3 4 5]
>>>print(arr[4:])
[5 6 7]
>>>print(arr[:4])
[1 2 3 4]
>>>print(arr[-3:-1])
[5 6]
>>>print(arr[1:5:2])
[2 4]
>>>print(arr[::2])
[1 3 5 7]
>>>print(arr[1, 1:4])
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
print(arr[1, 1:4])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
>>>arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
>>>print(arr[0:2, 2])
[3 8]
>>>print(arr[0:2, 1:4])
[[2 3 4]
[7 8 9]]
Array Reshape - By reshaping we can add or remove dimensions or change the number of
elements in each dimension.
#Converting a 1d array to 2d
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
>>>newarr = arr.reshape(4, 3)
>>>print(newarr)
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
ARRAY ITERATION:
Iterating means looping through elements one by one for specific number of times.
>>>import numpy as np
>>> arr = np.array([1, 2, 3])
>>> for x in arr:
print(x)
1
2
3
ARRAY JOINING:
Joining is the process of combining contents of two or more arrays in a single array.
>>>import numpy as np
>>>arr1 = np.array([1, 2, 3])
>>>arr2 = np.array([4, 5, 6])
>>>arr = np.concatenate((arr1, arr2))
>>>print(arr)
[1 2 3 4 5 6]
ARRAY SPLITTING:
Splitting is the reverse process operation of Joining. Splitting breaks one array into multiple
subarrays.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4, 5, 6])
>>>newarr = np.array_split(arr,3)
>>>print(newarr)
[array([1, 2]), array([3, 4]), array([5, 6])]
>>>print(np.array_split(arr,5))
[array([1, 2]), array([3]), array([4]), array([5]), array([6])]
ARRAY SORTING:
Sorting is the process of combining elements in an ordered sequence either in the ascending
or descending order.
>>>import numpy as np
#sorting numbers in ascending order
>>>arr = np.array([3, 2, 0, 1])
>>>print(np.sort(arr))
[0 1 2 3]
#sorting in alphabetical order
>>>arr = np.array(['banana', 'cherry', 'apple'])
>>>print(np.sort(arr))
['apple' 'banana' 'cherry']
SEARCHING ARRAYS:
Search an array for a certain value returns the index that gets a match. To search an array,
use the where ( ) method.
>>>import numpy as np
>>>arr = np.array([1, 2, 3, 4])
>>>print(arr.dtype)
int32
>>>arr = np.array(['apple', 'banana', 'cherry'])
>>>print(arr.dtype)
<U6
>>>arr = np.array([1, 2, 3, 4], dtype='S')
>>>print(arr)
[b'1' b'2' b'3' b'4']
>>>print(arr.dtype)
|S1
>>>arr = np.array([1, 2, 3, 4], dtype='i4')
>>>print(arr)
[1 2 3 4]
>>>print(arr.dtype)
int32
>>>arr = np.array(['a', '2', '3'], dtype='i')
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
arr = np.array(['a', '2', '3'], dtype='i')
ValueError: invalid literal for int() with base 10: 'a'
RESULT:
Thus working with NumPy arrays and possible manipulations using arrays were
successfully executed.
EX: NO: 3 WORKING WITH PANDAS DATAFRAMES
DATE :
AIM:
To download, install pandas package and perform various manipulations in Python.
DECRIPTION:
Pandas are a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008. It
allows us to analyze big data and make conclusions based on statistical theories. Pandas
can clean messy data sets, and make them readable and relevant. Relevant data is very
important in data science.
Pandas give data analysis for:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contain wrong values, like
empty or NULL values. This is called cleaning the data.
PROCEDURE:
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns. The methods used in pandas DataFrames are listed below.
FUNCTION DESCRIPTION
index() Method returns index (row labels) of the DataFrame
insert() Method inserts a column into a DataFrame
Method returns addition of dataframe and other, element-wise (binary
add()
operator add)
Method returns subtraction of dataframe and other, element-wise
sub()
(binary operator sub)
Method returns multiplication of dataframe and other, element-wise
mul()
(binary operator mul)
Method returns floating division of dataframe and other, element-wise
div()
(binary operator truediv)
unique() Method extracts the unique values in the dataframe
nunique() Method returns count of the unique values in the dataframe
Method counts the number of times each unique value occurs within the
value_counts()
Series
columns() Method returns the column labels of the DataFrame
axes() Method returns a list representing the axes of the DataFrame
isnull() Method creates a Boolean Series for extracting rows with null values
Method creates a Boolean Series for extracting rows with non-null
notnull()
values
Method extracts rows where a column value falls in between a
between()
predefined range
Method extracts rows from a DataFrame where a column value exists in
isin()
a predefined collection
Method returns a Series with the data type of each column. The result’s
dtypes()
index is the original DataFrame’s columns
astype() Method converts the data types in a Series
Method returns a Numpy representation of the DataFrame i.e. only the
values() values in the DataFrame will be returned, the axes labels will be
removed
sort_values()- Method sorts a data frame in Ascending or Descending order of passed
Set1, Set2 Column
Method sorts the values in a DataFrame based on their index positions
or labels instead of their values but sometimes a data frame is made out
sort_index()
of two or more data frames and hence later index can be changed using
this method
loc[] Method retrieves rows based on index label
iloc[] Method retrieves rows based on index position
Method retrieves DataFrame rows based on either index label or index
ix[] position. This method combines the best features of the .loc[] and .iloc[]
methods
Method is called on a DataFrame to change the names of the index labels
rename()
or column names
columns() Method is an alternative attribute to change the coloumn name
drop() Method is used to delete rows or columns from a DataFrame
pop() Method is used to delete rows or columns from a DataFrame
Method pulls out a random sample of rows or columns from a
sample()
DataFrame
nsmallest() Method pulls out the rows with the smallest values in a column
nlargest() Method pulls out the rows with the largest values in a column
Method returns a tuple representing the dimensionality of the
shape()
DataFrame
Method returns an ‘int’ representing the number of axes / array
ndim()
dimensions. Returns 1 if Series, otherwise returns 2 if DataFrame
Method allows the user to analyze and drop Rows/Columns with Null
dropna()
values in different ways
Method manages and let the user replace NaN values with some value of
fillna()
their own
rank() Values in a Series can be ranked in order with this method
Method is an alternate string-based syntax for extracting a subset from a
query()
DataFrame
copy() Method creates an independent copy of a pandas object
Method creates a Boolean Series and uses it to extract rows that have
duplicated()
duplicate values
Method is an alternative option to identifying duplicate rows and
drop_duplicates()
removing them through filtering
Method sets the DataFrame index (row labels) using one or more
set_index()
existing columns
Method resets index of a Data Frame. This method sets a list of integer
reset_index()
ranging from 0 to length of data as index
Method is used to check a Data Frame for one or more condition and
where() return the result accordingly. By default, the rows not satisfying the
condition are filled with NaN value
Pandas DataFrame will be created by loading the datasets from existing storage, storage
can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the
lists, dictionary, and from a list of dictionary etc.
Create a simple Pandas DataFrame
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
calories duration
0 420 50
1 380 40
2 390 45
Pandas use the loc attribute to return one or more specified row(s)
Return row 0:
#refer to the row index:
print(df.loc[0])
calories 420
duration 50
Name: 0, dtype: int64
Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40
Named Indexes:
With the index argument, we can name your own indexes.
import pandas as pd
data = {"calories": [420, 380, 390],"duration": [50, 40, 45]}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
calories duration
day1 420 50
day2 380 40
day3 390 45
Load Files into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame. Load a comma
separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
iso_code ... excess_mortality_cumulative_per_million
0 AFG ... NaN
1 AFG ... NaN
2 AFG ... NaN
3 AFG ... NaN
4 AFG ... NaN
... ... ... ...
166321 ZWE ... NaN
166322 ZWE ... NaN
166323 ZWE ... NaN
166324 ZWE ... NaN
166325 ZWE ... NaN
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)
Name Age
rank1 Tom 28
rank2 Jack 34
rank3 Steve 29
rank4 Ricky 42
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)
a b c
first 1 2 NaN
second 5 10 20.0
Column Selection: In order to select a column in Pandas DataFrame, we can either access
the columns by calling them by their columns name.
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],'Age':[27, 24, 22, 32],'Address':['Delhi',
'Kanpur', 'Allahabad', 'Kannauj'],'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Row Selection: Pandas provide a unique method to retrieve rows from a Data
frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can
also be selected by passing integer location to an iloc[] function.
RESULT:
Thus working with Pandas DataFrames and possible manipulations was successfully
verified and executed.
EX: No: 4 DESCRIPTIVE ANALYTICS
DATE :
AIM:
To write a python code for reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris data set.
DESCRIPTION:
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.
Iris Dataset
Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant; the researchers have measured various features of the different iris
flowers and recorded them digitally.
PROCEDURE:
CASE 1: READING DATA FROM EXCEL/CSV FILE
We will use the Pandas library to load the Iris data set CSV file, and will convert it into the
dataframe. read_csv() method which is used to read CSV files.
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]
file1 = open("/content/sample_data/Basics-Python.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
Output of Read function is
Python is a very popular general-purpose interpreted, interactive, object-
oriented, and high-level programming language. Python is dynamically-typed
and garbage-collected programming language. It was created by Guido van
Rossum during 1985- 1990. Like Perl, Python source code is also available
under the GNU General Public License (GPL).
df['sepallength'].mean()
5.843333333333334
df['sepalwidth'].median()
3.0
df['petalwidth'].mode()
0 0.2
dtype: float64
df['class'].mode()
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica
dtype: object
2. Dispersion
Dispersion is used to define variation present in given variable. Variation means how
values are close or away from the mean value.
Variance — its gives average deviation from mean value
Standard Deviation — it is square root of variance
Range — it gives difference between max and min value
InterQuartile Range(IQR) — it gives difference between Q3 and Q1, where Q3 is 3rd
Quartile value and Q1 is 1st Quartile value.
data[‘A’].var()
data[‘A’].std()
data[‘A’].max()-data[‘A’].min()
data[‘A’].quantile([.25,.5,.75])
df["sepalwidth"].var()
0.1880040268456376
df["sepallength"].std()
0.4335943113621737
df["sepallength"].max()-df["sepalwidth"].min()
5.9
df["petalwidth"].quantile([.25,.5,.75])
0.25 0.3
0.50 1.3
0.75 1.8
Name: petalwidth, dtype: float64
3. Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry
means equal distribution of observation above or below the mean.
skewness = 0: if data is symmetric along with mean
skewness = Negative: if data is not symmetric and right side tail is longer than left side tail
of density plot.
skewness = Positive: if data is not symmetric and left side tail is longer than right side tail
in density plot.
We can find skewness of given variable by below given formula.
data[‘A’].skew()
df["sepallength"].skew()
0.3149109566369728
df["sepalwidth"].skew()
0.3340526621720866
df["class"].skew()
ValueError: could not convert string to float: 'Iris-setosa'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/nanops.py in _f(*args,
**kwargs)
99 # object arrays that contain strings
100 if is_object_dtype(args[0]):
--> 101 raise TypeError(e) from e
102 raise
103
TypeError: could not convert string to float: 'Iris-setosa'
4. Kurtosis
Kurtosis is used to defined peakedness (or flatness) of density plot (normal distribution
plot). As per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure
of the combined weight of the tails relative to the rest of the distribution.” This means we
measure tail heaviness of given distribution.
kurtosis = 0: if peakedness of graph is equal to normal distribution.
kurtosis = Negative: if peakedness of graph is less than normal distribution(flat plot)
kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked
plot)
We can find kurtosis of given variable by below given formula.
data[‘A’].kurt()
df["sepalwidth"].kurt()
0.2907810623654279
df["sepallength"].kurt()
-0.5520640413156395
Let see the graph representation of given variable and interpretation of skewness and
peakedness of distribution from it.
import seaborn as sns
sns.distplot(df[“sepallength”],hist=True,kde=True)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `histplot` (an axes-level
function for histograms).
warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa94e2957d0>
Let’s see if the dataset is balanced or not i.e. all the column in class contains equal amounts
of rows or not. We will use the Series.value_counts() function. This function returns a
Series containing counts of unique values.
df.value_counts("sepalwidth")
sepalwidth
3.0 26
2.8 14
3.2 13
3.4 12
3.1 12
2.9 10
2.7 9
2.5 8
3.3 6
3.5 6
3.8 6
2.6 5
2.3 4
2.4 3
2.2 3
3.6 3
3.7 3
3.9 2
4.1 1
4.2 1
2.0 1
4.0 1
4.4 1
dtype: int64
Data Visualization
Visualizing the target column - Our target column will be the sepalwidth column because at
the end, we need the result according to the sepalwidth only. Let’s see a countplot for
species. (We will use Matplotlib and Seaborn library for the data visualization.)
sns.countplot(x="sepalwidth", data=df, )
plt.show()
Comparing Sepal Length and Sepal Width
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x="sepallength", y="sepalwidth",hue="class", data=df, )
plt.show()
Histograms
Histograms allow seeing the distribution of data for various columns. It can be used for uni
as well as bi-variate analysis.
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df["sepallength"], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df["sepalwidth"], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df["petallength"], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df["petalwidth"], bins=6);
Output:
The highest frequency of the sepal length is between 30 and 35 which is between
5.5 and 6.
The highest frequency of the sepal width is around 70 which is between 3.0 and 3.5.
The highest frequency of the petal length is around 50 which is between 1 and 2.
The highest frequency of the petal width is between 40 and 50 which is between 0.0
and 0.5
RESULT:
Thus the python code for reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris data set have been applied
and results retrieved successfully.
EX: No: 5 EXPLORATORY DATA ANALYSIS
DATE:
AIM:
To write a python code for performing Univariate analysis, Bivariate analysis,
Multiple Regression analysis and compare the results obtained on diabetes data set from
UCI and Pima Indians Diabetes data sets.
DESCRIPTION:
Exploratory Data Analysis is majorly performed using the following methods:
Univariate analysis: provides summary statistics for each field in the raw data set (or)
summary only on one variable. Ex: CDF, PDF, Box plot, Violin plot
Bivariate analysis: is performed to find the relationship between each variable in the
dataset and the target variable of interest (or) using 2 variables and finding the
relationship between them. Ex: Box plot, Violin plot.
Multivariate analysis: is performed to understand interactions between different fields
in the dataset (or) finding interactions between variables more than 2. Ex: Pair plot and 3D
scatter plot.
UCI Diabetes Dataset:
This dataset contains the distribution for 70 sets of data recorded on diabetes patients
(several weeks' to months' worth of glucose, insulin, and lifestyle data per patient + a
description of the problem domain).
PROCEDURE:
(5a) Univariate Analysis - Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis
import pandas as pd
import numpy as np
df = pd.read_csv("diabetes.csv")
print(df)
Age Gender Polyuria ... Alopecia Obesity class
0 40 Male No ... Yes Yes Positive
1 58 Male No ... Yes No Positive
2 41 Male Yes ... Yes No Positive
3 45 Male No ... No No Positive
4 60 Male Yes ... Yes Yes Positive
.. ... ... ... ... ... ... ...
515 39 Female Yes ... No No Positive
516 48 Female Yes ... No No Positive
517 58 Female Yes ... No Yes Positive
518 32 Female No ... Yes No Negative
519 42 Male No ... No No Negative
[520 rows x 17 columns]
>>>print(df['Age'].mean())
48.02884615384615
>>>print(df['Age'].median())
47.5
>>>print(df['Age'].mode())
0 35
dtype: int64
>>>print(df["Age"].var())
147.65812583370388
>>>print(df["Age"].std())
12.151465995249458
>>>print(df["Age"].skew())
0.3293593578272701
>>>print(df["Age"].kurt())
-0.19170941407070163
Data-Visualization:(pima-diabetes.csv)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
Data_X= df.copy(deep=True)
Data_X= Data_X.drop(['Outcome'],axis=1)
plt.rcParams['figure.figsize']=[40,40]
#Plotting Histogram of Data
Data_X.hist(bins=40)
plt.show()
(5b) Bivariate Analysis – Linear and Logistic Regression
Simple Linear Regression - It is an approach for predicting a response using a single
feature. It is assumed that the two variables are linearly related. So, we try to find a linear
function that predicts the response value(y) as accurately as possible as a function of the
feature or independent variable(x). Let us consider a dataset where we have a value of
response y for every feature x as given below(example):
Now, the task is to find a line that fits best in the above scatter plot so that we can
predict the response for any new feature values. (i.e. a value of x not present in a dataset).
This line is called a regression line. The equation of regression line is represented as:
h(xi) = b0+b1xi
Here,
h(xi) represents the predicted response value for ith observation.
b0 and b1 are regression coefficients and represent y-intercept and slope of regression
line respectively.
SOURCE CODE:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\n b_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
Logistic Regression:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
import sklearn
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
LR = LogisticRegression()
LR.fit(X_train, y_train)
y_pred = LR.predict(X_test)
print("Accuracy ", LR.score(X_test, y_test)*100)
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()
Output:
Accuracy 79.16666666666666
Output:
[48.13025197]
(5d) Comparative Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='pima-diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Outcome'] == 0]["Pregnancies"], color='green') # Healthy - green
sns.distplot(df[df['Outcome'] == 1]["Pregnancies"], color='red') # Diabetic - Red
plt.title('Healthy vs Diabetic by Pregnancy', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()
Output:
From above graph, we can infer that the Pregnancy isn't likely cause for diabetes as the
distribution between the Healthy and Diabetic is almost same.
//diabetes.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
#Load the data
filepath='diabetes.csv';
df = pd.read_csv(filepath)
plt.style.use("classic")
plt.figure(figsize=(10,10))
sns.distplot(df[df['Gender'] == 'Male']["Age"], color='green')
sns.distplot(df[df['Polyuria'] == 'No']["Age"], color='red')
plt.title('Male vs Polyuria by Age', fontsize=15)
plt.xlim([-5,20])
plt.grid(linewidth = 0.7)
plt.show()
Output:
RESULT:
Thus the python code for performing Univariate, Bivariate, multiple regression analysis
and comparison between dependent and independent variables on the diabetes or pima-
diabetes data set have been applied and results retrieved successfully.
EX: NO: 6 PLOTTING FUNCTIONS
DATE :
AIM:
To write a python code for performing various plotting functions on the UCI data sets.
DESCRIPTION:
The plotting methods are used to represent the data in a visualization format for enhancing
the data analytics. The various methods are listed below.
a. Normal curves - Normal Distribution is a probability function used in statistics that
tells about how the data values are distributed. It is the most important probability
distribution function used in statistics because of its advantages in real case scenarios.
b. Density and contour plots – It represent events with a density gradient or contour
gradient depending on the number of events. Density Plots - In a density plot, the color of
an area reflects how many events are in that position of the plot.
c. Correlation and scatter plots - A scatter plot displays the strength, direction, and form
of the relationship between two quantitative variables. A correlation coefficient measures
the strength of that relationship.
d. Histograms - A histogram is a graphical representation of data points organized into
user-specified ranges. Similar in appearance to a bar graph, the histogram condenses a data
series into an easily interpreted visual by taking many data points and grouping them into
logical ranges or bins.
e. Three dimensional plotting - plots are enabled by importing the mplot3d toolkit,
included with the main Matplotlib installation.
from mpl_toolkits import mplot3d
Once this submodule is imported, a three-dimensional axes can be created by passing the
keyword projection='3d' to any of the normal axes creation routines.
SOURCE CODE:
# Normal Curve
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.Glucose[0:50]
mean=st.mean(x)
sd=st.stdev(x)
pyplot.plot(x,norm.pdf(x,mean,sd))
pyplot.title("Normal plot")
pyplot.show()
OUTPUT:
#density plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()
OUTPUT:
#contour plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
x=data.BloodPressure[0:2]
y=data.Glucose[0:2]
z=((data.BMI[0:2],data.Age[0:2]))
pyplot.figure(figsize=(7,5))
pyplot.title("Contour plot")
contours=pyplot.contour(x,y,z)
pyplot.show()
OUTPUT:
#correlation plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
names=["Pregnancies", "Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction", "Age"]
correlation = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlation, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,8,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.title("Correlation")
pyplot.show()
OUTPUT:
#scatter plot
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
scatter_matrix(data)
pyplot.show()
OUTPUT:
#Histograms
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot
import pandas as pd
from pandas.plotting import scatter_matrix
import plotly.express as px
from scipy.stats import norm
import statistics as st
from mpl_toolkits import mplot3d
data = pd.read_csv('diabetes.csv')
data.hist()
pyplot.show()
OUTPUT:
data = pd.read_csv('diabetes.csv')
fig = pyplot.figure()
ax = pyplot.axes(projection='3d')
ax = pyplot.axes(projection='3d')
zline = np.array(data.BMI)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Blues')
pyplot.show()
OUTPUT:
RESULT:
Thus the python code for performing various plotting functions like scatter plot, normal
plot, histogram, density plot, contour plot, 3D plot were applied on the UCI diabetes data
sets and visualized successfully.
EX: NO: 7 GEOGRAPHIC DATA WITH BASEMAP
DATE :
AIM:
To write a python code for visualizing the geographic data with Basemap package and
methods by applying various projection techniques.
DESCRIPTION:
One common type of visualization in data science is that of geographic data. Matplotlib's
main tool for this type of visualization is the Basemap toolkit, which is one of several
Matplotlib toolkits which lives under the mpl_toolkits namespace. More modern solutions
such as leaflet or the Google Maps API may be a better choice for more intensive map
visualizations. Still, Basemap is a useful tool for Python users to have in their virtual
toolbelts.
PROCEDURE:
#Basemap and other packages installation
!pip install basemap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-
wheels/public/simple/
Collecting basemap
Downloading basemap-1.3.6-cp38-cp38-manylinux1_x86_64.whl (863 kB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 863 kB 14.5 MB/s
Collecting basemap-data<1.4,>=1.3.2
Downloading basemap_data-1.3.2-py2.py3-none-any.whl (30.5 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 30.5 MB 1.4 MB/s
Requirement already satisfied: matplotlib<3.7,>=1.5 in /usr/local/lib/python3.8/dist-
packages (from basemap) (3.2.2)
Collecting pyproj<3.5.0,>=1.9.3
Downloading pyproj-3.4.0-cp38-cp38-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 7.8 MB 55.4 MB/s
Collecting pyshp<2.4,>=1.2
Downloading pyshp-2.3.1-py2.py3-none-any.whl (46 kB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 46 kB 3.6 MB/s
Collecting numpy<1.24,>=1.22
Downloading numpy-1.23.5-cp38-cp38-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 17.1 MB 46.7 MB/s
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-
packages (from matplotlib<3.7,>=1.5->basemap) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in
/usr/local/lib/python3.8/dist-packages (from matplotlib<3.7,>=1.5->basemap) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages
(from matplotlib<3.7,>=1.5->basemap) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-
packages (from matplotlib<3.7,>=1.5->basemap) (1.4.4)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from
pyproj<3.5.0,>=1.9.3->basemap) (2022.9.24)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from
python-dateutil>=2.1->matplotlib<3.7,>=1.5->basemap) (1.15.0)
Installing collected packages: numpy, pyshp, pyproj, basemap-data, basemap
Attempting uninstall: numpy
Found existing installation: numpy 1.21.6
Uninstalling numpy-1.21.6:
Successfully uninstalled numpy-1.21.6
ERROR: pip's dependency resolver does not currently take into account all the packages
that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.23.5 which is
incompatible.
Successfully installed basemap-1.3.6 basemap-data-1.3.2 numpy-1.23.5 pyproj-3.4.0 pyshp-
2.3.1
SOURCE CODE:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5)
plt.show()
The useful thing is that the globe shown here is not a mere image; it is a fully-functioning
Matplotlib axes that understands spherical coordinates and which allows us to easily over
plot data on the map.
Map Projections:
The Basemap package implements several dozen such projections, all referenced by a short
format code.
Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant
longitude) remain vertical; The Mollweide projection (projection='moll') is one common
example of this, in which all meridians are elliptical arcs. It is constructed so as to preserve
area across the map: though there are distortions near the poles, the area of small patches
reflects the true area. Other pseudo-cylindrical projections are the sinusoidal
(projection='sinu') and Robinson (projection='robin') projections.
Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled. This can
lead to very good local properties, but regions far from the focus point of the cone may
become much distorted.
This shows us where larger populations of people have settled in California: they are clustered near
the coast in the Los Angeles and San Francisco areas, stretched along the highways in the flat
central valley, and avoiding almost completely the mountainous regions along the borders of the
state.
RESULT:
Thus the python code for visualizing the geographic data with various types of
projections has been applied and results retrieved successfully.