DS LAB MANUAL (1)
DS LAB MANUAL (1)
Ex. No:1
SciPY , Jupyter, Statsmodels and Pandas Packages
AIM:
To Download, Install and Explore the features of NumPy, SciPY, Jupyter, Statsmodels and
Pandas Packages
ALGORITHMS:
Hit the windows key, type command prompt, and click on run as administrator
Type pip install numpy command and press enter key to start the numpy installation.
The numpy package download and installation will automatically get started and finished.
Verify numpy installation
Launch command prompt and type pip show numpy command and hit the enter key to
verify if numpy is part of python packages.
The output will show you the numpy version with the location at which it is stored in the
system.
Hit the windows key, type command prompt, and click on run as administrator
pip install scipy command and press enter key to start the scipy installation.
The scipy package download and installation will automatically get started and finished.
Launch command prompt and type pip show scipy command and hit the enter key to verify
if scipy is part of python packages.
The output will show you the scipy version with the location at which it is stored in the
system.
Hit the windows key, type command prompt, and click on run as administrator
1
Type pip install jupyter command and press enter key to start the scipy installation.
The jupyter package download and installation will automatically get started and finished.
Launch command prompt and type pip show jupyter command and hit the enter key to
verify if jupyter is part of python packages.
The output will show you the jupyter version with the location at which it is stored in the
system.
Hit the windows key, type command prompt, and click on run as administrator
Type pip install statsmodels command and press enter key to start the statsmodels
installation.
statsmodels package download and installation will automatically get started and finished.
Launch command prompt and type pip show statsmodels command and hit the enter key
to verify if statsmodels is part of python packages.
The output will show you the statsmodels version with the location at which it is stored in
the system.
Hit the windows key, type command prompt, and click on run as administrator
Type pip install pandas command and press enter key to start the pandas installation.
The pandas package download and installation will automatically get started and finished.
Launch command prompt and type pip show pandas command and hit the enter key to
verify if pandas is part of python packages.
The output will show you the pandas version with the location at which it is stored in the
system.
Source Code:
python get-pip.py
2
(iii) Verify Installation
pip help
OUTPUT
Download PIPget-pip.py
3
Installing numpy : pip install nympy
Result:
Thus the given libraries are Download, Install and Explore the features of NumPy, SciPY,
Jupyter, Statsmodels and Pandas.
4
VIVA QUESTIONS :
5
ASSIGNMENT QUESTIONS:
SL. CO
ASSIGNMENT BT LEVEL COMPLEXITY
NO MAPPING
1. Write a Pandas program to select the
specified columns and rows from a given data
frame. Sample Python dictionary data and list
labels:
Select 'name' and 'score' columns in rows 1,
3, 5, 6 from the following data
frame.exam_data = {'name': ['Anastasia',
'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew','Laura','Kevin', 'Jonas'],
'score':[12.5,9,16.5,np.nan,9,20,14.5,np.nan,8,19
],
'attempts':[1,3,2, 3,2,3,1, 1,2,1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no',
CO1
'no', 'yes']} Create High
labels=['a','b','c','d','e','f','g','h','i','j']
ExpectedOutput:
Selectspecificcolumnsandrows:
Score
qualifyb9.0
no
dNaN
nof20.0yes
g14.5yes
6
Ex No: 2
Working with Numpy Arrays
Date:
Aim:
To Explore the features of NumPy with Arrays using NumPy package.
Definition:
(i) Numpy
Numpy is the core library for scientific computing in Python. It provides a high-performance
multidimensional array object, and tools for working with these arrays.
(ii) Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative
integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers
giving the size of the array along each dimension.
Algorithm:
Step 1: Verify Numpy package is installed in windows by using command ( pip show numpy).
Step 2: Create array using numpy packages.
Step 3: Perform operation such as Indexing, Slicing, Check Dimension, Check Data Types, Perform
Iterating, Joining, splitting, Searching, Sorting and Filter.
Step 4: Finally print the output of each operations.
Source Code:
Two Dimension:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
print('5th element on 2nd row: ', arr[1, 4])
Three Dimension:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
print(arr[4:])
print(arr[:4])
print(arr[-3:-1])
print(arr[1:5:2])
print(arr[::2])
Two Dimension:
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
print(arr[0:2, 2])
print(arr[0:2, 1:4])
8
(iv) NumPy Data Types
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)
import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
import numpy as np
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)
Two Dimension:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)
Three Dimension:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
9
(vi) NumPy Joining Array
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)
import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)
(viii) NumPy Searching Arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)
10
(ix) NumPy Sorting Arrays
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))
import numpy as np
arr = np.array([True, False, True])
print(np.sort(arr))
import numpy as np
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
filter_arr = []
for element in arr:
if element % 2 == 0:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
11
OUTPUT:
(i) Numpy Creating and Check Dimensions:
12
(v) NumPy Array Iterating
13
(x) NumPy Filter Array
Result:
Thus the given operations are executed and verified successfully using numpy packages.
14
VIVA QUESTIONS :
15
ASSIGNMENT QUESTIONS :
BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
(a) Write a NumPy program to convert an
array to a float type.
16
Ex.No:3 Working with Pandas DataFrames
Aim:
To implement the basic concepts of Pandas Dataframe.
Pandas DataFrame:
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e.,
data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal
components, the data, rows, and columns.
Creating a DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Iterating over rows and columns
Creating a DataFrame:
In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from
the lists, dictionary, and from a list of dictionary etc.
Dealing with Rows and Columns
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows
and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and
renaming.
Column Selection:
In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by
their columns name.
Row Selection:
Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to
retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to
an iloc[] function.
Indexing and Selecting Data:
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and
17
all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset
Selection.
Indexing a Dataframe using indexing operator []:
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers
also use the indexing operator to make selections. In this indexing operator to refer to df[].
Selecting a single columns
In order to select a single column, we simply put the name of the column in-between the brackets
This function selects data by the label of the rows and columns. The df.loc indexer selects data in a
different way than just the indexing operator. It can select subsets of rows or columns. It can also
simultaneously select subsets of rows and columns.
In order to drop a null values from a dataframe, we used dropna() function this fuction drop
Rows/Columns of datasets with Null values
18
Iterating over rows and columns:
Iteration is a general term for taking each item of something, one after another. Pandas DataFrame
consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a
dictionary.
In order to iterate over columns, we need to create a list of dataframe columns and then iterating through
that list to pull out the dataframe columns.
Algorithm:
Step1:Verify Pandas is installed in windows by using command pip show pandas
Step2:Create data frame from list,list of list,dict of narray/lists, by proving index label explicitly using
pandas.
19
df
import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)
df
print(df1, "\n")
20
print(df2)
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
print(df[['Name', 'Qualification']])
vii) Row Selection:
1. DataFrame.loc[]
import pandas as pd
import xlrd
read_file=pd.read_excel ("Test.xlsx")
read_file.to_csv ("Test.csv", index = None, header=True)
first = data.loc["Ronald"]
second = data.loc["Ben"]
print(first, "\n\n\n", second)
import pandas as pd
import xlrd
read_file=pd.read_excel ("Test.xlsx")
read_file.to_csv ("Test.csv",
index = None,
header=True)
21
data = pd.read_csv("Test.csv", index_col ="Name")
first = data["Cost"]
print(first)
print(row2)
df = pd.DataFrame(dict)
df.fillna(0)
df
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
df.dropna()
df = pd.DataFrame(dict)
for i, j in df.iterrows():
print(i, j)
print()
for i in columns:
print (df[i][2])
OUTPUT:
1. Creating a DataFrame
i) Creating Dataframe from Lists:
i) Column Selection:
25
iii) Dropping missing values using dropna() :
iv) Drop rows with at least one Nan value (Null value)
Result:
Thus all the basic concepts of Pandas DataFrame are implemented Successfully.
26
VIVA QUESTIONS :
27
ASSIGNMENT QUESTIONS :
BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
Write a Pandas program to
count the number of rows and
columns of a
DataFrame.SamplePython
1. dictionarydata andlist labels:
exam_data = {'name': ['Anastasia',
'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew','Laura','Kevin',
'Jonas'],
CO1 Create High
'score':[12.5,9,16.5,np.nan,9,20,14.5,np.nan
,8,19],
'attempts':[1,3,2, 3,2,3,1, 1,2,1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes',
'yes', 'no', 'no', 'yes']}
labels=['a','b','c','d','e','f','g','h','i','j']
Expected Output:
Number of Columns : 10
Number of rows : 4
28
Write a Pandas program to select
the rows where the number of
attempts in the examination
isgreaterthan2.
SamplePython dictionary dataandlistlabels:
exam_data={'name':['Anastasia','Dima','Kat
herine','James','Emily','Michael','Matthew','
3. Laura','Kevin', 'Jonas'], CO1 Create High
'score':[12.5, 9,16.5, np.nan, 9,20, 14.5,
np.nan,8, 19],
'attempts':[1, 3,2, 3,2, 3,1, 1,2, 1],
'qualify':['yes','no','yes','no','no','yes','yes','n
o','no','yes']}
labels=['a','b','c','d','e','f','g','h','i','j']
29
Reading data from text files, Excel and the web and exploring
Ex.No:4
various commands for doing descriptive analytics on the Iris data set
Aim:
i) To read data from text files, Excel file and web
ii) To explore various commands for doing descriptive analytics on the iris data set.
Algorithm:
Step 1:Download the Iris Test data text file and csv file from
https://sourceforge.net/projects/irisdss/files/
Step 2: Find the path location of Text file,Excel file
Step 3: Read the Text file using read() and view the text data
Step4: Read the Excel file using read_csv() and view the data
Step 4: Read the Iris dataset directly from the url 'https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data'and view the dataset.
Step 5:Perform descriptive analytics on the iris datasets using various commands such as describe()
,index ,coloumns ,head().
30
Checking Duplicates
Viewing Data:
print(iris.head())
sorting by an axis:
iris.sort_index(axis=1, ascending=False).head(10)
sorting by values:
iris.sort_values(by='Petal_Width').head(10)
Selection Getting:
iris['Sepal_Length'].head()
31
Selecting via []-slices the first 5 rows
iris[0:5]
Selection by Label
1. Selecting on a multi-axis by label
iris.loc[0:10, ['Sepal_Length', 'Petal_Length']]
7. Selection by position
iris.iloc[0:3, 0:4]
OUTPUT:
Reading data from a text file :
32
Read Excel File:
To display Data:
To display 5 rows:
33
To view the description of data:
Checking Duplicates
34
Reading data from web:
sorting by an axis:
Sorting by values:
Selection Getting:
35
Selection by Label
36
5. Retrieve a column of data by attribute
7. Selection by position
Result :
Thus reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set is performed.
37
VIVA QUESTIONS :
38
ASSIGNMENT QUESTIONS :
SL. BT
ASSIGNMENT CO MAPPING COMPLEXITY
NO LEVEL
1. Explore various commands for doing
CO3 Apply High
descriptive analytics on Crop yield production
39
Ex.No.5a Univariate Analysis on diabetes data set from UCI and Pima
Indians Diabetes data set.
Aim:
To perform Univariate Analysis on diabetes data set from UCI and Pima Indians
Diabetes data set.
Algorithm:
Step1: Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4:Perform univariate analysis such as frequency, mean, median, mode, variance, standard
deviation, skewness and kurtosis using corresponding functions.
import pandas as pd
#from sklearn import linear_model
DataPath = r'C:\Users\asus\Desktop\diabetes.csv'
df = pd.read_csv(DataPath)
1. To find Frequency:
print("Frequency of values in column ")
count = df['Pregnancies'].value_counts()
print(count)
count = df['Glucose'].value_counts()
print(count)
count = df['BloodPressure'].value_counts()
print(count)
count = df['SkinThickness'].value_counts()
print(count)
count = df['Insulin'].value_counts()
print(count)
40
count = df['BMI'].value_counts()
print(count)
count = df['DiabetesPedigreeFunction'].value_counts()
print(count)
count = df['Age'].value_counts()
print(count)
count = df['Outcome'].value_counts()
print(count)
2. To find Mean:
print("mean:")
print(df.mean())
3. To find Median:
print("median:")
print(df.median())
4. To find Mode:
print("mode:")
print(df.mode().T)
6. To find Variance:
print("Variance:\n",df.var())
7. To find Skewness:
print("Skewness:\n",df.skew())
8. To find Kurtosis:
print("Kurtosis:\n",df.kurtosis())
import pandas as pd
41
import numpy as np
import matplotlib.pyplot as plt
path= r'C:\Users\asus\Desktop\pima-indians-diabetes.csv'
print(path)
df=pd.read_csv(path)
df.columns
=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','
Age','Outcome']
print(df)
1. To find Frequency:
print("Frequency of values in column ")
count = df['Pregnancies'].value_counts()
print(count)
count = df['Glucose'].value_counts()
print(count)
count = df['BloodPressure'].value_counts()
print(count)
count = df['SkinThickness'].value_counts()
print(count)
count = df['Insulin'].value_counts()
print(count)
count = df['BMI'].value_counts()
print(count)
count = df['DiabetesPedigreeFunction'].value_counts()
print(count)
count = df['Age'].value_counts()
print(count)
count = df['Outcome'].value_counts()
print(count)
2. To find Mean:
print("mean:")
print(df.mean())
42
3. To find Median:
print("median:")
print(df.median())
4. To find Mode:
print("mode:")
print(df.mode().T)
6. To find Variance:
print("Variance:\n",df.var())
7. To find Skewness:
print("Skewness:\n",df.skew())
8. To find Kurtosis:
print("Kurtosis:\n",df.kurtosis())
OUTPUT:
To find mean:
mean:
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
To find median
median:
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
To find mode:
mode:
0 1
Pregnancies 1.000 NaN
Glucose 99.000 100.000
BloodPressure 70.000 NaN
SkinThickness 0.000 NaN
Insulin 0.000 NaN
BMI 32.000 NaN
DiabetesPedigreeFunction 0.254 0.258
Age 22.000 NaN
Outcome 0.000 NaN
Standard Deviation:
Pregnancies 3.369578
Glucose 31.972618
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
44
Age 11.760232
Outcome 0.476951
dtype: float64
Variance:
Pregnancies 11.354056
Glucose 1022.248314
BloodPressure 374.647271
SkinThickness 254.473245
Insulin 13281.180078
BMI 62.159984
DiabetesPedigreeFunction 0.109779
Age 138.303046
Outcome 0.227483
dtype: float64
Skewness:
Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI- 0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64
Kurtosis:
Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
Age 0.643159
Outcome -1.600930
dtype: float64
mean:
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
median:
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
mode:
0 1
Pregnancies 1.000 NaN
Glucose 99.000 100.000
BloodPressure 70.000 NaN
SkinThickness 0.000 NaN
Insulin 0.000 NaN
BMI 32.000 NaN
DiabetesPedigreeFunction 0.254 0.258
Age 22.000 NaN
Outcome 0.000 NaN
Standard Deviation:
Pregnancies 3.369578
Glucose 31.972618
46
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
Age 11.760232
Outcome 0.476951
dtype: float64
Variance:
Pregnancies 11.354056
Glucose 1022.248314
BloodPressure 374.647271
SkinThickness 254.473245
Insulin 13281.180078
BMI 62.159984
DiabetesPedigreeFunction 0.109779
Age 138.303046
Outcome 0.227483
dtype: float64
Skewness:
Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI -0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64
Kurtosis:
Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
Age 0.643159
Outcome -1.600930
dtype: float64
Result:
Thus univariate analysis is performed on diabetes data set from UCI and Pima Indians
Diabetes data set and executed successfully.
47
VIVA QUESTIONS :
48
ASSIGNMENT QUESTIONS :
SL. BT
ASSIGNMENT CO MAPPING COMPLEXITY
NO LEVEL
1. Determine whether the following statement
refers to univariate (single-variable) or
bivariate (two-variable) data: CO4 Evaluate High
Jen measured the height and number of
leaves of each plant in her laboratory.
Detect the mode(s), if any, for the following set
2.
of data. It may be helpful to order the data first. CO4 Evaluate High
4, 5, 2, 8, 2, 1, 0, 0, 9, 5, 0
98, 32, 60, 54, 78, 80, 54, 78, 77, 89.
5. Detect the rmean, median, mode, Standard
deviation of the following list of test scores.
CO4 Evaluate High
You may want to order each list first.
98, 32, 60, 54, 78, 80, 54, 78, 77, 89.
49
Bivariate Analysis such as linear regression modeling and logistic
Ex.No.5b regression modeling on diabetes data set from UCI and Pima
Indians Diabetes data set.
Aim:
To perform Bivariate Analysis on diabetes data set from UCI and Pima Indians
Diabetes data set.
Algorithm:
Step1: Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Perform Bivariate analysis such as Linear regression and Logistic regression on both the
datasets.
I. Linear Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
path= r'C:\Users\asus\Desktop\diabetes.csv'
print(path)
data=pd.read_csv(path)
print(data)
x = data['Age']
y = data['BMI']
B0 = y_mean - (B1*x_mean)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
path= r'C:\Users\asus\Desktop\pima-indians-diabetes.csv'
print(path)
data=pd.read_csv(path)
data.columns
=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','
Age','Outcome']
print(data)
x = data['Age']
y = data['BMI']
II Logistic Regression:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
DataPath = (r'"C:\Users\asus\Desktop\pima-indians-diabetes.csv"')
data = pd.read_csv(DataPath)
x=data.drop("Outcome",axis=1)
y=data[["Outcome"]]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=LogisticRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
model_score=model.score(x_test,y_test)
#Logistic Regression Model Score
print("Logistic Regression Model Score = ",model_score)
#confusion matrix
print("Confusion Matrix : \n",metrics.confusion_matrix(y_test,y_predict))
sns.heatmap(metrics.confusion_matrix(y_test,y_predict), annot=True, fmt='d',
cmap='Blues')
plt.title("LogisticRegression Confusion Matrix")
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.savefig('confusion_matrix.png')
plt.show()
55
OUTPUT:
Linear Regression.
56
Logistic Regression.
57
2. Pima Indians Diabetes dataset
Linear Regression.
58
Logistic Regression.
Result:
Thus Bivariate analysis such as linear regression modeling and logistic regression modeling
is performed on diabetes data set from UCI and Pima Indians Diabetes data set and executed
successfully.
59
VIVA QUESTIONS :
60
ASSIGNMENT QUESTIONS:
BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
For the following scatter plot, determine if
the dots are trying to form a line. If so,
approximate the line of best fit.
1.
CO4 Apply High
61
15.6 mm
-156 mm
None of the above
X and Y are two variables that have a strong
linear relationship. Which of the following
statements are incorrect?
There cannot be a negative relationship
between the two variables.
The relationship between the two variables Underst
5. CO4 High
and
is purely causal.
One variable may or may not cause a change
in the other variable.
The variables can be positively or
negatively correlated with each other.
62
Multiple Regression Analysis on on diabetes data set from
Ex.No.5C
UCI and Pima Indians Diabetes data set.
Aim:
To perform Multiple Regression Analysis on diabetes data set from UCI and Pima
Indians Diabetes data set.
Algorithm:
Step1:Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2:Import necessary Modules and functions
Step3:Read the Dataset path using read_csv().
Step4:Perform Multiple regression analysis on both the datasets.
63
regr.fit(x,y)
predicted=regr.predict([[500,200]])
print("Predicted Outcome = ", predicted)
OUTPUT:
1. UCI DATASET:
OUTPUT:
Predicted Outcome = [[0.86312063]]
Result:
Thus Multiple Regression Analysis on diabetes data set from UCI and Pima Indians Diabetes
data set is performed and executed successfully.
64
VIVA QUESTIONS :
65
ASSIGNMENT QUESTIONS :
SL. BT
ASSIGNMENT CO MAPPING COMPLEXITY
NO LEVEL
1. Give σx = 3 and Regression equation 8X – 10Y
+ 66 = 0; 40X – 18Y = 214.
Find Out (i) The mean value of X and Y, (ii) CO4 Create High
Coefficient of correlation between X and Y and
(iii) Standard deviation of Y.
3.
Use the Lung Cancer dataset from UCI for
performing the following: CO4 Create High
regression?
66
Result Comparison of diabetes data set from UCI and Pima
Ex.No.5D
Indians Diabetes data set.
Aim:
To perform result comparision Analysison diabetes data set from UCI and Pima
Indians Diabetes data set.
Algorithm:
Step1: Download diabetes data set from UCI and Pima Indians Diabetes data set
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Perform result comparison analysis on both the datasets using statsmodel
GLM.from_formula(),Fit(),summary()
Step5: Print the result using print()
Source Code:
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
df2 = pd.read_csv(r"C:\Users\asus\Desktop\diabetes.csv")
print(df2.shape)
df2.head(5)
model = sm.GLM.from_formula("Outcome ~
Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Ag
e", family=sm.families.Binomial(), data=df2)
result = model.fit()
result.summary()
print(result.summary())
67
2. PIMA INDIANS DIABETES DATASETS:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
df2 = pd.read_csv(r"C:\Users\asus\Desktop\pima-indians-diabetes.csv")
print(df2.shape)
df2.head(5)
model = sm.GLM.from_formula("Outcome ~
Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Ag
e", family=sm.families.Binomial(), data=df2)
result = model.fit()
result.summary()
print(result.summary())
OUTPUT:
68
2. Pima Indians Diabetes Dataset:
Result:
Thus result comparison is performed on diabetes data set from UCI and Pima Indians
Diabetes data set and executed successfully.
69
VIVA QUESTIONS :
1. What is Statsmodels?
Statsmodels is a Python package that allows users to explore data, estimate statistical models, and
perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions,
and result statistics are available for different types of data and each estimator.
70
ASSIGNMENT QUESTIONS :
BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
Compare the results of the Univariate and
1.
Bivariate analysis for the UCI diabetes CO4 Underst
High
and
dataset
2. How do you interpret regression formula?
CO4 Apply High
71
Ex.No.6 Apply and explore various plotting functions on UCI data sets.
Aim:
To apply and explore various plotting functions on diabetes data set from UCI
Algorithm:
Step1: Download diabetes data set from UCI
Step2: Import necessary Modules and functions
Step3: Read the Dataset path using read_csv().
Step4: Using norm() plot the normal curve for the UCI diabetes dataset variables.
Step5: Using lineplot() plot the lineplot curve for variables in the UCI diabetes dataset.
Step6: Using scatter()plot the scatter plot for the UCI diabetes dataset variables.
Step7: Using distplot() plot the density plot for the variables in UCI diabetes dataset.
Step8: Using contour() plot the normal curve for the UCI diabetes dataset variables.
Step9: Using hist() plot the histogram for the UCI diabetes dataset variables.
72
df = pd.read_csv(DataPath)
df.head()
#Line Plot for Diabetes Dataset
sns.lineplot(df['BloodPressure'],df['Age'], hue =df["Outcome"])
plt.title("Lineplot for Diabetes Dataset")
plt.show()
#Scatter Plot for Diabetes Dataset
sns.scatterplot(df['BloodPressure'],df['Age'], hue =df["Outcome"])
plt.title("Scatterplot for Diabetes Dataset")
plt.show()
Histogram:
1. histogram of all columns:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r'C:\Users\asus\Desktop\diabetes.csv')
73
df.hist(figsize=(10,10),color='red')
#3D ScatterPlot:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Pregnancies'], df['Glucose'], df['BloodPressure'], c='skyblue', s=60)
ax.view_init(30, 185)
plt.show()
fig = plt.figure()
ax = plt.axes(projection="3d")
def z_function(x, y):
return np.sin(x)**10+np.cos(10+y*x)*np.cos(x)
import pandas as pd
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection="3d")
def z_function(x, y):
return np.sin(x)**10+np.cos(10+y*x)*np.cos(x)
OUTPUT:
1. histogram of all columns:
78
3. histogram of Glucose column
79
5. histogram of Skin Thickness column
80
8. histogram of Diabetes Pedigree Function column
81
10. histogram of Outcome(results) column
82
#3D Line Plots
Result:
Thus result comparison is performed on diabetes data set from UCI and Pima Indians Diabetes
data set and executed successfully.
83
VIVA QUESTIONS :
2. What is Scatterplots?
Scatter plots are the graphs that present the relationship between two variables in a data-
set. It represents data points on a two-dimensional plane or on a Cartesian system. The
independent variable or attribute is plotted on the X-axis, while the dependent variable
is plotted on the Y-axis.
84
ASSIGNMENT QUESTIONS :
BT
SL.NO ASSIGNMENT CO MAPPING COMPLEXITY
LEVEL
1. The Gapminder dataset provides
population data from 1952 to 2007 (at 5
year intervals) for several countries around
the world. Compare the populations of the
European countries France, United
Kingdom, Italy, Germany and Spain over
this period using a line chart. Make
appropriate modifications to the chart title,
axis titles, legend, figure size, font size, CO5 Evaluate High
colors etc. to make the chart readable and
visually appealing.
Hints (not all of these may be useful):
You can use either Matplotlib or Plotly to
create this chart
To select the data for the given countries,
you may determine the is in method of a
Pandas series useful
diamonds_url points to a CSV file
2. containing various attributes like carat,
cut, color, clarity, price etc. for over
53,000 diamonds. Visualize the
relationship between the carat (size of
diamond) and price using a scatter plot.
Instead of using the entire dataset for this
visualization, just pick the diamonds with
CO5 Create High
a clarify "SI2" and color "E". Use the
values of the "cut" column to color the
dots in the scatter plot. Make appropriate
modifications to the chart title, axis titles,
legend, figure size, font size, colors etc. to
make the chart readable and visually
appealing.
85
Hints (not all of these may be useful):
You can use Seaborn or Plotly to create
the scatter plot for this dataset
Check this stackoverflow answer for
selecting data frame rows using multiple
conditions.
3. The Planets dataset contains details about
the 1,000+ extrasolar planets discovered
up to 2014. Visualize the distribution of
the masses of the planets (expressed as a
multiple of the mass of Jupiter), using a
histogram and a box plot. Make
appropriate modifications to the chart title,
axis titles, legend, figure size, font size,
CO5 Create High
colors etc. to make the chart readable and
visually appealing.
Hints:
You use use Matplotlib, Seaborn or Plotly
to create these plots
If you're using Plotly, you can show both
charts together (use the marginal argument
of px.histogram)+
4. The Job Automation Probability dataset,
created during a Future of Employment
study from 2013, estimates the probability
of different jobs being automated in the
21st century due to computerization.
Create a bar chart to show the 25 jobs
requiring a "Bachelor's degree" (and no Co5 Create High
higher qualification) that are most likely to
be automated. Make appropriate
modifications to the chart title, axis titles,
legend, figure size, font size, colors etc. to
make the chart readable and visually
appealing.
86
Ex.No:7 Visualizing Geographic Data with Basemap
Aim:
To Visualize Geographic Data using Basemap
Algorithm:
#Coastlines
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines()
plt.title("Coastlines", fontsize=20)
plt.show()
#Country_Boundaries
87
#latitude and longitudes
#locating a region
Result:
Thus Visualization of Geographic Data is done and executed successfully using Basemap.
88
VIVA QUESTIONS :
89
ASSIGNMENT QUESTIONS:
SL. CO BT
ASSIGNMENT COMPLEXITY
NO MAPPING LEVEL
1.
How to plot GIS databases in Python with
CO5 Apply High
Basemap?
90
Ex.No.8 Analyzing Selling Price of used Cars
Aim:
To Analyse the selling price of used cars
Algorithm:
Step 1:Download the dataset from
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Step 2: Find the path location of downloaded dataset and convert it into csv format
Step 3: Import the packages
Step4:Set the path to the data file(.csv file)
Step5:Find if there are any null data or NaN data in our file. If any, remove them
Step6:Perform various data cleaning and data visualisation operations on your data.
Step7:Obtain the result
Source Code:
Import the modules:
# importing section
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
Check the first five entries of dataset:
# using the Csv file
df = pd.read_csv('output.csv')
# Checking the first 5 entries of dataset
df.head()
Defining headers for our dataset.
headers = ["symboling", "normalized-losses", "make",
"fuel-type", "aspiration","num-of-doors",
"body-style","drive-wheels", "engine-location",
"wheel-base","length", "width","height", "curb-weight",
"engine-type","num-of-cylinders", "engine-size",
"fuel-system","bore","stroke", "compression-ratio",
"horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
df.head()
91
Finding the missing value if any.
data = df
# Finding the missing values
data.isna().any()
# Finding if missing values
data.isnull().any()
Converting mpg to L/100km and checking the data type of each column.
# converting mpg to L / 100km
data['city-mpg'] = 235 / df['city-mpg']
data.rename(columns = {'city_mpg': "city-L / 100km"}, inplace = True)
print(data.columns)
# checking the data type of each column
data.dtypes
Price is of object type(string), it should be int or float :
data.price.unique()
# Here it contains '?', so we Drop it
data = data[data.price != '?']
# checking it again
data.dtypes
Normalizing values by using simple feature scaling method examples(do for the
rest) and binning- grouping values
data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()
data['height'] = data['height']/data['height'].max()
print(data['price-binned'])
plt.hist(data['price-binned'])
plt.show()
92
Doing descriptive analysis of data categorical to numerical values.
# descriptive analysis
# NaN are skipped
data.describe()
Descriptive analysis of data categorical to numerical values.
# categorical to numerical variables
pd.get_dummies(data['fuel-type']).head()
# descriptive analysis
# NaN are skipped
data.describe()
Plotting the data according to the price based on engine size.
# examples of box plot
plt.boxplot(data['price'])
# by using seaborn
sns.boxplot(x ='drive-wheels', y ='price', data = data)
93
data_grp = test.groupby(['drive-wheels', 'body-style'],
as_index = False).mean()
data_grp
Using the pivot method and plotting the heatmap according to the data obtained
by pivot method
# pivot method
data_pivot = data_grp.pivot(index = 'drive-wheels',
columns = 'body-style')
data_pivot
94
Output:
95
Converting mpg to L/100km and checking the data type of each column.
96
Normalizing values by using simple feature scaling method
97
Grouping the data according to wheel, body-style and price:
98
Final result:
Result:
Thus the selling price of used cars are implemented and Analyzed Successfully.
99
Ex.No 9 Loan Approval Prediction
Aim:
To create loan Approval Prediction to check whether the applicant’s profile is relevant to be
granted with loan or not
Algorithm:
Step 1:Download the Loan approval Prediction from
https://drive.google.com/file/d/1LIvIdqdHDFEGnfzIgEh4L6GFirzsE3US/view
Step 2: Find the path location of downloaded dataset.
Step 3: import libraries Pandas, Seaborn, Matplotlib.
Step4: Perform Data Preprocessing by Getting the number of columns of object datatype.
Step 4: Visualize all the unique values in columns using barplot.
Step5: The heatmap is showing the correlation between Loan Amount and Applicant Income. It also
shows that Credit_ History has a high impact on Loan Status.
Step6: Now find out if there is any missing values in the dataset.
Step7: If there is no missing value then proceed to model training using KNeighbors Classifiers,
Random Forest Classifiers, Support Vector Classifiers (SVC), Logistics Regression.
Step8:Finally best model classifier is found out to predict the loan approval process.
Source Code:
Importing Libraries and Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("LoanApprovalPrediction.csv")
data.head(5)
Data Preprocessing and Visualization
obj = (data.dtypes == 'object')
print("Categorical variables:",len(list(obj[obj].index)))
# Dropping Loan_ID column
data.drop(['Loan_ID'],axis=1,inplace=True)
obj = (data.dtypes == 'object')
object_cols = list(obj[obj].index)
plt.figure(figsize=(18,36))
index = 1
for col in object_cols:
100
y = data[col].value_counts()
plt.subplot(11,4,index)
plt.xticks(rotation=90)
sns.barplot(x=list(y.index), y=y)
index +=1
# Import label encoder
from sklearn import preprocessing
sns.heatmap(data.corr(),cmap='BrBG',fmt='.2f',
linewidths=2,annot=True)
sns.catplot(x="Gender", y="Married",
hue="Loan_Status",
kind="bar",
data=data)
for col in data.columns:
data[col] = data[col].fillna(data[col].mean())
data.isna().sum()
Splitting Dataset:
from sklearn.model_selection import train_test_split
X = data.drop(['Loan_Status'],axis=1)
Y = data['Loan_Status']
X.shape,Y.shape
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.4,
101
random_state=1)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Model Training and Evaluation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
102
Output:
103
104
Finding Missing Values:
Splitting Dataset:
105
Prediction on the test set:
Result:
Thus to create loan Approval Prediction to check whether the applicant’s profile is relevant to be
granted with loan or not is implemented successfully.
106
Eye Colour Detection
Aim:
Algorithm:
Import cv2
Import numpy as np
Step4: Use an eye detection algorithm to locate the eyes in the image.
Step6: Analyze the color distribution within the eye regions to the predominate eye color.
Step7: Display the original image with bounching boxes around detected eyes and visualize the
determined eye color.
Source Code:
import tensorflow as tf
import sys
import os
import numpy as np
import cv2
import argparse
import time
107
detector = MTCNN()
parser = argparse.ArgumentParser()
parser.add_argument('--input_path', default="./images/Ranveer-singh.jpg")
parser.add_argument('--input_type', default='image')
opt = parser.parse_args()
class_name = ("Blue", "Blue Gray", "Brown", "Brown Gray", "Brown Black", "Green",
"Green Gray", "Other")
EyeColor = {
if (hsv[0] >= color[0][0]) and (hsv[0] <= color[1][0]) and (hsv[1] >= color[0][1]) and \
hsv[1] <= color[1][1] and (hsv[2] >= color[0][2]) and (hsv[2] <= color[1][2]):
return True
else:
return False
def find_class(hsv):
color_id = 7
for i in range(len(class_name)-1):
108
color_id = i
return color_id
def eye_color(image):
h, w = image.shape[0:2]
result = detector.detect_faces(image)
if result == []:
return
bounding_box = result[0]['box']
left_eye = result[0]['keypoints']['left_eye']
right_eye = result[0]['keypoints']['right_eye']
eye_distance = np.linalg.norm(np.array(left_eye)-np.array(right_eye))
cv2.rectangle(image,
(bounding_box[0], bounding_box[1]),
(255,155,255),
2)
if imgMask[y, x] != 0:
109
eye_class[find_class(imgHSV[y,x])] +=1
main_color_index = np.argmax(eye_class[:len(eye_class)-1])
total_vote = eye_class.sum()
for i in range(len(class_name)):
cv2.imshow('EYE-COLOR-DETECTION', image)
# image
if opt.input_type == 'image':
eye_color(image)
cv2.imwrite('sample/result.jpg', image)
cv2.waitKey(0)
else :
Eyecolor-1.py:
import tkinter as tk
import cv2
import numpy as np
110
def detect_eye_color(image_path):
image = cv2.imread(image_path)
eyes = []
x, y, w, h = cv2.boundingRect(contour)
aspect_ratio = w / float(h)
eyes.append((x, y, w, h))
mean_color = cv2.mean(eye)
111
print("Eye color: Blue/Gray")
root = tk.Tk()
detect_eye_color(file_path)
root.destroy()
result.py:
import cv2
def detect_eyes(image_path):
img = cv2.imread(image_path)
return img
# Replace 'path_to_your_image.jpg' with the path of the image you want to test
image_path = "./images/Ranveer-singh.jpg"
112
# Detect eyes in the image
result_image = detect_eyes(image_path)
cv2.waitKey(0)
cv2.destroyAllWindows()
Output:
Result:
Thus the eye colour of the person using OpenCV -python is implemented and detected
successfully.
113