Data Science Using Python Lab Manual
Data Science Using Python Lab Manual
1
LIST OF EXPERIMENTS
S.No. Experiment Name Page No.
Creating a NumPy Array
a) Basic nd-array
b) Array of zero
c) Array of ones
1. d) Random numbers in nd-array 04-06
e) An array of your choice
f) Imatrix in NumPy
g) Evenly spaced nd-array
2
Read the following file formats using pandas
a) Text files
8. b) CSV files 31-33
c) Excel files
d) JSON files
13. Getting started with NLTK, install NLTK using PIP 58-62
14. Python program to implement with Python Sci Kit-Learn & NLTK 63-66
3
1.CREATING A NUMPY ARRAY
AIM: To create a NumPy array and perform various operations such as creating basic nd-
arrays, arrays of zeros and ones, generating random numbers, and implementing specific
array types and transformations.
DESCRIPTION:
NumPy stands for Numerical Python and is one of the most useful scientific libraries in
Python programming. It provides support for large multidimensional array objects and
various tools to work with them. Various other libraries like Pandas, Matplotlib, and Scikit-
learn are built on top of this amazing library.
Arrays are a collection of elements/values, that can have one or more dimensions. An array of
one dimension is called a Vector while having two dimensions is called a Matrix.
NumPy arrays are called ndarray or N-dimensional arrays and they store elements of the
same type and size. It is known for its high-performance and provides efficient storage and
data operations as arrays grow in size.
NumPy comes pre-installed when you download Anaconda. But if you want to install NumPy
separately on your machine, just type the below command on your terminal:
pip install numpy
Now you need to import the library:
import numpy as np
(np is the de facto abbreviation for NumPy used by the data science community.)
PROGRAM:
a) Basic nd-array
np.array([1,2,0,3,4])
o/p:
np.array([1,2,0,3,4])
np.array([[1,2,3,4],[5,6,7,8]])
o/p:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
4
b) Array of zeros
np.zeros((2,3))
o/p:
array([[0., 0., 0.],
[0., 0., 0.]])
c) Array of ones
np.ones(5,dtype=np.int32)
o/p: array([1, 1, 1, 1, 1])
f) Imatrix in NumPy
np.eye(3)
o/p:
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
5
np.eye(3,k=1)
o/p:
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
np.arange(2,10,2)
o/p: array([2, 4, 6, 8])
RESULT: Hence, a NumPy array has been successfully created, and various operations such
as creating basic nd-arrays, arrays of zeros and ones, generating random numbers, and
implementing specific array types and transformations have been performed.
6
2. THE SHAPE AND RESHAPING OF NUMPY ARRAY
AIM: To understand and demonstrate the concepts of shape and reshaping in NumPy arrays,
including dimensions, shape, size, reshaping, flattening, and transposing.
DESCRIPTION:
NumPy, a fundamental library for numerical computing in Python, provides powerful tools
for handling multi-dimensional arrays. One of its key features is the ability to manipulate the
shape and structure of arrays efficiently. This exercise explores various aspects of handling
and modifying NumPy arrays:
1. Dimensions of a NumPy Array:
The number of dimensions (axes) of an array is called its rank. It can be determined
using the ndim attribute.
2. Shape of a NumPy Array:
The shape of an array is a tuple that indicates the size of the array along each
dimension. For example, a 2x3 array has a shape of (2, 3).
3. Size of a NumPy Array:
The total number of elements in an array can be found using the size attribute.
4. Reshaping a NumPy Array:
Reshaping allows us to change the structure of an array without altering its data. For
instance, a 1D array of size 6 can be reshaped into a 2x3 2D array using the reshape()
method.
5. Flattening a NumPy Array:
Flattening converts a multi-dimensional array into a 1D array. This is useful for linear
data processing and can be achieved using the flatten() method.
6. Transpose of a NumPy Array:
The transpose of an array swaps its axes, such as converting rows to columns in a 2D
array. This can be done using the T attribute.
These operations are fundamental for data manipulation in machine learning, data
preprocessing, and other numerical computation tasks.
PROGRAM:
a) Dimensions of NumPy array
import numpy as np
a = np.array([[5,10,15],[20,25,20]])
print('Array :','\n',a)
print('Dimensions :','\n',a.ndim)
7
o/p:
Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :
2
8
d) Reshaping a NumPy array
a = np.array([3,6,9,12])
np.reshape(a,(2,2))
o/p:
array([[ 3, 6],
[ 9, 12]])
a = np.array([3,6,9,12,18,24])
print('Three rows :','\n',np.reshape(a,(3,-1)))
print('Three columns :','\n',np.reshape(a,(-1,3)))
o/p:
Three rows :
[[ 3 6]
[ 9 12]
[18 24]]
Three columns :
[[ 3 6 9]
[12 18 24]]
9
o/p:
Original shape : (2, 2)
Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)
Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)
Array :
[1. 1. 1. 1.]
10
RESULT: Hence, the concepts of shape and reshaping in NumPy arrays, including
dimensions, shape, size, reshaping, flattening, and transposing, have been successfully
understood and demonstrated.
11
3. EXPANDING AND SQUEEZING A NUMPY ARRAY
AIM: To demonstrate the concepts of expanding and squeezing NumPy arrays and sorting
elements within a NumPy array.
DESCRIPTION:
NumPy provides efficient ways to manipulate the shape and structure of arrays, enabling
versatile data processing. This task involves the following operations:
1. Expanding a NumPy Array:
Expanding refers to adding a new axis to an array. This is achieved using the
np.expand_dims() function or slicing with np.newaxis. It is particularly useful for
reshaping data to match specific dimensions required for computations.
2. Squeezing a NumPy Array:
Squeezing removes dimensions of size 1 from an array. This is done using the
np.squeeze() function. It is often used to reduce unnecessary dimensions and simplify
data structures.
3. Sorting in NumPy Arrays:
Sorting rearranges the elements of an array in ascending or descending order. NumPy
offers the np.sort() function for sorting along a specified axis or the flattened array.
These operations are essential for preparing data for analysis, ensuring compatibility with
machine learning models, and enhancing the clarity of data structures.
PROGRAM:
a) Expanding a NumPy array
import numpy as np
a = np.array([1,2,3])
b = np.expand_dims(a,axis=0)
c = np.expand_dims(a,axis=1)
print('Original:','\n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
print('Expand along rows:','\n','Shape',c.shape,'\n',c)
o/p:
Original:
Shape (3,)
[1 2 3]
12
Expand along columns:
Shape (1, 3)
[[1 2 3]]
Expand along rows:
Shape (3, 1)
[[1]
[2]
[3]]
13
a = np.array([[5,6,7,4],
[9,2,3,7]])
print('Sort along column :','\n',np.sort(a, kind='mergresort',axis=1))
print('Sort along row :','\n',np.sort(a, kind='mergresort',axis=0))
o/p:
Sort along column :
[[4 5 6 7]
[2 3 7 9]]
Sort along row :
[[5 2 3 4]
[9 6 7 7]]
RESULT: Hence, the concepts of expanding and squeezing NumPy arrays, as well as sorting
elements within a NumPy array, have been successfully demonstrated.
14
4. INDEXING AND SLICING OF NUMPY ARRAY
AIM: To understand and demonstrate the concepts of indexing and slicing in NumPy arrays,
including slicing 1D, 2D, and 3D arrays, as well as negative slicing techniques.
DESCRIPTION:
Indexing and slicing are fundamental operations for accessing and manipulating elements
within NumPy arrays. This task focuses on the following operations:
1. Slicing 1-D NumPy Arrays:
In 1D arrays, slicing allows us to extract specific ranges of elements. The slicing
syntax [start:stop:step] is used to access subsets of the array.
2. Slicing 2-D NumPy Arrays:
For 2D arrays, slicing can be done in both row and column directions. Using the
syntax [start_row:end_row, start_col:end_col], we can extract subarrays by specifying
the rows and columns to include.
3. Slicing 3-D NumPy Arrays:
For 3D arrays, slicing works similarly, but it adds an additional dimension. We can
slice the array along all three axes, using the syntax [start_layer:end_layer,
start_row:end_row, start_col:end_col].
4. Negative Slicing of NumPy Arrays:
Negative indexing allows us to access elements from the end of the array. Negative
slicing uses the syntax [-start:stop] to refer to elements starting from the end of the
array, which is particularly useful for reversing arrays or accessing elements from the
back.
These operations are essential for extracting and modifying subsets of data in NumPy, and
they are widely used in data manipulation, machine learning, and scientific computing tasks.
PROGRAM:
a) Slicing 1-D NumPy arrays
import numpy as np
a = np.array([1,2,3,4,5,6])
print(a[1:6:2])
o/p: [2 4 6]
a = np.array([1,2,3,4,5,6])
print(a[:6:2])
print(a[1::2])
15
print(a[1:6:])
o/p:
[1 3 5]
[2 4 6]
[2 3 4 5 6]
16
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]
[[13 14]
[15 16]
[17 18]]]
print(a[:,-1:-3:-1])
o/p:
[[ 5 4]
[10 9]]
RESULT: Hence, the concepts of indexing and slicing in NumPy arrays, including slicing
1D, 2D, and 3D arrays, as well as negative slicing techniques, have been successfully
understood and demonstrated.
17
5. STACKING AND CONCATENATING NUMPY ARRAYS
AIM: To understand and demonstrate the concepts of stacking and concatenating NumPy
arrays, along with broadcasting operations in NumPy.
DESCRIPTION:
Stacking and concatenating are operations used to combine multiple arrays into a single array,
while broadcasting allows for operations between arrays of different shapes. This task focuses
on the following operations:
1. Stacking ndarrays:
Stacking is the process of combining multiple arrays along a new axis. This can be
done using functions like np.vstack() (vertical stacking), np.hstack() (horizontal
stacking), or np.dstack() (depth stacking). It creates higher-dimensional arrays by
stacking lower-dimensional arrays.
2. Concatenating ndarrays:
Concatenation involves joining multiple arrays along an existing axis. The
np.concatenate() function is used to combine arrays along a specified axis (0 for rows,
1 for columns, etc.), resulting in a larger array with the same number of dimensions.
3. Broadcasting in NumPy Arrays:
Broadcasting refers to the ability of NumPy to perform arithmetic operations on
arrays of different shapes by automatically expanding the smaller array to match the
larger one. This allows element-wise operations between arrays without explicit
reshaping.
These operations are essential for manipulating and combining data in NumPy arrays,
allowing for more complex computations in data analysis, machine learning, and scientific
computing.
PROGRAM:
a) Stacking nd-arrays
import numpy as np
a = np.arange(0,5)
b = np.arange(5,10)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
o/p:
Array 1 :
18
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :
[[0 1 2 3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0 1 2 3 4 5 6 7 8 9]
a = [[1,2],[3,4]]
b = [[5,6],[7,8]]
c = np.dstack((a,b))
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Dstack :','\n',c)
print(c.shape)
o/p:
Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]
Dstack :
[[[1 5]
[2 6]]
[[3 7]
[4 8]]]
(2, 2, 2)
19
b) Concatenating nd-arrays
a = np.arange(0,5).reshape(1,5)
b = np.arange(5,10).reshape(1,5)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Concatenate along rows :','\n',np.concatenate((a,b),axis=0))
print('Concatenate along columns :','\n',np.concatenate((a,b),axis=1))
o/p:
Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :
[[0 1 2 3 4]
[5 6 7 8 9]]
Concatenate along columns :
[[0 1 2 3 4 5 6 7 8 9]]
a = np.array([[1,2],
[3,4]])
np.append(a,[[5,6]], axis=0)
o/p:
array([[1, 2],
[3, 4],
[5, 6]])
20
print('Multiplying an ndarray and a number :',a*2)
o/p:
Adding two different size arrays :
[[12 14 16 18 20]
[12 14 16 18 20]]
Multiplying an ndarray and a number : [20 24 28 32 36]
RESULT: Hence, the concepts of stacking and concatenating NumPy arrays, along with
broadcasting operations, have been successfully understood and demonstrated.
21
6. OPERATIONS WITH PANDAS
AIM: To understand and demonstrate how to create a DataFrame, perform concatenation, set
conditions, and add new columns in a Pandas DataFrame.
DESCRIPTION:
Pandas is one of the most popular and powerful data science libraries in Python. It can be
considered as the stepping stone for any aspiring data scientist who prefers to code in Python.
Even though the library is easy to get started, it can certainly do a wide variety of data
manipulation. This makes Pandas one of the handiest data science libraries in the developer’s
community. Pandas basically allow the manipulation of large datasets and data frames. It can
also be considered as one of the most efficient statistical tools for mathematical computations
of tabular data. Pandas is a powerful library for data manipulation and analysis. This task
focuses on the following operations:
1. Creating a DataFrame:
A DataFrame is the core data structure in Pandas. It is a 2-dimensional labeled data
structure with columns of potentially different types. This operation will cover how to
create a DataFrame from lists, dictionaries, or other data structures.
2. Using concat() for DataFrame Concatenation:
The concat() function allows for combining multiple DataFrames along a particular
axis (either rows or columns). This operation demonstrates how to concatenate
DataFrames to combine datasets efficiently.
3. Setting Conditions in a DataFrame:
Conditional statements can be used to filter or modify data within a DataFrame. This
operation will explore how to apply conditions to create boolean masks for selecting
data or modifying values based on certain criteria.
4. Adding a New Column:
Adding a new column to a DataFrame is an essential operation when transforming or
enriching the data. This task will show how to add new columns, either by applying a
function to existing data or by assigning static or calculated values.
PROGRAM:
a) Creating Dataframe
import pandas as pd
data_england = {'Name': ['Kane', 'Sterling', 'Saka', 'Maguire'], 'Age': [27, 26, 19, 28]}
data_italy = {'Name': ['Immobile', 'Insigne', 'Chiellini', 'Chiesa'], 'Age': [31, 30, 36, 23]}
df_england = pd.DataFrame(data_england)
df_italy = pd.DataFrame(data_italy)
print("England Players:")
22
print(df_england)
print("\nItaly Players:")
print(df_italy)
o/p:
England Players:
Italy Players:
b) concat()
frames = [df_england, df_italy]
both_teams = pd.concat(frames)
both_teams
o/p:
23
pd.concat(frames, keys=["England", "Italy"])
o/p:
c) Setting Conditions
both_teams[both_teams["Age"] >= 30]
o/p:
both_teams[both_teams["Name"].str.startswith('S')]
o/p:
24
o/p:
0 Kane 27 Tottenham
2 Saka 19 Arsenal
Associated
Name Age
Clubs
0 Kane 27 Tottenham
2 Saka 19 Arsenal
0 Immobile 31 NaN
1 Insigne 30 NaN
2 Chiellini 36 NaN
3 Chiesa 23 NaN
25
7. PANDAS OPERATIONS: HANDLING NaN, SORTING, AND GROUPING DATA
AIM: To understand and demonstrate how to handle NaN values, sort data based on column values,
and group data using the groupby() function in Pandas.
DESCRIPTION:
This task covers essential operations in Pandas for data cleaning, organization, and analysis.
The following operations will be demonstrated:
1. Filling NaN with String:
NaN (Not a Number) values are common in real-world datasets. This operation
demonstrates how to fill NaN values in a DataFrame with a specified string (or any
other placeholder), ensuring the data is complete and usable for analysis.
2. Sorting Based on Column Values:
Sorting is an essential operation to organize data in a specific order. This operation
will show how to sort a DataFrame based on values in one or more columns, either in
ascending or descending order.
3. Using groupby() for Grouping Data:
The groupby() function in Pandas is used to group data based on column values,
which is useful for performing aggregate functions like summing, averaging, or
counting within each group. This operation will demonstrate how to group data and
apply aggregation functions.
These operations are fundamental for data preprocessing, cleaning, and analysis, and are
frequently used in data science, machine learning, and data manipulation tasks.
PROGRAM:
a) Filling NaN with string
both_teams['Associated Clubs'].fillna('No Data Found', inplace=True)
both_teams
o/p:
0 Kane 27 Tottenham
2 Saka 19 Arsenal
26
Name Age Associated Clubs
0 Kane 27 Tottenham
2 Saka 19 Arsenal
27
both_teams.sort_values('Age')
o/p:
2 Saka 19 Arsenal
0 Kane 27 Tottenham
c) groupby()
a={
df_a = pd.DataFrame(a)
df_a
28
o/p:
Transaction
UserID
0 U1001 500
1 U1002 300
2 U1001 200
3 U1001 300
4 U1003 700
df_a.groupby('UserID').sum()
o/p:
Transaction
UserID
U1001 1000
U1002 300
U1003 700
df_a.groupby('UserID').get_group('U1001')
o/p:
UserID Transaction
0 U1001 500
2 U1001 200
29
UserID Transaction
3 U1001 300
RESULT: Hence, the concepts of handling NaN values, sorting data based on column values,
and grouping data using the groupby() function in Pandas have been successfully understood
and demonstrated.
30
8. READING VARIOUS FILE FORMATS USING PANDAS
AIM: To read and load different types of data files (Text files, CSV files, Excel files, and
JSON files) into Python using the Pandas library for data manipulation and analysis.
DESCRIPTION:
This task focuses on using Pandas, a powerful data analysis library in Python, to read data
from various common file formats. These include:
• Text Files: Raw data files containing information, which can be loaded into a Pandas
DataFrame for analysis.
• CSV Files: Comma-separated values files that represent tabular data, which can be
easily read into a DataFrame using the read_csv() function.
• Excel Files: Files in .xls or .xlsx format, typically containing spreadsheets that can be
read using Pandas' read_excel() function.
• JSON Files: JavaScript Object Notation files, commonly used for structured data
exchange, can be read into a DataFrame with read_json().
By using these file formats with Pandas, you can efficiently load data into Python and
perform various data manipulation tasks like filtering, transforming, and analyzing the data.
PROGRAM:
a) Text files
with open(r'E:\\pythonfiles\\datascience.txt', 'r') as f:
print(f.read())
o/p:
Data Science with Python refers to the use of Python programming language and its rich
ecosystem of libraries to extract insights from data. Python has become one of the most
popular languages for data science due to its simplicity, readability, and extensive set of tools
and libraries designed specifically for data analysis, machine learning, and visualization.
(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)
b) CSV files
import pandas as pd
df = pd.read_csv(r'./Importing files/Employee.txt',delimiter='\t')
df
31
o/p:
(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)
c) Excel files
df = pd.read_excel(r'./Importing files/World_city.xlsx')
df
o/p:
df = pd.read_excel(r'./Importing files/World_city.xlsx',sheet_name='Europe')
df
o/p:
(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)
32
d) JSON files
import json
with open('./Importing files/sample_json.json','r') as file:
data = json.load(file)
print(type(data))
df_json = pd.DataFrame(data)
df_json
o/p:
(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)
RESULT: Hence, reading and loading different types of data files (Text files, CSV files,
Excel files, and JSON files) into Python using the Pandas library for data manipulation and
analysis has been successfully executed.
33
9. READING AND IMPORTING DATA FROM VARIOUS FILE FORMATS
AND DATABASES USING PYTHON
AIM: To demonstrate how to read and import data from various file formats, including
Pickle files, image files using PIL, multiple files using Glob, and importing data from
databases into Python for further analysis and processing.
DESCRIPTION:
This task covers the different ways to load and handle files in Python, using specialized
libraries and techniques for various data types. The aim is to provide a practical
understanding of how to work with different file formats and efficiently import data for
processing.
1. Pickle Files:
The Pickle module in Python is used for serializing and deserializing Python object
structures. This is helpful when you need to store Python objects for later use. It
allows saving complex data structures such as lists, dictionaries, and more, and
retrieving them later.
2. Image Files using PIL (Pillow):
PIL (Pillow) is a Python Imaging Library that provides easy-to-use methods for
opening, manipulating, and saving various image formats. This section will cover how
to open image files, perform transformations, and process image data using Pillow.
3. Multiple Files using Glob:
The Glob module in Python allows you to find all the pathnames matching a specified
pattern. It is particularly useful when you want to read multiple files (e.g., all text files
in a directory) without manually specifying their names. This enables dynamic file
handling.
4. Importing Data from Databases:
Python provides various libraries to connect to databases, such as sqlite3, MySQL
(using libraries like PyMySQL or SQLAlchemy), and PostgreSQL. This section will
demonstrate how to query data from relational databases and load it into Python for
further processing.
PROGRAM:
a) Pickle files
import pickle
with open('./Importing files/sample_pickle.pkl','rb') as file:
data = pickle.load(file)
print(type(data))
df_pkl = pd.DataFrame(data)
df_pkl
34
o/p:
35
import cv2
import matplotlib.pyplot as plt
filepath = r'./Importing files/Delhi'
images = glob.glob(filepath+'\*.jpg')
for i in images[:3]:
im = Image.open(i)
plt.imshow(im)
plt.show()
o/p:
36
d) Importing data from Database
import pandas as pd
import sqlite3
con=sqlite3.connect('./Importing files/sample_test.db')
cur = con.cursor()
rs = cur.execute('select * from TEST')
df = pd.DataFrame(rs.fetchall())
con.commit()
df
o/p:
RESULT: Hence, reading and importing data from various file formats, including Pickle
files, image files using PIL, multiple files using Glob, and data from databases into Python
for further analysis and processing has been successfully demonstrated.
37
10. WEB SCRAPING USING PYTHON
But as you can see, the data is not very readable. The tree-like structure of the HTML content
retrieved by our request is not very comprehensible. To improve this readability, Python has
another wonderful library called BeautifulSoup.
BeautifulSoup is a Python library for parsing the tree-like structure of HTML and extracting
data from the HTML document.
38
To make it work, we need to pass the text response from the request object
to BeautifulSoup() which creates its own object – “soup” in this case. Calling prettify() on
BeautifulSoup object parses the tree-like structure of the HTML document:
import requests
from bs4 import BeautifulSoup
url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b91257d413147
b69"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
print(soup.prettify())
o/p:
You must have noticed the difference in the output. We have a more structured output in this
case!
Now, we can extract the title of the webpage by calling the title() function of our soup object:
title = soup.title
title
The webpage has a lot of pictures of the famous monuments in Delhi and other things related
to Delhi. Let’s try and store these in a local folder.
We will need the Python urllib library to retrieve the URL of the images that we want to
store. It has a urllib.request() function that is used for opening and reading URLs. Calling
the urlretrieve() function on this object allows us to download objects denoted by the URL
to a local file:
39
import urllib
def download_img(url, i):
folder = r'C:\Users\Dell\Desktop\Delhi\\'
filepath = folder + str(i) +'.jpg'
urllib.request.urlretrieve(url,filepath)
The images are stored in the “img” tag in HTML. These can be found by calling find_all() on
the soup object. After this, we can iterate over the image and get its source by calling
the get() function on the image object. The rest is handled by our download function:
images = soup.find_all('img')
i=1
for image in images[2:10]:
try:
download_img('https:'+image.get('src'), i)
i = i+1
except:
continue
40
o/p:
RESULT: Hence, web scraping using Python with libraries like BeautifulSoup and Requests
has been successfully executed.
41
11. PREPROCESSING TECHNIQUES FOR LOAN PREDICTION DATASET
AIM: To apply and evaluate various preprocessing techniques on the Loan Prediction dataset
to prepare it for effective model training and prediction.
DESCRIPTION:
In this task, several preprocessing techniques will be applied to a Loan Prediction dataset to
enhance the data's quality and suitability for machine learning models. The steps include:
1. Feature Scaling: Scaling numerical features to a common scale to improve the
convergence rate of machine learning algorithms.
2. Feature Standardization: Standardizing features to have a mean of zero and a
standard deviation of one, ensuring that no feature dominates others due to differences
in scale.
3. Label Encoding: Converting categorical labels into numeric values, allowing
machine learning algorithms to interpret them more efficiently.
4. One Hot Encoding: Converting categorical variables into a series of binary columns,
each representing a different category, enabling algorithms to work with categorical
data without imposing any ordinal relationship.
These techniques are essential for improving model performance and ensuring that the data is
in the right format for machine learning algorithms.
(Before applying feature scaling, we need to import packages and datasets):
import pandas as pd
X_train=pd.read_csv('X_train.csv')
Y_train=pd.read_csv('Y_train.csv')
X_test=pd.read_csv('X_test.csv')
Y_test=pd.read_csv('Y_test.csv')
print (X_train.head()) # Provide respective csv files for training and testing
o/p:
Loan_ID Gender Married Dependents Education Self_Employed
15 LP001032 Male No 0 Graduate No
248 LP001824 Male Yes 1 Graduate No
590 LP002928 Male Yes 0 Graduate No
246 LP001814 Male Yes 2 Graduate No
388 LP002244 Male Yes 0 Graduate No
42
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
15 4950 0.0 125.0 360.0
248 2882 1843.0 123.0 480.0
590 3000 3416.0 56.0 180.0
246 9703 0.0 112.0 360.0
388 2333 2417.0 136.0 360.0
Credit_History Property_Area
15 1.0 Urban
248 1.0 Semiurban
590 1.0 Semiurban
246 1.0 Urban
388 1.0 Urban
PROGRAM:
a) Feature Scaling
import matplotlib.pyplot as plt
X_train[X_train.dtypes[(X_train.dtypes=="float64")|(X_train.dtypes=="int64")]
.index.values].hist(figsize=[11,11])
o/p:
43
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
X_train_minmax=min_max.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_minmax=min_max.fit_transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_minmax,Y_train)
accuracy_score(Y_test,knn.predict(X_test_minmax))
o/p: 0.75
b) Feature Standardization
from sklearn.preprocessing import scale
X_train_scale=scale(X_train[['ApplicantIncome', 'CoapplicantIncome',
44
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_scale=scale(X_test[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
from sklearn.linear_model import LogisticRegression
log=LogisticRegression(penalty='l2',C=.01)
log.fit(X_train_scale,Y_train)
accuracy_score(Y_test,log.predict(X_test_scale))
o/p: 0.75
c) Label Encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])
o/p:
Before : Male 318
Female 66
Name: Gender, dtype: int64
After : 1 318
0 66
Name: Gender, dtype: int64
45
log.fit(X_train_scale,Y_train)
accuracy_score(Y_test,log.predict(X_test_scale))
o/p: 0.73958333333333337
temp = enc.transform(X_train[[col]])
temp = pd.DataFrame(temp, columns=[(col + "_" + str(i)) for i in
data[col].value_counts().index])
temp = temp.set_index(X_train.index.values)
X_train_1 = pd.concat([X_train_1, temp], axis=1)
temp = enc.transform(X_test[[col]])
temp = pd.DataFrame(temp, columns=[(col + "_" + str(i)) for i in
data[col].value_counts().index])
temp = temp.set_index(X_test.index.values)
X_test_1 = pd.concat([X_test_1, temp], axis=1)
46
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train_scale = scale(X_train_1)
X_test_scale = scale(X_test_1)
accuracy_score(Y_test, log.predict(X_test_scale))
o/p: 0.75
RESULT: Hence, various preprocessing techniques have been successfully applied and
evaluated on the Loan Prediction dataset to prepare it for effective model training and
prediction.
47
12. DATA VISUALIZATIONS USING MATPLOTLIB
AIM: To perform various types of visualizations using Matplotlib to explore and understand
the dataset, and to identify patterns, trends, and relationships in the data.
DESCRIPTION:
Matplotlib is a Python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application
servers, and four graphical user interface toolkits.In this task, the following visualizations
will be performed using Matplotlib:
1. Bar Graph: To visualize the distribution of categorical data by showing the frequency
of each category.
2. Pie Chart: To display proportions or percentages of different categories in the dataset
for easy comparison.
3. Box Plot: To show the spread of numerical data and identify outliers by visualizing
quartiles and data range.
4. Histogram: To represent the distribution of a continuous numerical variable, showing
frequency counts for different ranges.
5. Line Chart and Subplots: To track changes over time or across categories, including
the use of subplots for multiple comparisons in a single view.
6. Scatter Plot: To explore the relationship between two continuous variables by
plotting data points on a two-dimensional axis.
These visualizations will help in understanding the dataset's structure and characteristics,
making it easier to analyze the data and make informed decisions.
PROGRAM:
Let us first import the relevant libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')
df_meal = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\meal_info.csv')
df_meal.head()
48
df_center = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\fulfilment_center_info.csv')
df_center.head()
df_food = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\train_food.csv')
df_food.head()
df = pd.merge(df_food,df_center,on='center_id')
df = pd.merge(df,df_meal,on='meal_id')
table = pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)
table
49
o/p:
a) Bar Graph
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')
df_meal = pd.read_csv('meal_info.csv')
df_center = pd.read_csv('fulfilment_center_info.csv')
df_food = pd.read_csv('train_food.csv')
df = pd.merge(df_food,df_center,on='center_id')
df = pd.merge(df,df_meal,on='meal_id')
table = pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)
50
plt.bar(table.index,table['num_orders'])
plt.xticks(rotation=70)
plt.xlabel('Food item')
plt.ylabel('Quantity sold')
plt.title('Most popular food')
plt.show()
o/p:
item_count = {}
for i in range(table.index.nunique()):
item_count[table.index[i]] =
table.num_orders[i]/df_meal[df_meal['category']==table.index[i]].shape[0]
plt.bar([x for x in item_count.keys()],[x for x in item_count.values()],color='orange')
plt.xticks(rotation=70)
plt.xlabel('Food item')
plt.ylabel('No. of meals')
plt.title('Meals per food item')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_7.png',dpi=300,bbox_inches='tight')
plt.show()
51
o/p:
b) Pie Chart
d_cuisine = {}
total = df['num_orders'].sum()
for i in range(df['cuisine'].nunique()):
c = df['cuisine'].unique()[i]
c_order = df[df['cuisine']==c]['num_orders'].sum()
d_cuisine[c] = c_order/total
52
o/p:
c) Box Plot
c_price = {}
for i in df['cuisine'].unique():
c_price[i] = df[df['cuisine']==i].base_price
d) Histogram
plt.hist(df['base_price'],rwidth=0.9,alpha=0.3,color='blue',bins=15,edgecolor='red')
plt.xlabel('Base price range')
53
plt.ylabel('Distinct order')
plt.title('Inspecting price effect')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_10.png',dpi=300,bbox_inches='tight')
plt.show()
o/p:
54
fig,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))
ax[0].plot(week,week_order)
ax[0].set_xlabel('Week')
ax[0].set_ylabel('Revenue')
ax[0].set_title('Weekly income')
ax[1].plot(month,month_order)
ax[1].set_xlabel('Month')
ax[1].set_ylabel('Revenue')
ax[1].set_title('Monthly income')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_11.png',dpi=300,bbox_inches='tight')
plt.show()
o/p:
f) Scatter plot
center_type_name = ['TYPE_A','TYPE_B','TYPE_C']
op_table = pd.pivot_table(df, index='op_area', values='num_orders', aggfunc=np.sum)
c_type = {}
for i in center_type_name:
c_type[i] = df[df['center_type'] == i].op_area
center_table = pd.pivot_table(df, index='center_type', values='num_orders', aggfunc=np.sum)
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(8, 12))
ax[0].scatter(op_table.index, op_table['num_orders'], color='pink')
ax[0].set_xlabel('Operation area')
55
ax[0].set_ylabel('Number of orders')
ax[0].set_title('Does operation area affect num of orders?')
ax[0].annotate('optimum operation area of 4 km^2', xy=(4.2, 1.1*10**7), xytext=(7,
1.1*10**7),
arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)
ax[1].boxplot([x for x in c_type.values()], labels=[x for x in c_type.keys()])
ax[1].set_xlabel('Center type')
ax[1].set_ylabel('Operation area')
ax[1].set_title('Which center type had the optimum operation area?')
ax[2].bar(center_table.index, center_table['num_orders'], alpha=0.7, color='orange',
width=0.5)
ax[2].set_xlabel('Center type')
ax[2].set_ylabel('Number of orders')
ax[2].set_title('Orders per center type')
plt.tight_layout()
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_12.png',
dpi=300, bbox_inches='tight')
plt.show()
o/p:
56
RESULT: Hence, various types of visualizations using Matplotlib have been performed
successfully to explore and understand the dataset, and to identify patterns, trends, and
relationships in the data.
57
13. GETTING STARTED WITH NLTK
AIM: To install and set up the Natural Language Toolkit (NLTK) library using PIP, laying the
foundation for natural language processing (NLP) tasks.
DESCRIPTION:
NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work
with human languages data. It gives a very easy user interface. It supports classification,
steaming, tagging, etc.
Installing NLTK on Windows using PIP:
In windows, we first have to install the python current version. Then we have to install pip
with it. Without pip, NLTK can not be installed.
Step 2: Move the cursor to the Download button & then click on the latest python version.
58
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.
59
Step 4: Click on Next.
60
Step 6: Wait till installation finish.
61
Step 8: Open Command Prompt & execute the following commands:
python --version
pip --version
pip install nltk
62
14. IMPLEMENTATION OF PYTHON PROGRAM WITH SCIKIT-LEARN
AND NLTK
AIM: To utilize Scikit-Learn and NLTK libraries in Python for machine learning and natural
language processing tasks.
DESCRIPTION:
This program is a text classification task that uses various machine learning classifiers to This
Python program implements a text classification model using Natural Language Processing (NLP) and
machine learning techniques. The program classifies a set of text documents into predefined categories
(Tech, Sport, Fitness) using a Naive Bayes classifier, which is trained on text features extracted using
the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization method.
Key Steps:
1. Text Preprocessing:
o The program tokenizes the text documents and removes stopwords (commonly
used words like 'the', 'and', etc.) to improve the quality of the features used for
classification.
o It also ensures that only alphanumeric tokens are considered for classification.
2. Feature Extraction:
o The TF-IDF Vectorizer is used to convert the preprocessed text into numerical
feature vectors, which represent the importance of words in the documents
relative to the entire corpus.
3. Model Training:
o The Multinomial Naive Bayes classifier is used to classify the documents into
categories based on the extracted features. The model is trained using an 80%
subset of the dataset.
4. Model Evaluation:
o The model is evaluated using the test set (20% of the data) to predict the
category labels for unseen documents. The program calculates the accuracy of
the model and generates a detailed classification report, including precision,
recall, and F1-score for each category.
5. Cross-Validation:
o To further assess the performance of the model, 5-fold cross-validation is used
to evaluate the model's generalizability and robustness.
The program prints the accuracy of the model on the test set, a classification report showing
detailed metrics (precision, recall, F1-score) for each class, and the cross-validated accuracy of
the model across multiple splits of the dataset.
63
PROGRAM:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report
nltk.download('punkt')
nltk.download('stopwords')
documents = [
'I love programming in Python',
'Machine learning is great',
'Natural Language Processing is fascinating',
'I enjoy solving problems using code',
'Python is a great programming language',
'Python is widely used in AI and data science',
'Football is a popular sport around the world',
'The Olympics are held every four years',
'The football match was exciting to watch',
'I love watching basketball games',
'Running is a great way to stay fit',
'Yoga helps in reducing stress and improving flexibility',
'Swimming is a great full-body exercise',
'Cycling is a popular outdoor activity',
]
64
labels = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
stop_words = set(stopwords.words('english'))
def preprocess(text):
tokens = word_tokenize(text.lower())
return ' '.join([word for word in tokens if word not in stop_words and word.isalnum()])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Tech", "Sport", "Fitness"]))
65
o/p:
RESULT: Hence, the Scikit-Learn and NLTK libraries in Python have been successfully
utilized and executed for machine learning and natural language processing tasks.
66
15.PYTHON PROGRAM TO IMPLEMENT TEXT PROCESSING USING
NLTK/SPACY/PYNLPI
AIM: To develop a Python program that utilizes NLTK for text processing tasks such as
tokenization, sentence segmentation, and word tokenization to prepare text data for further
analysis or NLP applications.
DESCRIPTION:
This Python program uses NLTK's sent_tokenize and word_tokenize functions to break down
a given text into sentences and words. The program first downloads the necessary punkt
tokenizer models, which are pre-trained models for sentence and word segmentation. The text
is then processed to identify and separate sentences using the sent_tokenize function, and each
sentence is further split into words using the word_tokenize function. The program outputs the
list of sentences and words, which can be used for tasks such as text analysis, machine learning
preprocessing, or natural language processing (NLP) applications. This program serves as an
introduction to basic text processing techniques in NLP using Python.
PROGRAM:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
text = "Natural language processing (NLP) is a field of computer science, artificial intelligence
and computational linguistics concerned with the interactions between computers and human
(natural) languages, and, in particular, concerned with programming computers to fruitfully
process large natural language corpora. Challenges in natural language processing frequently
involve natural language understanding, natural language generation (frequently from formal,
machine-readable logical forms), connecting language and machine perception, managing
human-computer dialog systems, or some combination thereof."
print("Sentences Tokenized:")
print(sent_tokenize(text))
print("\nWords Tokenized:")
print(word_tokenize(text))
o/p:
Sentences Tokenized:
['Natural language processing (NLP) is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions between computers and human
(natural) languages, and, in particular, concerned with programming computers to fruitfully
process large natural language corpora.', 'Challenges in natural language processing frequently
involve natural language understanding, natural language generation (frequently from formal,
67
machine-readable logical forms), connecting language and machine perception, managing
human-computer dialog systems, or some combination thereof.']
Words Tokenized:
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',',
'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the',
'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in',
'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large',
'natural', 'language', 'corpora', '.', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently',
'involve', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', '(',
'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting',
'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',',
'or', 'some', 'combination', 'thereof', '.']
RESULT: Hence, a Python program that utilizes NLTK for text processing tasks such as
tokenization, sentence segmentation, and word tokenization has been successfully developed
to prepare text data for further analysis or NLP applications.
68