0% found this document useful (0 votes)
489 views

Data Science Using Python Lab Manual

data science lab manual

Uploaded by

Sofiya Parvez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
489 views

Data Science Using Python Lab Manual

data science lab manual

Uploaded by

Sofiya Parvez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA SCIENCE USING PYTHON LAB (23HPC3004)

II B.TECH II Sem Regulation – R23

ANNAMACHARYA INSTITUTE OF TECHNOLOGY & SCIENCES::KADAPA


(AUTONOMOUS)
B.Tech : EEE, ECE & CSE Courses Accredited by NBA, New Delhi, Accredited by NAAC with ‘A’ Grade

Utukur(P) C.K.Dinne (V&M) KADAPA, Y.S.R. Dist., A.P.

1
LIST OF EXPERIMENTS
S.No. Experiment Name Page No.
Creating a NumPy Array
a) Basic nd-array
b) Array of zero
c) Array of ones
1. d) Random numbers in nd-array 04-06
e) An array of your choice
f) Imatrix in NumPy
g) Evenly spaced nd-array

The Shape and Reshaping of NumPy Array


a) Dimensions of NumPy array
b) Shape of NumPy array
2. c) Size of NumPy array 07-11
d) Reshaping a NumPy array
e) Flattening a NumPy array
f) Transpose of a NumPy array
Expanding and Squeezing a NumPy Array
a) Expanding a NumPy array
3. b) Squeezing a NumPy array 12-14
c) Sorting in NumPy Arrays

Indexing and Slicing of NumPy Array


a) Slicing 1-D NumPy arrays
4. b) Slicing 2-D NumPy arrays 15-17
c) Slicing 3-D NumPy arrays
d) Negative slicing of NumPy arrays

Stacking and Concatenating Numpy Arrays


a) Stacking nd-arrays
5. b) Concatenating nd-arrays 18-21
c) Broadcasting in Numpy Arrays

Perform following operations using pandas


a) Creating dataframe
6. b) concat() 22-25
c) Setting conditions
d) Adding a new column

Perform following operations using pandas


a) Filling NaN with string
7. b) Sorting based on column values 26-30
c) groupby()

2
Read the following file formats using pandas
a) Text files
8. b) CSV files 31-33
c) Excel files
d) JSON files

Read the following file formats


a) Pickle files
9. b) Image files using PIL 34-37
c) Multiple files using Glob
d) Importing data from database

10. Demonstrate web scraping using python 38-41

Perform following preprocessing techniques on loan prediction


dataset
11. a) Feature Scaling 42-47
b) Feature Standardization
c) Label Encoding
d) One Hot Encoding
Perform following visualizations using matplotlib
a) Bar Graph
b) Pie Chart
12. c) Box Plot 48-57
d) Histogram
e) Line Chart and Subplots
f) Scatter Plot

13. Getting started with NLTK, install NLTK using PIP 58-62

14. Python program to implement with Python Sci Kit-Learn & NLTK 63-66

15. Python program to implement with Python NLTK/Spicy/Py NLPI. 67-68

3
1.CREATING A NUMPY ARRAY

AIM: To create a NumPy array and perform various operations such as creating basic nd-
arrays, arrays of zeros and ones, generating random numbers, and implementing specific
array types and transformations.
DESCRIPTION:
NumPy stands for Numerical Python and is one of the most useful scientific libraries in
Python programming. It provides support for large multidimensional array objects and
various tools to work with them. Various other libraries like Pandas, Matplotlib, and Scikit-
learn are built on top of this amazing library.
Arrays are a collection of elements/values, that can have one or more dimensions. An array of
one dimension is called a Vector while having two dimensions is called a Matrix.
NumPy arrays are called ndarray or N-dimensional arrays and they store elements of the
same type and size. It is known for its high-performance and provides efficient storage and
data operations as arrays grow in size.
NumPy comes pre-installed when you download Anaconda. But if you want to install NumPy
separately on your machine, just type the below command on your terminal:
pip install numpy
Now you need to import the library:
import numpy as np
(np is the de facto abbreviation for NumPy used by the data science community.)

PROGRAM:
a) Basic nd-array
np.array([1,2,0,3,4])
o/p:
np.array([1,2,0,3,4])

np.array([[1,2,3,4],[5,6,7,8]])
o/p:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

4
b) Array of zeros
np.zeros((2,3))
o/p:
array([[0., 0., 0.],
[0., 0., 0.]])

c) Array of ones
np.ones(5,dtype=np.int32)
o/p: array([1, 1, 1, 1, 1])

d) Random numbers in nd-array


np.random.rand(2,3)
o/p:
array([[0.8691082 , 0.124569 , 0.5545699 ],
[0.18495189, 0.08205121, 0.51772858]])

e) An array of your choice


np.full((2,2),7)
o/p:
array([[7, 7],
[7, 7]])

f) Imatrix in NumPy
np.eye(3)
o/p:
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

5
np.eye(3,k=1)
o/p:
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])

g) Evenly spaced nd-array


np.arange(5)
o/p: array([0, 1, 2, 3, 4])

np.arange(2,10,2)
o/p: array([2, 4, 6, 8])

RESULT: Hence, a NumPy array has been successfully created, and various operations such
as creating basic nd-arrays, arrays of zeros and ones, generating random numbers, and
implementing specific array types and transformations have been performed.

6
2. THE SHAPE AND RESHAPING OF NUMPY ARRAY

AIM: To understand and demonstrate the concepts of shape and reshaping in NumPy arrays,
including dimensions, shape, size, reshaping, flattening, and transposing.

DESCRIPTION:
NumPy, a fundamental library for numerical computing in Python, provides powerful tools
for handling multi-dimensional arrays. One of its key features is the ability to manipulate the
shape and structure of arrays efficiently. This exercise explores various aspects of handling
and modifying NumPy arrays:
1. Dimensions of a NumPy Array:
The number of dimensions (axes) of an array is called its rank. It can be determined
using the ndim attribute.
2. Shape of a NumPy Array:
The shape of an array is a tuple that indicates the size of the array along each
dimension. For example, a 2x3 array has a shape of (2, 3).
3. Size of a NumPy Array:
The total number of elements in an array can be found using the size attribute.
4. Reshaping a NumPy Array:
Reshaping allows us to change the structure of an array without altering its data. For
instance, a 1D array of size 6 can be reshaped into a 2x3 2D array using the reshape()
method.
5. Flattening a NumPy Array:
Flattening converts a multi-dimensional array into a 1D array. This is useful for linear
data processing and can be achieved using the flatten() method.
6. Transpose of a NumPy Array:
The transpose of an array swaps its axes, such as converting rows to columns in a 2D
array. This can be done using the T attribute.
These operations are fundamental for data manipulation in machine learning, data
preprocessing, and other numerical computation tasks.

PROGRAM:
a) Dimensions of NumPy array
import numpy as np
a = np.array([[5,10,15],[20,25,20]])
print('Array :','\n',a)
print('Dimensions :','\n',a.ndim)

7
o/p:
Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :
2

b) Shape of NumPy array


a = np.array([[1,2,3],[4,5,6]])
print('Array :','\n',a)
print('Shape :','\n',a.shape)
print('Rows = ',a.shape[0])
print('Columns = ',a.shape[1])
o/p:
Array :
[[1 2 3]
[4 5 6]]
Shape :
(2, 3)
Rows = 2
Columns = 3

c) Size of NumPy array


a = np.array([[5,10,15],[20,25,20]])
print('Size of array :',a.size)
print('Manual determination of size of array :',a.shape[0]*a.shape[1])
o/p:
Size of array : 6
Manual determination of size of array : 6

8
d) Reshaping a NumPy array
a = np.array([3,6,9,12])
np.reshape(a,(2,2))
o/p:
array([[ 3, 6],
[ 9, 12]])

a = np.array([3,6,9,12,18,24])
print('Three rows :','\n',np.reshape(a,(3,-1)))
print('Three columns :','\n',np.reshape(a,(-1,3)))
o/p:
Three rows :
[[ 3 6]
[ 9 12]
[18 24]]
Three columns :
[[ 3 6 9]
[12 18 24]]

e) Flattening a NumPy array


a = np.ones((2,2))
b = a.flatten()
c = a.ravel()
print('Original shape :', a.shape)
print('Array :','\n', a)
print('Shape after flatten :',b.shape)
print('Array :','\n', b)
print('Shape after ravel :',c.shape)
print('Array :','\n', c)

9
o/p:
Original shape : (2, 2)
Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)
Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)
Array :
[1. 1. 1. 1.]

f) Transpose of a NumPy array


a = np.array([[1,2,3],
[4,5,6]])
b = np.transpose(a)
print('Original','\n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
o/p:
Original
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Expand along columns:
Shape (3, 2)
[[1 4]
[2 5]
[3 6]]

10
RESULT: Hence, the concepts of shape and reshaping in NumPy arrays, including
dimensions, shape, size, reshaping, flattening, and transposing, have been successfully
understood and demonstrated.

11
3. EXPANDING AND SQUEEZING A NUMPY ARRAY

AIM: To demonstrate the concepts of expanding and squeezing NumPy arrays and sorting
elements within a NumPy array.
DESCRIPTION:
NumPy provides efficient ways to manipulate the shape and structure of arrays, enabling
versatile data processing. This task involves the following operations:
1. Expanding a NumPy Array:
Expanding refers to adding a new axis to an array. This is achieved using the
np.expand_dims() function or slicing with np.newaxis. It is particularly useful for
reshaping data to match specific dimensions required for computations.
2. Squeezing a NumPy Array:
Squeezing removes dimensions of size 1 from an array. This is done using the
np.squeeze() function. It is often used to reduce unnecessary dimensions and simplify
data structures.
3. Sorting in NumPy Arrays:
Sorting rearranges the elements of an array in ascending or descending order. NumPy
offers the np.sort() function for sorting along a specified axis or the flattened array.
These operations are essential for preparing data for analysis, ensuring compatibility with
machine learning models, and enhancing the clarity of data structures.

PROGRAM:
a) Expanding a NumPy array
import numpy as np
a = np.array([1,2,3])
b = np.expand_dims(a,axis=0)
c = np.expand_dims(a,axis=1)
print('Original:','\n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
print('Expand along rows:','\n','Shape',c.shape,'\n',c)
o/p:
Original:
Shape (3,)
[1 2 3]

12
Expand along columns:
Shape (1, 3)
[[1 2 3]]
Expand along rows:
Shape (3, 1)
[[1]
[2]
[3]]

b) Squeezing a NumPy array


a = np.array([[[1,2,3],
[4,5,6]]])
b = np.squeeze(a, axis=0)
print('Original','\n','Shape',a.shape,'\n',a)
print('Squeeze array:','\n','Shape',b.shape,'\n',b)
o/p:
Original
Shape (1, 2, 3)
[[[1 2 3]
[4 5 6]]]
Squeeze array:
Shape (2, 3)
[[1 2 3]
[4 5 6]]

c) Sorting in NumPy arrays


a = np.array([1,4,2,5,3,6,8,7,9])
np.sort(a, kind='quicksort')
o/p:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

13
a = np.array([[5,6,7,4],
[9,2,3,7]])
print('Sort along column :','\n',np.sort(a, kind='mergresort',axis=1))
print('Sort along row :','\n',np.sort(a, kind='mergresort',axis=0))
o/p:
Sort along column :
[[4 5 6 7]
[2 3 7 9]]
Sort along row :
[[5 2 3 4]
[9 6 7 7]]

RESULT: Hence, the concepts of expanding and squeezing NumPy arrays, as well as sorting
elements within a NumPy array, have been successfully demonstrated.

14
4. INDEXING AND SLICING OF NUMPY ARRAY

AIM: To understand and demonstrate the concepts of indexing and slicing in NumPy arrays,
including slicing 1D, 2D, and 3D arrays, as well as negative slicing techniques.
DESCRIPTION:
Indexing and slicing are fundamental operations for accessing and manipulating elements
within NumPy arrays. This task focuses on the following operations:
1. Slicing 1-D NumPy Arrays:
In 1D arrays, slicing allows us to extract specific ranges of elements. The slicing
syntax [start:stop:step] is used to access subsets of the array.
2. Slicing 2-D NumPy Arrays:
For 2D arrays, slicing can be done in both row and column directions. Using the
syntax [start_row:end_row, start_col:end_col], we can extract subarrays by specifying
the rows and columns to include.
3. Slicing 3-D NumPy Arrays:
For 3D arrays, slicing works similarly, but it adds an additional dimension. We can
slice the array along all three axes, using the syntax [start_layer:end_layer,
start_row:end_row, start_col:end_col].
4. Negative Slicing of NumPy Arrays:
Negative indexing allows us to access elements from the end of the array. Negative
slicing uses the syntax [-start:stop] to refer to elements starting from the end of the
array, which is particularly useful for reversing arrays or accessing elements from the
back.
These operations are essential for extracting and modifying subsets of data in NumPy, and
they are widely used in data manipulation, machine learning, and scientific computing tasks.

PROGRAM:
a) Slicing 1-D NumPy arrays
import numpy as np
a = np.array([1,2,3,4,5,6])
print(a[1:6:2])
o/p: [2 4 6]

a = np.array([1,2,3,4,5,6])
print(a[:6:2])
print(a[1::2])

15
print(a[1:6:])
o/p:
[1 3 5]
[2 4 6]
[2 3 4 5 6]

b) Slicing 2-D NumPy arrays


a = np.array([[1,2,3],[4,5,6]])
print('First row values :','\n',a[0:1,:])
print('Alternate values from first row:','\n',a[0:1,::2])
print('Second column values :','\n',a[:,1::2])
print('Arbitrary values :','\n',a[0:1,1:3])
o/p:
First row values :
[[1 2 3]]
Alternate values from first row:
[[1 3]]
Second column values :
[[2]
[5]]
Arbitrary values :
[[2 3]]

c) Slicing 3-D NumPy arrays


a = np.array([[[1,2],[3,4],[5,6]],
[[7,8],[9,10],[11,12]],
[[13,14],[15,16],[17,18]]])
print(a)
o/p:
[[[ 1 2]

16
[ 3 4]
[ 5 6]]

[[ 7 8]
[ 9 10]
[11 12]]

[[13 14]
[15 16]
[17 18]]]

d) Negative slicing of NumPy arrays


a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print(a[:,-1])
o/p: [ 5 10]

print(a[:,-1:-3:-1])
o/p:
[[ 5 4]
[10 9]]

RESULT: Hence, the concepts of indexing and slicing in NumPy arrays, including slicing
1D, 2D, and 3D arrays, as well as negative slicing techniques, have been successfully
understood and demonstrated.

17
5. STACKING AND CONCATENATING NUMPY ARRAYS

AIM: To understand and demonstrate the concepts of stacking and concatenating NumPy
arrays, along with broadcasting operations in NumPy.
DESCRIPTION:
Stacking and concatenating are operations used to combine multiple arrays into a single array,
while broadcasting allows for operations between arrays of different shapes. This task focuses
on the following operations:
1. Stacking ndarrays:
Stacking is the process of combining multiple arrays along a new axis. This can be
done using functions like np.vstack() (vertical stacking), np.hstack() (horizontal
stacking), or np.dstack() (depth stacking). It creates higher-dimensional arrays by
stacking lower-dimensional arrays.
2. Concatenating ndarrays:
Concatenation involves joining multiple arrays along an existing axis. The
np.concatenate() function is used to combine arrays along a specified axis (0 for rows,
1 for columns, etc.), resulting in a larger array with the same number of dimensions.
3. Broadcasting in NumPy Arrays:
Broadcasting refers to the ability of NumPy to perform arithmetic operations on
arrays of different shapes by automatically expanding the smaller array to match the
larger one. This allows element-wise operations between arrays without explicit
reshaping.
These operations are essential for manipulating and combining data in NumPy arrays,
allowing for more complex computations in data analysis, machine learning, and scientific
computing.
PROGRAM:
a) Stacking nd-arrays
import numpy as np
a = np.arange(0,5)
b = np.arange(5,10)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
o/p:
Array 1 :

18
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :
[[0 1 2 3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0 1 2 3 4 5 6 7 8 9]

a = [[1,2],[3,4]]
b = [[5,6],[7,8]]
c = np.dstack((a,b))
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Dstack :','\n',c)
print(c.shape)
o/p:
Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]
Dstack :
[[[1 5]
[2 6]]

[[3 7]
[4 8]]]
(2, 2, 2)

19
b) Concatenating nd-arrays
a = np.arange(0,5).reshape(1,5)
b = np.arange(5,10).reshape(1,5)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Concatenate along rows :','\n',np.concatenate((a,b),axis=0))
print('Concatenate along columns :','\n',np.concatenate((a,b),axis=1))
o/p:
Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :
[[0 1 2 3 4]
[5 6 7 8 9]]
Concatenate along columns :
[[0 1 2 3 4 5 6 7 8 9]]

a = np.array([[1,2],
[3,4]])
np.append(a,[[5,6]], axis=0)
o/p:
array([[1, 2],
[3, 4],
[5, 6]])

c) Broadcasting in NumPy arrays


a = np.arange(10,20,2)
b = np.array([[2],[2]])
print('Adding two different size arrays :','\n',a+b)

20
print('Multiplying an ndarray and a number :',a*2)
o/p:
Adding two different size arrays :
[[12 14 16 18 20]
[12 14 16 18 20]]
Multiplying an ndarray and a number : [20 24 28 32 36]

RESULT: Hence, the concepts of stacking and concatenating NumPy arrays, along with
broadcasting operations, have been successfully understood and demonstrated.

21
6. OPERATIONS WITH PANDAS

AIM: To understand and demonstrate how to create a DataFrame, perform concatenation, set
conditions, and add new columns in a Pandas DataFrame.
DESCRIPTION:
Pandas is one of the most popular and powerful data science libraries in Python. It can be
considered as the stepping stone for any aspiring data scientist who prefers to code in Python.
Even though the library is easy to get started, it can certainly do a wide variety of data
manipulation. This makes Pandas one of the handiest data science libraries in the developer’s
community. Pandas basically allow the manipulation of large datasets and data frames. It can
also be considered as one of the most efficient statistical tools for mathematical computations
of tabular data. Pandas is a powerful library for data manipulation and analysis. This task
focuses on the following operations:
1. Creating a DataFrame:
A DataFrame is the core data structure in Pandas. It is a 2-dimensional labeled data
structure with columns of potentially different types. This operation will cover how to
create a DataFrame from lists, dictionaries, or other data structures.
2. Using concat() for DataFrame Concatenation:
The concat() function allows for combining multiple DataFrames along a particular
axis (either rows or columns). This operation demonstrates how to concatenate
DataFrames to combine datasets efficiently.
3. Setting Conditions in a DataFrame:
Conditional statements can be used to filter or modify data within a DataFrame. This
operation will explore how to apply conditions to create boolean masks for selecting
data or modifying values based on certain criteria.
4. Adding a New Column:
Adding a new column to a DataFrame is an essential operation when transforming or
enriching the data. This task will show how to add new columns, either by applying a
function to existing data or by assigning static or calculated values.
PROGRAM:
a) Creating Dataframe
import pandas as pd
data_england = {'Name': ['Kane', 'Sterling', 'Saka', 'Maguire'], 'Age': [27, 26, 19, 28]}
data_italy = {'Name': ['Immobile', 'Insigne', 'Chiellini', 'Chiesa'], 'Age': [31, 30, 36, 23]}
df_england = pd.DataFrame(data_england)
df_italy = pd.DataFrame(data_italy)
print("England Players:")

22
print(df_england)
print("\nItaly Players:")
print(df_italy)
o/p:
England Players:

Italy Players:

b) concat()
frames = [df_england, df_italy]
both_teams = pd.concat(frames)
both_teams
o/p:

23
pd.concat(frames, keys=["England", "Italy"])
o/p:

c) Setting Conditions
both_teams[both_teams["Age"] >= 30]
o/p:

both_teams[both_teams["Name"].str.startswith('S')]
o/p:

d) Adding a new column


club = ['Tottenham', 'Man City', 'Arsenal', 'Man Utd']
df_england['Associated Clubs'] = club
df_england

24
o/p:

Name Age Associated Clubs

0 Kane 27 Tottenham

1 Sterling 26 Man City

2 Saka 19 Arsenal

3 Maguire 28 Man Utd

frames = [df_england, df_italy]


both_teams = pd.concat(frames)
both_teams
o/p:

Associated
Name Age
Clubs

0 Kane 27 Tottenham

1 Sterling 26 Man City

2 Saka 19 Arsenal

3 Maguire 28 Man Utd

0 Immobile 31 NaN

1 Insigne 30 NaN

2 Chiellini 36 NaN

3 Chiesa 23 NaN

RESULT: Hence, the concepts of creating a DataFrame, performing concatenation, setting


conditions, and adding new columns in a Pandas DataFrame have been successfully
understood and demonstrated.

25
7. PANDAS OPERATIONS: HANDLING NaN, SORTING, AND GROUPING DATA

AIM: To understand and demonstrate how to handle NaN values, sort data based on column values,
and group data using the groupby() function in Pandas.

DESCRIPTION:
This task covers essential operations in Pandas for data cleaning, organization, and analysis.
The following operations will be demonstrated:
1. Filling NaN with String:
NaN (Not a Number) values are common in real-world datasets. This operation
demonstrates how to fill NaN values in a DataFrame with a specified string (or any
other placeholder), ensuring the data is complete and usable for analysis.
2. Sorting Based on Column Values:
Sorting is an essential operation to organize data in a specific order. This operation
will show how to sort a DataFrame based on values in one or more columns, either in
ascending or descending order.
3. Using groupby() for Grouping Data:
The groupby() function in Pandas is used to group data based on column values,
which is useful for performing aggregate functions like summing, averaging, or
counting within each group. This operation will demonstrate how to group data and
apply aggregation functions.
These operations are fundamental for data preprocessing, cleaning, and analysis, and are
frequently used in data science, machine learning, and data manipulation tasks.
PROGRAM:
a) Filling NaN with string
both_teams['Associated Clubs'].fillna('No Data Found', inplace=True)
both_teams
o/p:

Name Age Associated Clubs

0 Kane 27 Tottenham

1 Sterling 26 Man City

2 Saka 19 Arsenal

3 Maguire 28 Man Utd

26
Name Age Associated Clubs

0 Immobile 31 No Data Found

1 Insigne 30 No Data Found

2 Chiellini 36 No Data Found

3 Chiesa 23 No Data Found

b) Sorting based on column values


both_teams.sort_values('Name')
o/p:

Name Age Associated Clubs

2 Chiellini 36 No Data Found

3 Chiesa 23 No Data Found

0 Immobile 31 No Data Found

1 Insigne 30 No Data Found

0 Kane 27 Tottenham

3 Maguire 28 Man Utd

2 Saka 19 Arsenal

1 Sterling 26 Man City

27
both_teams.sort_values('Age')
o/p:

Age Associated Clubs


Name

2 Saka 19 Arsenal

3 Chiesa 23 No Data Found

1 Sterling 26 Man City

0 Kane 27 Tottenham

3 Maguire 28 Man Utd

1 Insigne 30 No Data Found

0 Immobile 31 No Data Found

2 Chiellini 36 No Data Found

c) groupby()

a={

'UserID': ['U1001', 'U1002', 'U1001', 'U1001', 'U1003'],

'Transaction': [500, 300, 200, 300, 700]

df_a = pd.DataFrame(a)

df_a

28
o/p:

Transaction
UserID

0 U1001 500

1 U1002 300

2 U1001 200

3 U1001 300

4 U1003 700

df_a.groupby('UserID').sum()
o/p:

Transaction

UserID

U1001 1000

U1002 300

U1003 700

df_a.groupby('UserID').get_group('U1001')
o/p:

UserID Transaction

0 U1001 500

2 U1001 200

29
UserID Transaction

3 U1001 300

RESULT: Hence, the concepts of handling NaN values, sorting data based on column values,
and grouping data using the groupby() function in Pandas have been successfully understood
and demonstrated.

30
8. READING VARIOUS FILE FORMATS USING PANDAS

AIM: To read and load different types of data files (Text files, CSV files, Excel files, and
JSON files) into Python using the Pandas library for data manipulation and analysis.
DESCRIPTION:
This task focuses on using Pandas, a powerful data analysis library in Python, to read data
from various common file formats. These include:
• Text Files: Raw data files containing information, which can be loaded into a Pandas
DataFrame for analysis.
• CSV Files: Comma-separated values files that represent tabular data, which can be
easily read into a DataFrame using the read_csv() function.
• Excel Files: Files in .xls or .xlsx format, typically containing spreadsheets that can be
read using Pandas' read_excel() function.
• JSON Files: JavaScript Object Notation files, commonly used for structured data
exchange, can be read into a DataFrame with read_json().
By using these file formats with Pandas, you can efficiently load data into Python and
perform various data manipulation tasks like filtering, transforming, and analyzing the data.
PROGRAM:
a) Text files
with open(r'E:\\pythonfiles\\datascience.txt', 'r') as f:
print(f.read())
o/p:
Data Science with Python refers to the use of Python programming language and its rich
ecosystem of libraries to extract insights from data. Python has become one of the most
popular languages for data science due to its simplicity, readability, and extensive set of tools
and libraries designed specifically for data analysis, machine learning, and visualization.
(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)

b) CSV files
import pandas as pd
df = pd.read_csv(r'./Importing files/Employee.txt',delimiter='\t')
df

31
o/p:

(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)

c) Excel files
df = pd.read_excel(r'./Importing files/World_city.xlsx')
df
o/p:

df = pd.read_excel(r'./Importing files/World_city.xlsx',sheet_name='Europe')
df
o/p:

(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)

32
d) JSON files
import json
with open('./Importing files/sample_json.json','r') as file:
data = json.load(file)
print(type(data))
df_json = pd.DataFrame(data)
df_json
o/p:

path = './Importing files/sample_json.json'


df = pd.read_json(path)
df
o/p:

(Note: This output will only be displayed if the file exists and contains text. Please provide
the location of your file where it is stored.)

RESULT: Hence, reading and loading different types of data files (Text files, CSV files,
Excel files, and JSON files) into Python using the Pandas library for data manipulation and
analysis has been successfully executed.

33
9. READING AND IMPORTING DATA FROM VARIOUS FILE FORMATS
AND DATABASES USING PYTHON

AIM: To demonstrate how to read and import data from various file formats, including
Pickle files, image files using PIL, multiple files using Glob, and importing data from
databases into Python for further analysis and processing.
DESCRIPTION:
This task covers the different ways to load and handle files in Python, using specialized
libraries and techniques for various data types. The aim is to provide a practical
understanding of how to work with different file formats and efficiently import data for
processing.
1. Pickle Files:
The Pickle module in Python is used for serializing and deserializing Python object
structures. This is helpful when you need to store Python objects for later use. It
allows saving complex data structures such as lists, dictionaries, and more, and
retrieving them later.
2. Image Files using PIL (Pillow):
PIL (Pillow) is a Python Imaging Library that provides easy-to-use methods for
opening, manipulating, and saving various image formats. This section will cover how
to open image files, perform transformations, and process image data using Pillow.
3. Multiple Files using Glob:
The Glob module in Python allows you to find all the pathnames matching a specified
pattern. It is particularly useful when you want to read multiple files (e.g., all text files
in a directory) without manually specifying their names. This enables dynamic file
handling.
4. Importing Data from Databases:
Python provides various libraries to connect to databases, such as sqlite3, MySQL
(using libraries like PyMySQL or SQLAlchemy), and PostgreSQL. This section will
demonstrate how to query data from relational databases and load it into Python for
further processing.
PROGRAM:
a) Pickle files
import pickle
with open('./Importing files/sample_pickle.pkl','rb') as file:
data = pickle.load(file)
print(type(data))
df_pkl = pd.DataFrame(data)
df_pkl

34
o/p:

b) Image files using PIL


from PIL import Image
filename = r'C:\Users\Dell\Desktop \Delhi\1.jpg'
Image.open(filename)
o/p:

c) Multiple files using Glob


for i in glob.glob('.\Importing files\*.py'):
print(i)
o/p:

35
import cv2
import matplotlib.pyplot as plt
filepath = r'./Importing files/Delhi'
images = glob.glob(filepath+'\*.jpg')
for i in images[:3]:
im = Image.open(i)
plt.imshow(im)
plt.show()
o/p:

36
d) Importing data from Database
import pandas as pd
import sqlite3
con=sqlite3.connect('./Importing files/sample_test.db')
cur = con.cursor()
rs = cur.execute('select * from TEST')
df = pd.DataFrame(rs.fetchall())
con.commit()
df
o/p:

RESULT: Hence, reading and importing data from various file formats, including Pickle
files, image files using PIL, multiple files using Glob, and data from databases into Python
for further analysis and processing has been successfully demonstrated.

37
10. WEB SCRAPING USING PYTHON

AIM: To perform web scraping using python.


DESCRIPTION:
Web Scraping refers to extracting large amounts of data from the web. This is important for a
data scientist who has to analyze large amounts of data.
Python provides a very handy module called requests to retrieve data from any website.
The requests.get() function takes in a URL as its parameter and returns the HTML response
as its output. The way it works is summarized in the following steps:
1. It packages the Get request to retrieve data from webpage
2. Sends the request to the server
3. Receives the HTML response and stores in a response object
PROGRAM:
import requests
url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b91257d413147
b69"
resp = requests.get(url)
text = resp.text
print(text)
o/p:

But as you can see, the data is not very readable. The tree-like structure of the HTML content
retrieved by our request is not very comprehensible. To improve this readability, Python has
another wonderful library called BeautifulSoup.
BeautifulSoup is a Python library for parsing the tree-like structure of HTML and extracting
data from the HTML document.

38
To make it work, we need to pass the text response from the request object
to BeautifulSoup() which creates its own object – “soup” in this case. Calling prettify() on
BeautifulSoup object parses the tree-like structure of the HTML document:

import requests
from bs4 import BeautifulSoup
url = "https://weather.com/en-
IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27fb7eb73c70b91257d413147
b69"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
print(soup.prettify())
o/p:

You must have noticed the difference in the output. We have a more structured output in this
case!
Now, we can extract the title of the webpage by calling the title() function of our soup object:

title = soup.title
title

The webpage has a lot of pictures of the famous monuments in Delhi and other things related
to Delhi. Let’s try and store these in a local folder.
We will need the Python urllib library to retrieve the URL of the images that we want to
store. It has a urllib.request() function that is used for opening and reading URLs. Calling
the urlretrieve() function on this object allows us to download objects denoted by the URL
to a local file:

39
import urllib
def download_img(url, i):
folder = r'C:\Users\Dell\Desktop\Delhi\\'
filepath = folder + str(i) +'.jpg'
urllib.request.urlretrieve(url,filepath)

The images are stored in the “img” tag in HTML. These can be found by calling find_all() on
the soup object. After this, we can iterate over the image and get its source by calling
the get() function on the image object. The rest is handled by our download function:

images = soup.find_all('img')
i=1
for image in images[2:10]:
try:
download_img('https:'+image.get('src'), i)
i = i+1
except:
continue

from PIL import Image


filename = r'C:\Users\Dell\Desktop\Delhi\1.jpg'
Image.open(filename)

40
o/p:

RESULT: Hence, web scraping using Python with libraries like BeautifulSoup and Requests
has been successfully executed.

41
11. PREPROCESSING TECHNIQUES FOR LOAN PREDICTION DATASET

AIM: To apply and evaluate various preprocessing techniques on the Loan Prediction dataset
to prepare it for effective model training and prediction.
DESCRIPTION:
In this task, several preprocessing techniques will be applied to a Loan Prediction dataset to
enhance the data's quality and suitability for machine learning models. The steps include:
1. Feature Scaling: Scaling numerical features to a common scale to improve the
convergence rate of machine learning algorithms.
2. Feature Standardization: Standardizing features to have a mean of zero and a
standard deviation of one, ensuring that no feature dominates others due to differences
in scale.
3. Label Encoding: Converting categorical labels into numeric values, allowing
machine learning algorithms to interpret them more efficiently.
4. One Hot Encoding: Converting categorical variables into a series of binary columns,
each representing a different category, enabling algorithms to work with categorical
data without imposing any ordinal relationship.
These techniques are essential for improving model performance and ensuring that the data is
in the right format for machine learning algorithms.
(Before applying feature scaling, we need to import packages and datasets):
import pandas as pd
X_train=pd.read_csv('X_train.csv')
Y_train=pd.read_csv('Y_train.csv')
X_test=pd.read_csv('X_test.csv')
Y_test=pd.read_csv('Y_test.csv')
print (X_train.head()) # Provide respective csv files for training and testing
o/p:
Loan_ID Gender Married Dependents Education Self_Employed
15 LP001032 Male No 0 Graduate No
248 LP001824 Male Yes 1 Graduate No
590 LP002928 Male Yes 0 Graduate No
246 LP001814 Male Yes 2 Graduate No
388 LP002244 Male Yes 0 Graduate No

42
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
15 4950 0.0 125.0 360.0
248 2882 1843.0 123.0 480.0
590 3000 3416.0 56.0 180.0
246 9703 0.0 112.0 360.0
388 2333 2417.0 136.0 360.0

Credit_History Property_Area
15 1.0 Urban
248 1.0 Semiurban
590 1.0 Semiurban
246 1.0 Urban
388 1.0 Urban

PROGRAM:
a) Feature Scaling
import matplotlib.pyplot as plt
X_train[X_train.dtypes[(X_train.dtypes=="float64")|(X_train.dtypes=="int64")]
.index.values].hist(figsize=[11,11])
o/p:

43
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
X_train_minmax=min_max.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_minmax=min_max.fit_transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_minmax,Y_train)
accuracy_score(Y_test,knn.predict(X_test_minmax))
o/p: 0.75

b) Feature Standardization
from sklearn.preprocessing import scale
X_train_scale=scale(X_train[['ApplicantIncome', 'CoapplicantIncome',

44
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_scale=scale(X_test[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
from sklearn.linear_model import LogisticRegression
log=LogisticRegression(penalty='l2',C=.01)
log.fit(X_train_scale,Y_train)
accuracy_score(Y_test,log.predict(X_test_scale))
o/p: 0.75

c) Label Encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])
o/p:
Before : Male 318
Female 66
Name: Gender, dtype: int64
After : 1 318
0 66
Name: Gender, dtype: int64

d) One Hot Encoding


X_train_scale=scale(X_train)
X_test_scale=scale(X_test)
log=LogisticRegression(penalty='l2',C=1)

45
log.fit(X_train_scale,Y_train)
accuracy_score(Y_test,log.predict(X_test_scale))
o/p: 0.73958333333333337

from sklearn.preprocessing import OneHotEncoder


import pandas as pd
enc = OneHotEncoder(sparse=False)
X_train_1 = X_train
X_test_1 = X_test

columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',


'Credit_History', 'Property_Area']

for col in columns:


data = X_train[[col]].append(X_test[[col]])
enc.fit(data)

temp = enc.transform(X_train[[col]])
temp = pd.DataFrame(temp, columns=[(col + "_" + str(i)) for i in
data[col].value_counts().index])
temp = temp.set_index(X_train.index.values)
X_train_1 = pd.concat([X_train_1, temp], axis=1)

temp = enc.transform(X_test[[col]])
temp = pd.DataFrame(temp, columns=[(col + "_" + str(i)) for i in
data[col].value_counts().index])
temp = temp.set_index(X_test.index.values)
X_test_1 = pd.concat([X_test_1, temp], axis=1)

46
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train_scale = scale(X_train_1)
X_test_scale = scale(X_test_1)

log = LogisticRegression(penalty='l2', C=1)


log.fit(X_train_scale, Y_train)

accuracy_score(Y_test, log.predict(X_test_scale))

o/p: 0.75

RESULT: Hence, various preprocessing techniques have been successfully applied and
evaluated on the Loan Prediction dataset to prepare it for effective model training and
prediction.

47
12. DATA VISUALIZATIONS USING MATPLOTLIB

AIM: To perform various types of visualizations using Matplotlib to explore and understand
the dataset, and to identify patterns, trends, and relationships in the data.
DESCRIPTION:
Matplotlib is a Python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application
servers, and four graphical user interface toolkits.In this task, the following visualizations
will be performed using Matplotlib:
1. Bar Graph: To visualize the distribution of categorical data by showing the frequency
of each category.
2. Pie Chart: To display proportions or percentages of different categories in the dataset
for easy comparison.
3. Box Plot: To show the spread of numerical data and identify outliers by visualizing
quartiles and data range.
4. Histogram: To represent the distribution of a continuous numerical variable, showing
frequency counts for different ranges.
5. Line Chart and Subplots: To track changes over time or across categories, including
the use of subplots for multiple comparisons in a single view.
6. Scatter Plot: To explore the relationship between two continuous variables by
plotting data points on a two-dimensional axis.
These visualizations will help in understanding the dataset's structure and characteristics,
making it easier to analyze the data and make informed decisions.
PROGRAM:
Let us first import the relevant libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')

df_meal = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\meal_info.csv')
df_meal.head()

48
df_center = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\fulfilment_center_info.csv')
df_center.head()

df_food = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\train_food.csv')
df_food.head()

df = pd.merge(df_food,df_center,on='center_id')
df = pd.merge(df,df_meal,on='meal_id')

table = pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)
table

49
o/p:

a) Bar Graph
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')
df_meal = pd.read_csv('meal_info.csv')
df_center = pd.read_csv('fulfilment_center_info.csv')
df_food = pd.read_csv('train_food.csv')
df = pd.merge(df_food,df_center,on='center_id')
df = pd.merge(df,df_meal,on='meal_id')
table = pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)

50
plt.bar(table.index,table['num_orders'])
plt.xticks(rotation=70)
plt.xlabel('Food item')
plt.ylabel('Quantity sold')
plt.title('Most popular food')
plt.show()
o/p:

item_count = {}
for i in range(table.index.nunique()):
item_count[table.index[i]] =
table.num_orders[i]/df_meal[df_meal['category']==table.index[i]].shape[0]
plt.bar([x for x in item_count.keys()],[x for x in item_count.values()],color='orange')
plt.xticks(rotation=70)
plt.xlabel('Food item')
plt.ylabel('No. of meals')
plt.title('Meals per food item')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_7.png',dpi=300,bbox_inches='tight')
plt.show()

51
o/p:

b) Pie Chart
d_cuisine = {}
total = df['num_orders'].sum()
for i in range(df['cuisine'].nunique()):
c = df['cuisine'].unique()[i]
c_order = df[df['cuisine']==c]['num_orders'].sum()
d_cuisine[c] = c_order/total

plt.pie([x*100 for x in d_cuisine.values()],labels=[x for x in


d_cuisine.keys()],autopct='%0.1f',explode=[0,0,0.1,0])
plt.title('Cuisine share %')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_8.png',dpi=300,bbox_inches='tight')
plt.show()

52
o/p:

c) Box Plot
c_price = {}
for i in df['cuisine'].unique():
c_price[i] = df[df['cuisine']==i].base_price

plt.boxplot([x for x in c_price.values()],labels=[x for x in c_price.keys()])


plt.xlabel('Cuisine')
plt.ylabel('Price')
plt.title('Analysing cuisine price')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_9.png',dpi=300,bbox_inches='tight')
plt.show()
o/p:

d) Histogram
plt.hist(df['base_price'],rwidth=0.9,alpha=0.3,color='blue',bins=15,edgecolor='red')
plt.xlabel('Base price range')

53
plt.ylabel('Distinct order')
plt.title('Inspecting price effect')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_10.png',dpi=300,bbox_inches='tight')
plt.show()

o/p:

e) Line chart and Subplots


df['revenue'] = df.apply(lambda x: x.checkout_price*x.num_orders,axis=1)
df['month'] = df['week'].apply(lambda x: x//4)
month=[]
month_order=[]
for i in range(max(df['month'])):
month.append(i)
month_order.append(df[df['month']==i].revenue.sum())
week=[]
week_order=[]
for i in range(max(df['week'])):
week.append(i)
week_order.append(df[df['week']==i].revenue.sum())

54
fig,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))
ax[0].plot(week,week_order)
ax[0].set_xlabel('Week')
ax[0].set_ylabel('Revenue')
ax[0].set_title('Weekly income')
ax[1].plot(month,month_order)
ax[1].set_xlabel('Month')
ax[1].set_ylabel('Revenue')
ax[1].set_title('Monthly income')
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting
images\\matplotlib_plotting_11.png',dpi=300,bbox_inches='tight')
plt.show()
o/p:

f) Scatter plot
center_type_name = ['TYPE_A','TYPE_B','TYPE_C']
op_table = pd.pivot_table(df, index='op_area', values='num_orders', aggfunc=np.sum)
c_type = {}
for i in center_type_name:
c_type[i] = df[df['center_type'] == i].op_area
center_table = pd.pivot_table(df, index='center_type', values='num_orders', aggfunc=np.sum)
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(8, 12))
ax[0].scatter(op_table.index, op_table['num_orders'], color='pink')
ax[0].set_xlabel('Operation area')

55
ax[0].set_ylabel('Number of orders')
ax[0].set_title('Does operation area affect num of orders?')
ax[0].annotate('optimum operation area of 4 km^2', xy=(4.2, 1.1*10**7), xytext=(7,
1.1*10**7),
arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)
ax[1].boxplot([x for x in c_type.values()], labels=[x for x in c_type.keys()])
ax[1].set_xlabel('Center type')
ax[1].set_ylabel('Operation area')
ax[1].set_title('Which center type had the optimum operation area?')
ax[2].bar(center_table.index, center_table['num_orders'], alpha=0.7, color='orange',
width=0.5)
ax[2].set_xlabel('Center type')
ax[2].set_ylabel('Number of orders')
ax[2].set_title('Orders per center type')
plt.tight_layout()
plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_12.png',
dpi=300, bbox_inches='tight')
plt.show()
o/p:

56
RESULT: Hence, various types of visualizations using Matplotlib have been performed
successfully to explore and understand the dataset, and to identify patterns, trends, and
relationships in the data.

57
13. GETTING STARTED WITH NLTK

AIM: To install and set up the Natural Language Toolkit (NLTK) library using PIP, laying the
foundation for natural language processing (NLP) tasks.
DESCRIPTION:
NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work
with human languages data. It gives a very easy user interface. It supports classification,
steaming, tagging, etc.
Installing NLTK on Windows using PIP:
In windows, we first have to install the python current version. Then we have to install pip
with it. Without pip, NLTK can not be installed.

Step 1: Browse to the official site of python by clicking this link.

Step 2: Move the cursor to the Download button & then click on the latest python version.

58
Step 3: Open the downloaded file. Click on the checkbox & Click on Customize installation.

59
Step 4: Click on Next.

Step 5: Click on Install.

60
Step 6: Wait till installation finish.

Step 7: Click on Close.

61
Step 8: Open Command Prompt & execute the following commands:
python --version
pip --version
pip install nltk

Hence, NLTK installation will start.

Step 9: Then you can see the successfully installed message.

Hence NLTK installation is successful

RESULT: Hence, NLTK has been successfully installed using PIP.

62
14. IMPLEMENTATION OF PYTHON PROGRAM WITH SCIKIT-LEARN
AND NLTK

AIM: To utilize Scikit-Learn and NLTK libraries in Python for machine learning and natural
language processing tasks.
DESCRIPTION:
This program is a text classification task that uses various machine learning classifiers to This
Python program implements a text classification model using Natural Language Processing (NLP) and
machine learning techniques. The program classifies a set of text documents into predefined categories
(Tech, Sport, Fitness) using a Naive Bayes classifier, which is trained on text features extracted using
the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization method.

Key Steps:
1. Text Preprocessing:
o The program tokenizes the text documents and removes stopwords (commonly
used words like 'the', 'and', etc.) to improve the quality of the features used for
classification.
o It also ensures that only alphanumeric tokens are considered for classification.
2. Feature Extraction:
o The TF-IDF Vectorizer is used to convert the preprocessed text into numerical
feature vectors, which represent the importance of words in the documents
relative to the entire corpus.
3. Model Training:
o The Multinomial Naive Bayes classifier is used to classify the documents into
categories based on the extracted features. The model is trained using an 80%
subset of the dataset.
4. Model Evaluation:
o The model is evaluated using the test set (20% of the data) to predict the
category labels for unseen documents. The program calculates the accuracy of
the model and generates a detailed classification report, including precision,
recall, and F1-score for each category.
5. Cross-Validation:
o To further assess the performance of the model, 5-fold cross-validation is used
to evaluate the model's generalizability and robustness.
The program prints the accuracy of the model on the test set, a classification report showing
detailed metrics (precision, recall, F1-score) for each class, and the cross-validated accuracy of
the model across multiple splits of the dataset.

63
PROGRAM:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

nltk.download('punkt')
nltk.download('stopwords')

documents = [
'I love programming in Python',
'Machine learning is great',
'Natural Language Processing is fascinating',
'I enjoy solving problems using code',
'Python is a great programming language',
'Python is widely used in AI and data science',
'Football is a popular sport around the world',
'The Olympics are held every four years',
'The football match was exciting to watch',
'I love watching basketball games',
'Running is a great way to stay fit',
'Yoga helps in reducing stress and improving flexibility',
'Swimming is a great full-body exercise',
'Cycling is a popular outdoor activity',
]

64
labels = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]

stop_words = set(stopwords.words('english'))

def preprocess(text):
tokens = word_tokenize(text.lower())
return ' '.join([word for word in tokens if word not in stop_words and word.isalnum()])

processed_docs = [preprocess(doc) for doc in documents]

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

X_train, X_test, y_train, y_test = train_test_split(processed_docs, labels, test_size=0.2,


random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print(f"Accuracy: {accuracy * 100:.2f}%")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Tech", "Sport", "Fitness"]))

cv_scores = cross_val_score(model, processed_docs, labels, cv=5)


print(f"\nCross-validated accuracy: {cv_scores.mean() * 100:.2f}%")

65
o/p:

RESULT: Hence, the Scikit-Learn and NLTK libraries in Python have been successfully
utilized and executed for machine learning and natural language processing tasks.

66
15.PYTHON PROGRAM TO IMPLEMENT TEXT PROCESSING USING
NLTK/SPACY/PYNLPI

AIM: To develop a Python program that utilizes NLTK for text processing tasks such as
tokenization, sentence segmentation, and word tokenization to prepare text data for further
analysis or NLP applications.
DESCRIPTION:
This Python program uses NLTK's sent_tokenize and word_tokenize functions to break down
a given text into sentences and words. The program first downloads the necessary punkt
tokenizer models, which are pre-trained models for sentence and word segmentation. The text
is then processed to identify and separate sentences using the sent_tokenize function, and each
sentence is further split into words using the word_tokenize function. The program outputs the
list of sentences and words, which can be used for tasks such as text analysis, machine learning
preprocessing, or natural language processing (NLP) applications. This program serves as an
introduction to basic text processing techniques in NLP using Python.
PROGRAM:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
text = "Natural language processing (NLP) is a field of computer science, artificial intelligence
and computational linguistics concerned with the interactions between computers and human
(natural) languages, and, in particular, concerned with programming computers to fruitfully
process large natural language corpora. Challenges in natural language processing frequently
involve natural language understanding, natural language generation (frequently from formal,
machine-readable logical forms), connecting language and machine perception, managing
human-computer dialog systems, or some combination thereof."
print("Sentences Tokenized:")
print(sent_tokenize(text))
print("\nWords Tokenized:")
print(word_tokenize(text))
o/p:
Sentences Tokenized:
['Natural language processing (NLP) is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions between computers and human
(natural) languages, and, in particular, concerned with programming computers to fruitfully
process large natural language corpora.', 'Challenges in natural language processing frequently
involve natural language understanding, natural language generation (frequently from formal,

67
machine-readable logical forms), connecting language and machine perception, managing
human-computer dialog systems, or some combination thereof.']

Words Tokenized:
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',',
'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the',
'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in',
'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large',
'natural', 'language', 'corpora', '.', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently',
'involve', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', '(',
'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting',
'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',',
'or', 'some', 'combination', 'thereof', '.']

RESULT: Hence, a Python program that utilizes NLTK for text processing tasks such as
tokenization, sentence segmentation, and word tokenization has been successfully developed
to prepare text data for further analysis or NLP applications.

68

You might also like