0% found this document useful (0 votes)
10 views

data science lab exp lis

The document outlines the lab manual for an Introduction to Data Science course using Python, detailing the mission and vision of the Artificial Intelligence and Data Science department. It includes course outcomes, guidelines for students, and a comprehensive list of experiments focusing on various aspects of Python programming and data manipulation using libraries like NumPy and pandas. The manual emphasizes practical skills and theoretical knowledge necessary for students to succeed in the field.

Uploaded by

fiza13844
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

data science lab exp lis

The document outlines the lab manual for an Introduction to Data Science course using Python, detailing the mission and vision of the Artificial Intelligence and Data Science department. It includes course outcomes, guidelines for students, and a comprehensive list of experiments focusing on various aspects of Python programming and data manipulation using libraries like NumPy and pandas. The manual emphasizes practical skills and theoretical knowledge necessary for students to succeed in the field.

Uploaded by

fiza13844
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

II B.

Tech II Semester
INTRODUCTION TO DATA SCIENCE USING PYTHON

DEPARTMENT OF

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


Check list for Lab Manual

S. No. Particulars

1 Mission and Vision

2 Course Outcomes

3 Guidelines for the student

4 List of Programs as per


University
Artificial Intelligence And Data Science
Vision and Mission of the Department
Vision

To be a Model in Quality Education for producing highly talented and globally recognizable
students with sound ethics, latest knowledge, and innovative ideas in Computer Science &
Engineering.

MISSION

To be a Model in Quality Education by

M1: Imparting good sound theoretical basis and wide-ranging practical experience to the
Students for fulfilling the upcoming needs of the Society in the various fields of Computer
Science & Engineering.

M2: Offering the Students an overall background suitable for making a Successful career in
Industry/Research/Higher Education in India and abroad.

M3: Providing opportunity to the Students for Learning beyond Curriculum and improving
Communication Skills.

M4: Engaging Students in Learning, Understanding and Applying Novel Ideas.

Course: Artificial Intelligence Lab using Python Course Code: 23A30403

CO (Course Outcomes) RBT*- Revised


Bloom’s
Taxonomy
To Use Control Structures and Operators to write basic Python L3
CO1
programming. (Apply)
L4
CO2 To Analyze object-oriented concepts in Python.
( Analyze)
To Evaluate the AI models pre-processed through various feature L5
CO3
engineering algorithms by Python Programming. ( Evaluate)
To Develop the code for the recommender system using Natural L6
CO4
Language processing. ( Create)
To Design various reinforcement algorithms to solve real-time complex L6
CO5
problems. (Create)
CO PO-PSO Articulation Matrices

Course (P PS
Outcomes Os) Os
(COs)
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO PSO
1 2
CO1 2 2 1 2 2
CO2 2 2 1 2 1 2 2
CO3 2 1 2 2 1 2 2
CO4 2 2 2 1 2 1 1 2
CO5 2 2 2 1 2 1 2 1
Guidelines for the Students:
1. Students should be regular and come prepared for the lab practice.
2. In case a student misses a class, it is his/her responsibility to complete
that missedexperiment(s).
3. Students should bring the observation book, lab journal and
lab manual. Prescribed textbook and class notes can be kept ready
for reference if required.
4. They should implement the given Program individually.
5. While conducting the experiments students should see that their
programs wouldmeet the following criteria:
 Programs should be interactive with appropriate prompt
messages, errormessages if any, and descriptive messages
for outputs.
 Programs should perform input validation (Data type, range
error, etc.) andgive appropriate error messages and suggest
corrective actions.
 Comments should be used to give the statement of the problem and
everyfunction should indicate the purpose of the function, inputs and outputs
 Statements within the program should be properly indented
 Use meaningful names for variables and functions.
 Make use of Constants and type definitions wherever needed.
6. Once the experiment(s) get executed, they should show the program
and resultsto the instructors and copy the same in their observation
book.
7. Questions for lab tests and exam need not necessarily be limited to
the questionsin the manual, but could involve some variations and / or
combinations of the questions.
List of Experiments
1. Creating a NumPy Array
a. Basic ndarray
b. Array of zeros
c. Array of ones
d. Random numbers in ndarray
e. An array of your choice
f. Imatrix in NumPy
g. Evenly space dndarray
2. The Shape and Reshaping of NumPy Array
a. Dimensions of NumPy array
b. Shape of NumPy array
c. Size of NumPy array
d. Reshaping a NumPy array
e. Flattening a NumPy array
f. Transpose of a NumPy array
3. Expanding and Squeezing a NumPy Array
a. Expanding a NumPy array
b. Squeezing a NumPy array
c. Sorting in NumPy Arrays
4. Indexing and Slicing of NumPy Array
a. Slicing1-DNumPyarrays
b. Slicing2-DNumPyarrays
c. Slicing3-DNumPyarrays
d. Negative slicing of NumPy arrays
5. Stacking and Concatenating Numpy Arrays
a. Stacking ndarrays
b. Concatenating ndarrays
c. Broad casting in Numpy Arrays
6. Perform following operations using pandas
a. Creating data frame
b. concat()
c. Setting conditions
d. Adding a new column
7. Perform following operations using pandas
a. Filling NaN with string
b. Sorting based on column values
c. group by()
8. Read the following file formats using pandas
a. Text files
b. CSV files
c. Excel files
d. JSON files
9. Read the following file formats
a. Pickle files
b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
10. Demonstrate web scraping using python
11. Perform following preprocessing techniques on loan prediction dataset
a. Feature Scaling
b. Feature Standardization
c. Label Encoding
d. One Hot Encoding
12. Perform following visualizations using matplotlib
a. Bar Graph
b. Pie Chart
c. Box Plot
d. Histogram
e. Line Chart and Subplots
f. Scatter Plot
13. Getting started with NLTK, install NLTK using PIP
14. Python program to implement with Python SciKit-Learn &NLTK
15. Python program to implement with Python NLTK/Spicy/PyNLPI.

WebReferences:
1. https://www.analyticsvidhya.com/blog/2020/04/the-ultimate-numpy-
tutorial-for-data-science-beginners/
2. https://www.analyticsvidhya.com/blog/2021/07/data-science-with-pandas-
2-minutes-guide-to-key-concepts/
3. https://www.analyticsvidhya.com/blog/2020/04/how-to-read-common-
file-formats-python/
4. https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-
preprocessing-python-scikit-learn/
5. https://www.analyticsvidhya.com/blog/2020/02/beginner-guide-
matplotlib-data-visualization-exploration-python/6.
6. https://www.nltk.org/book/ch01.html
Experiment-1
1.Creating a NumPy Array
a. Basic ndarray
b. Array of zeros
c. Array of ones
d. Random numbers in ndarray
e. An array of your choice
f. Imatrix in NumPy
g. Evenly space dndarray

a.Basic ndarray:
Create a NumPy array using a Python list or tuple.

import numpy as np

# Creating a basic ndarray

basic_array = np.array([1, 2, 3, 4, 5])

print("Basic ndarray:\n", basic_array)

OUTPUT

Basic ndarray:

[1 2 3 4 5]
b.Array of Zeros

Create an array filled with zeros.

import numpy as np

# Array of zeros

zeros_array = np.zeros((3, 4)) # 3 rows, 4 columns

print("Array of zeros:\n", zeros_array)

OUTPUT

Array of zeros:

[[0. 0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]]

c. Array of Ones

Create an array filled with ones.

# Array of ones

ones_array = np.ones((2, 5)) # 2 rows, 5 columns

print("Array of ones:\n", ones_array)

OUTPUT

Array of ones:

[[1. 1. 1. 1. 1.]

[1. 1. 1. 1. 1.]]
d. Random Numbers in ndarray

Create an array with random numbers.

import numpy as np

# Random numbers array

random_array = np.random.random((4, 4)) # 4x4 array of random numbers between 0 and 1

print("Random numbers array:\n", random_array)

OUTPUT

Random numbers array:

[[0.27224358 0.60158252 0.53894698 0.97854983]

[0.07764887 0.93821304 0.23114664 0.1588078 ]

[0.39133845 0.35115649 0.10656164 0.46716248]

[0.95782011 0.13896551 0.09214915 0.01784185]]


e. An Array of Your Choice

Create an array with specific elements.

import numpy as np

# Array of your choice

custom_array = np.array([[10, 20, 30], [40, 50, 60]])

print("Custom array:\n", custom_array)

OUTPUT:

Custom array:

[[10 20 30]

[40 50 60]]

f. Identity Matrix in NumPy

Create an identity matrix.

import numpy as np

# Identity matrix

identity_matrix = np.eye(3) # 3x3 identity matrix

print("Identity matrix:\n", identity_matrix)

OUTPUT

Identity matrix:

[[1. 0. 0.]

[0. 1. 0.]

[0. 0. 1.]]
Experiment-2
2.The Shape and Reshaping of NumPy Array

a. Dimensions of NumPy array


b. Shape of NumPy array
c. Size of NumPy array
d. Reshaping a NumPy array
e. Flattening a NumPy array
f. Transpose of a NumPy array

a. Dimensions of NumPy Array

The dimension of a NumPy array refers to the number of axes (or levels) the array has.

 1D Array: A single line of elements (e.g., [1, 2, 3]).


 2D Array: A table of elements with rows and columns (e.g., [[1, 2], [3, 4]]).
 3D Array and Beyond: Higher-dimensional structures (e.g., matrices stacked in a cube).

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.ndim)

OUTPUT

b. Shape of NumPy Array

The shape of a NumPy array is a tuple indicating the size along each dimension.

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape)

OUTPUT

(2, 3)
c. Size of NumPy Array

The size of a NumPy array is the total number of elements in the array. It’s equivalent to
multiplying all the dimensions together.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.size)

OUTPUT

d. Reshaping a NumPy Array

Reshaping means changing the shape of the array without altering its data.

Rules for reshaping:

1. The new shape must be compatible with the total number of elements.
2. Use -1 to infer one dimension automatically:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

reshaped = arr.reshape(2, 3) # Reshape to 2 rows, 3 columns

print(reshaped)

OUTPUT

[[1 2 3]

[4 5 6]]
e. Flattening a NumPy Array

Flattening converts a multidimensional array into a 1D array.

 Methods to flatten:
o .flatten() (returns a copy):

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
flat = arr.flatten()
print(flat)

OUTPUT

[1 2 3 4 5 6]

f. Transpose of a NumPy Array

The transpose of an array swaps its rows and columns (for 2D) or reverses the axes (for higher
dimensions). Use .T or np.transpose().

import numpy as np

# Example: Transpose

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

transposed = arr_2d.T

print("Original array:\n", arr_2d)

print("Transposed array:\n", transposed)

OUTPUT

Original array:

[[1 2 3]

[4 5 6]]
Transposed array:

[[1 4]

[2 5]

[3 6]]
Experiment-3

3.Expanding and Squeezing a NumPy Array

a. Expanding a NumPy array

b. Squeezing a NumPy array

c. Sorting in NumPy Arrays

a. Expanding a NumPy Array

Expanding a NumPy array means adding new axes to its dimensions. This can be done using
functions like numpy.expand_dims() or slicing with numpy.newaxis.

Methods to Expand Dimensions:

Using numpy.expand_dims():

import numpy as np

arr = np.array([1, 2, 3])

expanded = np.expand_dims(arr, axis=0) # Adds a new axis at the 0th position

print(expanded.shape) # Output: (1, 3)

OUTPUT

(1, 3)
b. Squeezing a NumPy Array

Squeezing a NumPy array removes axes with size 1. This is achieved using numpy.squeeze().

Using numpy.squeeze():

import numpy as np

arr = np.array([[[1, 2, 3]]]) # Shape: (1, 1, 3)

squeezed = np.squeeze(arr) # Removes axes with size 1

print(squeezed.shape)

OUTPUT

(3,)

c. Sorting in NumPy Arrays

Sorting in NumPy can be done along any axis using the numpy.sort() function. It does not
modify the original array (returns a sorted copy).

Basic Sorting:

import numpy as np

arr = np.array([3, 1, 2])

sorted_arr = np.sort(arr)

print(sorted_arr) # Output: [1, 2, 3]

OUTPUT

[1 2 3]
Sorting Along an Axis:

import numpy as np

arr = np.array([[3, 1, 2], [6, 5, 4]])

sorted_arr = np.sort(arr, axis=0) # Sorts along columns

print(sorted_arr)

[[3 1 2]

[6 5 4]]

In-place Sorting:

Use the .sort() method for in-place sorting.

import numpy as np

arr = np.array([3, 1, 2])

arr.sort()

print(arr) # Output: [1, 2, 3]

OUTPUT

[1 2 3]

Advanced Sorting:

For indices of sorted elements, use numpy.argsort():

import numpy as np

arr = np.array([3, 1, 2])

indices = np.argsort(arr)

print(indices) # Output: [1, 2, 0]

OUTPUT

[1 2 0]
Experiment-4
4.Indexing and Slicing of NumPy Array

a. Slicing1-DNumPyarrays
b. Slicing2-DNumPyarrays
c. Slicing3-DNumPyarrays
d. Negative slicing of NumPy arrays

For one-dimensional arrays, slicing works similarly to Python lists.

import numpy as np

# Create a 1-D array

arr = np.array([10, 20, 30, 40, 50, 60])

# Slice from index 1 to 4 (exclusive)

print(arr[1:4]) # Output: [20 30 40]

# Slice with a step of 2

print(arr[::2]) # Output: [10 30 50]

# Slice from index 2 to the end

print(arr[2:]) # Output: [30 40 50 60]

# Slice up to index 3 (exclusive)

print(arr[:3]) # Output: [10 20 30]

OUTPUT

[20 30 40]

[10 30 50]

[30 40 50 60]

[10 20 30]
b. Slicing 2-D NumPy Arrays

For two-dimensional arrays, slicing can be done along both rows and columns.

import numpy as np

# Create a 2-D array

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Slice rows 0 to 1 (exclusive) and columns 1 to 3 (exclusive)

print(arr[0:2, 1:3])

# Output:

# [[2 3]

# [5 6]]

# Slice all rows and column 1

print(arr[:, 1]) # Output: [2 5 8]

# Slice row 1 and all columns

print(arr[1, :]) # Output: [4 5 6]

# Slice every other row and column

print(arr[::2, ::2])

# Output:

# [[1 3]

# [7 9]]
OUTPUT

[[2 3]

[5 6]]

[2 5 8]

[4 5 6]

[[1 3]

[7 9]]
d. Negative Slicing of NumPy Arrays

Negative slicing allows you to access elements from the end of an array.

import numpy as np

# Create a 1-D array

arr = np.array([10, 20, 30, 40, 50])

# Slice the last three elements

print(arr[-3:]) # Output: [30 40 50]

# Slice from the start to the second-last element

print(arr[:-1]) # Output: [10 20 30 40]

# Reverse the array using negative step

print(arr[::-1]) # Output: [50 40 30 20 10]

# 2-D array negative slicing

arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Last two rows and last two columns

print(arr2[-2:, -2:])

# Output:

# [[5 6]

# [8 9]]

# Reverse rows

print(arr2[::-1])

Slicing3-DNumPyarrays

Negative slicing of NumPy arrays


OUTPUT

[30 40 50]

[10 20 30 40]

[50 40 30 20 10]

[[5 6]

[8 9]]

[[7 8 9]

[4 5 6]

[1 2 3]]
Experiment-5

5.Stacking and Concatenating NumPy Arrays

Stacking and concatenating are methods to combine arrays in different ways. Stacking involves
combining along new axes, while concatenating merges along existing axes.

a. Stacking ndarrays

b. Concatenating ndarrays

c. Broad casting in Numpy Arrays

1. Vertical Stacking (np.vstack)

Combines arrays vertically along a new row axis.

import numpy as np

# Create arrays

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

# Vertical stack

result = np.vstack((arr1, arr2))

print(result)

OUTPUT

[[1 2 3]

[4 5 6]]
Horizontal Stacking (np.hstack)

Combines arrays horizontally along a new column axis.

import numpy as np

# Create arrays

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

result = np.hstack((arr1, arr2))

print(result)

OUTPUT

[1 2 3 4 5 6]

3. Depth Stacking (np.dstack)

Combines arrays along the third axis (depth).

import numpy as np

# Create arrays

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

# Depth stack

result = np.dstack((arr1, arr2))

print(result)

OUTPUT

[[[1 4]

[2 5]

[3 6]]]
4. Stacking with np.stack

Specifies the axis along which stacking occurs.

import numpy as np

# Create arrays

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

# Stack along a new axis

result = np.stack((arr1, arr2), axis=1)

print(result)

OUTPUT

[[1 4]

[2 5]

[3 6]]
b. Concatenating NumPy Arrays

Concatenation merges arrays along an existing axis. Unlike stacking, it does not add a new axis.

1. Concatenating Along Rows or Columns


import numpy as np

# Create 2-D arrays

arr1 = np.array([[1, 2], [3, 4]])

arr2 = np.array([[5, 6]])

# Concatenate along rows (axis 0)

result = np.concatenate((arr1, arr2), axis=0)

print(result)

# Output:

# [[1 2]

# [3 4]

# [5 6]]

# Concatenate along columns (axis 1)

arr3 = np.array([[7], [8]])

result = np.concatenate((arr1, arr3), axis=1)

print(result)

# Output:

# [[1 2 7]

# [3 4 8]]

OUTPUT

[[1 2]
[3 4]
[5 6]]
[[1 2 7]
[3 4 8]]
c. Broadcasting in NumPy Arrays

Broadcasting is a powerful mechanism in NumPy that allows operations on arrays of different


shapes. Smaller arrays are "broadcasted" to match the shape of larger arrays during operations.

1. Rules for Broadcasting

1. If the dimensions of the two arrays are not the same, NumPy pads the smaller array with 1 from
the left.
2. If the sizes of the dimensions don’t match, the size of one must be 1, or the operation will fail.
3. The resulting shape is determined by taking the maximum of each dimension.

2. Examples of Broadcasting
import numpy as np

# Array addition

arr1 = np.array([[1, 2, 3], [4, 5, 6]])

arr2 = np.array([1, 2, 3]) # Shape (3,)

# Broadcasting occurs

result = arr1 + arr2

print(result)

# Output:

# [[ 2 4 6]

# [ 5 7 9]]

OUTPUT

[[2 4 6]

[5 7 9]]
Broadcasting with Scalar Values

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Add scalar to each element

result = arr + 10

print(result)

OUTPUT

[[11 12 13]

[14 15 16]]

Broadcasting with Different Dimensions


import numpy as np

arr1 = np.array([[1], [2], [3]]) # Shape (3, 1)

arr2 = np.array([10, 20, 30]) # Shape (3,)

# Broadcasting occurs

result = arr1 + arr2

print(result)

OUTPUT

[[11 21 31]

[12 22 32]

[13 23 33]]
Experiment-6
6.Perform following operations using pandas

a. Creating data frame


b. concat()
c. Setting conditions
d. Adding a new column

1. Perform Operations Using Pandas

Pandas is a powerful Python library for data manipulation. Below are examples for each
operation:

a. Creating a DataFrame

A DataFrame is a two-dimensional, tabular data structure with labeled rows and columns.

import pandas as pd

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

print(df)

OUTPUT

Name Age City

0 Alice 25 New York

1 Bob 30 Los Angeles

2 Charlie 35 Chicago
b. Using concat()

The concat() function is used to concatenate multiple DataFrames along a specified axis (rows
or columns).

import pandas as pd

# Create two DataFrames

df1 = pd.DataFrame({

'ID': [1, 2],

'Name': ['Alice', 'Bob']

})

df2 = pd.DataFrame({

'ID': [3, 4],

'Name': ['Charlie', 'David']

})

# Concatenate along rows (default axis=0)

result = pd.concat([df1, df2])

print(result)

result_col = pd.concat([df1, df2], axis=1)

print(result_col)
OUTPUT

ID Name

0 1 Alice

1 2 Bob

0 3 Charlie

1 4 David

ID Name ID Name

0 1 Alice 3 Charlie

1 2 Bob 4 David
c. Setting Conditions

Conditions are used to filter or modify rows/columns based on logical expressions.

import pandas as pd

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

print(df)

# Filter rows where Age > 25

filtered_df = df[df['Age'] > 25]

print(filtered_df)

# Output:

# Name Age City

#1 Bob 30 Los Angeles

# 2 Charlie 35 Chicago

# Set a condition to modify a column

df.loc[df['Age'] > 30, 'City'] = 'San Francisco'

print(df)
OUTPUT

Name Age City

0 Alice 25 New York

1 Bob 30 Los Angeles

2 Charlie 35 Chicago

Name Age City

1 Bob 30 Los Angeles

2 Charlie 35 Chicago

Name Age City

0 Alice 25 New York

1 Bob 30 Los Angeles

2 Charlie 35 San Francisco


d. Adding a New Column

A new column can be added directly by assigning values to a new column name.

import pandas as pd

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

print(df)

# Add a new column 'Salary'

df['Salary'] = [50000, 60000, 70000]

print(df)

OUTPUT

Name Age City

0 Alice 25 New York

1 Bob 30 Los Angeles

2 Charlie 35 Chicago

Name Age City Salary

0 Alice 25 New York 50000

1 Bob 30 Los Angeles 60000

2 Charlie 35 Chicago 70000


Adding a New Column Based on Existing Data:

import pandas as pd

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

print(df)

# Add a new column 'Salary'

df['Salary'] = [50000, 60000, 70000]

print(df)

# Add a column 'Tax' based on a percentage of 'Salary'

df['Tax'] = df['Salary'] * 0.1

print(df)
OUTPUT

Name Age City

0 Alice 25 New York

1 Bob 30 Los Angeles

2 Charlie 35 Chicago

Name Age City Salary

0 Alice 25 New York 50000

1 Bob 30 Los Angeles 60000

2 Charlie 35 Chicago 70000

Name Age City Salary Tax

0 Alice 25 New York 50000 5000.0

1 Bob 30 Los Angeles 60000 6000.0

2 Charlie 35 Chicago 70000 7000.0


Experiment-7
7.Perform following operations using pandas

a. Filling NaN with string


b. Sorting based on column values
c. group by()
a. Filling NaN with a String

The fillna() function is used to replace missing values (NaN) with a specified value.

import pandas as pd

import numpy as np

# Create a DataFrame with NaN values

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, np.nan, 35],

'City': [np.nan, 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

# Fill NaN values with a string

df_filled = df.fillna('Unknown')

print(df_filled)

OUTPUT

Name Age City

0 Alice 25.0 Unknown

1 Bob Unknown Los Angeles

2 Charlie 35.0 Chicago


b. Sorting Based on Column Values

The sort_values() function is used to sort a DataFrame by column values. Sorting can be
ascending or descending.

import pandas as pd

import numpy as np

# Create a DataFrame with NaN values

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, np.nan, 35],

'City': [np.nan, 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

# Create a DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000]

df = pd.DataFrame(data)

# Sort by 'Age' in ascending order

df_sorted = df.sort_values(by='Age')

print(df_sorted)

df_sorted_desc = df.sort_values(by='Salary', ascending=False)

print(df_sorted_desc)
OUTPUT

Name Age Salary

0 Alice 25 50000

1 Bob 30 60000

2 Charlie 35 70000

Name Age Salary

2 Charlie 35 70000

1 Bob 30 60000

0 Alice 25 50000
c. groupby()

The groupby() function is used to group data based on column values, and aggregation
functions (e.g., sum, mean, count) can be applied to each group.

import pandas as pd

import numpy as np

# Create a DataFrame

data = {

'Department': ['HR', 'Finance', 'HR', 'IT', 'Finance'],

'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Salary': [50000, 60000, 55000, 70000, 65000]

df = pd.DataFrame(data)

# Group by 'Department' and calculate total salary

grouped = df.groupby('Department')['Salary'].sum()

print(grouped)

# Group by 'Department' and count employees

grouped_count = df.groupby('Department')['Employee'].count()

print(grouped_count)
OUTPUT

Department

Finance 125000

HR 105000

IT 70000

Name: Salary, dtype: int64

Department

Finance 2

HR 2

IT 1

Name: Employee, dtype: int64


Experiment-8
8.Read the following file formats using pandas

a. Text files
b. CSV files
c. Excel files
d. JSON files

Pandas provides functions for loading data from different file formats into a DataFrame.

a. Reading Text Files

Text files are often read using the read_csv() function, assuming the file is delimited (e.g., by
spaces or tabs).

# Open function to open the file "MyFile1.txt"

# (same directory) in append mode and

file1 = open("MyFile1.txt","a")

# store its reference in the variable file1

# and "MyFile2.txt" in D:\Text in file2

file2 = open(r"D:\Text\MyFi

file1 = open("myfile.txt", "w")

L = ["This is Delhi \n", "This is Paris \n", "This is London \n"]

# \n is placed to indicate EOL (End of Line)

file1.write("Hello \n")
file1.writelines(L)

file1.close() # to change file access modes

file1 = open("myfile.txt", "r+")

print("Output of Read function is ")

print(file1.read())

print()

# seek(n) takes the file handle to the nth

# byte from the beginning.

file1.seek(0)

print("Output of Readline function is ")

print(file1.readline())

print()

file1.seek(0)

# To show difference between read and readline

print("Output of Read(9) function is ")

print(file1.read(9))

print()
file1.seek(0)

print("Output of Readline(9) function is ")

print(file1.readline(9))

file1.seek(0)

# readlines function

print("Output of Readlines function is ")

print(file1.readlines())

print()

file1.close()

OUTPUT

Output of Read function is


Hello
This is Delhi
This is Paris
This is London
Output of Readline function is
Hello
Output of Read(9) function is
Hello
Th
Output of Readline(9) function is
Hello
Output of Readlines function is
['Hello \n', 'This is Delhi \n', 'This is Paris \n', 'This is London
\n']
b. CSV Files

CSV (Comma-Separated Values) files can be read directly using pandas.read_csv():

# importing csv module

import csv

# csv file name

filename = "aapl.csv"

# initializing the titles and rows list

fields = []

rows = []

# reading csv file

with open(filename, 'r') as csvfile:

# creating a csv reader object

csvreader = csv.reader(csvfile)

# extracting field names through first row

fields = next(csvreader)

# extracting each data row one by one

for row in csvreader:

rows.append(row)
# get total number of rows

print("Total no. of rows: %d" % (csvreader.line_num))

# printing the field names

print('Field names are:' + ', '.join(field for field in fields))

# printing first 5 rows

print('\nFirst 5 rows are:\n')

for row in rows[:5]:

# parsing each column of a row

for col in row:

print("%10s" % col, end=" "),

print('\n')

OUTPUT
c. Excel Files

Excel files can be read using pandas.read_excel(). You'll need the openpyxl library installed
to read .xlsx files or xlrd for .xls files.

# import openpyxl module

import openpyxl

# Give the location of the file

path = "gfg.xlsx"

# To open the workbook

# workbook object is created

wb_obj = openpyxl.load_workbook(path)

# Get workbook active sheet object

# from the active attribute

sheet_obj = wb_obj.active

cell_obj = sheet_obj.cell(row=1, column=1)

print(cell_obj.value)

OUTPUT

Name
import openpyxl

# Give the location of the file

path = "gfg.xlsx"

wb_obj = openpyxl.load_workbook(path)

sheet_obj = wb_obj.active

row = sheet_obj.max_row

column = sheet_obj.max_column

print("Total Rows:", row)

print("Total Columns:", column)

print("\nValue of first column")

for i in range(1, row + 1):

cell_obj = sheet_obj.cell(row=i, column=1)

print(cell_obj.value)

print("\nValue of first row")


for i in range(1, column + 1):

cell_obj = sheet_obj.cell(row=2, column=i)

print(cell_obj.value,end=” “)

OUTPUT

Total Rows: 6
Total Columns: 4
Value of first column
Name
Ankit
Rahul
Priya
Nikhil
Nisha
Value of first row
Ankit B.Tech CSE 4
d. JSON Files

JSON (JavaScript Object Notation) files can be read using pandas.read_json(). The file must
be in a compatible JSON format.

import json

# Opening JSON file

f = open('data.json')

# returns JSON object as a dictionary

data = json.load(f)

# Iterating through the json list

for i in data['emp_details']:

print(i)

# Closing file

f.close()

OUTPUT

{'emp_name': 'Shubham', 'email': '[email protected]',


'job_profile': 'intern'},
{'emp_name': 'Gaurav', 'email': '[email protected]',
'job_profile': 'developer'},
{'emp_name': 'Nikhil', 'email': '[email protected]',
'job_profile': 'Full Time'}
Experiment-9
9.Read the following file formats

a. Pickle files
b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
a. Reading Pickle Files

Pickle files store serialized Python objects. Pandas provides read_pickle() for reading
DataFrames stored in pickle format.

import pickle

# initializing data to be stored in db

Omkar = {'key' : 'Omkar', 'name' : 'Omkar Pathak',

'age' : 21, 'pay' : 40000}

Jagdish = {'key' : 'Jagdish', 'name' : 'Jagdish Pathak',

'age' : 50, 'pay' : 50000}

# database

db = {}

db['Omkar'] = Omkar

db['Jagdish'] = Jagdish

# For storing

# type(b) gives <class 'bytes'>;

b = pickle.dumps(db)

# For loading

myEntry = pickle.loads(b)

print(myEntry)
OUTPUT

{'Omkar': {'key': 'Omkar', 'name': 'Omkar Pathak', 'age': 21, 'pay': 40000}, 'Jagdish': {'key':
'Jagdish', 'name': 'Jagdish Pathak', 'age': 50, 'pay': 50000}}

b. Reading Image Files Using PIL

The Python Imaging Library (PIL) allows you to open, process, and display image files. The
library is available as Pillow.

# Imports PIL module

from PIL import Image

# open method used to open different extension image file

im = Image.open(r"C:\Users\System-Pc\Desktop\lion.png")

# This method will show image in any image viewer

im.show()

OUTPUT
c. Reading Multiple Files Using Glob

The glob module retrieves file paths matching a specified pattern. This is useful for reading
multiple files from a directory.

import pandas as pd

import glob

# Define the file path pattern (e.g., all CSV files in a folder)

file_pattern = 'data_folder/*.csv'

# Use glob to get file paths

file_list = glob.glob(file_pattern)

# Read all files and concatenate them into one DataFrame

dataframes = [pd.read_csv(file) for file in file_list]

combined_df = pd.concat(dataframes, ignore_index=True)

print(combined_df)
d. Importing Data from a Database

You can use pandas with libraries like sqlite3 for SQLite databases or sqlalchemy for other
databases to read data into a DataFrame.

import sqlite3

import pandas as pd

# Connect to the SQLite database (or create one if it doesn't exist)

conn = sqlite3.connect('example.db')

# Create a table and insert data (for demonstration)

query = """

CREATE TABLE IF NOT EXISTS users (

id INTEGER PRIMARY KEY,

name TEXT,

age INTEGER

);

INSERT INTO users (name, age) VALUES ('Alice', 25), ('Bob', 30);

"""

conn.executescript(query)

# Query data from the database

df = pd.read_sql_query("SELECT * FROM users", conn)

print(df)

# Close the connection

conn.close()
Output:

# id name age

# 0 1 Alice 25

# 1 2 Bob 30
Experiment-10

10Web Scraping Using Python


Steps for Web Scraping

1. Import necessary libraries.


2. Send an HTTP request to the target URL.
3. Parse the HTML content of the webpage.
4. Extract specific data based on tags or attributes.

Example: Scraping Titles and Links from a Website


Prerequisites:

Install required libraries:

bash

pip install requests beautifulsoup4

import requests

from bs4 import BeautifulSoup

# Step 1: Send an HTTP request to the website

url = 'https://example.com' # Replace with the target website

response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

# Step 2: Parse the HTML content

soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Extract data (e.g., titles and links)


titles = soup.find_all('h2') # Assuming titles are in <h2> tags

links = soup.find_all('a', href=True) # Find all anchor tags with 'href' attribute

# Display the extracted data

print("Titles:")

for title in titles:

print(title.text.strip())

print("\nLinks:")

for link in links:

print(link['href'])

else:

print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

OUTPUT

Titles: Welcome to Example Our Services Links: https://example.com/about


https://example.com/services https://example.com/contact
Experiment-11
11.Perform following preprocessing techniques on loan prediction dataset

a. Feature Scaling
b. Feature Standardization
c. Label Encoding
d. One Hot Encoding

To perform the following preprocessing techniques on a loan prediction dataset, here's a step-by-step
breakdown using Python's popular libraries like pandas, scikit-learn, and numpy. For the purpose of
illustration, I'll provide code snippets for each technique.

a. Feature Scaling

Feature scaling typically involves either normalization (min-max scaling) or standardization (z-
score normalization). Let's use Min-Max Scaling for scaling the features.

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

# Assuming your dataset is loaded into a pandas DataFrame 'df'

# Load the dataset

df = pd.read_csv('loan_data.csv')

# Selecting numerical columns for scaling

numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Initialize MinMaxScaler

scaler = MinMaxScaler()

# Apply scaling to numerical columns

df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Check the scaled data

print(df.head())
b. Feature Standardization

Feature standardization involves scaling the features to have a mean of 0 and a standard
deviation of 1.

from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler

scaler = StandardScaler()

# Apply standardization to numerical columns

df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Check the standardized data

print(df.head())
c. Label Encoding

Label encoding is used to convert categorical labels into numeric form. Typically, this is done
for the target variable (e.g., "Loan_Status").

from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder

label_encoder = LabelEncoder()

# Apply label encoding to the target variable (e.g., 'Loan_Status')

df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])

# Check the encoded target variable

print(df['Loan_Status'].head())
d.One-Hot Encoding

One-Hot Encoding is used to convert categorical features into binary columns (0 or 1). This is
often used for non-ordinal categorical variables like "Gender", "Marital_Status", etc.

from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder

label_encoder = LabelEncoder()

# Apply One-Hot Encoding to categorical columns

df = pd.get_dummies(df, drop_first=True) # drop_first=True to avoid multicollinearity

# Check the dataset with one-hot encoded variables

print(df.head())
Experiment-12
12.Perform following visualizations using matplotlib

a. Bar Graph
b. Pie Chart
c. Box Plot
d. Histogram
e. Line Chart and Subplots
f. Scatter Plot

import matplotlib.pyplot as plt

import numpy as np

# Sample data

categories = ['A', 'B', 'C', 'D', 'E']

values = [3, 7, 9, 5, 4]

data = np.random.randn(1000)

# a. Bar Graph

plt.figure(figsize=(6, 4))

plt.bar(categories, values, color='skyblue')

plt.title('Bar Graph')

plt.xlabel('Categories')

plt.ylabel('Values')

plt.show()

# b. Pie Chart

plt.figure(figsize=(6, 6))

plt.pie(values, labels=categories, autopct='%1.1f%%', startangle=90,


colors=['#ff9999','#66b3ff','#99ff99','#ffcc99','#c2c2f0'])

plt.title('Pie Chart')

plt.axis('equal') # Equal aspect ratio ensures that pie chart is drawn as a circle.
plt.show()

# c. Box Plot

plt.figure(figsize=(6, 4))

plt.boxplot(data)

plt.title('Box Plot')

plt.ylabel('Values')

plt.show()

# d. Histogram

plt.figure(figsize=(6, 4))

plt.hist(data, bins=30, color='green', edgecolor='black')

plt.title('Histogram')

plt.xlabel('Data Values')

plt.ylabel('Frequency')

plt.show()

# e. Line Chart and Subplots

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Line Chart

axes[0].plot(values, marker='o', color='b')

axes[0].set_title('Line Chart')

axes[0].set_xlabel('Index')

axes[0].set_ylabel('Values')
# Line Chart with more points

x = np.linspace(0, 10, 100)

y = np.sin(x)

axes[1].plot(x, y, color='r')

axes[1].set_title('Sine Wave')

axes[1].set_xlabel('X')

axes[1].set_ylabel('sin(X)')

plt.tight_layout()

plt.show()

# f. Scatter Plot

x = np.random.randn(100)

y = np.random.randn(100)

plt.figure(figsize=(6, 4))

plt.scatter(x, y, color='purple')

plt.title('Scatter Plot')

plt.xlabel('X')

plt.ylabel('Y')

plt.show()
Experiment-13
13.Getting started with NLTK, install NLTK using PIP

To get started with NLTK (Natural Language Toolkit), you first need to install it. You can
install it using pip, Python's package installer.

Here’s how you can do it:

Steps to Install NLTK:

1.Open a terminal or command prompt (depending on your operating system).

2.Install NLTK using pip: Run the following command:

pip install nltk

3.Verify Installation: After installation, you can verify if NLTK is successfully installed by
opening a Python interpreter or a script and importing it:

import nltk

print(nltk.__version__)

If NLTK is installed correctly, it should print the version number.

After installation, you can also download some NLTK data:

To use various resources (like corpora, tokenizers, and more) from NLTK, you need to download
the data.

Here’s how you can do it:

import nltk

nltk.download('punkt') # For tokenizing text into words and sentences

nltk.download('stopwords') # For common stopwords

nltk.download('wordnet') # For WordNet corpus (used for word definitions)

This will download necessary data files to work with NLTK.


Experiment-14

14.Python program to implement with Python SciKit-Learn &NLTK

import nltk

from nltk.corpus import movie_reviews

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

# Download required NLTK datasets

nltk.download('movie_reviews')

nltk.download('punkt')

nltk.download('stopwords')

# Step 1: Prepare the dataset

# Load the movie review dataset (positive and negative reviews)

reviews = [(list(movie_reviews.words(fileid)), category)

for category in movie_reviews.categories()

for fileid in movie_reviews.fileids(category)]

# Step 2: Preprocess the data


# Remove stopwords and tokenize text

stop_words = set(stopwords.words('english'))

def preprocess_text(words):

# Remove stopwords and return words

return [word.lower() for word in words if word.isalpha() and word not in stop_words]

# Apply preprocessing to all reviews

processed_reviews = [(preprocess_text(words), category) for words, category in reviews]

# Step 3: Split data into features and labels

texts = [' '.join(words) for words, _ in processed_reviews]

labels = [category for _, category in processed_reviews]

# Step 4: Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Step 5: Convert text data to feature vectors using CountVectorizer

vectorizer = CountVectorizer()

X_train_vec = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test)

# Step 6: Train a Naive Bayes classifier


classifier = MultinomialNB()

classifier.fit(X_train_vec, y_train)

# Step 7: Make predictions on the test set

y_pred = classifier.predict(X_test_vec)

# Step 8: Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

# Example: Make a prediction on a new review

new_review = "This movie was amazing! The plot and acting were fantastic."

new_review_processed = ' '.join(preprocess_text(word_tokenize(new_review)))

new_review_vec = vectorizer.transform([new_review_processed])

prediction = classifier.predict(new_review_vec)

print(f'Prediction for the review: {prediction[0]}')

OUTPUT

Accuracy: 80.67%

Prediction for the review: pos


Experiment-15
15. Python program to implement with Python NLTK/Spicy/PyNLPI.

Install the required packages:

First, make sure you have installed the necessary libraries:

pip install nltk spacy pynlpi

python -m spacy download en_core_web_sm # Download a spaCy model for English

Python Program:

This program will perform the following tasks:

 Tokenize text using NLTK.


 Perform Named Entity Recognition (NER) using spaCy.
 Use PyNLPI to extract useful entities or phrases from text.

import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import pynlpi

# Download necessary NLTK resources


nltk.download('punkt')
nltk.download('stopwords')

# Load the spaCy model for English


nlp = spacy.load("en_core_web_sm")

# Sample text for processing


text = """
Apple is looking at buying U.K. startup for $1 billion.
The quick brown fox jumps over the lazy dog.
Barack Obama was the 44th president of the United States.
"""

# 1. Tokenizing text using NLTK


print("NLTK Tokenization:")
words = word_tokenize(text)
sentences = sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)

# 2. Remove stopwords using NLTK


stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("\nFiltered Words (after removing stopwords):", filtered_words)

# 3. Named Entity Recognition (NER) with spaCy


print("\nNamed Entity Recognition using spaCy:")
doc = nlp(text)
for ent in doc.ents:
print(f"Entity: {ent.text}, Label: {ent.label_}")

# 4. Using PyNLPI for phrase extraction


print("\nExtracting phrases using PyNLPI:")
# PyNLPI focuses on extracting and understanding linguistic patterns, like named
entities or chunk extraction.
# For illustration, we will use PyNLPI to process and extract basic chunks/phrases
from the text.
nlpi = pynlpi.PyNLPI()
nlpi.load_text(text)
phrases = nlpi.extract_phrases()

print("Extracted Phrases:")
for phrase in phrases:
print(phrase)

Explanation of Code:

1. NLTK:
o Tokenization: We use word_tokenize() and sent_tokenize() from NLTK to
tokenize the text into words and sentences.
o Stopword Removal: We filter out common words (e.g., "the", "is", etc.) using the
stopwords corpus from NLTK.
2. spaCy:
o We use spaCy to perform Named Entity Recognition (NER), where spaCy
identifies named entities such as persons, organizations, dates, and locations. In
this example, it identifies "Apple", "U.K.", and "Barack Obama" as named
entities.
3. PyNLPI:
o PyNLPI focuses on NLP tasks such as chunking and phrase extraction. It can be
used for deeper linguistic analysis, like extracting noun phrases or more advanced
information from a text. Here, we extract and print the basic chunks or phrases
that PyNLPI identifies from the text.
OUTPUT

NLTK Tokenization: Words: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1',
'billion', '.', 'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'] Sentences: ['Apple is
looking at buying U.K. startup for $1 billion.', 'The quick brown fox jumps over the lazy dog.',
'Barack Obama was the 44th president of the United States.']

Filtered Words (after removing stopwords): ['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1',
'billion', '.', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.', 'Barack', 'Obama', '44th', 'president',
'United', 'States', '.']

Named Entity Recognition using spaCy:

Entity: Apple, Label: ORG

Entity: U.K., Label: GPE

Entity: $1 billion, Label: MONEY

Entity: Barack Obama, Label: PERSON

Entity: 44th, Label: ORDINAL

Entity: United States, Label: GPE

Extracting phrases using PyNLPI:

Extracted Phrases:

Apple

U.K.

startup

$1 billion

quick brown fox

lazy dog

Barack Obama

44th president

United States

You might also like