0% found this document useful (0 votes)
15 views24 pages

De Lab Manual New

The document provides a comprehensive guide on creating and manipulating NumPy arrays and performing operations using pandas. It covers topics such as creating different types of arrays, reshaping, indexing, and performing mathematical operations, as well as data manipulation techniques in pandas like concatenation, filtering, and reading various file formats. Additionally, it includes examples of web scraping using Python with libraries like requests and BeautifulSoup.

Uploaded by

prasanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

De Lab Manual New

The document provides a comprehensive guide on creating and manipulating NumPy arrays and performing operations using pandas. It covers topics such as creating different types of arrays, reshaping, indexing, and performing mathematical operations, as well as data manipulation techniques in pandas like concatenation, filtering, and reading various file formats. Additionally, it includes examples of web scraping using Python with libraries like requests and BeautifulSoup.

Uploaded by

prasanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

1.

Creating a NumPy Array

a. Basic ndarray

b. Array of zeros

c. Array of ones

d. Random numbers in ndarray

e. An array of your choice

f. Imatrix in NumPy

g. Evenly spaced ndarray

Here’s how you can create different types of NumPy arrays in Python using the numpy library:

import numpy as np

# a. Basic ndarray

basic_array = np.array([1, 2, 3, 4, 5])

print("a. Basic ndarray:\n", basic_array)

# b. Array of zeros

zeros_array = np.zeros((3, 4)) # 3 rows, 4 columns

print("\nb. Array of zeros:\n", zeros_array)

# c. Array of ones

ones_array = np.ones((2, 5)) # 2 rows, 5 columns

print("\nc. Array of ones:\n", ones_array)

# d. Random numbers in ndarray

random_array = np.random.rand(3, 3) # 3x3 array with random floats between 0 and 1

print("\nd. Random numbers in ndarray:\n", random_array)

# e. An array of your choice

custom_array = np.array([[10, 20], [30, 40]])

print("\ne. An array of your choice:\n", custom_array)

# f. Identity matrix in NumPy (often called "Imatrix")

identity_matrix = np.eye(4) # 4x4 identity matrix


print("\nf. Identity matrix (Imatrix) in NumPy:\n", identity_matrix)

# g. Evenly spaced ndarray

evenly_spaced_array = np.linspace(0, 10, 6) # 6 values evenly spaced from 0 to 10

print("\ng. Evenly spaced ndarray:\n", evenly_spaced_array)

2. The Shape and Reshaping of NumPy Array

a. Dimensions of NumPy array

b. Shape of NumPy array

c. Size of NumPy array

d. Reshaping a NumPy array

e. Flattening a NumPy array

f. Transpose of a NumPy array

2. The Shape and Reshaping of NumPy Array

a. Dimensions of NumPy Array

The number of axes (also called rank) of an array.

Use .ndim to get the number of dimensions.

Example:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.ndim) # Output: 2 (2D array)

b. Shape of NumPy Array

A tuple showing the size of the array along each dimension.

Use .shape to get or set the shape.

Example:

print(arr.shape) # Output: (2, 3)

c. Size of NumPy Array

The total number of elements in the array.


Use .size to get it.

Example:

print(arr.size) # Output: 6 (2*3)

d. Reshaping a NumPy Array

Changing the shape without changing the data.

Use .reshape(new_shape) method.

Example:

reshaped = arr.reshape(3, 2)

print(reshaped)

# Output:

# [[1 2]

# [3 4]

# [5 6]]

e. Flattening a NumPy Array

Converts a multi-dimensional array into a 1D array.

Use .flatten() or .ravel().

Example:

flat = arr.flatten()

print(flat) # Output: [1 2 3 4 5 6]

f. Transpose of a NumPy Array

Swaps the axes of the array (rows become columns and vice versa).

Use .T or .transpose().

Example:

transposed = arr.T

print(transposed)

# Output:

# [[1 4]
# [2 5]

# [3 6]]

3. Expanding and Squeezing a NumPy Array

a. Expanding a NumPy array

b. Squeezing a NumPy array

c. Sorting in NumPy Arrays

3a. Expanding a NumPy Array

Expanding an array typically means adding dimensions, often to make it compatible with operations
like broadcasting.

Using np.expand_dims()

python

Copy

Edit

import numpy as np

arr = np.array([1, 2, 3]) # Shape: (3,)

expanded = np.expand_dims(arr, axis=0) # Shape: (1, 3)

Using None or np.newaxis

expanded = arr[np.newaxis, :] # Shape: (1, 3)

expanded = arr[:, np.newaxis] # Shape: (3, 1)

3b. Squeezing a NumPy Array

Squeezing means removing dimensions of size 1.

Using np.squeeze()

arr = np.array([[[1], [2], [3]]]) # Shape: (1, 3, 1)

squeezed = np.squeeze(arr) # Shape: (3,)

You can also specify an axis:

squeezed = np.squeeze(arr, axis=0) # Only squeeze axis 0

3c. Sorting in NumPy Arrays

You can sort an array either in-place or return a sorted copy.


Using np.sort() (returns a sorted copy)

arr = np.array([3, 1, 2])

sorted_arr = np.sort(arr) # [1, 2, 3]

For 2D arrays:

arr2d = np.array([[3, 1], [2, 4]])

np.sort(arr2d, axis=0) # Sorts each column

np.sort(arr2d, axis=1) # Sorts each row

Using .sort() method (in-place)

arr.sort()

Getting sorted indices with np.argsort()

arr = np.array([3, 1, 2])

indices = np.argsort(arr) # [1, 2, 0]

4. Indexing and Slicing of NumPy Array

a. Slicing 1-D NumPy arrays

b. Slicing 2-D NumPy arrays

c. Slicing 3-D NumPy arrays

d. Negative slicing of NumPy arrays

4. Indexing and Slicing of NumPy Array

a. Slicing 1-D NumPy Arrays

A 1D NumPy array is similar to a list.

import numpy as np

arr = np.array([10, 20, 30, 40, 50])

print(arr[1:4]) # Output: [20 30 40]

print(arr[:3]) # Output: [10 20 30]

print(arr[2:]) # Output: [30 40 50]


b. Slicing 2-D NumPy Arrays

2D arrays require slicing both rows and columns: array[row_start:row_end, col_start:col_end]

arr2d = np.array([[1, 2, 3],

[4, 5, 6],

[7, 8, 9]])

print(arr2d[0:2, 1:3]) # Output: [[2 3], [5 6]]

print(arr2d[:, 0]) # Output: [1 4 7] (all rows, first column)

print(arr2d[1, :]) # Output: [4 5 6] (second row, all columns)

c. Slicing 3-D NumPy Arrays

3D arrays are sliced using three indices: array[depth, row, column]

arr3d = np.array([[[1, 2], [3, 4]],

[[5, 6], [7, 8]]])

print(arr3d[0, :, :]) # Output: [[1 2], [3 4]] (1st matrix)

print(arr3d[:, 1, :]) # Output: [[3 4], [7 8]] (2nd row from each matrix)

d. Negative Slicing of NumPy Arrays

Negative slicing allows reverse indexing.

arr = np.array([10, 20, 30, 40, 50])

print(arr[-3:]) # Output: [30 40 50]

print(arr[::-1]) # Output: [50 40 30 20 10] (reverse array)

2D example:

arr2d = np.array([[1, 2, 3],

[4, 5, 6],

[7, 8, 9]])

print(arr2d[::-1, ::-1])

# Output:

# [[9 8 7]

# [6 5 4]
# [3 2 1]]

5. Stacking and Concatenating Numpy Arrays

a. Stacking ndarrays

b. Concatenating ndarrays

c. Broadcasting in Numpy Arrays

5. Stacking and Concatenating Numpy Arrays

a. Stacking ndarrays

Stacking refers to joining arrays along a new axis. There are a few key functions for stacking in
NumPy:

np.stack()

Combines arrays along a new axis.

import numpy as np

a = np.array([1, 2, 3])

b = np.array([4, 5, 6])

stacked = np.stack((a, b)) # default is axis=0

print(stacked)

# Output:

# [[1 2 3]

# [4 5 6]]

np.vstack()

Stacks arrays vertically (row-wise).

np.vstack((a, b))

# Output:

# [[1 2 3]

# [4 5 6]]

np.hstack()

Stacks arrays horizontally (column-wise).


np.hstack((a, b))

# Output:

# [1 2 3 4 5 6]

np.dstack()

Stacks arrays depth-wise (3rd dimension).

np.dstack((a, b))

# Output:

# [[[1 4]

# [2 5]

# [3 6]]]

b. Concatenating ndarrays

Concatenation joins arrays along an existing axis (unlike stacking which creates a new one).

np.concatenate()

a = np.array([[1, 2], [3, 4]])

b = np.array([[5, 6]])

# Concatenate along axis 0 (row-wise)

np.concatenate((a, b), axis=0)

# Output:

# [[1 2]

# [3 4]

# [5 6]]

You can also concatenate along other axes if dimensions match.

c. Broadcasting in NumPy Arrays

Broadcasting allows NumPy to perform operations on arrays of different shapes in a memory-


efficient way.

Rules of Broadcasting:

If arrays have different dimensions, the smaller one is padded with 1s on the left.
Dimensions must be equal, or one of them must be 1.

Example 1: Adding a scalar to an array

a = np.array([1, 2, 3])

b = 10

a + b # b is "broadcast" to [10, 10, 10]

# Output: [11 12 13]

Example 2: Adding a column vector to a matrix

a = np.array([[1, 2, 3],

[4, 5, 6]])

b = np.array([[10], [20]])

# b is broadcast to match a's shape

a+b

# Output:

# [[11 12 13]

# [24 25 26]]

6. Perform following operations using pandas

a. Creating dataframe

b. concat()

c. Setting conditions

d. Adding a new column

Step 1: Import Pandas

import pandas as pd

a. Creating a DataFrame

Let's create two sample DataFrames:

data1 = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}

data2 = {'Name': ['Charlie', 'David'], 'Age': [35, 40]}


df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

print(df1)

print(df2)

b. Using concat() to Combine DataFrames

Concatenate df1 and df2 vertically:

df_combined = pd.concat([df1, df2], ignore_index=True)

print(df_combined)

c. Setting Conditions

Filter rows where age is greater than 30:

age_above_30 = df_combined[df_combined['Age'] > 30]

print(age_above_30)

d. Adding a New Column

Add a new column called "Is_Adult" based on age:

df_combined['Is_Adult'] = df_combined['Age'] >= 18

print(df_combined)

7. Perform following operations using pandasa. Filling NaN with string

b. Sorting based on column values

c. groupby()

a. Filling NaN with a String

Use the .fillna() method to replace missing (NaN) values with a string.

import pandas as pd

# Sample DataFrame

data = {

'Name': ['Alice', 'Bob', None, 'David'],

'Age': [25, None, 30, 22]

}
df = pd.DataFrame(data)

# Fill NaN with a string

df_filled = df.fillna('Unknown')

print(df_filled)

b. Sorting Based on Column Values

Use the .sort_values() method to sort by a specific column.

# Sort DataFrame by 'Age' column

df_sorted = df.sort_values(by='Age')

print(df_sorted)

c. Using groupby()

Group the DataFrame by a column and apply aggregate functions like .sum(), .mean(), etc.

# Example DataFrame

data = {

'Department': ['HR', 'IT', 'HR', 'IT', 'Finance'],

'Salary': [40000, 60000, 45000, 62000, 50000]

df2 = pd.DataFrame(data)

# Group by 'Department' and get average salary

grouped = df2.groupby('Department')['Salary'].mean()

print(grouped)

8. Read the following file formats using pandas

a. Text files

b. CSV files

c. Excel files

d. JSON files

a. Text Files

If the text file is delimited (e.g., tab, space), use read_csv() with a delimiter:
import pandas as pd

# Example: space-delimited text file

df_text = pd.read_csv('file.txt', delimiter=' ')

b. CSV Files

CSV (Comma-Separated Values) files are read with:

df_csv = pd.read_csv('file.csv')

c. Excel Files

You can read Excel files using:

df_excel = pd.read_excel('file.xlsx') # Default is the first sheet

# You can specify a sheet name:

# df_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')

d. JSON Files

JSON files are read with:

df_json = pd.read_json('file.json')

9. Read the following file formats

a. Pickle files

b. Image files using PIL

c. Multiple files using Glob

d. Importing data from database

a. Pickle Files

Python's pickle module is used to serialize and deserialize Python objects.

import pickle

# Reading a pickle file

with open('data.pkl', 'rb') as file:

data = pickle.load(file)

print(data)
b. Image Files using PIL (Pillow)

PIL (now maintained as Pillow) is used for opening, manipulating, and saving image files.

from PIL import Image

# Open an image file

img = Image.open('image.jpg')

img.show() # To display the image

c. Multiple Files using Glob

The glob module finds all the pathnames matching a specified pattern.

import glob

# Get all .txt files in a directory

file_list = glob.glob('path/to/directory/*.txt')

# Read them

for filename in file_list:

with open(filename, 'r') as file:

content = file.read()

print(content)

d. Importing Data from a Database

Using sqlite3 (for SQLite) or other database connectors like psycopg2 (PostgreSQL), pyodbc (SQL
Server), or mysql.connector.

Example with SQLite:

import sqlite3

# Connect to database

conn = sqlite3.connect('example.db')

cursor = conn.cursor()

# Execute a query

cursor.execute('SELECT * FROM users')

# Fetch data

rows = cursor.fetchall()
for row in rows:

print(row)

# Close connection

conn.close()

10. Demonstrate web scraping using python

Step-by-step Web Scraping Example in Python

1. Install Required Libraries

pip install requests beautifulsoup4

2. Python Script: Scrape Quotes

import requests

from bs4 import BeautifulSoup

# Target URL

url = 'http://quotes.toscrape.com/'

# Send GET request

response = requests.get(url)

# Parse HTML content using BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# Find all quote containers

quotes = soup.find_all('div', class_='quote')

# Loop through and extract quote text and author

for quote in quotes:

text = quote.find('span', class_='text').get_text()

author = quote.find('small', class_='author').get_text()

print(f'"{text}" - {author}')

Sample Output:

“The world as we have created it is a process of our thinking. It cannot be changed without changing
our thinking.” - Albert Einstein

“It is our choices, Harry, that show what we truly are, far more than our abilities.” - J.K. Rowling
11. Perform following preprocessing techniques on loan prediction dataset a. Feature Scaling
b. Feature Standardization c. Label Encoding d. One Hot Encoding

Let's go through each preprocessing technique you mentioned using Python and the pandas,
scikit-learn, and numpy libraries. These techniques are often applied to prepare data for
machine learning models.

Here’s how we can perform each technique on the "loan prediction dataset":

1. Feature Scaling: This involves scaling features so they are within a similar range.
Typically, this is done using MinMaxScaler or StandardScaler.
2. Feature Standardization: Standardization scales the features so that they have a
mean of 0 and a standard deviation of 1, using StandardScaler.
3. Label Encoding: For categorical labels (target variable), we encode them as numeric
labels using LabelEncoder.
4. One-Hot Encoding: For categorical features, we create dummy variables (binary
columns) for each unique value in the categorical feature using OneHotEncoder or
pd.get_dummies().

Code Example:
python
CopyEdit
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler,
LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset (assuming you have it in a .csv file)


# df = pd.read_csv('loan_prediction.csv')

# For demonstration, let's assume we have a small dataset like below:


data = {
'LoanAmount': [200, 150, 300, 250, 500],
'ApplicantIncome': [5000, 4000, 6000, 7000, 8000],
'Credit_History': ['Good', 'Bad', 'Good', 'Good', 'Bad'],
'Loan_Status': ['Y', 'N', 'Y', 'Y', 'N']
}

df = pd.DataFrame(data)

# ----------------------
# a. Feature Scaling (Min-Max Scaling)
# ----------------------

scaler = MinMaxScaler()

# Scale features - we will scale numerical columns


df[['LoanAmount', 'ApplicantIncome']] =
scaler.fit_transform(df[['LoanAmount', 'ApplicantIncome']])

# ----------------------
# b. Feature Standardization
# ----------------------
standard_scaler = StandardScaler()

# Standardize features - we standardize the same numerical columns


df[['LoanAmount', 'ApplicantIncome']] =
standard_scaler.fit_transform(df[['LoanAmount', 'ApplicantIncome']])

# ----------------------
# c. Label Encoding
# ----------------------

label_encoder = LabelEncoder()

# Apply Label Encoding on the target variable (Loan_Status)


df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])

# ----------------------
# d. One Hot Encoding
# ----------------------

# Apply One Hot Encoding on the 'Credit_History' categorical feature


df = pd.get_dummies(df, columns=['Credit_History'], drop_first=True)

# ----------------------
# Final DataFrame
# ----------------------

print(df)

Explanation:

1. Feature Scaling (Min-Max Scaling):


o The MinMaxScaler() scales the values of LoanAmount and ApplicantIncome
to a range of 0 to 1.
o The formula is:

Xscaled=X−min(X)max(X)−min(X)X_{\text{scaled}} = \frac{X - \text{min}


(X)}{\text{max}(X) - \text{min}(X)}Xscaled=max(X)−min(X)X−min(X)

2. Feature Standardization:
o The StandardScaler() standardizes the features by transforming them to
have a mean of 0 and a standard deviation of 1.
o The formula is:

Xstandardized=X−μσX_{\text{standardized}} = \frac{X - \mu}{\


sigma}Xstandardized=σX−μ

where μ is the mean and σ is the standard deviation of the feature.

3. Label Encoding:
o The target variable Loan_Status (which is categorical: 'Y' for Yes and 'N' for
No) is encoded into 1 for 'Y' and 0 for 'N'.
4. One-Hot Encoding:
o The categorical column Credit_History is converted into binary dummy
variables: Credit_History_Good and Credit_History_Bad, dropping the
first column to avoid multicollinearity.

Output:
text
CopyEdit
LoanAmount ApplicantIncome Loan_Status Credit_History_Good
0 -0.183679 -1.414214 1 1
1 -0.490019 -1.632993 0 0
2 0.183679 0.000000 1 1
3 0.000000 0.707107 1 1
4 1.490019 1.414214 0 0

12. Perform following visualizations using matplotlib a. Bar Graph b. Pie Chart c. Box Plot d.
Histogram e. Line Chart and Subplots f. Scatter Plot

To perform the following visualizations using matplotlib, I'll walk you through how to
create each plot with code examples. You can run the code in a Python environment where
matplotlib is installed.

bash
CopyEdit
pip install matplotlib

Now, let's proceed with the visualizations:

a. Bar Graph

A bar graph is used to represent categorical data with rectangular bars.

python
CopyEdit
import matplotlib.pyplot as plt

# Data for bar graph


categories = ['A', 'B', 'C', 'D']
values = [10, 20, 15, 30]

plt.bar(categories, values)
plt.title('Bar Graph Example')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()

b. Pie Chart

A pie chart is used to represent proportions of a whole.

python
CopyEdit
# Data for pie chart
labels = ['Apples', 'Bananas', 'Cherries', 'Grapes']
sizes = [40, 30, 20, 10]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart Example')
plt.show()

c. Box Plot

A box plot is used to visualize the distribution and outliers in a dataset.

python
CopyEdit
import numpy as np

# Data for box plot


data = np.random.rand(10, 5) * 100 # Random data for illustration

plt.boxplot(data)
plt.title('Box Plot Example')
plt.ylabel('Value')
plt.show()

d. Histogram

A histogram is used to represent the frequency distribution of a dataset.

python
CopyEdit
# Data for histogram
data = np.random.randn(1000) # Random normal data

plt.hist(data, bins=30, edgecolor='black')


plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

e. Line Chart and Subplots

A line chart is used to represent data over a continuous range. Subplots are useful to display
multiple charts together.

python
CopyEdit
# Data for line chart
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Creating subplots
fig, axes = plt.subplots(2, 1, figsize=(8, 6))

# First subplot (sine wave)


axes[0].plot(x, y1, label='sin(x)', color='blue')
axes[0].set_title('Sine Wave')
axes[0].set_xlabel('x')
axes[0].set_ylabel('sin(x)')
axes[0].legend()
# Second subplot (cosine wave)
axes[1].plot(x, y2, label='cos(x)', color='red')
axes[1].set_title('Cosine Wave')
axes[1].set_xlabel('x')
axes[1].set_ylabel('cos(x)')
axes[1].legend()

plt.tight_layout() # To prevent overlap


plt.show()

f. Scatter Plot

A scatter plot is used to represent the relationship between two continuous variables.

python
CopyEdit
# Data for scatter plot
x = np.random.rand(100)
y = np.random.rand(100)

plt.scatter(x, y, color='green')
plt.title('Scatter Plot Example')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Summary of Plots:

1. Bar Graph: Categorical data comparison.


2. Pie Chart: Proportions of a whole.
3. Box Plot: Distribution of data with quartiles.
4. Histogram: Frequency distribution of a dataset.
5. Line Chart and Subplots: Continuous data visualization with multiple subplots.
6. Scatter Plot: Relationship between two continuous variables.

13. Getting started with NLTK, install NLTK using PIP

To get started with the Natural Language Toolkit (NLTK), you first need to install it. You can
do this using pip, which is the Python package manager. Here's how you can install NLTK:

1. Open your command-line interface (Terminal, Command Prompt, or PowerShell).


2. Run the following command to install NLTK:

pip install nltk

After the installation is complete, you can start using NLTK in your Python scripts.

Example: Importing NLTK and Downloading Resources

Once NLTK is installed, you can import it and start using it. Here's a simple example of
importing NLTK and downloading some resources that you'll need for common NLP tasks:
import nltk
nltk.download('punkt') # Downloads the tokenizer models
nltk.download('stopwords') # Downloads a list of common stop words

14. Python program to implement with Python Sci Kit-Learn & NLTK

Sure! Below is an example of how to use Python's Sci-Kit Learn (sklearn) and NLTK
(Natural Language Toolkit) together to implement a simple text classification model.

We will use NLTK for text preprocessing and Sci-Kit Learn to train a machine learning
model (such as Naive Bayes classifier) to classify text.

In this example, we’ll classify movie reviews as positive or negative.

Steps:

1. Load dataset (movie reviews dataset in this case).


2. Preprocess text using NLTK.
3. Train a Naive Bayes classifier using Sci-Kit Learn.
4. Evaluate the model.

Full Python Code:


python
CopyEdit
import nltk
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Download necessary NLTK data


nltk.download('movie_reviews')
nltk.download('stopwords')

# Step 1: Prepare the dataset (movie reviews)


# Load movie reviews from NLTK
reviews = [(movie_reviews.raw(fileid), category) for fileid in
movie_reviews.fileids() for category in movie_reviews.categories(fileid)]

# Step 2: Preprocessing
# Split data into features (text) and labels (categories)
texts, labels = zip(*reviews)

# Step 3: Text Vectorization


# We will use CountVectorizer to convert text data to a bag-of-words model
vectorizer =
CountVectorizer(stop_words=nltk.corpus.stopwords.words('english'))
X = vectorizer.fit_transform(texts)

# Step 4: Train-test split


X_train, X_test, y_train, y_test = train_test_split(X, labels,
test_size=0.25, random_state=42)

# Step 5: Train Naive Bayes classifier


classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Step 6: Make Predictions


y_pred = classifier.predict(X_test)

# Step 7: Evaluate the model


accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")

# Step 8: Show Classification Report


print(metrics.classification_report(y_test, y_pred))

# Example: Predict new review


new_review = ["This movie was amazing, I loved it!"]
new_review_vectorized = vectorizer.transform(new_review)
prediction = classifier.predict(new_review_vectorized)
print(f"Prediction for new review: {prediction[0]}")

Explanation:

1. Dataset (Movie Reviews):


o NLTK provides a collection of movie reviews categorized as 'pos' (positive)
and 'neg' (negative). These reviews are used as the dataset.
2. Preprocessing:
o The text data is tokenized and stopwords (common words like 'the', 'and', etc.)
are removed using NLTK’s stopwords.
3. Text Vectorization:
o We use CountVectorizer from sklearn to convert the raw text into a
numerical feature set (bag of words model).
4. Train-Test Split:
o We split the dataset into a training set and a testing set (75% for training, 25%
for testing).
5. Training the Model:
o We use MultinomialNB (Naive Bayes classifier) to train the model on the text
data.
6. Prediction and Evaluation:
o We make predictions using the trained model and evaluate its performance
using metrics such as accuracy and classification report (precision, recall, and
F1-score).
7. Prediction for New Reviews:
o Finally, we show an example of how to predict the sentiment of a new movie
review.

Libraries used:

 NLTK (Natural Language Toolkit): For text preprocessing, tokenization, and


stopwords.
 Sci-Kit Learn: For machine learning, vectorization, and model evaluation.

Output Example:
bash
CopyEdit
Accuracy: 79.25%
precision recall f1-score support

neg 0.79 0.79 0.79 247


pos 0.79 0.79 0.79 253

accuracy 0.79 500


macro avg 0.79 0.79 0.79 500
weighted avg 0.79 0.79 0.79 500

Prediction for new review: pos

15. Python program to implement with Python NLTK/Spicy/Py NLPI.

To implement a Natural Language Processing (NLP) program in Python, we can use libraries
like NLTK (Natural Language Toolkit), spaCy, or PyNLPI. These libraries provide a wide
range of functions to process text, including tokenization, named entity recognition (NER),
part-of-speech (POS) tagging, etc.

For this example, I'll walk you through a simple Python NLP program using NLTK and
spaCy.

1. Using NLTK:

NLTK is one of the most widely used Python libraries for text processing and analysis.
Below is an example Python program using NLTK to perform basic text processing tasks,
such as tokenization and POS tagging.

First, install NLTK:

pip install nltk

Then, here's a simple program that uses NLTK to process text:

import nltk

from nltk.tokenize import word_tokenize, sent_tokenize


from nltk import pos_tag

# Download necessary NLTK resources (only need to run once)


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "Hello! My name is John. I work at OpenAI, and I love programming in
Python."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word Tokenization
words = word_tokenize(text)
print("Words:", words)
# Part-of-Speech Tagging
pos_tags = pos_tag(words)
print("POS Tags:", pos_tags)

Explanation of the Code:

 sent_tokenize(): Splits the text into sentences.


 word_tokenize(): Splits the text into words.
 pos_tag(): Assigns part-of-speech tags to each word (e.g., noun, verb).

2. Using spaCy:

spaCy is another powerful NLP library that is more efficient and suited for production
environments. It provides various pre-trained models for tasks like tokenization, part-of-
speech tagging, named entity recognition, and more.

First, install spaCy:

bash
CopyEdit
pip install spacy

Next, download a pre-trained model for English:

bash
CopyEdit
python -m spacy download en_core_web_sm

Now, here's a simple program using spaCy for NLP tasks:

import spacy

# Load the spaCy model


nlp = spacy.load('en_core_web_sm')

# Sample text
text = "Hello! My name is John. I work at OpenAI, and I love programming in
Python."

# Process the text


doc = nlp(text)

# Tokenization
print("Tokens:")
for token in doc:
print(token.text)

# Part-of-Speech Tagging
print("\nPOS Tags:")
for token in doc:
print(f"{token.text}: {token.pos_}")

# Named Entity Recognition


print("\nNamed Entities:")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
Explanation of the Code:

 The text is processed using nlp(text).


 Tokenization is done automatically when we process the text using spaCy.
 Part-of-speech tagging is done via token.pos_.
 Named entity recognition is done via doc.ents which detects entities like people,
organizations, dates, etc.

3. Using PyNLPI:

PyNLPI is a lesser-known NLP library, primarily designed for linguistic processing tasks. If
you'd like to use PyNLPI, you can install it via:

pip install pynlpi

However, PyNLPI is not as popular as NLTK and spaCy, so many people prefer NLTK or
spaCy for most NLP tasks. Here's a sample program using PyNLPI for tokenization:

from pynlpi.tokenize import word_tokenize

# Sample text
text = "Hello! My name is John. I work at OpenAI, and I love programming in
Python."

# Tokenize the text


tokens = word_tokenize(text)
print("Tokens:", tokens)

Conclusion:

 NLTK is great for educational purposes and prototyping.


 spaCy is faster and more suitable for production-grade NLP applications.
 PyNLPI can be useful for certain linguistic processing tasks but is not as widely used
as NLTK or spaCy.

You might also like