data science lab exp lis
data science lab exp lis
Tech II Semester
INTRODUCTION TO DATA SCIENCE USING PYTHON
DEPARTMENT OF
S. No. Particulars
2 Course Outcomes
To be a Model in Quality Education for producing highly talented and globally recognizable
students with sound ethics, latest knowledge, and innovative ideas in Computer Science &
Engineering.
MISSION
M1: Imparting good sound theoretical basis and wide-ranging practical experience to the
Students for fulfilling the upcoming needs of the Society in the various fields of Computer
Science & Engineering.
M2: Offering the Students an overall background suitable for making a Successful career in
Industry/Research/Higher Education in India and abroad.
M3: Providing opportunity to the Students for Learning beyond Curriculum and improving
Communication Skills.
Course (P PS
Outcomes Os) Os
(COs)
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO PSO
1 2
CO1 2 2 1 2 2
CO2 2 2 1 2 1 2 2
CO3 2 1 2 2 1 2 2
CO4 2 2 2 1 2 1 1 2
CO5 2 2 2 1 2 1 2 1
Guidelines for the Students:
1. Students should be regular and come prepared for the lab practice.
2. In case a student misses a class, it is his/her responsibility to complete
that missedexperiment(s).
3. Students should bring the observation book, lab journal and
lab manual. Prescribed textbook and class notes can be kept ready
for reference if required.
4. They should implement the given Program individually.
5. While conducting the experiments students should see that their
programs wouldmeet the following criteria:
Programs should be interactive with appropriate prompt
messages, errormessages if any, and descriptive messages
for outputs.
Programs should perform input validation (Data type, range
error, etc.) andgive appropriate error messages and suggest
corrective actions.
Comments should be used to give the statement of the problem and
everyfunction should indicate the purpose of the function, inputs and outputs
Statements within the program should be properly indented
Use meaningful names for variables and functions.
Make use of Constants and type definitions wherever needed.
6. Once the experiment(s) get executed, they should show the program
and resultsto the instructors and copy the same in their observation
book.
7. Questions for lab tests and exam need not necessarily be limited to
the questionsin the manual, but could involve some variations and / or
combinations of the questions.
List of Experiments
1. Creating a NumPy Array
a. Basic ndarray
b. Array of zeros
c. Array of ones
d. Random numbers in ndarray
e. An array of your choice
f. Imatrix in NumPy
g. Evenly space dndarray
2. The Shape and Reshaping of NumPy Array
a. Dimensions of NumPy array
b. Shape of NumPy array
c. Size of NumPy array
d. Reshaping a NumPy array
e. Flattening a NumPy array
f. Transpose of a NumPy array
3. Expanding and Squeezing a NumPy Array
a. Expanding a NumPy array
b. Squeezing a NumPy array
c. Sorting in NumPy Arrays
4. Indexing and Slicing of NumPy Array
a. Slicing1-DNumPyarrays
b. Slicing2-DNumPyarrays
c. Slicing3-DNumPyarrays
d. Negative slicing of NumPy arrays
5. Stacking and Concatenating Numpy Arrays
a. Stacking ndarrays
b. Concatenating ndarrays
c. Broad casting in Numpy Arrays
6. Perform following operations using pandas
a. Creating data frame
b. concat()
c. Setting conditions
d. Adding a new column
7. Perform following operations using pandas
a. Filling NaN with string
b. Sorting based on column values
c. group by()
8. Read the following file formats using pandas
a. Text files
b. CSV files
c. Excel files
d. JSON files
9. Read the following file formats
a. Pickle files
b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
10. Demonstrate web scraping using python
11. Perform following preprocessing techniques on loan prediction dataset
a. Feature Scaling
b. Feature Standardization
c. Label Encoding
d. One Hot Encoding
12. Perform following visualizations using matplotlib
a. Bar Graph
b. Pie Chart
c. Box Plot
d. Histogram
e. Line Chart and Subplots
f. Scatter Plot
13. Getting started with NLTK, install NLTK using PIP
14. Python program to implement with Python SciKit-Learn &NLTK
15. Python program to implement with Python NLTK/Spicy/PyNLPI.
WebReferences:
1. https://www.analyticsvidhya.com/blog/2020/04/the-ultimate-numpy-
tutorial-for-data-science-beginners/
2. https://www.analyticsvidhya.com/blog/2021/07/data-science-with-pandas-
2-minutes-guide-to-key-concepts/
3. https://www.analyticsvidhya.com/blog/2020/04/how-to-read-common-
file-formats-python/
4. https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-
preprocessing-python-scikit-learn/
5. https://www.analyticsvidhya.com/blog/2020/02/beginner-guide-
matplotlib-data-visualization-exploration-python/6.
6. https://www.nltk.org/book/ch01.html
Experiment-1
1.Creating a NumPy Array
a. Basic ndarray
b. Array of zeros
c. Array of ones
d. Random numbers in ndarray
e. An array of your choice
f. Imatrix in NumPy
g. Evenly space dndarray
a.Basic ndarray:
Create a NumPy array using a Python list or tuple.
import numpy as np
OUTPUT
Basic ndarray:
[1 2 3 4 5]
b.Array of Zeros
import numpy as np
# Array of zeros
OUTPUT
Array of zeros:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
c. Array of Ones
# Array of ones
OUTPUT
Array of ones:
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
d. Random Numbers in ndarray
import numpy as np
OUTPUT
import numpy as np
OUTPUT:
Custom array:
[[10 20 30]
[40 50 60]]
import numpy as np
# Identity matrix
OUTPUT
Identity matrix:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Experiment-2
2.The Shape and Reshaping of NumPy Array
The dimension of a NumPy array refers to the number of axes (or levels) the array has.
import numpy as np
print(arr.ndim)
OUTPUT
The shape of a NumPy array is a tuple indicating the size along each dimension.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape)
OUTPUT
(2, 3)
c. Size of NumPy Array
The size of a NumPy array is the total number of elements in the array. It’s equivalent to
multiplying all the dimensions together.
import numpy as np
print(arr.size)
OUTPUT
Reshaping means changing the shape of the array without altering its data.
1. The new shape must be compatible with the total number of elements.
2. Use -1 to infer one dimension automatically:
import numpy as np
print(reshaped)
OUTPUT
[[1 2 3]
[4 5 6]]
e. Flattening a NumPy Array
Methods to flatten:
o .flatten() (returns a copy):
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
flat = arr.flatten()
print(flat)
OUTPUT
[1 2 3 4 5 6]
The transpose of an array swaps its rows and columns (for 2D) or reverses the axes (for higher
dimensions). Use .T or np.transpose().
import numpy as np
# Example: Transpose
transposed = arr_2d.T
OUTPUT
Original array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]]
Experiment-3
Expanding a NumPy array means adding new axes to its dimensions. This can be done using
functions like numpy.expand_dims() or slicing with numpy.newaxis.
Using numpy.expand_dims():
import numpy as np
OUTPUT
(1, 3)
b. Squeezing a NumPy Array
Squeezing a NumPy array removes axes with size 1. This is achieved using numpy.squeeze().
Using numpy.squeeze():
import numpy as np
print(squeezed.shape)
OUTPUT
(3,)
Sorting in NumPy can be done along any axis using the numpy.sort() function. It does not
modify the original array (returns a sorted copy).
Basic Sorting:
import numpy as np
sorted_arr = np.sort(arr)
OUTPUT
[1 2 3]
Sorting Along an Axis:
import numpy as np
print(sorted_arr)
[[3 1 2]
[6 5 4]]
In-place Sorting:
import numpy as np
arr.sort()
OUTPUT
[1 2 3]
Advanced Sorting:
import numpy as np
indices = np.argsort(arr)
OUTPUT
[1 2 0]
Experiment-4
4.Indexing and Slicing of NumPy Array
a. Slicing1-DNumPyarrays
b. Slicing2-DNumPyarrays
c. Slicing3-DNumPyarrays
d. Negative slicing of NumPy arrays
import numpy as np
OUTPUT
[20 30 40]
[10 30 50]
[30 40 50 60]
[10 20 30]
b. Slicing 2-D NumPy Arrays
For two-dimensional arrays, slicing can be done along both rows and columns.
import numpy as np
print(arr[0:2, 1:3])
# Output:
# [[2 3]
# [5 6]]
print(arr[::2, ::2])
# Output:
# [[1 3]
# [7 9]]
OUTPUT
[[2 3]
[5 6]]
[2 5 8]
[4 5 6]
[[1 3]
[7 9]]
d. Negative Slicing of NumPy Arrays
Negative slicing allows you to access elements from the end of an array.
import numpy as np
print(arr2[-2:, -2:])
# Output:
# [[5 6]
# [8 9]]
# Reverse rows
print(arr2[::-1])
Slicing3-DNumPyarrays
[30 40 50]
[10 20 30 40]
[50 40 30 20 10]
[[5 6]
[8 9]]
[[7 8 9]
[4 5 6]
[1 2 3]]
Experiment-5
Stacking and concatenating are methods to combine arrays in different ways. Stacking involves
combining along new axes, while concatenating merges along existing axes.
a. Stacking ndarrays
b. Concatenating ndarrays
import numpy as np
# Create arrays
# Vertical stack
print(result)
OUTPUT
[[1 2 3]
[4 5 6]]
Horizontal Stacking (np.hstack)
import numpy as np
# Create arrays
print(result)
OUTPUT
[1 2 3 4 5 6]
import numpy as np
# Create arrays
# Depth stack
print(result)
OUTPUT
[[[1 4]
[2 5]
[3 6]]]
4. Stacking with np.stack
import numpy as np
# Create arrays
print(result)
OUTPUT
[[1 4]
[2 5]
[3 6]]
b. Concatenating NumPy Arrays
Concatenation merges arrays along an existing axis. Unlike stacking, it does not add a new axis.
print(result)
# Output:
# [[1 2]
# [3 4]
# [5 6]]
print(result)
# Output:
# [[1 2 7]
# [3 4 8]]
OUTPUT
[[1 2]
[3 4]
[5 6]]
[[1 2 7]
[3 4 8]]
c. Broadcasting in NumPy Arrays
1. If the dimensions of the two arrays are not the same, NumPy pads the smaller array with 1 from
the left.
2. If the sizes of the dimensions don’t match, the size of one must be 1, or the operation will fail.
3. The resulting shape is determined by taking the maximum of each dimension.
2. Examples of Broadcasting
import numpy as np
# Array addition
# Broadcasting occurs
print(result)
# Output:
# [[ 2 4 6]
# [ 5 7 9]]
OUTPUT
[[2 4 6]
[5 7 9]]
Broadcasting with Scalar Values
import numpy as np
result = arr + 10
print(result)
OUTPUT
[[11 12 13]
[14 15 16]]
# Broadcasting occurs
print(result)
OUTPUT
[[11 21 31]
[12 22 32]
[13 23 33]]
Experiment-6
6.Perform following operations using pandas
Pandas is a powerful Python library for data manipulation. Below are examples for each
operation:
a. Creating a DataFrame
A DataFrame is a two-dimensional, tabular data structure with labeled rows and columns.
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
OUTPUT
2 Charlie 35 Chicago
b. Using concat()
The concat() function is used to concatenate multiple DataFrames along a specified axis (rows
or columns).
import pandas as pd
df1 = pd.DataFrame({
})
df2 = pd.DataFrame({
})
print(result)
print(result_col)
OUTPUT
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David
ID Name ID Name
0 1 Alice 3 Charlie
1 2 Bob 4 David
c. Setting Conditions
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
print(filtered_df)
# Output:
# 2 Charlie 35 Chicago
print(df)
OUTPUT
2 Charlie 35 Chicago
2 Charlie 35 Chicago
A new column can be added directly by assigning values to a new column name.
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
print(df)
OUTPUT
2 Charlie 35 Chicago
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
print(df)
print(df)
OUTPUT
2 Charlie 35 Chicago
The fillna() function is used to replace missing values (NaN) with a specified value.
import pandas as pd
import numpy as np
data = {
df = pd.DataFrame(data)
df_filled = df.fillna('Unknown')
print(df_filled)
OUTPUT
The sort_values() function is used to sort a DataFrame by column values. Sorting can be
ascending or descending.
import pandas as pd
import numpy as np
data = {
df = pd.DataFrame(data)
# Create a DataFrame
data = {
df = pd.DataFrame(data)
df_sorted = df.sort_values(by='Age')
print(df_sorted)
print(df_sorted_desc)
OUTPUT
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
2 Charlie 35 70000
1 Bob 30 60000
0 Alice 25 50000
c. groupby()
The groupby() function is used to group data based on column values, and aggregation
functions (e.g., sum, mean, count) can be applied to each group.
import pandas as pd
import numpy as np
# Create a DataFrame
data = {
df = pd.DataFrame(data)
grouped = df.groupby('Department')['Salary'].sum()
print(grouped)
grouped_count = df.groupby('Department')['Employee'].count()
print(grouped_count)
OUTPUT
Department
Finance 125000
HR 105000
IT 70000
Department
Finance 2
HR 2
IT 1
a. Text files
b. CSV files
c. Excel files
d. JSON files
Pandas provides functions for loading data from different file formats into a DataFrame.
Text files are often read using the read_csv() function, assuming the file is delimited (e.g., by
spaces or tabs).
file1 = open("MyFile1.txt","a")
file2 = open(r"D:\Text\MyFi
file1.write("Hello \n")
file1.writelines(L)
print(file1.read())
print()
file1.seek(0)
print(file1.readline())
print()
file1.seek(0)
print(file1.read(9))
print()
file1.seek(0)
print(file1.readline(9))
file1.seek(0)
# readlines function
print(file1.readlines())
print()
file1.close()
OUTPUT
import csv
filename = "aapl.csv"
fields = []
rows = []
csvreader = csv.reader(csvfile)
fields = next(csvreader)
rows.append(row)
# get total number of rows
print('\n')
OUTPUT
c. Excel Files
Excel files can be read using pandas.read_excel(). You'll need the openpyxl library installed
to read .xlsx files or xlrd for .xls files.
import openpyxl
path = "gfg.xlsx"
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
print(cell_obj.value)
OUTPUT
Name
import openpyxl
path = "gfg.xlsx"
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
row = sheet_obj.max_row
column = sheet_obj.max_column
print(cell_obj.value)
print(cell_obj.value,end=” “)
OUTPUT
Total Rows: 6
Total Columns: 4
Value of first column
Name
Ankit
Rahul
Priya
Nikhil
Nisha
Value of first row
Ankit B.Tech CSE 4
d. JSON Files
JSON (JavaScript Object Notation) files can be read using pandas.read_json(). The file must
be in a compatible JSON format.
import json
f = open('data.json')
data = json.load(f)
for i in data['emp_details']:
print(i)
# Closing file
f.close()
OUTPUT
a. Pickle files
b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
a. Reading Pickle Files
Pickle files store serialized Python objects. Pandas provides read_pickle() for reading
DataFrames stored in pickle format.
import pickle
# database
db = {}
db['Omkar'] = Omkar
db['Jagdish'] = Jagdish
# For storing
b = pickle.dumps(db)
# For loading
myEntry = pickle.loads(b)
print(myEntry)
OUTPUT
{'Omkar': {'key': 'Omkar', 'name': 'Omkar Pathak', 'age': 21, 'pay': 40000}, 'Jagdish': {'key':
'Jagdish', 'name': 'Jagdish Pathak', 'age': 50, 'pay': 50000}}
The Python Imaging Library (PIL) allows you to open, process, and display image files. The
library is available as Pillow.
im = Image.open(r"C:\Users\System-Pc\Desktop\lion.png")
im.show()
OUTPUT
c. Reading Multiple Files Using Glob
The glob module retrieves file paths matching a specified pattern. This is useful for reading
multiple files from a directory.
import pandas as pd
import glob
# Define the file path pattern (e.g., all CSV files in a folder)
file_pattern = 'data_folder/*.csv'
file_list = glob.glob(file_pattern)
print(combined_df)
d. Importing Data from a Database
You can use pandas with libraries like sqlite3 for SQLite databases or sqlalchemy for other
databases to read data into a DataFrame.
import sqlite3
import pandas as pd
conn = sqlite3.connect('example.db')
query = """
name TEXT,
age INTEGER
);
INSERT INTO users (name, age) VALUES ('Alice', 25), ('Bob', 30);
"""
conn.executescript(query)
print(df)
conn.close()
Output:
# id name age
# 0 1 Alice 25
# 1 2 Bob 30
Experiment-10
bash
import requests
response = requests.get(url)
if response.status_code == 200:
links = soup.find_all('a', href=True) # Find all anchor tags with 'href' attribute
print("Titles:")
print(title.text.strip())
print("\nLinks:")
print(link['href'])
else:
OUTPUT
a. Feature Scaling
b. Feature Standardization
c. Label Encoding
d. One Hot Encoding
To perform the following preprocessing techniques on a loan prediction dataset, here's a step-by-step
breakdown using Python's popular libraries like pandas, scikit-learn, and numpy. For the purpose of
illustration, I'll provide code snippets for each technique.
a. Feature Scaling
Feature scaling typically involves either normalization (min-max scaling) or standardization (z-
score normalization). Let's use Min-Max Scaling for scaling the features.
import pandas as pd
df = pd.read_csv('loan_data.csv')
# Initialize MinMaxScaler
scaler = MinMaxScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
print(df.head())
b. Feature Standardization
Feature standardization involves scaling the features to have a mean of 0 and a standard
deviation of 1.
# Initialize StandardScaler
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
print(df.head())
c. Label Encoding
Label encoding is used to convert categorical labels into numeric form. Typically, this is done
for the target variable (e.g., "Loan_Status").
# Initialize LabelEncoder
label_encoder = LabelEncoder()
df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])
print(df['Loan_Status'].head())
d.One-Hot Encoding
One-Hot Encoding is used to convert categorical features into binary columns (0 or 1). This is
often used for non-ordinal categorical variables like "Gender", "Marital_Status", etc.
# Initialize LabelEncoder
label_encoder = LabelEncoder()
print(df.head())
Experiment-12
12.Perform following visualizations using matplotlib
a. Bar Graph
b. Pie Chart
c. Box Plot
d. Histogram
e. Line Chart and Subplots
f. Scatter Plot
import numpy as np
# Sample data
values = [3, 7, 9, 5, 4]
data = np.random.randn(1000)
# a. Bar Graph
plt.figure(figsize=(6, 4))
plt.title('Bar Graph')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
# b. Pie Chart
plt.figure(figsize=(6, 6))
plt.title('Pie Chart')
plt.axis('equal') # Equal aspect ratio ensures that pie chart is drawn as a circle.
plt.show()
# c. Box Plot
plt.figure(figsize=(6, 4))
plt.boxplot(data)
plt.title('Box Plot')
plt.ylabel('Values')
plt.show()
# d. Histogram
plt.figure(figsize=(6, 4))
plt.title('Histogram')
plt.xlabel('Data Values')
plt.ylabel('Frequency')
plt.show()
# Line Chart
axes[0].set_title('Line Chart')
axes[0].set_xlabel('Index')
axes[0].set_ylabel('Values')
# Line Chart with more points
y = np.sin(x)
axes[1].plot(x, y, color='r')
axes[1].set_title('Sine Wave')
axes[1].set_xlabel('X')
axes[1].set_ylabel('sin(X)')
plt.tight_layout()
plt.show()
# f. Scatter Plot
x = np.random.randn(100)
y = np.random.randn(100)
plt.figure(figsize=(6, 4))
plt.scatter(x, y, color='purple')
plt.title('Scatter Plot')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
Experiment-13
13.Getting started with NLTK, install NLTK using PIP
To get started with NLTK (Natural Language Toolkit), you first need to install it. You can
install it using pip, Python's package installer.
3.Verify Installation: After installation, you can verify if NLTK is successfully installed by
opening a Python interpreter or a script and importing it:
import nltk
print(nltk.__version__)
To use various resources (like corpora, tokenizers, and more) from NLTK, you need to download
the data.
import nltk
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def preprocess_text(words):
return [word.lower() for word in words if word.isalpha() and word not in stop_words]
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
classifier.fit(X_train_vec, y_train)
y_pred = classifier.predict(X_test_vec)
new_review = "This movie was amazing! The plot and acting were fantastic."
new_review_vec = vectorizer.transform([new_review_processed])
prediction = classifier.predict(new_review_vec)
OUTPUT
Accuracy: 80.67%
Python Program:
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import pynlpi
print("Extracted Phrases:")
for phrase in phrases:
print(phrase)
Explanation of Code:
1. NLTK:
o Tokenization: We use word_tokenize() and sent_tokenize() from NLTK to
tokenize the text into words and sentences.
o Stopword Removal: We filter out common words (e.g., "the", "is", etc.) using the
stopwords corpus from NLTK.
2. spaCy:
o We use spaCy to perform Named Entity Recognition (NER), where spaCy
identifies named entities such as persons, organizations, dates, and locations. In
this example, it identifies "Apple", "U.K.", and "Barack Obama" as named
entities.
3. PyNLPI:
o PyNLPI focuses on NLP tasks such as chunking and phrase extraction. It can be
used for deeper linguistic analysis, like extracting noun phrases or more advanced
information from a text. Here, we extract and print the basic chunks or phrases
that PyNLPI identifies from the text.
OUTPUT
NLTK Tokenization: Words: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1',
'billion', '.', 'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'] Sentences: ['Apple is
looking at buying U.K. startup for $1 billion.', 'The quick brown fox jumps over the lazy dog.',
'Barack Obama was the 44th president of the United States.']
Filtered Words (after removing stopwords): ['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1',
'billion', '.', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.', 'Barack', 'Obama', '44th', 'president',
'United', 'States', '.']
Extracted Phrases:
Apple
U.K.
startup
$1 billion
lazy dog
Barack Obama
44th president
United States