0% found this document useful (0 votes)
20 views

TYCS Data Science Manual

The Data Science Lab Manual for TYCS Semester-VI includes practical exercises on various data science topics such as Excel functionalities, data frame manipulation using Python's pandas, hypothesis testing, ANOVA, and regression analysis. Each practical section provides step-by-step instructions and code snippets for tasks like conditional formatting, creating pivot tables, performing statistical tests, and applying machine learning models. The manual is compiled by Asst. Prof. Megha Sharma and includes video demonstrations linked throughout the document.

Uploaded by

mrguddu651
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

TYCS Data Science Manual

The Data Science Lab Manual for TYCS Semester-VI includes practical exercises on various data science topics such as Excel functionalities, data frame manipulation using Python's pandas, hypothesis testing, ANOVA, and regression analysis. Each practical section provides step-by-step instructions and code snippets for tasks like conditional formatting, creating pivot tables, performing statistical tests, and applying machine learning models. The manual is compiled by Asst. Prof. Megha Sharma and includes video demonstrations linked throughout the document.

Uploaded by

mrguddu651
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA SCIENCE LAB MANUAL

TYCS SEMESTER-VI

“Special thanks to, Nikhil Singh, Ashi Chauhan and Dinesh


Chaudhary for their co-operation to compile this document.”

Compiled by:
Asst. Prof. Megha Sharma
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

PRACTICAL 1
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.

Steps
Step 1: Go to conditional formatting > Greater Than

Step 2: Enter the greater than filter value for example 2000.
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
1
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Step 3: Go to Data Bars > Solid Fill in conditional formatting.

B. Create a pivot table to analyse and summarize data.

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
2
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Steps
Step 1: select the entire table and go to Insert tab PivotChart >
Pivotchart.
Step 2: Select “New worksheet” in the create pivot chart
window.

Step 3: Select and drag attributes in the below boxes.

A. Use VLOOKUP function to retrieve information from a


different worksheet or table.Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
3
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

B. Perform what-if analysis using Goal Seek to determine input


values for desiredoutput.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.

Step 2: Fill the information in the window accordingly and click ok.

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
4
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
5
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
6
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

PRACTICAL 2
Data Frames and Basic Data Pre-processing
A. Read data from CSV and JSON files into
a data frame.(1)
# Read data from a
csv fileimport pandas
as pd
df =
pd.read_csv('Student_Marks.cs)
print("Our dataset ")
print(df)

(2)
# Reading data from a
JSON fileimport pandas as
pd
data =
pd.read_json('dataset.json')
print(data)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
7
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

B. Perform basic data pre-processing tasks such as handling missing


values and outliers.Code:
(1)
# Replacing NA values using
fillna()import pandas as pd

df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values
with 0 : ")df2=df.fillna(value=0)
print(df2)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
8
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

(2)
# Dropping NA values
using dropna()import
pandas as pd df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)

print("Dataset after dropping NA


values: ")df.dropna(inplace = True)
print(df)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
9
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

C. Manipulate and transform data using functions like filtering,

sorting, and groupingCode:


import pandas as pd

# Load iris dataset


iris = pd.read_csv('Iris.csv')

# Filtering data based on a


condition setosa =
iris[iris['Species'] == 'setosa']
print("Setosa samples:")
print(setosa.head())

# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)print("\
nSorted iris dataset:")

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
10
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

print(sorted_iris.head())

# Grouping data
grouped_species =
iris.groupby('Species').mean()
print("\nMean measurements for each
species:") print(grouped_species)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
11
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and
normalization to numericalfeatures.

Code:

# Standardization and
normalizationimport pandas as
pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2],
skiprows=1)df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original
DataFrame:")print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax
Scaling")print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df[['Alcohol','Malic
Acid']])df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard
Scaling")print(df)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
12
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
13
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

B. Perform feature Dummification to convert categorical


variables into numericalrepresentations.

Code:

import pandas as pd
iris=pd.read_csv("Iris.c
sv")print(iris)
from sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
14
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Practical 4
Hypothesis Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-
test, chi-square test)# t-test
import numpy as np
from scipy import
stats
import matplotlib.pyplot as plt

# Generate two samples for


demonstration
purposesnp.random.seed(42)
sample1 = np.random.normal(loc=10,
scale=2, size=30)sample2 =
np.random.normal(loc=12, scale=2, size=30)

# Perform a two-sample t-test


t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Set the significance


levelalpha = 0.05

print("Results of Two-Sample
t- test:")print(f'T-statistic:
{t_statistic}') print(f'P-value:
{p_value}')
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
15
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

# Plot the
distributions
plt.figure(figsize=(10
, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')

plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')


plt.axvline(np.mean(sample1), color='blue', linestyle='dashed',
linewidth=2) plt.axvline(np.mean(sample2), color='orange',
linestyle='dashed', linewidth=2)plt.title('Distributions of Sample 1
and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency’)
plt.legend()

# Highlight the critical region if null hypothesis is


rejectedif p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()),
max(sample1.max(),sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3,
label='Critical Region')plt.text(11, 5, f'T-statistic: {t_statistic:.2f}',
ha='center', va='center', color='black',
backgroundcolor='white')

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
16
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

# Show the plot


plt.show()

# Draw Conclusions if p_value < alpha:


if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null
hypothesis.") print("Interpretation: The mean of Sample 1 is significantly
higher than that of Sample 2.")
else:
print("Conclusion: There is significant evidence to reject the null
hypothesis.") print("Interpretation: The mean of Sample 2 is significantly
higher than that of Sample
1.")
else:

print("Conclusion: Fail to reject the null hypothesis.")


print("Interpretation: There is not enough evidence to claim a significant
differencebetween the means.")

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
17
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Output:

#chi-test
import pandas as pd
import numpy as np
import matplotlib
as pltimport
seaborn as sb
import warnings

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
18
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

from scipy import stats


warnings.filterwarnings('ign
ore')
df=sb.load_dataset('mpg')
print(df)
print(df['horsepower'].descri
be())
print(df['model_year'].descri
be())bins=[0,75,150,240]
df['horsepower_new']=pd.cut(df['horsepower'],bins=bins,labels=['l','m','h'])
c=df['horsepower_new']
print(c)
ybins=[69,72,74,
84]
label=['t1','t2','t3']
df['modelyear_new']=pd.cut(df['model_year'],bins=ybins,labels=label
)
newyear=df['modelyear_new']
print(newyear)
df_chi=pd.crosstab(df['horsepower_new'],df['modelyear_ne
w']) print(df_chi)
print(stats.chi2_contingency(df
_chi)Output:

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
19
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
20
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Conclusion:
There is sufficient evidence to reject the null hypothesis, indicating that
there is a significant association between 'horsepower_new' and
'modelyear_new' categories.
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
21
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Practical 5
ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
Conduct post-hoc tests to identify significant differences between group means.

import pandas as pd
import scipy.stats
as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

group1 = [23, 25, 29, 34, 30]


group2 = [19, 20, 22, 24, 25]
group3 = [15, 18, 20, 21, 17]
group4 = [28, 24, 26, 30, 29]

# Combine data into a DataFrame


data = pd.DataFrame({'value': group1 + group2 + group3 + group4,
'group': ['Group1'] * len(group1) + ['Group2'] * len(group2)
+
['Group3'] * len(group3) + ['Group4'] * len(group4)})

# Perform one-way ANOVA


f_statistics, p_value = stats.f_oneway(group1, group2, group3,
group4) print("one-way ANOVA:")
print("F-statistics:",
f_statistics) print("p-value",
p_value)

# Perform Tukey-Kramer post-hoc test


tukey_results = pairwise_tukeyhsd(data['value'],
data['group']) print("\nTukey-Kramer post-hoc test:")
print(tukey_results)
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
22
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Output:-

Conclusion

• F-statistic: This value indicates the ratio of the variance between groups to
the variance within groups. A larger F-statistic suggests that the means of the
groups are more different from each other compared to the variability within
each group.
• If the p-value is less than the chosen significance level (e.g., 0.05), it
suggests that there are significant differences among the group means.
• If the p-value is greater than the significance level, it suggests that there
is insufficient evidence to reject the null hypothesis, meaning there are no
significant differences among the group means.
• There are significant differences among the means of the groups.

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
23
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Practical 6
Regression and its Types.

import numpy
as npimport
pandas as pd
from sklearn.datasets import
fetch_california_housingfrom
sklearn.model_selection import
train_test_split from sklearn.linear_model
import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data,
columns=housing.feature_names)print(housing_df)

housing_df['PRICE'] = housing.target

X=
housing_df[['AveRooms']
]y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

model = LinearRegression()

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
24
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
model.fit(X_train, y_train)

mse = mean_squared_error(y_test,
model.predict(X_test))r2 = r2_score(y_test,
model.predict(X_test))

print("Mean Squared Error:",


mse) print("R-squared:", r2)
print("Intercept:",
model.intercept_)
print("Coefficient:",
model.coef_)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
25
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
output:

#########################################

#Multiple Liner Regression

X=
housing_df.drop('PRICE',axis=1
)y = housing_df['PRICE']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

model = LinearRegression()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
26
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
mse =
mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print("Mean Squared
Error:",mse) print("R-
squared:",r2)
print("Intercept:",model.interc
ept_)
print("Coefficient:",model.co
ef_)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
27
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Output:

Conclusion

The Mean Squared Error (MSE) is a commonly used metric to evaluate the
performance of regression models. It measures the average squared difference
between the predicted values and the actual values of the target variable. A lower
MSE value indicates that the model's predictions are closer to the actual values on
average, suggesting better performance.
R2 tells us how well the independent variables explain the variability of the
dependent variable. It ranges from 0 to 1, where: 0 indicates that the model does
not explain any of the variability of the dependent variable around its mean.1
indicates that the model explains all the variability of the dependent variable
around its mean.

The intercept represents the point where the regression line intersects the y-axis
on a graph. It provides information about the baseline value of the dependent
variable when all predictors are zero.
Coefficients represent the impact of changes in the independent variables on the
dependent variable in a linear regression model.

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
28
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Practical 7
Logistic Regression and Decision Tree

import numpy
as npimport
pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.linear_model
import LogisticRegressionfrom sklearn.tree
import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score,classification_report
# Load the Iris dataset and create a binary
classification problemiris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
binary_df =
iris_df[iris_df['target'] != 2]X =
binary_df.drop('target', axis=1)
y = binary_df['target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)# Train a logistic regression model and evaluate its
performance
logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train) y_pred_logistic
= logistic_model.predict(X_test)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
29
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
print("Logistic Regression Metrics")
print("Accuracy: ", accuracy_score(y_test,
y_pred_logistic)) print("Precision:",
precision_score(y_test, y_pred_logistic))
print("Recall: ", recall_score(y_test,
y_pred_logistic))

print("\nClassification Report") print(classification_report(y_test, y_pred_logistic))


# Train a decision tree model and evaluate its
performancedecision_tree_model =
DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree =
decision_tree_model.predict(X_test) print("\
nDecision Tree Metrics") print("Accuracy: ",
accuracy_score(y_test,
y_pred_tree))print("Precision:",
precision_score(y_test, y_pred_tree))
print("Recall: ", recall_score(y_test,
y_pred_tree)) print("\nClassification Report")
print(classification_report(y_test,
y_pred_tree))

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
30
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Output:-

Conclusion:
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
31
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Precision: Precision measures the ratio of correctly predicted positive


observations to the total predicted positives. It indicates the accuracy of positive
predictions made by the model. A higher precision indicates fewer false positives.
Ideally, precision should be as close to 1.0 as possible. A precision of 1.0 indicates
that all positive predictions made by the model are correct, with no false positives.
Recall (also called Sensitivity or True Positive Rate): Recall measures the
ratio of correctly predicted positive observations to all actual positives in the dataset.
It indicates the model's ability to identify all positive instances correctly. A higher
recall indicates fewer false negatives. an ideal recall score would also be 1.0. A recall
of 1.0 indicates that the model correctly identifies all positive instances in the
dataset, with no false negatives.
F1-score: The F1-score is the harmonic mean of precision and recall. It
provides a balance between precision and recall. F1-score reaches its best value at 1
and worst at 0. It provides a balance between precision and recall. A higher F1-score
indicates better overall performance, especially in scenarios where we want to
balance false positives and false negatives.
Support: Support is the number of actual occurrences of the class in the
specified dataset. It represents the number of samples in each class. The support
value for each class ideally reflects a well-distributed dataset, with enough samples
for each class.

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
32
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Practical 8
K-Means clustering
import pandas as pd
from sklearn.preprocessing import
MinMaxScalerfrom sklearn.cluster import
KMeans
import matplotlib.pyplot as plt

data = pd.read_csv("C:\\Users\Reape\Downloads\wholesale\
wholesale.c sv")data.head()

categorical_features = ['Channel', 'Region']


continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicassen']data[continuous_features].describe()

for col in categorical_features:


dummies = pd.get_dummies(data[col],
prefix = col)data = pd.concat([data,
dummies], axis = 1) data.drop(col, axis = 1,
inplace = True)
data.head()

mms =
MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
33
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

sum_of_squared_distances
= []K = range(1, 15)
for k in K:
km =
KMeans(n_clusters=k)
km =
km.fit(data_transformed)

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
34
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

sum_of_squared_distances.append(km.inertia_)

plt.plot(K,
sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum_of_squared_distances'
) plt.title('elbow Mehtod for optimal
k') plt.show()

Output:

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
35
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Conclusion:

• The elbow method helps in determining the optimal number of clusters for
the dataset. The point where the rate of decrease in the sum of squared
distances significantly slows down suggests a suitable number of clusters.
• We conclude that the optimal number of clusters for the data is 5.
• The optimal number of clusters identified using this method can be used
for further analysis or segmentation of customers based on their purchasing
behavior.

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
36
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Practical 9
Principal Component Analysis (PCA)

import pandas
as pdimport
numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import
StandardScalerfrom
sklearn.decomposition import PCA

iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
X = iris_df.drop('target',
axis=1)y = iris_df['target']

scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio =
pca.explained_variance_ratio_

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
37
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio),
marker='o', linestyle='--')plt.title('Explained Variance
Ratio') plt.xlabel('Number of Principal
Components') plt.ylabel('Cumulative
Explained Variance Ratio')plt.grid(True)

plt.show()

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of principal components to explain 95% variance:
{n_components}")

pca =
PCA(n_components=n_components)
X_reduced =
pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50,
alpha=0.5)plt.title('Data in Reduced-dimensional Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal
Component 2')
plt.colorbar(label='Target')
plt.show()
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
38
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Output:

Conclusion:
In conclusion, the code demonstrates the effectiveness of PCA in reducing the
dimensionality of high-dimensional datasets while preserving essential information.
It provides a systematic approach to exploring and visualizing complex datasets,
thereby aiding in data analysis and interpretation.
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
39
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

Practical 10
Data Visualization and Storytelling

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate random data
np.random.seed(42) # Set a seed for reproducibility
# Create a DataFrame with random data
data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]),
dtype='category')
})
# Create a scatter plot to visualize the relationship between two variables
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5)
plt.title('Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.show()
# Create a bar chart to visualize the distribution of a categorical variable
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data)
plt.title('Distribution of Categories',
fontsize=16) plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()
# Create a heatmap to visualize the correlation between numerical variables
plt.figure(figsize=(10, 8))
numerical_cols = ['variable1', 'variable2', 'variable3']
sns.heatmap(data[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=16)
plt.show()
# Data Storytelling
print("Title: Exploring the Relationship between Variable 1 and Variable 2")
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
40
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
print("\nThe scatter plot (Figure 1) shows the relationship between Variable 1 and
Variable 2. ")
print("\nScatter Plot")
print("Figure 1: Scatter Plot of Variable 1 and Variable 2")
print("\nTo better understand the distribution of the categorical variable 'category',
we created a ")
print("\nBar Chart")
print("Figure 2: Distribution of Categories")
print("\nAdditionally, we explored the correlation between numerical variables using
a heatmap ")
print("\nHeatmap")
print("Figure 3: Correlation Heatmap")

print("\nIn summary, the visualizations and analysis provide insights into the
relationships ")

Output:

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
41
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
42
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched

For video demonstration of the practical click on the below link:


Data Science Practical Playlist
43

You might also like