TYCS Data Science Manual
TYCS Data Science Manual
TYCS SEMESTER-VI
Compiled by:
Asst. Prof. Megha Sharma
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
PRACTICAL 1
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.
Steps
Step 1: Go to conditional formatting > Greater Than
Step 2: Enter the greater than filter value for example 2000.
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
1
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Step 2: Fill the information in the window accordingly and click ok.
PRACTICAL 2
Data Frames and Basic Data Pre-processing
A. Read data from CSV and JSON files into
a data frame.(1)
# Read data from a
csv fileimport pandas
as pd
df =
pd.read_csv('Student_Marks.cs)
print("Our dataset ")
print(df)
(2)
# Reading data from a
JSON fileimport pandas as
pd
data =
pd.read_json('dataset.json')
print(data)
df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values
with 0 : ")df2=df.fillna(value=0)
print(df2)
(2)
# Dropping NA values
using dropna()import
pandas as pd df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)print("\
nSorted iris dataset:")
print(sorted_iris.head())
# Grouping data
grouped_species =
iris.groupby('Species').mean()
print("\nMean measurements for each
species:") print(grouped_species)
PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and
normalization to numericalfeatures.
Code:
# Standardization and
normalizationimport pandas as
pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2],
skiprows=1)df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original
DataFrame:")print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax
Scaling")print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df[['Alcohol','Malic
Acid']])df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard
Scaling")print(df)
Code:
import pandas as pd
iris=pd.read_csv("Iris.c
sv")print(iris)
from sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)
Practical 4
Hypothesis Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-
test, chi-square test)# t-test
import numpy as np
from scipy import
stats
import matplotlib.pyplot as plt
print("Results of Two-Sample
t- test:")print(f'T-statistic:
{t_statistic}') print(f'P-value:
{p_value}')
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
15
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
# Plot the
distributions
plt.figure(figsize=(10
, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
Output:
#chi-test
import pandas as pd
import numpy as np
import matplotlib
as pltimport
seaborn as sb
import warnings
Conclusion:
There is sufficient evidence to reject the null hypothesis, indicating that
there is a significant association between 'horsepower_new' and
'modelyear_new' categories.
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
21
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Practical 5
ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
Conduct post-hoc tests to identify significant differences between group means.
import pandas as pd
import scipy.stats
as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
Conclusion
• F-statistic: This value indicates the ratio of the variance between groups to
the variance within groups. A larger F-statistic suggests that the means of the
groups are more different from each other compared to the variability within
each group.
• If the p-value is less than the chosen significance level (e.g., 0.05), it
suggests that there are significant differences among the group means.
• If the p-value is greater than the significance level, it suggests that there
is insufficient evidence to reject the null hypothesis, meaning there are no
significant differences among the group means.
• There are significant differences among the means of the groups.
import numpy
as npimport
pandas as pd
from sklearn.datasets import
fetch_california_housingfrom
sklearn.model_selection import
train_test_split from sklearn.linear_model
import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data,
columns=housing.feature_names)print(housing_df)
housing_df['PRICE'] = housing.target
X=
housing_df[['AveRooms']
]y = housing_df['PRICE']
model = LinearRegression()
mse = mean_squared_error(y_test,
model.predict(X_test))r2 = r2_score(y_test,
model.predict(X_test))
#########################################
X=
housing_df.drop('PRICE',axis=1
)y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print("Mean Squared
Error:",mse) print("R-
squared:",r2)
print("Intercept:",model.interc
ept_)
print("Coefficient:",model.co
ef_)
Conclusion
The Mean Squared Error (MSE) is a commonly used metric to evaluate the
performance of regression models. It measures the average squared difference
between the predicted values and the actual values of the target variable. A lower
MSE value indicates that the model's predictions are closer to the actual values on
average, suggesting better performance.
R2 tells us how well the independent variables explain the variability of the
dependent variable. It ranges from 0 to 1, where: 0 indicates that the model does
not explain any of the variability of the dependent variable around its mean.1
indicates that the model explains all the variability of the dependent variable
around its mean.
The intercept represents the point where the regression line intersects the y-axis
on a graph. It provides information about the baseline value of the dependent
variable when all predictors are zero.
Coefficients represent the impact of changes in the independent variables on the
dependent variable in a linear regression model.
import numpy
as npimport
pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.linear_model
import LogisticRegressionfrom sklearn.tree
import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score,classification_report
# Load the Iris dataset and create a binary
classification problemiris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
binary_df =
iris_df[iris_df['target'] != 2]X =
binary_df.drop('target', axis=1)
y = binary_df['target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)# Train a logistic regression model and evaluate its
performance
logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train) y_pred_logistic
= logistic_model.predict(X_test)
Output:-
Conclusion:
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
31
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
data = pd.read_csv("C:\\Users\Reape\Downloads\wholesale\
wholesale.c sv")data.head()
mms =
MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)
sum_of_squared_distances
= []K = range(1, 15)
for k in K:
km =
KMeans(n_clusters=k)
km =
km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
plt.plot(K,
sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum_of_squared_distances'
) plt.title('elbow Mehtod for optimal
k') plt.show()
Output:
Conclusion:
• The elbow method helps in determining the optimal number of clusters for
the dataset. The point where the rate of decrease in the sum of squared
distances significantly slows down suggests a suitable number of clusters.
• We conclude that the optimal number of clusters for the data is 5.
• The optimal number of clusters identified using this method can be used
for further analysis or segmentation of customers based on their purchasing
behavior.
Practical 9
Principal Component Analysis (PCA)
import pandas
as pdimport
numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import
StandardScalerfrom
sklearn.decomposition import PCA
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
X = iris_df.drop('target',
axis=1)y = iris_df['target']
scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio =
pca.explained_variance_ratio_
plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio),
marker='o', linestyle='--')plt.title('Explained Variance
Ratio') plt.xlabel('Number of Principal
Components') plt.ylabel('Cumulative
Explained Variance Ratio')plt.grid(True)
plt.show()
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of principal components to explain 95% variance:
{n_components}")
pca =
PCA(n_components=n_components)
X_reduced =
pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50,
alpha=0.5)plt.title('Data in Reduced-dimensional Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal
Component 2')
plt.colorbar(label='Target')
plt.show()
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
38
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Output:
Conclusion:
In conclusion, the code demonstrates the effectiveness of PCA in reducing the
dimensionality of high-dimensional datasets while preserving essential information.
It provides a systematic approach to exploring and visualizing complex datasets,
thereby aiding in data analysis and interpretation.
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
39
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
Practical 10
Data Visualization and Storytelling
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate random data
np.random.seed(42) # Set a seed for reproducibility
# Create a DataFrame with random data
data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]),
dtype='category')
})
# Create a scatter plot to visualize the relationship between two variables
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5)
plt.title('Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.show()
# Create a bar chart to visualize the distribution of a categorical variable
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data)
plt.title('Distribution of Categories',
fontsize=16) plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()
# Create a heatmap to visualize the correlation between numerical variables
plt.figure(figsize=(10, 8))
numerical_cols = ['variable1', 'variable2', 'variable3']
sns.heatmap(data[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=16)
plt.show()
# Data Storytelling
print("Title: Exploring the Relationship between Variable 1 and Variable 2")
For video demonstration of the practical click on the below link:
Data Science Practical Playlist
40
TYCS Sem-VI Data Science Lab Manual By: Megha Sharma
http://www.youtube.com/@omega_teched
print("\nThe scatter plot (Figure 1) shows the relationship between Variable 1 and
Variable 2. ")
print("\nScatter Plot")
print("Figure 1: Scatter Plot of Variable 1 and Variable 2")
print("\nTo better understand the distribution of the categorical variable 'category',
we created a ")
print("\nBar Chart")
print("Figure 2: Distribution of Categories")
print("\nAdditionally, we explored the correlation between numerical variables using
a heatmap ")
print("\nHeatmap")
print("Figure 3: Correlation Heatmap")
print("\nIn summary, the visualizations and analysis provide insights into the
relationships ")
Output: