0% found this document useful (0 votes)
24 views

Reagrding Lab Test

The document provides sample questions and code for a lab exam on data analysis and machine learning algorithms. It includes two tasks - the first involves generating synthetic data to study the relationship between study hours and exam scores. The second uses the Iris dataset to answer 10 questions covering descriptive statistics, data visualization, correlation analysis, preprocessing, hypothesis testing, decision trees, SVM, clustering, PCA, and model evaluation. The questions are designed to assess proficiency with Python libraries like Pandas, NumPy, Matplotlib, Scikit-Learn, and basic machine learning algorithms. Sample code is provided to demonstrate solutions to the questions.

Uploaded by

aman raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Reagrding Lab Test

The document provides sample questions and code for a lab exam on data analysis and machine learning algorithms. It includes two tasks - the first involves generating synthetic data to study the relationship between study hours and exam scores. The second uses the Iris dataset to answer 10 questions covering descriptive statistics, data visualization, correlation analysis, preprocessing, hypothesis testing, decision trees, SVM, clustering, PCA, and model evaluation. The questions are designed to assess proficiency with Python libraries like Pandas, NumPy, Matplotlib, Scikit-Learn, and basic machine learning algorithms. Sample code is provided to demonstrate solutions to the questions.

Uploaded by

aman raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Sample Questions For Lab Test

*** You must be using python and jupyter notebook. No other platform
will be entertained. There will be two tasks in exam a) Tasks on
synthetic dataset b) Task on original dataset.
In exam the tasks specially the type of synthetic data generation and
Dataset may be different. So be careful. Marks will be given only for
fully executable codes.
** Learn all the questions and go through the code properly for exam.
**Execute the same for practice.
** There will be two theory questions as well.

Task 1: Statistical Analysis on synthetic data

Suppose you have a dataset that represents the relationship between


the number of hours students spend studying ('X') and their exam
scores ('Y'). You need to generate random samples and explore
statistical observations. (Mean, median, mode, etc.

Related questions:
Question 1: Data Generation and Visualization

Explain the purpose of the numpy seed (np.random.seed(42)) in the


code. How does it affect the reproducibility of the dataset?

Describe the process of generating the hypothetical dataset. What are


the variables in the dataset, and how are they related?

Explain the significance of the scatter plot generated using


plt.scatter. What insights can you gain from the visualization?

Question 2: Descriptive Statistics

Calculate the mean, median, and standard deviation of the 'Study


Hours' variable in the dataset. Provide the numerical values.

Question 3: Correlation Analysis

Calculate the Pearson correlation coefficient between 'Study Hours'


and 'Exam Scores'. What does the correlation coefficient indicate about
the relationship between these variables?
What is the role of the p-value in correlation analysis? How is it
interpreted in the context of this dataset?

Question 4: Linear Regression

Explain the purpose of using linear regression in this analysis. How is


the linear regression model fitted to the data?

Interpret the coefficients of the linear regression model. What do they


signify about the relationship between 'Study Hours' and 'Exam
Scores'?

Sample Code:

Solution: Run on jupyter Notebook

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Create a hypothetical dataset


np.random.seed(42) # Setting seed for reproducibility
study_hours = np.random.uniform(1, 10, 50) # Generating 50 random
study hours
exam_scores = 50 + 10 * study_hours + np.random.normal(0, 5, 50) #
Generating exam scores with some noise

# Create a DataFrame
data = pd.DataFrame({'Study Hours': study_hours, 'Exam Scores':
exam_scores})

# Visualize the data


plt.scatter(data['Study Hours'], data['Exam Scores'])
plt.title('Relationship between Study Hours and Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.show()

# Calculate and print descriptive statistics


mean_study_hours = data['Study Hours'].mean()
median_exam_scores = data['Exam Scores'].median()
std_study_hours = data['Study Hours'].std()
print(f'Mean Study Hours: {mean_study_hours}')
print(f'Median Exam Scores: {median_exam_scores}')
print(f'Standard Deviation Study Hours: {std_study_hours}')

# Calculate Pearson correlation coefficient


correlation_coefficient, p_value = pearsonr(data['Study Hours'],
data['Exam Scores'])
print(f'Pearson Correlation Coefficient: {correlation_coefficient}')
print(f'P-value: {p_value}')

# Perform a linear regression


from sklearn.linear_model import LinearRegression

X = data[['Study Hours']]
y = data['Exam Scores']

regressor = LinearRegression()
regressor.fit(X, y)

# Visualize the linear regression line


plt.scatter(data['Study Hours'], data['Exam Scores'])
plt.plot(data['Study Hours'], regressor.predict(X), color='red',
linewidth=3)
plt.title('Linear Regression: Study Hours vs Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.show()

Task 2: Questions on Given Dataset :


*** For exam the dataset may changed to MNIST, Glass, Wine. etc
Use Iris data set and perform the following: (In exam dataset may differ
and few more question could be added)
** For PCA, you are supposed to use SVD. not like this code.

Question 1: Descriptive Statistics


Describe the dataset: Number of features, samples, etc.
Must publish the attribute names and classes associated in dataset.

Like this:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset


iris = load_iris()

# Create a DataFrame to better visualize the dataset


iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Display basic information about the dataset


print("Number of Features:", iris_df.shape[1])
print("Number of Samples:", iris_df.shape[0])
print("\nAttribute Names (Features):")
print(iris_df.columns.tolist())
print("\nClasses (Target Names):")
print(iris.target_names)

Provide the mean, median, and standard deviation for the sepal length
of the Iris dataset.

Question 2: Data Visualization


Create a box plot to compare the distribution of petal width for each
species in the Iris dataset.

Question 3: Correlation Analysis


Calculate the Pearson correlation coefficient between sepal length and
petal length.

Question 4: Data Preprocessing


Normalize the sepal width values in the Iris dataset using the Min-Max
scaling method.

Question 5: Hypothesis Testing


Perform a t-test to check if there is a significant difference in the mean
sepal length between the setosa and versicolor species.

Question 6: Decision Trees


Build a decision tree classifier for the Iris dataset using the Gini index
as the splitting criterion. Evaluate its accuracy on a test set.

Question 7: Support Vector Machines (SVM)


Implement a RBF SVM classifier on the Iris dataset and report the
accuracy on a test set.

Question 8: K-Means Clustering


Apply the K-Means clustering algorithm to group the data points into
three clusters based on sepal length and sepal width. Visualize the
resulting clusters.

Question 9: Principal Component Analysis (PCA)


Perform PCA using SVD on the Iris dataset and determine the
proportion of variance explained by each principal component. Print
the values of eigen vectors

Question 10: Model Evaluation


Compare the performance of a logistic regression classifier, Support
vector machine and a k-nearest neighbors (KNN) classifier on the Iris
dataset using a suitable evaluation metric (e.g., accuracy, precision,
recall).

Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, ttest_ind
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score,
recall_score
from sklearn.datasets import load_iris

# Load Iris dataset


iris = load_iris()
iris_data = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['species'])
# Question 1: Descriptive Statistics
mean_sepal_length = iris_data['sepal length (cm)'].mean()
median_sepal_length = iris_data['sepal length (cm)'].median()
std_sepal_length = iris_data['sepal length (cm)'].std()

# Question 2: Data Visualization


plt.figure(figsize=(10, 6))
for species in iris_data['species'].unique():
species_data = iris_data[iris_data['species'] == species]
plt.boxplot(species_data['petal width (cm)'], positions=[species],
labels=[f'Species {int(species)}'])

plt.title('Box Plot - Petal Width Distribution for Each Species')


plt.xlabel('Species')
plt.ylabel('Petal Width (cm)')
plt.show()

# Question 3: Correlation Analysis


pearson_corr, _ = pearsonr(iris_data['sepal length (cm)'], iris_data['petal
length (cm)'])

# Question 4: Data Preprocessing


scaler = MinMaxScaler()
iris_data['sepal width (normalized)'] =
scaler.fit_transform(iris_data[['sepal width (cm)']])

# Question 5: Hypothesis Testing


setosa_sepal_length = iris_data[iris_data['species'] == 0]['sepal length
(cm)']
versicolor_sepal_length = iris_data[iris_data['species'] == 1]['sepal
length (cm)']
t_stat, p_value = ttest_ind(setosa_sepal_length,
versicolor_sepal_length)

# Question 6: Decision Trees


X_train, X_test, y_train, y_test = train_test_split(iris_data.drop('species',
axis=1), iris_data['species'], test_size=0.2, random_state=42)
dt_classifier = DecisionTreeClassifier(criterion='gini')
dt_classifier.fit(X_train, y_train)
dt_accuracy = accuracy_score(y_test, dt_classifier.predict(X_test))

# Question 7: Support Vector Machines (SVM)


svm_classifier = SVC(kernel='rbf')
svm_classifier.fit(X_train, y_train)
svm_accuracy = accuracy_score(y_test, svm_classifier.predict(X_test))

# Question 8: K-Means Clustering


kmeans = KMeans(n_clusters=3)
iris_data['cluster'] = kmeans.fit_predict(iris_data[['sepal length (cm)',
'sepal width (cm)']])
plt.scatter(iris_data['sepal length (cm)'], iris_data['sepal width (cm)'],
c=iris_data['cluster'], cmap='viridis')
plt.title('K-Means Clustering - Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()

# Question 9: Principal Component Analysis (PCA)


pca = PCA()
iris_pca = pca.fit_transform(iris_data.drop(['species', 'cluster'], axis=1))
explained_variance_ratio = pca.explained_variance_ratio_
eigen_vectors = pca.components_

# Question 10: Model Evaluation


logreg_classifier = LogisticRegression()
knn_classifier = KNeighborsClassifier()

logreg_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)

logreg_accuracy = accuracy_score(y_test,
logreg_classifier.predict(X_test))
knn_accuracy = accuracy_score(y_test, knn_classifier.predict(X_test))

print(f'Mean Sepal Length: {mean_sepal_length}')


print(f'Median Sepal Length: {median_sepal_length}')
print(f'Standard Deviation Sepal Length: {std_sepal_length}')
print(f'Pearson Correlation Coefficient: {pearson_corr}')
print(f'Petal Width Min-Max Scaled Values:\n{iris_data["sepal width
(normalized)"]}')
print(f'T-Test p-value: {p_value}')
print(f'Decision Tree Accuracy: {dt_accuracy}')
print(f'SVM Classifier Accuracy: {svm_accuracy}')
print(f'Principal Components Explained Variance Ratio:
{explained_variance_ratio}')
print(f'Eigen Vectors:\n{eigen_vectors}')
print(f'Logistic Regression Accuracy: {logreg_accuracy}')
print(f'KNN Classifier Accuracy: {knn_accuracy}')

Task 3: Theory questions

Explain PCA with full mathematical description, how it works?


What do you mean by covariance matrix of a a given dataset?
List out all the statistical attributes such as mean, mode, variance, etc.
with suitable example.
Write differences between Hadoop and Spark.

You might also like