TC2-Lab Manual
TC2-Lab Manual
TC- 1
Principle of Data Science & Engineering (DS)
Year - 3rd
Semester - 5th
Lab Manual
Aim: -
Programming problems on Probability basics(Without Libraries)
Theory :
In this experiment, we learn how to read and manipulate CSV (Comma Separated Values) files in
Python using the built-in csv module. A CSV file is a simple text file where each line represents
a row of data, and each value within that row is separated by commas.
The process involves opening the CSV file, reading its contents, and then processing the data as
needed. Python provides several ways to interact with CSV files:
Code:
Output:
Conclusion:
The experiment demonstrated how to read and process a CSV file using Python. Using `csv.reader`
efficiently parses the file row-by-row, while `readline()` and `readlines()` are used for reading specific or
all lines. Manual reading and splitting of lines were also shown, but using `csv.reader` is recommended
for structured data. The length of the list `li` represents the number of rows read manually from the file.
Lab -2
Aim: -
Performing an Exploratory Data Analysis (EDA) on the Iris Dataset and calculating Probability
Mass Functions (PMF), Probability Density Functions (PDF), and Cumulative Distribution
Functions (CDF).
Objective:
- Explore and understand the Iris Dataset.
- Calculate and visualize the Probability Mass Function (PMF) for a specific variable.
- Calculate and visualize the Probability Density Function (PDF) for a specific variable.
- Calculate and visualize the Cumulative Distribution Function (CDF) for a specific variable
Theory:
● We started by loading the Iris Dataset using Python's data manipulation libraries (e.g.,
Pandas) and explored its basic characteristics. This included checking for missing data,
summary statistics, and the structure of the dataset.
● For this lab report, we focused on the "sepal length" variable from the Iris Dataset. This
variable represents the sepal length of iris flowers.
● We calculated the Probability Mass Function (PMF) for the "sepal length" variable. The
PMF provides the probability distribution of discrete values. We used Python libraries
like NumPy and Matplotlib to create a PMF plot.
● Finally, we computed the Cumulative Distribution Function (CDF) for the "sepal length"
variable.The CDF represents the probability that a random variable takes on a value less
than or equal to a specific value. We used Python libraries to create a CDF plot.
Code:
2.1 Probability Mass Function(PMF):
Conclusion:
After performing the above steps, we can analyze the plots and results to draw insights into the
"sepal length" variable.
● PMF: The PMF plot will show the probability of different discrete values of "sepal
length."
● PDF: The PDF plot will show the smooth probability distribution of "sepal length,"
indicating how the values are spread.
● CDF: The CDF plot will indicate the cumulative probability of the "sepal length" values
up to a certain point.
Lab-3
Aim: -
To Perform Different Types of Distribution and Visualize by creating a Histogram using
Matplotlib Library .
Theory:
4 Types of Distribution :
1. Uniform Distribution :
2. Normal Distribution:
distribution characterized by its bell-shaped curve. It is fully defined by its mean (μ) and
3. Exponential Distribution:
The exponential distribution describes the time between events in a Poisson process. It
is characterized by a rate parameter (λ), which determines the average rate of events
per unit time.
4. Binomial Distribution:
5. Poisson Distribution:
The Poisson distribution describes the number of events occurring in a fixed interval of
time or space. It is characterized by a rate parameter (λ), which represents the average
rate of events.
Code:
Output:
In this experiment, we explored and visualized five different types of probability distributions
using histograms generated from random data. Each distribution was generated using NumPy's
random number generation functions, and we used Matplotlib to create clear visualizations for
each distribution.
Overall Observations:
Each distribution has unique characteristics that are well-represented through the histograms.
The normal distribution displayed the familiar bell curve, showing central tendency.
The exponential distribution had a right-skew, demonstrating the occurrence of rare events.
The binomial distribution showed discrete counts, reflecting the success/failure trials.
The Poisson distribution illustrated a right-skew due to events occurring at a constant average
rate but with variable frequency in smaller intervals.
These visualizations were crucial in understanding the fundamental characteristics and behavior
of different probability distributions, which are widely used in data science and statistics for
modeling real-world phenomena.
Lab-4
Aim: -
The purpose of this lab is to introduce hypothesis testing using statistical methods in Python,
focusing on hypothesis tests like the t-test, ANOVA, and chi-square test. By applying these
techniques to the well-known Iris dataset, you will learn how to test assumptions about
population means and relationships between categorical variables.
Theory: -
Hypothesis testing is a statistical method used to make inferences or draw conclusions about a
population based on a sample of data. It helps in determining whether there is enough evidence
in a sample of data to infer that a certain condition is true for the entire population.
● Null Hypothesis (H₀): The statement that there is no effect or no difference. It is what you
try to disprove or reject.
● Alternative Hypothesis (H₁): The statement that there is an effect or a difference. It is
what you want to prove.
● p-value: The probability of observing the results if the null hypothesis is true. A small
p-value (< 0.05) indicates strong evidence against the null hypothesis.
● Significance Level (α): A threshold (commonly 0.05) used to decide whether to reject the
null hypothesis.
● Test Statistic: A value calculated from the data used to determine whether to reject the
null hypothesis.
Dataset: The Iris Dataset
The Iris dataset is one of the most famous datasets in the field of machine learning. It consists of
150 observations, with the following features: Sepal length (cm),Sepal width (cm),Petal length
(cm),Petal width (cm),Species (Iris-setosa, Iris-versicolor, and Iris-virginica)
Each observation represents a different iris flower from one of the three species, and the dataset
contains measurements for each flower's sepals and petals.
Objective:
To test if there is a significant difference in the sepal lengths between the species Iris-setosa and
Iris-versicolor.
Hypotheses:
● Null Hypothesis (H₀): There is no significant difference between the mean sepal lengths
of setosa and versicolor species. (μ₁ = μ₂)
● Alternative Hypothesis (H₁): There is a significant difference between the mean sepal
lengths of setosa and versicolor species. (μ₁ ≠ μ₂)
Interpretation:
If the p-value is less than 0.05, reject the null hypothesis, meaning there is a statistically
significant difference in sepal lengths between setosa and versicolor.
If the p-value is greater than 0.05, fail to reject the null hypothesis, meaning there is no
significant difference in sepal lengths.
Objective:
To test if there is a significant difference in the sepal lengths across all three species (setosa,
versicolor, and virginica).
Hypotheses:
Null Hypothesis (H₀): The means of sepal lengths are equal for all species. (μ₁ = μ₂ = μ₃)
Alternative Hypothesis (H₁): At least one species has a different mean sepal length. (μ₁ ≠ μ₂ or μ₁
≠ μ₃, etc.)
Interpretation:
If the p-value is less than 0.05, reject the null hypothesis, indicating that at least one species has a
significantly different mean sepal length.
If the p-value is greater than 0.05, fail to reject the null hypothesis, suggesting that the means are
not significantly different across species.
Objective:
To test whether there is a relationship between species and different categories of sepal width
(e.g., narrow, medium, wide).
Hypotheses:
- Null Hypothesis (H₀): There is no relationship between species and sepal width categories (i.e.,
the two variables are independent).
- Alternative Hypothesis (H₁): There is a relationship between species and sepal width categories
(i.e., the two variables are dependent).
Interpretation:
If the p-value is less than 0.05, reject the null hypothesis, indicating that sepal width and species
are related (dependent).
If the p-value is greater than 0.05, fail to reject the null hypothesis, suggesting that sepal width
and species are independent.
Code:
4.2 Boxplot
4.3 T-test
4.4 Chi-square Test
4.5 ANOVA
Output:
4.3 T-test
4.5 ANOVA
Conclusion:-
In this lab on Hypothesis Testing using the Iris dataset, we applied three statistical tests:
These tests allowed us to draw conclusions about the differences in sepal lengths and the
relationship between species and sepal width categories, highlighting the importance of
hypothesis testing in data analysis.
Lab – 5
Aim: -
● To understand the importance of feature engineering in preparing data for machine learning
models.
● To learn how to preprocess data by handling missing values, encoding categorical variables,
and scaling numerical features.
● To perform feature selection using statistical methods and machine learning models.
Theory: -
● Feature Engineering: The process of creating new input features from your existing data
● Feature Selection: The process of selecting the most important and relevant features for
Why is it important: Improves model accuracy, Reduces model complexity and training time and
Prevents overfitting by removing irrelevant features.
Code:
1. Import Libraries.
2. Load the dataset.
6. Feature Scaling.
7. Feature Selection Mechanisms.
1. Import Libraries.
2. Load the dataset.
6. Feature Scaling.
7. Feature Selection Mechanisms.
Conclusion:
In this lab, we focused on Feature Engineering and Feature Selection techniques to prepare data
for machine learning models. Key activities included:
1. Data Preprocessing: Handling missing values and encoding categorical variables to make the
data suitable for machine learning.
2. Feature Creation: We generated new features (e.g., FamilySize, IsAlone) to improve the
model's predictive power.
Overall, this lab demonstrated how effective feature engineering and selection can enhance
model performance, improve efficiency, and prevent overfitting in machine learning tasks.
Lab – 6
Aim: -
● To identify and handle common problems in data entities, such as duplicates, missing
values, outliers, and inconsistencies.
● To understand the impact of data quality issues on machine learning model performance.
● To apply preprocessing techniques to improve data quality and model readiness
Theory:
● Duplicates: Rows or entries that repeat without adding new information, leading to
redundancy and inflated data size.
● Missing Values: Absences in the data that can lead to biased analysis or prevent algorithms
from functioning properly.
● Outliers: Extreme values that diverge from the main distribution, potentially skewing analysis
and model results.
● Inconsistencies: Variations in formatting, categorization, or units (e.g., “M” vs. “Male” for
gender) that create inconsistencies.
● Data Quality Directly Affects Model Performance: Clean data allows models to capture
patterns more accurately.
● Reduces Bias and Overfitting: Correcting issues like duplicates and outliers produces a more
balanced dataset.
● Efficient Data Processing: Reducing redundancy and handling missing values optimize
computational resources, speeding up analysis.
Code:
Output: