0% found this document useful (0 votes)
16 views

TC2-Lab Manual

Uploaded by

dhruvpatil0912
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

TC2-Lab Manual

Uploaded by

dhruvpatil0912
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

DY Patil International University

TC- 1
Principle of Data Science & Engineering (DS)
Year - 3rd
Semester - 5th

Lab Manual

Name : Bhavisha Chauhan


PRN : 20220802171
Batch : DS1

Dr Maheshwari Biradar Dr Sulaxan Jadhav


(Faculty) (Teaching Assistant)
INDEX

Sr. No. Name

1 Programming problems on Probability basics(Without Libraries)


2 Distribution problem statements (Matplotlib)

3 Information Extraction from Data Sets through designing


appropriate programming modules. (Numpy, Pandas and
Matplotlib)

4 Hypothesis testing problems (Use of Numpy, Pandas, SciPy etc.)

5 Problems on Basic Feature Engineering & Selection Mechanisms


(Use of Numpy, Pandas, SciPy etc.)

6 Common Problem on data entities (Using all required Python


Libraries)- Taking all together.
Lab – 1

Aim: -
Programming problems on Probability basics(Without Libraries)

Theory :

In this experiment, we learn how to read and manipulate CSV (Comma Separated Values) files in
Python using the built-in csv module. A CSV file is a simple text file where each line represents
a row of data, and each value within that row is separated by commas.

The process involves opening the CSV file, reading its contents, and then processing the data as
needed. Python provides several ways to interact with CSV files:

Code:
Output:

Conclusion:
The experiment demonstrated how to read and process a CSV file using Python. Using `csv.reader`
efficiently parses the file row-by-row, while `readline()` and `readlines()` are used for reading specific or
all lines. Manual reading and splitting of lines were also shown, but using `csv.reader` is recommended
for structured data. The length of the list `li` represents the number of rows read manually from the file.
Lab -2

Aim: -
Performing an Exploratory Data Analysis (EDA) on the Iris Dataset and calculating Probability
Mass Functions (PMF), Probability Density Functions (PDF), and Cumulative Distribution
Functions (CDF).

Objective:
- Explore and understand the Iris Dataset.
- Calculate and visualize the Probability Mass Function (PMF) for a specific variable.
- Calculate and visualize the Probability Density Function (PDF) for a specific variable.
- Calculate and visualize the Cumulative Distribution Function (CDF) for a specific variable

Theory:

1. Data Loading and Exploration

● We started by loading the Iris Dataset using Python's data manipulation libraries (e.g.,
Pandas) and explored its basic characteristics. This included checking for missing data,
summary statistics, and the structure of the dataset.

2. Selecting a Variable of Interest

● For this lab report, we focused on the "sepal length" variable from the Iris Dataset. This
variable represents the sepal length of iris flowers.

3. Probability Mass Function (PMF)

● We calculated the Probability Mass Function (PMF) for the "sepal length" variable. The
PMF provides the probability distribution of discrete values. We used Python libraries
like NumPy and Matplotlib to create a PMF plot.

4. Probability Density Function (PDF)


● Next, we calculated the Probability Density Function (PDF) for the "sepal length"
variable. The PDF provides the probability distribution of continuous values. We used the
Seaborn library for creating a PDF plot.

5. Cumulative Distribution Function (CDF)

● Finally, we computed the Cumulative Distribution Function (CDF) for the "sepal length"
variable.The CDF represents the probability that a random variable takes on a value less
than or equal to a specific value. We used Python libraries to create a CDF plot.

Code:
2.1 Probability Mass Function(PMF):

2.2 Probability Density Function(PDF):

2.3 Cumulative Distribution Function (CDF):


Output:
2.1 Probability Mass Function(PMF):

2.2 Probability Density Function(PDF):


2.3 Cumulative Distribution Function (CDF):

Conclusion:
After performing the above steps, we can analyze the plots and results to draw insights into the
"sepal length" variable.

● PMF: The PMF plot will show the probability of different discrete values of "sepal
length."
● PDF: The PDF plot will show the smooth probability distribution of "sepal length,"
indicating how the values are spread.
● CDF: The CDF plot will indicate the cumulative probability of the "sepal length" values
up to a certain point.
Lab-3

Aim: -
To Perform Different Types of Distribution and Visualize by creating a Histogram using
Matplotlib Library .

Theory:

4 Types of Distribution :

1. Uniform Distribution :

A uniform distribution is a probability distribution where all outcomes have equal


chances of occurring. In this case, the uniform distribution is defined between 0 and 1,
meaning any value between 0 and 1 is equally likely.

2. Normal Distribution:

A normal distribution (also known as Gaussian distribution) is a continuous probability

distribution characterized by its bell-shaped curve. It is fully defined by its mean (μ) and

standard deviation (σ).

3. Exponential Distribution:

The exponential distribution describes the time between events in a Poisson process. It
is characterized by a rate parameter (λ), which determines the average rate of events
per unit time.

4. Binomial Distribution:

The binomial distribution describes the number of successes (binary outcomes) in a


fixed number of independent Bernoulli trials. It is characterized by two parameters: the
number of trials (n) and the probability of success (p).

5. Poisson Distribution:
The Poisson distribution describes the number of events occurring in a fixed interval of
time or space. It is characterized by a rate parameter (λ), which represents the average
rate of events.

Code:

3.1 Uniform Distribution.

3.2 Normal Distribution.

3.3 Exponential Distribution:

3.4 Binomial Distribution:


3.5 Poisson Distribution:

Output:

3.1 Uniform Distribution.

3.2 Normal Distribution.


3.3 Exponential Distribution:

3.4 Binomial Distribution:

3.5 Poisson Distribution:


Conclusion:

In this experiment, we explored and visualized five different types of probability distributions
using histograms generated from random data. Each distribution was generated using NumPy's
random number generation functions, and we used Matplotlib to create clear visualizations for
each distribution.

Overall Observations:

Each distribution has unique characteristics that are well-represented through the histograms.

The uniform distribution had an even, flat spread of values.

The normal distribution displayed the familiar bell curve, showing central tendency.

The exponential distribution had a right-skew, demonstrating the occurrence of rare events.

The binomial distribution showed discrete counts, reflecting the success/failure trials.

The Poisson distribution illustrated a right-skew due to events occurring at a constant average
rate but with variable frequency in smaller intervals.

These visualizations were crucial in understanding the fundamental characteristics and behavior
of different probability distributions, which are widely used in data science and statistics for
modeling real-world phenomena.
Lab-4

Aim: -

The purpose of this lab is to introduce hypothesis testing using statistical methods in Python,
focusing on hypothesis tests like the t-test, ANOVA, and chi-square test. By applying these
techniques to the well-known Iris dataset, you will learn how to test assumptions about
population means and relationships between categorical variables.

Theory: -

Introduction to Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences or draw conclusions about a
population based on a sample of data. It helps in determining whether there is enough evidence
in a sample of data to infer that a certain condition is true for the entire population.

Key concepts in hypothesis testing:

● Null Hypothesis (H₀): The statement that there is no effect or no difference. It is what you
try to disprove or reject.
● Alternative Hypothesis (H₁): The statement that there is an effect or a difference. It is
what you want to prove.
● p-value: The probability of observing the results if the null hypothesis is true. A small
p-value (< 0.05) indicates strong evidence against the null hypothesis.
● Significance Level (α): A threshold (commonly 0.05) used to decide whether to reject the
null hypothesis.
● Test Statistic: A value calculated from the data used to determine whether to reject the
null hypothesis.
Dataset: The Iris Dataset

The Iris dataset is one of the most famous datasets in the field of machine learning. It consists of
150 observations, with the following features: Sepal length (cm),Sepal width (cm),Petal length
(cm),Petal width (cm),Species (Iris-setosa, Iris-versicolor, and Iris-virginica)

Each observation represents a different iris flower from one of the three species, and the dataset
contains measurements for each flower's sepals and petals.

Problem 1: Two-Sample t-test

Objective:

To test if there is a significant difference in the sepal lengths between the species Iris-setosa and
Iris-versicolor.

Hypotheses:

● Null Hypothesis (H₀): There is no significant difference between the mean sepal lengths
of setosa and versicolor species. (μ₁ = μ₂)
● Alternative Hypothesis (H₁): There is a significant difference between the mean sepal
lengths of setosa and versicolor species. (μ₁ ≠ μ₂)

Interpretation:

If the p-value is less than 0.05, reject the null hypothesis, meaning there is a statistically
significant difference in sepal lengths between setosa and versicolor.

If the p-value is greater than 0.05, fail to reject the null hypothesis, meaning there is no
significant difference in sepal lengths.

Problem 2: One-Way ANOVA (Analysis of Variance)

Objective:

To test if there is a significant difference in the sepal lengths across all three species (setosa,
versicolor, and virginica).

Hypotheses:

Null Hypothesis (H₀): The means of sepal lengths are equal for all species. (μ₁ = μ₂ = μ₃)
Alternative Hypothesis (H₁): At least one species has a different mean sepal length. (μ₁ ≠ μ₂ or μ₁
≠ μ₃, etc.)

Interpretation:

If the p-value is less than 0.05, reject the null hypothesis, indicating that at least one species has a
significantly different mean sepal length.

If the p-value is greater than 0.05, fail to reject the null hypothesis, suggesting that the means are
not significantly different across species.

Problem 3: Chi-Square Test for Independence

Objective:

To test whether there is a relationship between species and different categories of sepal width
(e.g., narrow, medium, wide).

Hypotheses:

- Null Hypothesis (H₀): There is no relationship between species and sepal width categories (i.e.,
the two variables are independent).

- Alternative Hypothesis (H₁): There is a relationship between species and sepal width categories
(i.e., the two variables are dependent).

Interpretation:

If the p-value is less than 0.05, reject the null hypothesis, indicating that sepal width and species
are related (dependent).

If the p-value is greater than 0.05, fail to reject the null hypothesis, suggesting that sepal width
and species are independent.
Code:

4.1 Import libraries and load dataset

4.2 Boxplot

4.3 T-test
4.4 Chi-square Test

4.5 ANOVA

Output:

4.1 Import libraries and load dataset


4.2 Boxplot and histogram

4.3 T-test

4.4 Chi-square Test

4.5 ANOVA
Conclusion:-

In this lab on Hypothesis Testing using the Iris dataset, we applied three statistical tests:

1. Two-Sample t-test: Compared sepal lengths between Iris-setosa and Iris-versicolor. We


determined whether the means of the two species were significantly different based on
the p-value.
2. One-Way ANOVA: Tested if the mean sepal lengths of Iris-setosa, Iris-versicolor, and
Iris-virginica were the same. A significant p-value indicated that at least one species had
a different mean sepal length.
3. Chi-Square Test: Examined the relationship between species and sepal width categories.
A significant p-value showed dependence between the variables, while a larger p-value
suggested independence.

These tests allowed us to draw conclusions about the differences in sepal lengths and the
relationship between species and sepal width categories, highlighting the importance of
hypothesis testing in data analysis.
Lab – 5

Aim: -

● To understand the importance of feature engineering in preparing data for machine learning
models.

● To learn how to preprocess data by handling missing values, encoding categorical variables,
and scaling numerical features.

● To perform feature selection using statistical methods and machine learning models.

● To apply dimensionality reduction techniques like PCA.

Theory: -

● Feature Engineering: The process of creating new input features from your existing data

to improve model performance.

● Feature Selection: The process of selecting the most important and relevant features for

your machine learning model.

Why is it important: Improves model accuracy, Reduces model complexity and training time and
Prevents overfitting by removing irrelevant features.

Code:

1. Import Libraries.
2. Load the dataset.

3. Data processing and Handling missing values.

4. Feature Engineering: Encoding Categorical Variables.

5. Feature Engineering: Creating New Features.

6. Feature Scaling.
7. Feature Selection Mechanisms.

7.1 Separate Features and target.

7.2 Recursive Features Elimination.

7.3 Feature Importance using Random Forest.

8. Dimensionality reduction using PCA


Output:

1. Import Libraries.
2. Load the dataset.

3. Data processing and Handling missing values.

4. Feature Engineering: Encoding Categorical Variables.

5. Feature Engineering: Creating New Features.

6. Feature Scaling.
7. Feature Selection Mechanisms.

7.2 Recursive Features Elimination.

7.3 Feature Importance using Random Forest.


8. Dimensionality reduction using PCA

Conclusion:

In this lab, we focused on Feature Engineering and Feature Selection techniques to prepare data
for machine learning models. Key activities included:

1. Data Preprocessing: Handling missing values and encoding categorical variables to make the
data suitable for machine learning.

2. Feature Creation: We generated new features (e.g., FamilySize, IsAlone) to improve the
model's predictive power.

3. Feature Scaling: We standardized and normalized numerical features to ensure algorithms


perform optimally.

4. Feature Selection: Using methods like Univariate Selection,Recursive Feature Elimination


(RFE), and Random Forest to choose the most important features, reducing dimensionality and
preventing overfitting.

5. Dimensionality Reduction with PCA: We applied Principal Component Analysis (PCA) to


reduce feature space while retaining important variance in the data.

Overall, this lab demonstrated how effective feature engineering and selection can enhance
model performance, improve efficiency, and prevent overfitting in machine learning tasks.
Lab – 6

Aim: -

● To identify and handle common problems in data entities, such as duplicates, missing
values, outliers, and inconsistencies.
● To understand the impact of data quality issues on machine learning model performance.
● To apply preprocessing techniques to improve data quality and model readiness

Theory:

Common Issues in Data:

● Duplicates: Rows or entries that repeat without adding new information, leading to
redundancy and inflated data size.

● Missing Values: Absences in the data that can lead to biased analysis or prevent algorithms
from functioning properly.

● Outliers: Extreme values that diverge from the main distribution, potentially skewing analysis
and model results.

● Inconsistencies: Variations in formatting, categorization, or units (e.g., “M” vs. “Male” for
gender) that create inconsistencies.

Why It’s Important:

● Data Quality Directly Affects Model Performance: Clean data allows models to capture
patterns more accurately.

● Reduces Bias and Overfitting: Correcting issues like duplicates and outliers produces a more
balanced dataset.
● Efficient Data Processing: Reducing redundancy and handling missing values optimize
computational resources, speeding up analysis.

Code:

6.1 Exploratory Data Analysis (EDA)

6.2 Handling Missing Values

6.3 Handling Categorical Variables


6.4 Feature Scaling

6.5 Outlier Detection and Removal

6.6 Data Summary and Saving the Cleaned Dataset


6.7 Visual Exploration

Output:

6.1 Exploratory Data Analysis (EDA)


6.2 Handling Missing Values
6.3 Handling Categorical Variables

6.5 Outlier Detection and Removal

6.4 Feature Scaling

6.6 Data Summary and Saving the Cleaned Dataset


6.7 Visual Exploration
Conclusion:
This lab focused on essential data preprocessing tasks to prepare datasets for
analysis. Key steps included removing duplicates, handling missing values,
addressing outliers, correcting data inconsistencies, and encoding categorical
variables. We also explored scaling numerical features, selecting relevant features,
and applying PCA for dimensionality reduction. By the end, the dataset was
cleaned, standardized, and transformed, ensuring it was ready for effective analysis
and model training.

You might also like