DAP writeups_merged
DAP writeups_merged
Objective:
To clean, process, and analyze a sample dataset using Pandas and NumPy, extracting
meaningful insights.
Problem Statement:
Data is often incomplete and contains missing values. The challenge is to clean, process, and
analyze data efficiently using Pandas functionalities to extract insights.
Outcomes:
Theory
1. Introduction to Data Cleaning and Analysis
Data cleaning and analysis are essential steps in data preprocessing to ensure accurate insights
and reliable results. In real-world scenarios, datasets often contain missing, inconsistent, or
incorrect data. Using NumPy and Pandas, we can efficiently handle, clean, and analyze data.
• Pandas:
o Name (Categorical)
o Age (Numerical)
o Salary (Numerical)
o Department (Categorical)
• Removing Missing Data: Rows with missing names are dropped, as the name is a
critical identifier.
• Handling Missing Age Values: The missing age values are filled with the mean age
to ensure consistency.
• Handling Missing Salary Values: The missing salary values are filled with the
median salary to reduce the impact of outliers.
o df.describe() provides statistical measures like mean, min, max, and quartiles
for numerical columns.
• Prepares Data for Machine Learning: Cleaned data improves model performance.
▪ Incomplete Data: Data that is missing certain values or attributes, which can occur due
to errors in data collection, transmission, or storage.
▪ Missing Values: These are gaps in a dataset where information is absent. In Pandas, they
are often represented as NaN (Not a Number).
▪ Data Cleaning: The process of identifying and correcting (or removing) errors,
inconsistencies, and missing values in a dataset to ensure data quality.
▪ Data Processing: Transforming raw data into a structured format suitable for analysis.
This includes handling missing values, normalizing data, and feature engineering.
▪ Pandas: A powerful Python library used for data manipulation and analysis. It provides
data structures like DataFrame and Series for handling structured data efficiently.
1.head()
Displays the first n rows of the DataFrame (default is 5).
2.tail(n)
Displays the last n rows of the DataFrame (default is 5).
3. .info()
Provides a summary of the dataset, including column names, data types, and missing values.
4. .describe()
Generates summary statistics for numerical columns (mean, std, min, max, etc.).
5. .shape
Returns the number of rows and columns in the dataset.
6. .columns
Lists the column names of the DataFrame.
7. .dropna()
Removes rows with missing (NaN) values.
8. .fillna(value)
Replaces missing values with a specified value.
9. .isnull().sum()
Counts the number of missing values in each column.
10. .duplicated().sum()
Checks for duplicate rows.
11. .drop_duplicates()
Removes duplicate rows from the DataFrame.
12. .astype(data_type)
Converts the data type of a column.
13. .apply(function)
Applies a function to a column.
16. .sort_values(by='column')
Sorts the DataFrame by a specific column.
17. .value_counts()
Counts unique values in a column.
18. .corr()
Computes the correlation between numerical columns.
19. .groupby('column').mean()
Groups the DataFrame by a column and calculates the mean.
Conclusion
In this practical, we successfully demonstrated the use of NumPy and Pandas for data
cleaning, processing, and analysis. The dataset initially contained missing values, which were
handled systematically by removing incomplete records where names were missing, filling
missing numerical values such as age and salary using statistical measures like the mean and
median, and replacing missing categorical values in the department column with a default value
("Unknown"). After cleaning the data, we performed analysis by extracting summary statistics,
calculating the average salary per department, counting employees in each department, and
identifying the maximum and minimum salary values.
importnumpy as np
import pandas as pd
# Sample dataset
raw_data = {
df = pd.DataFrame(raw_data)
print("Original Data:")
print(df)
# Data Cleaning
df.dropna(subset=['Name'], inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
df['Department'].fillna('Unknown', inplace=True)
print("\nCleaned Data:")
print(df)
# Data Analysis
# 1. Summary Statistics
print("\nSummary Statistics:")
print(df.describe())
print(df.groupby('Department')['Salary'].mean())
print(df['Department'].value_counts())
OUTPUT:
Practical No 02
Title: Data Analysis of Company Sales, Expenses, and Profit Using Python
Objective:
Outcomes:
Problem Statement:
Businesses need to track sales, expenses, and profit for effective decision-making. However,
raw data may have missing values and lack clear insights. This practical focuses on cleaning
the data, performing statistical analysis, and visualizing trends to better understand business
performance.
Theory :
Conclusion
This program provides insights into a company's financial performance by cleaning data,
computing key statistics, and visualizing important trends. The results can help in making
strategic business decisions.
Program:
#https://github.com/YBI-Foundation/Dataset/blob/main/Product%20Sales%20Data.csv
import pandas as pd
df.head()
df.isna()
df.dropna()
month=df['Month']
cream=df['Cream']
detergent=df['Detergent']
moisturizer=df['Moisturizer']
sanitizer=df['Sanitizer']
shampoo=df['Shampoo']
soap=df['Soap']
total_units=df['Total Units']
total_profit=df['Total Profit']
plt.xlabel('Month')
plt.ylabel('Profit in Rupees')
plt.legend()
plt.grid()
plt.show()
labels=['Face Cream','Detergent','Moisturizer','Sanitizer','Shampoo','Soap']
sales_sum=[df['Cream'].sum(),df['Detergent'].sum(),df['Moisturizer'].sum(),df['Sanitizer'].su
m(),df['Shampoo'].sum(),df['Soap'].sum()]
plt.pie(sales_sum,labels=labels,startangle=90,autopct='%.2f')
plt.legend(loc='lower left')
plt.show()
Practical No 03
Objective:
Problem Statement:
In data analysis and machine learning, working with large datasets can be computationally
expensive. Random sampling helps in selecting a subset of data that represents the entire
dataset while reducing computational load. This experiment aims to extract a representative
sample from a dataset of student performance and analyze its distribution.
Outcomes:
Theory :
1. Introduction to Sampling
In data science, it is often impractical to work with entire datasets due to time and
computational constraints. Sampling allows us to select a subset of data that represents the
entire dataset while maintaining statistical accuracy. Sampling is widely used in research,
machine learning, and big
data analysis to make predictions and insights without processing the full dataset.
Random sampling ensures that every data point has an equal chance of being selected, reducing
bias and increasing the accuracy of statistical analysis. Key benefits include:
• Efficiency: Processing a smaller subset reduces computation time and memory usage.
Each data point has an equal probability of being chosen. This method is unbiased and widely
• Example: Selecting 100 students randomly from a university with 10,000 students.
B. Stratified Sampling
The dataset is divided into different groups (strata) based on specific characteristics, and
samples are taken from each group proportionally.
• Example: If a dataset has 60% male and 40% female students, the sample should
C. Systematic Sampling
D. Cluster Sampling
Instead of selecting individuals, entire groups (clusters) are randomly chosen, making the
process more efficient.
• Example: Choosing 5 random schools from a district and analyzing all students in those
schools.
implementation.
5. Implementation in Python
By applying these techniques, we can efficiently analyze large datasets while maintaining
Conclusion
The experiment successfully demonstrated how to extract a random sample from a dataset.
Random sampling helps in efficient data analysis by reducing computational cost while
Code :
Output:
Practical No 04
Objective:
Problem Statement:
In statistical analysis, understanding the variability in sample means drawn from a population
is essential. This task involves simulating the process of generating multiple random samples
from a population and calculating the mean of each sample. The goal is to investigate the
distribution of sample means and compare it with the population mean.
Outcomes:
The histogram will display the distribution of the means of the 30 samples, showing how
The red dashed line represents the population mean, which serves as a reference to
compare how close the sample means are to the true population mean.
Even though the population is normally distributed in this case, if the population were not
normal, the histogram of the sample means would still show a normal distribution due to
Theory
Sampling is a process of selecting a subset of individuals from a larger population. The goal of
sampling is often to estimate population parameters (like the mean) without needing to collect
Sampling: The process of selecting items from the population to form a sample.
When drawing a sample, it's important to ensure that the sample is random. A random sample
is one where each member of the population has an equal chance of being selected. This helps
to ensure that the sample is representative of the population.
2. Sampling Distribution:
A sampling distribution is the distribution of a statistic (e.g., the sample mean) obtained by
Each sample you draw will give a sample statistic (e.g., the sample mean).
The collection of all possible sample means (for a fixed sample size) forms the sampling
The key question is: how does the sample mean relate to the population mean? To understand
The Central Limit Theorem is one of the most important results in statistics. It states that,
regardless of the distribution of the population, if you take sufficiently large random samples
from the population, the distribution of the sample means will be approximately normal
(bell shaped). This is true even if the population distribution is not normal.
Given a population with a mean μ\muμ and a standard deviation σ\sigmaσ, and you take
o The mean of the sample means will be equal to the population mean μ\muμ.
o The standard deviation of the sample means (also called the standard error) will be
o As the sample size nnn increases, the sampling distribution of the sample mean
becomes more tightly concentrated around the population mean, forming a normal
Mean of the Sampling Distribution: The mean of the sampling distribution of the sample
mean
is the same as the population mean, i.e., μxˉ=μ\mu_{\bar{x}} = \muμxˉ=μ.
Standard Deviation of the Sampling Distribution (Standard Error): The standard error
SESESE is the standard deviation of the sampling distribution. It is calculated as SE=σnSE =
\frac{\sigma}{\sqrt{n}}SE=nσ, where σ\sigmaσ is the population standard deviation, and nnn
is the sample size. As the sample size increases, the standard error decreases.
Shape of the Sampling Distribution: As the sample size increases, the distribution of the
sample means will become approximately normal (bell-shaped) even if the population
distribution is not normal.
The sample size plays a crucial role in the precision of your estimates. Larger sample sizes
lead to smaller standard errors, meaning the sample means will be closer to the population
mean.
According to the CLT, the more samples you take, the closer the distribution of sample
means will be to a normal distribution, even if the population distribution is not normal. In
practice, a sample size of n≥30n \geq 30n≥30 is considered large enough for the CLT to apply,
though this can vary based on the population distribution.
The Law of Large Numbers (LLN) is another important concept that states:
As the sample size increases, the sample mean will get closer to the population mean.
For example, if you randomly sample from a population and calculate the mean of each
sample, as the number of samples grows, the average of those sample means will tend to get
closer to the true population mean.
This law supports the idea that with larger samples, your estimates will be more accurate and
less variable.
7. Real-World Application:
Sampling is used extensively in fields like market research, quality control, medical testing,
and opinion polling. For example:
Polling: When conducting political polls, survey organizations often select a small group
(sample) of people from a large population. The poll results (sample mean) are then used to
Quality Control: In manufacturing, a company may randomly sample a few products from a
production line to estimate the overall quality of the products being produced.
By using statistical techniques like sampling and the CLT, we can make informed decisions
based on a subset of data rather than needing to assess every possible data point in the
population.
Conclusion
Understanding the concepts of sampling, sample means, and the Central Limit Theorem allows
statisticians to make reliable inferences about populations from sample data. The Central Limit
Theorem is particularly powerful because it enables the use of normal distribution
approximations for the sampling distribution of the mean, regardless of the original
population's distribution. This is a cornerstone of inferential statistics and is essential for
hypothesis testing, confidence intervals, and many other statistical methods.
Practical No.4
Code :
Output:
Practical No 05
Objective:
To enable users to input a dataset and specify independent and dependent variables.
To perform linear regression to model the relationship between the specified variables.
To visualize the results and interpret the regression output.
Problem Statement: Develop python program for linear regression analysis. Enable user to input a
dataset specify independent and dependent variable and perform linear regression to model the
relationship between them.
Outcomes:
Theory:
1. Linear Regression:
o Linear regression is a method used to model the relationship between a dependent
variable (Y) and one or more independent variables (X).
o The equation of simple linear regression is: where:
is the intercept,
is the slope (coefficient),
is the error term.
o Linear regression can be classified into two types:
Simple Linear Regression: Involves one independent variable.
Multiple Linear Regression: Involves two or more independent variables.
Conclusion:
This practical provides a fundamental understanding of linear regression and its implementation in
Python. Users can analyse relationships between variables, interpret key statistical outputs, and
visualize results using regression plots. This technique is widely applicable in fields such as finance,
economics, and machine learning for predictive modelling.
Code :
# Load dataset
file_path = r"C:\Users\Uday Pagar\OneDrive\Documents\Salary_Data.csv"
df = pd.read_csv(file_path)
# Prepare data
X = df[[independent_var]] # Selecting the independent variable
y = df[dependent_var] # Selecting the dependent variable
# Plot results
plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual Data') # Scatter plot of actual values
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line') # Regression line
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Linear Regression: Experience vs Salary')
plt.legend()
plt.grid(True)
plt.show()
Output :
Title:
Statistical Analysis of a Dataset Using Python
Objective:
To compute and analyze key statistical measures such as mean, median,
mode, variance, and standard deviation for a given dataset using Python
libraries (numpy and statistics).
Problem Statement:
Calculate mean, median, mode, variance and standard deviation of a
dataset.
Outcomes:
By the end of this analysis, you will be able to:
1. Understand and calculate the mean, median, and mode of a dataset.
2. Compute variance and standard deviation to measure data
dispersion.
3. Differentiate between sample and population-based statistical
calculations.
4. Implement statistical computations using Python efficiently.
Theory:
data = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Statistical measures provide insights into data distribution and variability.
1. Mean: The arithmetic average of all values in a dataset. It represents
the central tendency.
Mean = (99+86+87+88+111+86+103+87+94+78+77+85+86)
/ 13 = 89.77
2. Median: The middle value when the dataset is sorted. If the number
of elements is even, it is the average of the two middle values.
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
(32+111+138+28+59+77+97) / 7 = 77.4
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6
(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7
= 1432.2
5. Standard Deviation: The square root of variance, representing how
much data deviates from the mean.
Standard Deviation = Sqrt(Variance) = sqrt(1432.2) = 37.8
Python provides built-in functions in numpy and statistics to efficiently
compute these measures.
Conclusion:
This analysis successfully computed fundamental statistical measures using
Python. The results provide insights into the dataset’s distribution and
variability. The mean indicates the central value, the median gives the
midpoint, and the mode highlights the most frequent value. The variance and
standard deviation quantify data dispersion. Understanding these measures is
essential for statistical analysis and data science applications.
CODE:
import statistics
import numpy
data = [10,15,82,33,82,55,44,91,30,90]
mean = numpy.mean(data)
mode = statistics.mode(data)
median = numpy.median(data)
std = numpy.std(data)
var = numpy.var(data)
Objective
To perform multiple regression analysis by allowing the user to input a dataset with multiple
independent variables, specify a dependent variable, and analyze the complex relationship
between variables using Python libraries such as pandas, sklearn, and matplotlib.
Problem Statement
Outcomes
Theory
• Independent Variables:
o Square footage
o Number of bedrooms
• Use Case:
Example
1 1500 3 8 10 300000
2 2000 4 9 5 450000
3 1200 2 6 20 200000
4 1800 3 7 15 350000
Using multiple regression, we can predict the house price based on square footage,
bedrooms, location score, and age. The model will help estimate how much each factor
contributes to the price, enabling better pricing strategies and investment decisions.
Algorithm
1. Load the dataset using pandas.
6. Predict values for the training set and visualize using a scatter plot.
7. Accept a new data sample from the user, predict the dependent variable, and display the
result.
Conclusion
The developed program will help in understanding complex data relationships, predicting
outcomes, and visualizing model performance effectively. Mastering multiple regression
analysis is essential for advanced statistical analysis and data science applications.
Code with Graphical Representation
import pandas as pd
import numpy as np
# Load dataset
file_path = input("Enter the file path of the dataset (CSV format): ")
data = pd.read_csv(file_path)
print(data.head())
X = data[independent_vars]
y = data[dependent_var]
model = LinearRegression()
model.fit(X, y)
print(coefficients)
print("\nModel Performance:")
y_pred = model.predict(X)
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y, y=y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.show()