0% found this document useful (0 votes)
4 views

DAP writeups_merged

The document outlines practical exercises focused on data cleaning, analysis, and sampling using Python libraries like NumPy and Pandas. It covers techniques for handling missing values, generating descriptive statistics, and visualizing data trends, as well as the importance of random sampling in data analysis. Each practical emphasizes the significance of data preprocessing and statistical methods in deriving meaningful insights from datasets.

Uploaded by

Vedant Kolte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DAP writeups_merged

The document outlines practical exercises focused on data cleaning, analysis, and sampling using Python libraries like NumPy and Pandas. It covers techniques for handling missing values, generating descriptive statistics, and visualizing data trends, as well as the importance of random sampling in data analysis. Each practical emphasizes the significance of data preprocessing and statistical methods in deriving meaningful insights from datasets.

Uploaded by

Vedant Kolte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Practical No 01

Title: Data Cleaning and Analysis using NumPy and Pandas

Objective:

To clean, process, and analyze a sample dataset using Pandas and NumPy, extracting
meaningful insights.

Problem Statement:

Data is often incomplete and contains missing values. The challenge is to clean, process, and
analyze data efficiently using Pandas functionalities to extract insights.

Outcomes:

o Removing rows with missing names.


o Filling missing age values with the column mean.
o Filling missing salary values with the column median.
o Assigning 'Unknown' to missing department values.

Theory
1. Introduction to Data Cleaning and Analysis

Data cleaning and analysis are essential steps in data preprocessing to ensure accurate insights
and reliable results. In real-world scenarios, datasets often contain missing, inconsistent, or
incorrect data. Using NumPy and Pandas, we can efficiently handle, clean, and analyze data.

2. Overview of NumPy and Pandas

• NumPy (Numerical Python):

o Provides support for large multidimensional arrays and matrices.

o Contains mathematical functions for efficient numerical computations.

o Useful for handling missing values and performing numerical operations.

• Pandas:

o A powerful library for data manipulation and analysis.

o Provides DataFrame, a 2D tabular structure similar to spreadsheets and SQL


tables.

o Offers functionalities for handling missing values, grouping data, and


computing statistics.

3. Steps Involved in the Program


Step 1: Creating the Dataset

• The dataset is created using a dictionary with sample employee details:

o Name (Categorical)

o Age (Numerical)

o Salary (Numerical)

o Department (Categorical)

• Some values are missing (NaN) to simulate real-world data issues.

Step 2: Displaying the Original Data

• The dataset is converted into a Pandas DataFrame and displayed.

• Missing values are clearly identified for further processing.

Step 3: Data Cleaning

• Removing Missing Data: Rows with missing names are dropped, as the name is a
critical identifier.

• Handling Missing Age Values: The missing age values are filled with the mean age
to ensure consistency.

• Handling Missing Salary Values: The missing salary values are filled with the
median salary to reduce the impact of outliers.

• Handling Missing Department Values: Any missing department is replaced with


‘Unknown’ to maintain categorical consistency.

Step 4: Data Analysis

1. Generating Summary Statistics:

o df.describe() provides statistical measures like mean, min, max, and quartiles
for numerical columns.

2. Computing Average Salary per Department:

o df.groupby('Department')['Salary'].mean() calculates the mean salary for each


department.

3. Counting Employees per Department:

o df['Department'].value_counts() counts the number of employees in each


department.

4. Finding Maximum and Minimum Salary:


o df['Salary'].max() and df['Salary'].min() identify the highest and lowest salaries
in the dataset.

4. Importance of Data Cleaning and Analysis

• Ensures Data Accuracy: Handling missing values prevents misinterpretation of


results.

• Prepares Data for Machine Learning: Cleaned data improves model performance.

• Extracts Meaningful Insights: Helps in understanding trends, distributions, and


relationships in the data.

▪ Incomplete Data: Data that is missing certain values or attributes, which can occur due
to errors in data collection, transmission, or storage.

▪ Missing Values: These are gaps in a dataset where information is absent. In Pandas, they
are often represented as NaN (Not a Number).

▪ Data Cleaning: The process of identifying and correcting (or removing) errors,
inconsistencies, and missing values in a dataset to ensure data quality.

▪ Data Processing: Transforming raw data into a structured format suitable for analysis.
This includes handling missing values, normalizing data, and feature engineering.

▪ Data Analysis: Examining, transforming, and modeling data to extract meaningful


insights, often using statistical and visualization techniques.

▪ Pandas: A powerful Python library used for data manipulation and analysis. It provides
data structures like DataFrame and Series for handling structured data efficiently.

5. Data pre processing and cleaning commands

1.head()
Displays the first n rows of the DataFrame (default is 5).
2.tail(n)
Displays the last n rows of the DataFrame (default is 5).

3. .info()
Provides a summary of the dataset, including column names, data types, and missing values.

4. .describe()
Generates summary statistics for numerical columns (mean, std, min, max, etc.).

5. .shape
Returns the number of rows and columns in the dataset.

6. .columns
Lists the column names of the DataFrame.

7. .dropna()
Removes rows with missing (NaN) values.

8. .fillna(value)
Replaces missing values with a specified value.

9. .isnull().sum()
Counts the number of missing values in each column.

10. .duplicated().sum()
Checks for duplicate rows.

11. .drop_duplicates()
Removes duplicate rows from the DataFrame.

12. .astype(data_type)
Converts the data type of a column.
13. .apply(function)
Applies a function to a column.

14. .replace(old_value, new_value)


Replaces specific values in a column.

15. .rename(columns={'old': 'new'})


Renames columns in the DataFrame.

16. .sort_values(by='column')
Sorts the DataFrame by a specific column.

17. .value_counts()
Counts unique values in a column.

18. .corr()
Computes the correlation between numerical columns.

19. .groupby('column').mean()
Groups the DataFrame by a column and calculates the mean.

Conclusion

In this practical, we successfully demonstrated the use of NumPy and Pandas for data
cleaning, processing, and analysis. The dataset initially contained missing values, which were
handled systematically by removing incomplete records where names were missing, filling
missing numerical values such as age and salary using statistical measures like the mean and
median, and replacing missing categorical values in the department column with a default value
("Unknown"). After cleaning the data, we performed analysis by extracting summary statistics,
calculating the average salary per department, counting employees in each department, and
identifying the maximum and minimum salary values.

importnumpy as np
import pandas as pd

# Sample dataset

raw_data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', np.nan],

'Age': [25, 30, 35, np.nan, 28, 40],

'Salary': [50000, 60000, np.nan, 70000, 55000, 65000],

'Department': ['HR', 'IT', 'IT', 'Finance', 'HR', np.nan]

df = pd.DataFrame(raw_data)

# Display original dataset

print("Original Data:")

print(df)

# Data Cleaning

# 1. Remove rows where 'Name' is missing

df.dropna(subset=['Name'], inplace=True)

# 2. Fill missing values in 'Age' with the column mean

df['Age'].fillna(df['Age'].mean(), inplace=True)

# 3. Fill missing values in 'Salary' with the column median

df['Salary'].fillna(df['Salary'].median(), inplace=True)

# 4. Fill missing values in 'Department' with 'Unknown'

df['Department'].fillna('Unknown', inplace=True)

# Display cleaned dataset

print("\nCleaned Data:")

print(df)

# Data Analysis

# 1. Summary Statistics

print("\nSummary Statistics:")

print(df.describe())

# 2. Average Salary per Department


print("\nAverage Salary per Department:")

print(df.groupby('Department')['Salary'].mean())

# 3. Count of Employees per Department

print("\nEmployee Count per Department:")

print(df['Department'].value_counts())

# 4. Maximum and Minimum Salary

print("\nMaximum Salary:", df['Salary'].max())

print("Minimum Salary:", df['Salary'].min())

OUTPUT:
Practical No 02

Title: Data Analysis of Company Sales, Expenses, and Profit Using Python

Objective:

The objective of this practical is to analyze a dataset containing information about a


company's sales, expenses, and profit. This involves performing data cleaning to handle
missing values, calculating descriptive statistics to summarize key metrics, and visualizing
trends using graphs to gain meaningful insights.

Outcomes:

1. Successfully handled missing values by filling them with appropriate statistical


measures (mean).

2. Generated descriptive statistics to summarize the dataset, including mean, minimum,


maximum, and standard deviation.

3. Created visualizations using Matplotlib and Seaborn to analyze trends in sales,


expenses, and profit over time.

4. Gained insights into business performance, helping in data-driven decision-making.

5. Demonstrated the importance of data preprocessing and visualization in financial


analysis.

Problem Statement:

Businesses need to track sales, expenses, and profit for effective decision-making. However,
raw data may have missing values and lack clear insights. This practical focuses on cleaning
the data, performing statistical analysis, and visualizing trends to better understand business
performance.

Theory :

1. Loading the Dataset


The dataset is loaded into a Pandas DataFrame from a CSV file.
2. Data Cleaning
o Duplicates are removed to avoid redundancy.
o Missing values are handled by dropping incomplete rows.
o Sales and Expenses are converted to numeric values for calculations.
o Profit is calculated as: Profit=Sales−Expenses\text{Profit} = \text{Sales} -
\text{Expenses}Profit=Sales−Expenses
3. Descriptive Statistics
o Summary statistics (count, mean, min, max, etc.) are generated to understand
data distribution.
4. Data Visualization
o Histogram: Shows the distribution of Profit.
o Scatter Plot: Analyzes the relationship between Sales and Profit.
o Box Plot: Compares Sales, Expenses, and Profit distributions.

Conclusion

This program provides insights into a company's financial performance by cleaning data,
computing key statistics, and visualizing important trends. The results can help in making
strategic business decisions.

Program:

#https://github.com/YBI-Foundation/Dataset/blob/main/Product%20Sales%20Data.csv

import pandas as pd

from matplotlib import pyplot as plt

df=pd.read_csv(r"C:\Users\ABHISHEK\Downloads\Product Sales Data.csv")

df.head()

df.isna()
df.dropna()

month=df['Month']

cream=df['Cream']

detergent=df['Detergent']

moisturizer=df['Moisturizer']

sanitizer=df['Sanitizer']

shampoo=df['Shampoo']

soap=df['Soap']

total_units=df['Total Units']

total_profit=df['Total Profit']

plt.plot(month,total_profit,label='Total Profits', marker='o',linewidth=1.5)

plt.title('Profit Throughout Months')

plt.xlabel('Month')

plt.ylabel('Profit in Rupees')

plt.legend()

plt.grid()

plt.show()
labels=['Face Cream','Detergent','Moisturizer','Sanitizer','Shampoo','Soap']

sales_sum=[df['Cream'].sum(),df['Detergent'].sum(),df['Moisturizer'].sum(),df['Sanitizer'].su
m(),df['Shampoo'].sum(),df['Soap'].sum()]

plt.pie(sales_sum,labels=labels,startangle=90,autopct='%.2f')

plt.legend(loc='lower left')

plt.show()
Practical No 03

Title: Random Sampling for Representative Data Extraction

Objective:

• To implement a random sampling process to extract a representative sample from a

dataset using Python and analyze its effectiveness.

Problem Statement:

In data analysis and machine learning, working with large datasets can be computationally
expensive. Random sampling helps in selecting a subset of data that represents the entire
dataset while reducing computational load. This experiment aims to extract a representative
sample from a dataset of student performance and analyze its distribution.

Outcomes:

• Successfully extract a random sample from the dataset.

• Ensure the sample is representative of the whole dataset.

• Understand the importance of random sampling in data analysis.

Theory :

1. Introduction to Sampling

In data science, it is often impractical to work with entire datasets due to time and
computational constraints. Sampling allows us to select a subset of data that represents the
entire dataset while maintaining statistical accuracy. Sampling is widely used in research,
machine learning, and big

data analysis to make predictions and insights without processing the full dataset.

2. Importance of Random Sampling

Random sampling ensures that every data point has an equal chance of being selected, reducing

bias and increasing the accuracy of statistical analysis. Key benefits include:

• Efficiency: Processing a smaller subset reduces computation time and memory usage.

• Unbiased Representation: Proper sampling prevents skewed results by ensuring the

sample represents the entire dataset.

• Improved Generalization: In machine learning, random sampling helps create training

and testing datasets that reflect real-world scenarios.


3. Types of Random Sampling Methods

There are several types of random sampling techniques:

A. Simple Random Sampling (SRS)

Each data point has an equal probability of being chosen. This method is unbiased and widely

used in statistical analysis.

• Example: Selecting 100 students randomly from a university with 10,000 students.

• Implementation: pandas.sample() or numpy.random.choice().

B. Stratified Sampling

The dataset is divided into different groups (strata) based on specific characteristics, and
samples are taken from each group proportionally.

• Example: If a dataset has 60% male and 40% female students, the sample should

maintain the same proportion.

• Use case: When data is imbalanced (e.g., fraud detection).

• Implementation: Using groupby() in Pandas or StratifiedShuffleSplit in Scikit learn.

C. Systematic Sampling

A fixed interval (k) is used to select samples from an ordered dataset.

• Example: Selecting every 5th student from a class list.

• Use case: When the dtaset is already randomized.

D. Cluster Sampling

Instead of selecting individuals, entire groups (clusters) are randomly chosen, making the
process more efficient.

• Example: Choosing 5 random schools from a district and analyzing all students in those

schools.

• Use case: Large-scale surveys.

4. Applications of Random Sampling in Data Science

• Data Preprocessing: Selecting a representative training dataset for machine learning.

• Statistical Analysis: Making inferences about a population from a small sample.

• Market Research: Conducting surveys on a subset of consumers to predict preferences.


• Medical Studies: Testing new treatments on a smaller group before large-scale

implementation.

5. Implementation in Python

Random sampling can be implemented using different Python libraries:

• Pandas: data.sample(n=sample_size) for simple random sampling.

• NumPy: np.random.choice(data.index, size=sample_size, replace=False) for

random index selection.

• Scikit-learn: StratifiedShuffleSplit for stratified sampling.

By applying these techniques, we can efficiently analyze large datasets while maintaining

accuracy and reliability.

Conclusion

The experiment successfully demonstrated how to extract a random sample from a dataset.

Random sampling helps in efficient data analysis by reducing computational cost while

maintaining representativeness. This technique is crucial in machine learning, statistical


analysis, and decision-making processes.
Practical No.3

Code :

Output:
Practical No 04

Title: Sampling and Calculation of Mean from a Population

Objective:

To simulate the process of sampling from a population.

To calculate the mean of each of the generated samples.

Problem Statement:

In statistical analysis, understanding the variability in sample means drawn from a population
is essential. This task involves simulating the process of generating multiple random samples
from a population and calculating the mean of each sample. The goal is to investigate the
distribution of sample means and compare it with the population mean.

Outcomes:

The Distribution of Sample Means:

The histogram will display the distribution of the means of the 30 samples, showing how

the sample means vary.

The red dashed line represents the population mean, which serves as a reference to

compare how close the sample means are to the true population mean.

Central Limit Theorem Observation:

Even though the population is normally distributed in this case, if the population were not

normal, the histogram of the sample means would still show a normal distribution due to

the Central Limit Theorem (CLT) as the sample size increases.

Theory

1. Basic Sampling Concepts:

Sampling is a process of selecting a subset of individuals from a larger population. The goal of

sampling is often to estimate population parameters (like the mean) without needing to collect

data from the entire population, which is often impractical or costly.

Population: The entire set of items or individuals you want to study.

Sample: A subset of the population.

Sampling: The process of selecting items from the population to form a sample.
When drawing a sample, it's important to ensure that the sample is random. A random sample

is one where each member of the population has an equal chance of being selected. This helps
to ensure that the sample is representative of the population.

2. Sampling Distribution:

A sampling distribution is the distribution of a statistic (e.g., the sample mean) obtained by

repeatedly drawing random samples of a given size from a population.

Each sample you draw will give a sample statistic (e.g., the sample mean).

The collection of all possible sample means (for a fixed sample size) forms the sampling

distribution of the sample mean.

The key question is: how does the sample mean relate to the population mean? To understand

this, we refer to the Central Limit Theorem (CLT).

3. Central Limit Theorem (CLT):

The Central Limit Theorem is one of the most important results in statistics. It states that,

regardless of the distribution of the population, if you take sufficiently large random samples

from the population, the distribution of the sample means will be approximately normal
(bell shaped). This is true even if the population distribution is not normal.

Formal Statement of CLT:

Given a population with a mean μ\muμ and a standard deviation σ\sigmaσ, and you take

random samples of size nnn from that population:

o The mean of the sample means will be equal to the population mean μ\muμ.

o The standard deviation of the sample means (also called the standard error) will be

σn\frac{\sigma}{\sqrt{n}}nσ, where nnn is the sample size.

o As the sample size nnn increases, the sampling distribution of the sample mean

becomes more tightly concentrated around the population mean, forming a normal

distribution even if the population distribution is not normal.

4. Key Components of the CLT:

Mean of the Sampling Distribution: The mean of the sampling distribution of the sample
mean
is the same as the population mean, i.e., μxˉ=μ\mu_{\bar{x}} = \muμxˉ=μ.

Standard Deviation of the Sampling Distribution (Standard Error): The standard error
SESESE is the standard deviation of the sampling distribution. It is calculated as SE=σnSE =
\frac{\sigma}{\sqrt{n}}SE=nσ, where σ\sigmaσ is the population standard deviation, and nnn
is the sample size. As the sample size increases, the standard error decreases.

Shape of the Sampling Distribution: As the sample size increases, the distribution of the
sample means will become approximately normal (bell-shaped) even if the population
distribution is not normal.

5. Importance of Sample Size:

The sample size plays a crucial role in the precision of your estimates. Larger sample sizes
lead to smaller standard errors, meaning the sample means will be closer to the population
mean.

According to the CLT, the more samples you take, the closer the distribution of sample
means will be to a normal distribution, even if the population distribution is not normal. In
practice, a sample size of n≥30n \geq 30n≥30 is considered large enough for the CLT to apply,
though this can vary based on the population distribution.

6. Law of Large Numbers:

The Law of Large Numbers (LLN) is another important concept that states:

As the sample size increases, the sample mean will get closer to the population mean.

For example, if you randomly sample from a population and calculate the mean of each
sample, as the number of samples grows, the average of those sample means will tend to get
closer to the true population mean.

This law supports the idea that with larger samples, your estimates will be more accurate and
less variable.

7. Real-World Application:

Sampling is used extensively in fields like market research, quality control, medical testing,
and opinion polling. For example:

Polling: When conducting political polls, survey organizations often select a small group

(sample) of people from a large population. The poll results (sample mean) are then used to

estimate the opinions of the larger population.

Quality Control: In manufacturing, a company may randomly sample a few products from a

production line to estimate the overall quality of the products being produced.
By using statistical techniques like sampling and the CLT, we can make informed decisions

based on a subset of data rather than needing to assess every possible data point in the

population.

Conclusion

Understanding the concepts of sampling, sample means, and the Central Limit Theorem allows
statisticians to make reliable inferences about populations from sample data. The Central Limit
Theorem is particularly powerful because it enables the use of normal distribution
approximations for the sampling distribution of the mean, regardless of the original
population's distribution. This is a cornerstone of inferential statistics and is essential for
hypothesis testing, confidence intervals, and many other statistical methods.
Practical No.4

Code :

Output:
Practical No 05

Title: Linear Regression Analysis .

Objective:

 To enable users to input a dataset and specify independent and dependent variables.
 To perform linear regression to model the relationship between the specified variables.
 To visualize the results and interpret the regression output.

Problem Statement: Develop python program for linear regression analysis. Enable user to input a
dataset specify independent and dependent variable and perform linear regression to model the
relationship between them.

Outcomes:

 Understanding the linear relationship between variables.


 Ability to compute regression coefficients and evaluate model performance.
 Visualization of regression results through plots.

Theory:

1. Linear Regression:
o Linear regression is a method used to model the relationship between a dependent
variable (Y) and one or more independent variables (X).
o The equation of simple linear regression is: where:
 is the intercept,
 is the slope (coefficient),
 is the error term.
o Linear regression can be classified into two types:
 Simple Linear Regression: Involves one independent variable.
 Multiple Linear Regression: Involves two or more independent variables.

2. Application of Linear Regression:


o Predictive Modelling: Used in forecasting trends, e.g., predicting sales based on
advertising expenditure.
o Medical Research: Analysing the impact of diet on cholesterol levels.
o Finance: Predicting stock prices based on historical data.
o Social Sciences: Understanding the impact of education levels on income.
Algorithm :
1. Accept user input for dataset (CSV file).
2. Allow user to select independent and dependent variables.
3. Perform linear regression using Python's sklearn library.
4. Display regression coefficients, R-squared value, and significance metrics.
5. Plot the regression line along with the dataset.

Conclusion:

This practical provides a fundamental understanding of linear regression and its implementation in
Python. Users can analyse relationships between variables, interpret key statistical outputs, and
visualize results using regression plots. This technique is widely applicable in fields such as finance,
economics, and machine learning for predictive modelling.
Code :

# Import necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
file_path = r"C:\Users\Uday Pagar\OneDrive\Documents\Salary_Data.csv"
df = pd.read_csv(file_path)

# Define independent and dependent variables


independent_var = "YearsExperience" # Chosen independent variable
dependent_var = "Salary" # Chosen dependent variable

# Prepare data
X = df[[independent_var]] # Selecting the independent variable
y = df[dependent_var] # Selecting the dependent variable

# Split data into training (80%) and testing (20%) sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform Linear Regression


model = LinearRegression()
model.fit(X_train, y_train)

# Predict values for test set


y_pred = model.predict(X_test)

# Display model parameters


print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared: {r2_score(y_test, y_pred)}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")

# Plot results
plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual Data') # Scatter plot of actual values
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line') # Regression line
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Linear Regression: Experience vs Salary')
plt.legend()
plt.grid(True)
plt.show()

Output :
Title:
Statistical Analysis of a Dataset Using Python

Objective:
To compute and analyze key statistical measures such as mean, median,
mode, variance, and standard deviation for a given dataset using Python
libraries (numpy and statistics).

Problem Statement:
Calculate mean, median, mode, variance and standard deviation of a
dataset.

Outcomes:
By the end of this analysis, you will be able to:
1. Understand and calculate the mean, median, and mode of a dataset.
2. Compute variance and standard deviation to measure data
dispersion.
3. Differentiate between sample and population-based statistical
calculations.
4. Implement statistical computations using Python efficiently.

Theory:
data = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Statistical measures provide insights into data distribution and variability.
1. Mean: The arithmetic average of all values in a dataset. It represents
the central tendency.
Mean = (99+86+87+88+111+86+103+87+94+78+77+85+86)
/ 13 = 89.77
2. Median: The middle value when the dataset is sorted. If the number
of elements is even, it is the average of the two middle values.

77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111

3. Mode: The most frequently occurring value in a dataset. A dataset


may have one or multiple modes.
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
4. Variance: Measures the spread of data points around the mean. A
higher variance indicates greater variability.
data = [32,111,138,28,59,77,97]

1. Find the mean:

(32+111+138+28+59+77+97) / 7 = 77.4

2. For each value: find the difference from the mean:

32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

3. For each difference: find the square value:

(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16

4. The variance is the average number of these squared


differences:

(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7
= 1432.2
5. Standard Deviation: The square root of variance, representing how
much data deviates from the mean.
Standard Deviation = Sqrt(Variance) = sqrt(1432.2) = 37.8
Python provides built-in functions in numpy and statistics to efficiently
compute these measures.

Conclusion:
This analysis successfully computed fundamental statistical measures using
Python. The results provide insights into the dataset’s distribution and
variability. The mean indicates the central value, the median gives the
midpoint, and the mode highlights the most frequent value. The variance and
standard deviation quantify data dispersion. Understanding these measures is
essential for statistical analysis and data science applications.

CODE:
import statistics

import numpy

data = [10,15,82,33,82,55,44,91,30,90]

mean = numpy.mean(data)

mode = statistics.mode(data)

median = numpy.median(data)

std = numpy.std(data)

var = numpy.var(data)

print("Mean: ", mean)

print("Mode: ", mode)

print("Median: ", median)

print("Variance: ", var)

print("Standard Deviation: ", std)


OUTPUT:
Title: Multiple Regression Analysis Using Python

Objective

To perform multiple regression analysis by allowing the user to input a dataset with multiple
independent variables, specify a dependent variable, and analyze the complex relationship
between variables using Python libraries such as pandas, sklearn, and matplotlib.

Problem Statement

Develop a Python program that:

1. Accepts a CSV dataset from the user.

2. Allows specification of a dependent variable and multiple independent variables.

3. Performs multiple regression analysis.

4. Displays regression coefficients and model performance (R-squared value).

5. Visualizes actual vs. predicted values using a scatter plot.

Outcomes

By the end of this practical, you will be able to:

1. Implement multiple regression analysis using Python.

2. Understand regression coefficients and model performance metrics.

3. Visualize relationships between dependent and independent variables.

Theory

Multiple regression is a statistical technique that models the relationship between a


dependent variable and two or more independent variables. It extends simple linear
regression by handling multidimensional data and providing insights into how variables
influence each other.

Application and Real-Life Example

A common application of multiple regression is in the real estate industry:

• Dependent Variable: House prices

• Independent Variables:

o Square footage

o Number of bedrooms

o Location (e.g., proximity to schools, public transport)


o Age of the property

• Use Case:

o Identifies key factors impacting property prices

o Supports market analysis and pricing strategies

o Helps investors make informed decisions

Example

Suppose we have a dataset of houses with the following information:

House ID Square Footage Bedrooms Location Score Age Price

1 1500 3 8 10 300000

2 2000 4 9 5 450000

3 1200 2 6 20 200000

4 1800 3 7 15 350000

Using multiple regression, we can predict the house price based on square footage,
bedrooms, location score, and age. The model will help estimate how much each factor
contributes to the price, enabling better pricing strategies and investment decisions.

Algorithm
1. Load the dataset using pandas.

2. Get user input for the dependent and independent variables.

3. Prepare the data for regression.

4. Train the multiple regression model using sklearn's LinearRegression.

5. Display regression coefficients and model performance (R-squared).

6. Predict values for the training set and visualize using a scatter plot.

7. Accept a new data sample from the user, predict the dependent variable, and display the
result.

Conclusion
The developed program will help in understanding complex data relationships, predicting
outcomes, and visualizing model performance effectively. Mastering multiple regression
analysis is essential for advanced statistical analysis and data science applications.
Code with Graphical Representation

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

import seaborn as sns

# Load dataset

file_path = input("Enter the file path of the dataset (CSV format): ")

data = pd.read_csv(file_path)

print("Dataset loaded successfully!")

print(data.head())

# Specify the dependent and independent variables

dependent_var = input("Enter the dependent variable: ")

independent_vars = input("Enter the independent variables (comma-separated): ").split(',')

independent_vars = [var.strip() for var in independent_vars]

# Prepare the data for regression

X = data[independent_vars]

y = data[dependent_var]

# Perform multiple regression

model = LinearRegression()

model.fit(X, y)

# Display the regression coefficients

coefficients = pd.DataFrame({'Variable': independent_vars, 'Coefficient': model.coef_})


print("\nRegression Coefficients:")

print(coefficients)

# Display the model performance

print("\nModel Performance:")

print("R-squared:", model.score(X, y))

# Plot the predicted vs actual values

y_pred = model.predict(X)

plt.figure(figsize=(8, 6))

sns.scatterplot(x=y, y=y_pred)

plt.xlabel('Actual Values')

plt.ylabel('Predicted Values')

plt.title('Actual vs Predicted Values')

plt.axline([0, 0], [1, 1], color='red', linestyle='--')

plt.show()

You might also like