0% found this document useful (0 votes)
8 views

DSBDA_prac2

The document outlines a data analysis process using a dataset of student performance, including loading the data, checking for missing values, and filling them with mean values. It also describes the creation of boxplots to visualize the data, the calculation of z-scores to identify outliers, and the removal of these outliers from the dataset. Additionally, it includes steps for installing necessary libraries and applying statistical methods to clean and analyze the data.

Uploaded by

Manasi Deshmukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DSBDA_prac2

The document outlines a data analysis process using a dataset of student performance, including loading the data, checking for missing values, and filling them with mean values. It also describes the creation of boxplots to visualize the data, the calculation of z-scores to identify outliers, and the removal of these outliers from the dataset. Additionally, it includes steps for installing necessary libraries and applying statistical methods to clean and analyze the data.

Uploaded by

Manasi Deshmukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

In [70]: import pandas as pd

In [71]: df = pd.read_csv("D:\\Jupyter notebook\\datasets_74977_169835_StudentsPerformance.csv")

In [72]: df.head()

Out[72]: gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score

0 female group B bachelor's degree standard none 72.0 72.0 74.0

1 female group C some college standard completed 69.0 90.0 88.0

2 female group B master's degree standard none 90.0 95.0 93.0

3 male group A associate's degree free/reduced none 47.0 57.0 44.0

4 male group C some college standard none 76.0 78.0 75.0

In [73]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 997 non-null float64
6 reading score 997 non-null float64
7 writing score 998 non-null float64
dtypes: float64(3), object(5)
memory usage: 62.6+ KB

In [74]: df.isnull()

Out[74]: gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score

0 False False False False False False False False

1 False False False False False False False False

2 False False False False False False False False

3 False False False False False False False False

4 False False False False False False False False

... ... ... ... ... ... ... ... ...

995 False False False False False False False False

996 False False False False False False False False

997 False False False False False False False False

998 False False False False False False False False

999 False False False False False False False False

1000 rows × 8 columns

In [75]: df.isnull().sum()

Out[75]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 3
reading score 3
writing score 2
dtype: int64

In [76]: df['reading score'].fillna(df['reading score'].mean(),inplace=True)


df['math score'].fillna(df['math score'].mean(),inplace=True)
df['writing score'].fillna(df['writing score'].mean(),inplace=True)

In [77]: df.isnull().sum()

Out[77]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [78]: df.boxplot()

Out[78]: <Axes: >

In [79]: newdf = df[df["math score"]>20]

In [80]: !pip install matplotlib

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: matplotlib in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (3.8.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (4.47.2)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy<2,>=1.21 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (1.26.3)
Requirement already satisfied: packaging>=20.0 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (10.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\manasi deshmukh\appdata\roaming\python\python312\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

In [81]: import matplotlib.pyplot as plt

In [82]: newdf.boxplot()
plt.show()

In [83]: newdf = df[df["writing score"]>20]

In [84]: newdf.boxplot()
plt.show()

In [85]: pip install scipy

Requirement already satisfied: scipy in c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages (1.12.0)


Requirement already satisfied: numpy<1.29.0,>=1.22.4 in c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages (from scipy) (1.26.1)
Note: you may need to restart the kernel to use updated packages.
WARNING: Ignoring invalid distribution -illow (c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\manasi deshmukh\appdata\local\programs\python\python39\lib\site-packages)
WARNING: You are using pip version 22.0.4; however, version 23.3.2 is available.
You should consider upgrading via the 'C:\Users\Manasi Deshmukh\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.

In [86]: from scipy.stats import zscore

In [87]: df['z_scores_math'] = zscore(df['math score'])

In [88]: df

Out[88]: gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score z_scores_math

0 female group B bachelor's degree standard none 72.0 72.0 74.0 0.390843

1 female group C some college standard completed 69.0 90.0 88.0 0.192706

2 female group B master's degree standard none 90.0 95.0 93.0 1.579670

3 male group A associate's degree free/reduced none 47.0 57.0 44.0 -1.260305

4 male group C some college standard none 76.0 78.0 75.0 0.655027

... ... ... ... ... ... ... ... ... ...

995 female group E master's degree standard completed 88.0 99.0 95.0 1.447578

996 male group C high school free/reduced none 62.0 55.0 55.0 -0.269616

997 female group C high school free/reduced completed 59.0 71.0 65.0 -0.467754

998 female group D some college standard completed 68.0 78.0 77.0 0.126660

999 female group D some college free/reduced none 77.0 86.0 86.0 0.721073

1000 rows × 9 columns

In [89]: outliers = (df["z_scores_math"]> 1) | (df["z_scores_math"] < -1)

In [90]: outliers

Out[90]: 0 False
1 False
2 True
3 True
4 False
...
995 True
996 False
997 False
998 False
999 False
Name: z_scores_math, Length: 1000, dtype: bool

In [91]: df_no_math_score_outiler=df[(df.z_scores_math >-1) & (df.z_scores_math <1)]


df_no_math_score_outiler

Out[91]: gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score z_scores_math

0 female group B bachelor's degree standard none 72.0 72.0 74.0 0.390843

1 female group C some college standard completed 69.0 90.0 88.0 0.192706

4 male group C some college standard none 76.0 78.0 75.0 0.655027

5 female group B associate's degree standard none 71.0 83.0 78.0 0.324798

8 male group D high school free/reduced completed 64.0 64.0 67.0 -0.137524

... ... ... ... ... ... ... ... ... ...

994 male group A high school standard none 63.0 63.0 62.0 -0.203570

996 male group C high school free/reduced none 62.0 55.0 55.0 -0.269616

997 female group C high school free/reduced completed 59.0 71.0 65.0 -0.467754

998 female group D some college standard completed 68.0 78.0 77.0 0.126660

999 female group D some college free/reduced none 77.0 86.0 86.0 0.721073

697 rows × 9 columns

In [92]: df_no_math_score_outiler.boxplot()
plt.show()

In [93]: def RemoveOutlier(df, var):


Q1 = df[var].quantile(0.25)
Q3 = df[var].quantile(0.75)
IQR = Q3 - Q1
high = Q3 + 1.5 * IQR
low = Q1 - 1.5 * IQR
df = df[(df[var] > low) & (df[var] <= high)]
print('Outliers removed in', var)
return df

In [94]: data = RemoveOutlier(df,'math score')

Outliers removed in math score

In [95]: col= "math score"


data.boxplot(col)
plt.show()

In [96]: data.boxplot()
plt.show()

In [99]: data = RemoveOutlier(data,'reading score')

Outliers removed in reading score

In [100… data.boxplot()
plt.show()

In [101… data = RemoveOutlier(data,'math score')

Outliers removed in math score

In [102… data.boxplot()
plt.show()
In [ ]:

You might also like