0% found this document useful (0 votes)
16 views

Data Preprocessing & Visualization1

Uploaded by

Zeha 1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Data Preprocessing & Visualization1

Uploaded by

Zeha 1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Step 1: Load the Data

In [4]: import pandas as pd

# Load the data


df = pd.read_csv('data.csv')
print("Initial Data:\n", df)

Initial Data:
ID Age Salary Department Experience
0 1 25.0 50000.0 Sales 2
1 2 30.0 60000.0 Engineering 5
2 3 22.0 45000.0 Sales 1
3 4 35.0 NaN HR 10
4 5 28.0 70000.0 Engineering 4
5 6 40.0 80000.0 HR 15
6 7 38.0 75000.0 Sales 12
7 8 NaN 62000.0 Engineering 7
8 9 45.0 90000.0 HR 20
9 10 32.0 54000.0 Sales 6

Step 2: Data Preprocessing

1) Handling Missing Values


In [5]: # Fill missing values in 'Age' and 'Salary' with the mean of the respective columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nData after handling missing values:\n", df)

Data after handling missing values:


ID Age Salary Department Experience
0 1 25.000000 50000.000000 Sales 2
1 2 30.000000 60000.000000 Engineering 5
2 3 22.000000 45000.000000 Sales 1
3 4 35.000000 65111.111111 HR 10
4 5 28.000000 70000.000000 Engineering 4
5 6 40.000000 80000.000000 HR 15
6 7 38.000000 75000.000000 Sales 12
7 8 32.777778 62000.000000 Engineering 7
8 9 45.000000 90000.000000 HR 20
9 10 32.000000 54000.000000 Sales 6

2) Handling Outliers
In [6]: # Remove outliers from 'Salary' using the IQR method
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['Salary'] < (Q1 - 1.5 * IQR)) | (df['Salary'] > (Q3 + 1.5 * IQR)))]
print("\nData after removing outliers:\n", df)

Data after removing outliers:


ID Age Salary Department Experience
0 1 25.000000 50000.000000 Sales 2
1 2 30.000000 60000.000000 Engineering 5
2 3 22.000000 45000.000000 Sales 1
3 4 35.000000 65111.111111 HR 10
4 5 28.000000 70000.000000 Engineering 4
5 6 40.000000 80000.000000 HR 15
6 7 38.000000 75000.000000 Sales 12
7 8 32.777778 62000.000000 Engineering 7
8 9 45.000000 90000.000000 HR 20
9 10 32.000000 54000.000000 Sales 6

3) Encoding Categorical Variables


In [7]: from sklearn.preprocessing import LabelEncoder

# Encode 'Department' column


label_encoder = LabelEncoder()
df['Department'] = label_encoder.fit_transform(df['Department'])
print("\nData after encoding categorical variables:\n", df)

Data after encoding categorical variables:


ID Age Salary Department Experience
0 1 25.000000 50000.000000 2 2
1 2 30.000000 60000.000000 0 5
2 3 22.000000 45000.000000 2 1
3 4 35.000000 65111.111111 1 10
4 5 28.000000 70000.000000 0 4
5 6 40.000000 80000.000000 1 15
6 7 38.000000 75000.000000 2 12
7 8 32.777778 62000.000000 0 7
8 9 45.000000 90000.000000 1 20
9 10 32.000000 54000.000000 2 6

4) Scaling and Normalization


In [9]: from sklearn.preprocessing import StandardScaler

# Scale numeric columns


scaler = StandardScaler()
df[['Age', 'Salary', 'Experience']] = scaler.fit_transform(df[['Age', 'Salary', 'Experience']])
print("\nData after scaling and normalization:\n", df)

Data after scaling and normalization:


ID Age Salary Department Experience
0 1 -1.170477 -1.140700 2 -1.083228
1 2 -0.418027 -0.385825 0 -0.559085
2 3 -1.621947 -1.518138 2 -1.257942
3 4 0.334422 0.000000 1 0.314485
4 5 -0.719007 0.369050 0 -0.733799
5 6 1.086871 1.123925 1 1.188056
6 7 0.785892 0.746488 2 0.663914
7 8 0.000000 -0.234850 0 -0.209657
8 9 1.839321 1.878801 1 2.061627
9 10 -0.117048 -0.838750 2 -0.384371

Step 3: Data Visualization


In [10]: import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot of Age vs Salary


plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Age', y='Salary', data=df)
plt.title('Scatter plot of Age vs Salary')

# Histogram of Age
plt.subplot(2, 2, 2)
sns.histplot(df['Age'], kde=True)
plt.title('Histogram of Age')

# Heatmap of Correlation Matrix


plt.subplot(2, 2, 3)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Correlation Matrix')

# Boxplot of Salary by Department


plt.subplot(2, 2, 4)
sns.boxplot(x='Department', y='Salary', data=df)
plt.title('Boxplot of Salary by Department')

plt.tight_layout()
plt.show()

C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN befor
e operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
In [ ]:

You might also like