0% found this document useful (0 votes)
13 views

Week 6 - Data Cleaning

Uploaded by

hadiyahaya87
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Week 6 - Data Cleaning

Uploaded by

hadiyahaya87
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Week 6: Data Cleaning and Preprocessing

---

Day 1: Handling Missing Data

1. Why Handle Missing Data?

Missing data can interfere with analyses, skewing results, or causing errors. Handling these
gaps effectively is crucial for reliable insights.

2. Identifying Missing Data

To identify missing values, use the .isnull() method, which returns a DataFrame with True for
each cell with missing data.

Example Code:

import pandas as pd

data = {'Name': ['Alice', 'Bob', None, 'Charlie', 'Dave'],


'Age': [25, None, 30, None, 45]}
df = pd.DataFrame(data)

# Identify missing data


print("Missing values:\n", df.isnull())

# Count missing values in each column


print("Missing values count:\n", df.isnull().sum())

Explanation:

df.isnull() checks each value in the DataFrame and returns True for missing values.

df.isnull().sum() provides a count of missing values for each column, making it easier to know
which column has how many missing values.

3. Dropping Missing Data

Use .dropna() to remove rows or columns with missing values.


Example Code:

# Drop rows with any missing values


df_no_missing = df.dropna()
print("DataFrame without missing values:\n", df_no_missing)

Explanation:

df.dropna() removes any row that contains at least one missing value.

This can also be done for columns by using df.dropna(axis=1) to drop columns with missing
data.

4. Filling Missing Data

Use .fillna() to fill missing data with a specific value, like the mean, median, or forward/backward
fill.

Example Code:

# Fill missing values with the mean of the 'Age' column


df['Age'].fillna(df['Age'].mean(), inplace=True)
print("DataFrame with missing values filled:\n", df)

Explanation:

df['Age'].fillna(df['Age'].mean(), inplace=True) fills missing values in the Age column with the
column's mean.

The inplace=True argument modifies the original DataFrame.

5. Context and Considerations

When handling missing data, consider the type and source of missingness. For example,
demographic studies might justify filling missing age data with the mean, while time series
analysis could require interpolation instead of filling with a simple value.
---

Day 2: Data Normalization and Scaling

1. Why Normalize and Scale Data?

Data normalization and scaling standardize data, making it easier for models to compare
variables on the same scale. This is especially critical for models like K-nearest neighbors,
linear regression, and neural networks.

2. Normalization (Min-Max Scaling)

Use MinMaxScaler to scale data between a specific range, such as 0 to 1.

Example Code:

from sklearn.preprocessing import MinMaxScaler

# Sample DataFrame
data = {'Score': [200, 300, 400, 500]}
df = pd.DataFrame(data)

# Apply Min-Max scaling


scaler = MinMaxScaler()
df['Score_scaled'] = scaler.fit_transform(df[['Score']])
print("Normalized DataFrame:\n", df)

Explanation:

MinMaxScaler transforms each feature to fit a range, in this case, from 0 to 1.

scaler.fit_transform(df[['Score']]) scales the Score column, which will now have values between
0 and 1.

3. Standardization (Z-score Scaling)

Use StandardScaler to scale data with a mean of 0 and a standard deviation of 1.

Example Code:
from sklearn.preprocessing import StandardScaler

# Apply Standard scaling


scaler = StandardScaler()
df['Score_standardized'] = scaler.fit_transform(df[['Score']])
print("Standardized DataFrame:\n", df)

Explanation:

StandardScaler standardizes the data so it has a mean of 0 and standard deviation of 1.

scaler.fit_transform(df[['Score']]) computes the mean and standard deviation, and then scales
the data accordingly.

4. Choosing Normalization vs. Standardization

Normalization is ideal when your data has a known fixed range (e.g., percentages, scores).

Standardization works better when your data has varying scales, and you want to focus on
statistical properties like mean and standard deviation.

-Certainly! Here's a continuation and completion of Day 3 to Day 5 of Week 6, focusing on Data
Cleaning and Preprocessing:

---

Day 3: Data Transformation

1. Why Transform Data?

Data transformations help adjust data for more effective analysis. This can involve stabilizing
variance, dealing with skewed distributions, or making data more normally distributed, which is
often a prerequisite for many statistical tests or models.

2. Log Transformation
When data is heavily skewed, a log transformation can help normalize the distribution. This is
especially useful for variables with a long tail (e.g., income or sales).

Example Code:

import numpy as np

# Sample DataFrame with skewed data


data = {'Income': [30000, 45000, 60000, 80000, 150000]}
df = pd.DataFrame(data)

# Apply log transformation


df['Income_log'] = np.log(df['Income'])
print("Log-transformed DataFrame:\n", df)

3. Power Transformation (Box-Cox Transformation)

The Box-Cox transformation is often used to stabilize variance and make data more normally
distributed. It is most effective when data is positive and exhibits skewness.

Example Code:

from sklearn.preprocessing import power_transform

# Apply power transformation (Box-Cox)


df['Income_boxcox'] = power_transform(df[['Income']], method='box-cox')
print("Power-transformed DataFrame:\n", df)

4. Context and Choosing Transformation Methods

Log Transformation is useful when dealing with positively skewed data or data with extreme
values (e.g., income, sales figures).

Box-Cox Transformation can handle a wider range of distribution shapes, but requires positive
data.

Understanding the nature of your data and its distribution is key to selecting the right
transformation.
---

Day 4: Data Quality Checks

1. Importance of Data Quality Checks

Data quality checks ensure that your data is accurate, consistent, and ready for analysis.
Poor-quality data can lead to misleading conclusions and inaccurate models.

2. Removing Duplicates

Duplicates in data can distort analysis, especially when the same information is counted multiple
times. It's essential to identify and remove them.

Example Code:

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],


'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

# Identify duplicates
print("Duplicate Rows:\n", df[df.duplicated()])

# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("DataFrame without duplicates:\n", df_no_duplicates)

3. Validating Data Types

Sometimes data types may be misrepresented (e.g., numerical data stored as strings).
Validating and correcting data types ensures that operations and analyses are performed
correctly.

Example Code:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': ['25', '30', '35']} # Age stored as string
df = pd.DataFrame(data)

# Convert 'Age' to integer


df['Age'] = df['Age'].astype(int)
print("Data types after conversion:\n", df.dtypes)

4. Identifying Outliers

Outliers can skew statistical results, so it’s important to identify them. Z-scores and visualization
techniques like boxplots can help identify outliers.

Example Code:

from scipy import stats

data = {'Sales': [200, 250, 300, 500, 10000]} # Outlier in Sales


df = pd.DataFrame(data)

# Identify outliers using Z-Score


df['z_score'] = stats.zscore(df['Sales'])
outliers = df[df['z_score'].abs() > 3]
print("Outliers:\n", outliers)

5. Checking for Logical Consistency

It's essential to check that the data makes sense according to real-world expectations. For
example, negative ages or dates in the future may be logically inconsistent.

Example Code:

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, -5, 30]} # Invalid negative age
df = pd.DataFrame(data)

# Identify invalid ages


invalid_ages = df[df['Age'] < 0]
print("Rows with invalid ages:\n", invalid_ages)
Summary

Day 1: Handling Missing Data (identify, drop, fill missing values).

Day 2: Data Normalization and Scaling (Min-Max, Standardization).

Day 3: Data Transformation (log transformation, Box-Cox).

Day 4: Data Quality Checks (duplicates, data types, outliers, logical consistency).

---

Additional Practice

Take a dataset of your choice and perform the following tasks:

Handle missing values (drop or fill).

Normalize or scale numerical features.

Transform any skewed features.

Remove duplicates and correct any data type inconsistencies.

Apply any logical consistency checks and ensure the data is ready for analysis or machine
learning.

---

This completes Week 6: Data Cleaning and Preprocessing, preparing you to handle raw data,
ensuring it's clean, consistent, and ready for deeper analysis or machine learning.

You might also like