Week 6 - Data Cleaning
Week 6 - Data Cleaning
---
Missing data can interfere with analyses, skewing results, or causing errors. Handling these
gaps effectively is crucial for reliable insights.
To identify missing values, use the .isnull() method, which returns a DataFrame with True for
each cell with missing data.
Example Code:
import pandas as pd
Explanation:
df.isnull() checks each value in the DataFrame and returns True for missing values.
df.isnull().sum() provides a count of missing values for each column, making it easier to know
which column has how many missing values.
Explanation:
df.dropna() removes any row that contains at least one missing value.
This can also be done for columns by using df.dropna(axis=1) to drop columns with missing
data.
Use .fillna() to fill missing data with a specific value, like the mean, median, or forward/backward
fill.
Example Code:
Explanation:
df['Age'].fillna(df['Age'].mean(), inplace=True) fills missing values in the Age column with the
column's mean.
When handling missing data, consider the type and source of missingness. For example,
demographic studies might justify filling missing age data with the mean, while time series
analysis could require interpolation instead of filling with a simple value.
---
Data normalization and scaling standardize data, making it easier for models to compare
variables on the same scale. This is especially critical for models like K-nearest neighbors,
linear regression, and neural networks.
Example Code:
# Sample DataFrame
data = {'Score': [200, 300, 400, 500]}
df = pd.DataFrame(data)
Explanation:
scaler.fit_transform(df[['Score']]) scales the Score column, which will now have values between
0 and 1.
Example Code:
from sklearn.preprocessing import StandardScaler
Explanation:
scaler.fit_transform(df[['Score']]) computes the mean and standard deviation, and then scales
the data accordingly.
Normalization is ideal when your data has a known fixed range (e.g., percentages, scores).
Standardization works better when your data has varying scales, and you want to focus on
statistical properties like mean and standard deviation.
-Certainly! Here's a continuation and completion of Day 3 to Day 5 of Week 6, focusing on Data
Cleaning and Preprocessing:
---
Data transformations help adjust data for more effective analysis. This can involve stabilizing
variance, dealing with skewed distributions, or making data more normally distributed, which is
often a prerequisite for many statistical tests or models.
2. Log Transformation
When data is heavily skewed, a log transformation can help normalize the distribution. This is
especially useful for variables with a long tail (e.g., income or sales).
Example Code:
import numpy as np
The Box-Cox transformation is often used to stabilize variance and make data more normally
distributed. It is most effective when data is positive and exhibits skewness.
Example Code:
Log Transformation is useful when dealing with positively skewed data or data with extreme
values (e.g., income, sales figures).
Box-Cox Transformation can handle a wider range of distribution shapes, but requires positive
data.
Understanding the nature of your data and its distribution is key to selecting the right
transformation.
---
Data quality checks ensure that your data is accurate, consistent, and ready for analysis.
Poor-quality data can lead to misleading conclusions and inaccurate models.
2. Removing Duplicates
Duplicates in data can distort analysis, especially when the same information is counted multiple
times. It's essential to identify and remove them.
Example Code:
# Identify duplicates
print("Duplicate Rows:\n", df[df.duplicated()])
# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("DataFrame without duplicates:\n", df_no_duplicates)
Sometimes data types may be misrepresented (e.g., numerical data stored as strings).
Validating and correcting data types ensures that operations and analyses are performed
correctly.
Example Code:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': ['25', '30', '35']} # Age stored as string
df = pd.DataFrame(data)
4. Identifying Outliers
Outliers can skew statistical results, so it’s important to identify them. Z-scores and visualization
techniques like boxplots can help identify outliers.
Example Code:
It's essential to check that the data makes sense according to real-world expectations. For
example, negative ages or dates in the future may be logically inconsistent.
Example Code:
Day 4: Data Quality Checks (duplicates, data types, outliers, logical consistency).
---
Additional Practice
Apply any logical consistency checks and ensure the data is ready for analysis or machine
learning.
---
This completes Week 6: Data Cleaning and Preprocessing, preparing you to handle raw data,
ensuring it's clean, consistent, and ready for deeper analysis or machine learning.