0% found this document useful (0 votes)
11 views

Chapter 2. Pre-Processing Data

Uploaded by

hoaptm.21el
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Chapter 2. Pre-Processing Data

Uploaded by

hoaptm.21el
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Chapter 3.

Pre-processing data
Content

1. Handle Missing Values


2. Handle Outliers
3. Convert Data Types
4. Normalize or Scale Data
5. Feature Engineering
Content

1. Handle Missing Values


2. Remove or Replace Outliers
3. Convert Data Types
4. Normalize or Scale Data
5. Feature Engineering
Identify Missing Values
# Load your dataset
data <- read.csv("DataCoSupplyChainDataset.csv")
# Check for missing values in the entire dataset
sum(is.na(data))
# Check for missing values in each column
colSums(is.na(data))
# Display rows with any missing values
data[!complete.cases(data), ]
Remove Missing Values
• # Remove rows with any missing values
• data_cleaned <- na.omit(data)
• # Remove columns with any missing values
• data_cleaned <- data[, colSums(is.na(data)) == 0]
Impute Missing Values

• # Replace NA with 0
• data[is.na(data)] <- 0
• # Replace NA in a specific column with the mean
• data$ColumnName[is.na(data$ColumnName)] <- mean(data$ColumnName, na.rm = TRUE)

• # Replace NA in a specific column with the median


• data$ColumnName[is.na(data$ColumnName)] <- median(data$ColumnName, na.rm = TRUE)

• # Replace NA in a specific column with the mode


• mode_value <- as.numeric(names(sort(table(data$ColumnName), decreasing = TRUE)[1]))
• data$ColumnName[is.na(data$ColumnName)] <- mode_value
Content

1. Handle Missing Values


2. Handle Outliers
3. Convert Data Types
4. Normalize or Scale Data
5. Feature Engineering
• Q1 (First Quartile): The 25th percentile of the data.
• Q3 (Third Quartile): The 75th percentile of the data.
• IQR (Interquartile Range): The range between Q1 and Q3 (middle 50% of
the data).
• Calculation: IQR = Q3 - Q1.
• Identifying Outliers:
• Outliers are data points that fall significantly outside the range of the
majority of the data.
• The IQR method is a common technique for identifying outliers:
• Lower Bound: Q1 - 1.5 * IQR
• Upper Bound: Q3 + 1.5 * IQR
• Data points outside these bounds are considered outliers.
• # Calculate Q1, Q3, and IQR
• Q1 <- quantile(data$value, 0.25)
• Q3 <- quantile(data$value, 0.75)
• IQR <- Q3 - Q1
• # Identify outliers
• lower_bound <- Q1 - 1.5 * IQR
• upper_bound <- Q3 + 1.5 * IQR
• outliers <- data$value < lower_bound | data$value > upper_bound
# Identify outliers
outliers <- data$value < (Q1 - 1.5 * IQR) | data$value > (Q3 + 1.5 * IQR)
outliers_indices <- which(outliers)
outliers_values <- data$value[outliers]
print(outliers_indices) # Indices of outliers
print(outliers_values) # Values of outliers
Visualize Outliers Using Boxplot
# Create a boxplot
ggplot(data, aes(y = value)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Boxplot to Identify Outliers", y = "Value") +
theme_minimal()
Process Outliers
Remove Outliers
# Remove outliers from data
data_no_outliers <- data[!outliers, ]
# Verify by plotting
ggplot(data_no_outliers, aes(y = value)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Boxplot After Removing Outliers", y = "Value") +
theme_minimal()
Transform Data
# Apply log transformation to reduce the effect of outliers
data$log_value <- log(data$value)
# Plot transformed data
ggplot(data, aes(y = log_value)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Boxplot of Log-Transformed Data", y = "Log(Value)") +
theme_minimal()
Impute Outliers
# Replace outliers with median
median_value <- median(data$value, na.rm = TRUE)
data$value[outliers] <- median_value

# Verify by plotting
ggplot(data, aes(y = value)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Boxplot After Imputing Outliers", y = "Value") +
theme_minimal()
Summary
• Identify Outliers: Use summary statistics and visualizations like boxplots.
• Process Outliers: Choose a method based on your analysis goals:
• Remove Outliers: Exclude them from your dataset.
• Transform Data: Apply transformations to mitigate their impact.
• Impute Outliers: Replace them with a more representative value.
Content

1. Handle Missing Values


2. Handle Outliers
3. Convert Data Types
4. Normalize or Scale Data
5. Feature Engineering
Common Data Type Conversions
in R
1. Convert to Numeric
2. Convert to Factor
3. Convert to Character
4. Convert to Date
5. Convert to POSIXct (Date-Time)
Convert to Numeric
• From Character/String to Numeric

• data$num_str <- as.numeric(data$num_str)


• str(data) # Check the structure to verify the conversion
Convert to Factor
• From Character/String to Factor

• data$cat_str <- as.factor(data$cat_str)


• str(data) # Check the structure to verify the conversion
Convert to Character
• From Factor to Character
• data$cat_str <- as.character(data$cat_str)
• str(data) # Check the structure to verify the conversion
Convert to Date
• From Character/String to Date
• data$date_str <- as.Date(data$date_str, format = "%Y-%m-%d")
• str(data) # Check the structure to verify the conversion
Convert to POSIXct (Date-Time)
• From Character/String to POSIXct
• # Example date-time string
• datetime_str <- "2020-01-01 12:34:56"
• datetime <- as.POSIXct(datetime_str, format = "%Y-%m-%d %H:%M:%S")
• print(datetime) # Verify the conversion

• # Convert column in a data frame


• data$datetime_str <- c("2020-01-01 12:34:56", "2020-02-01 08:22:10", "2020-03-01
14:55:00")
• data$datetime_str <- as.POSIXct(data$datetime_str, format = "%Y-%m-%d %H:%M:%S")
• str(data) # Check the structure to verify the conversion
# Convert gender to numeric: 1 for male, 0 for female
data$gender_numeric <- ifelse(data$gender == "male", 1, ifelse(data$gender == "female", 0, NA))

# View the modified data frame


print(data)

# Convert categorical to numeric: 1 for one, 2 for two, 3 for three


data$category_numeric <- ifelse(data$category == "one", 1,
ifelse(data$category == "two", 2,
ifelse(data$category == "three", 3, NA)))

# View the modified data frame


print(data)
Content

1. Handle Missing Values


2. Remove or Replace Outliers
3. Convert Data Types
4. Normalize or Scale Data
5. Feature Engineering
• Normalization or scaling of data is an important preprocessing step in data
analysis
• 1. Min-Max Normalization
• 2. Z-Score Standardization
Min-Max Normalization
Min-Max normalization scales the data to a fixed range, usually [0, 1].
# Min-Max Normalization function
min_max_normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# Apply Min-Max Normalization to all numeric columns
data_normalized <- data %>% mutate_if(is.numeric, min_max_normalize)

# Apply Min-Max Normalization to specific columns


data$Sales <- min_max_normalize(data$Sales)
Z-Score Standardization
# Z-Score Standardization function
z_score_standardize <- function(x) { return ((x - mean(x)) / sd(x)) }

# Apply Z-Score Standardization to all numeric columns


data_standardized <- data %>% mutate_if(is.numeric, z_score_standardize)

# Apply Z-Score Standardization to specific columns


data$Sales <- min_max_normalize(data$Sales)
Feature Engineering
Imputation of Missing Values:
• Replace missing values with mean, median, or mode of the column.
• Use more advanced techniques like K-nearest neighbors (KNN) imputation
or regression imputation.
Encoding Categorical Variables:
• Convert categorical variables into numeric form suitable for modeling.
• Techniques include one-hot encoding, label encoding, and target
encoding.
Scaling and Normalization:
• Scale numeric variables to a consistent range to prevent features with
larger values from dominating the model.
• Techniques include Min-Max scaling and Z-score standardization.
Binning or Discretization:
• Group numerical variables into bins or categories to capture non-linear
relationships.
• Useful when the exact numerical values are less important than the
general range.
Feature Interaction:
• Create new features by combining existing features to capture
interactions between them.
• Examples include product features, ratios, or polynomial features.
Feature Selection:
• Identify and select the most important features for modeling to improve
model performance and reduce overfitting.
• Techniques include correlation analysis, feature importance from tree-
based models, and recursive feature elimination.
Dimensionality Reduction:
• Reduce the number of input variables in a dataset by merging correlated
features or using techniques like Principal Component Analysis (PCA).

You might also like