Open In App

Flight Delay Prediction Using R

Last Updated : 25 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Predicting flight delays is an important aspect in today's moving modern world. This step is important for better time management and customer satisfaction. These delays can cause significant dissatisfaction among passengers even resulting in churn for further flights in the future. Using Machine Learning and Data analysis, we can estimate and predict flight delays using R, a popular statistical programming language. This article will cover how to predict flight delays with the help of R Programming Language.

The objective of Flight Delay Prediction

Flight delay prediction involves forecasting whether a flight will be delayed and by how much, based on various factors such as weather conditions, flight schedule, aircraft specifics, and air traffic control constraints. The goal is to build a predictive model that can assist stakeholders in making informed decisions.

There are certain steps to be followed to predict flight delay in R.

  1. Load Libraries
  2. Load and Inspect the Data
  3. Data Preprocessing
  4. Exploratory Data Analysis (EDA)
  5. Model Building
  6. Prediction
  7. Model Evaluation

1. Load the Required Libraries

Loading and installing these necessary packages are important since they simplify the processing.

R
# Install required packages if not already installed
install.packages(c("dplyr", "ggplot2", "caret", "randomForest", "lubridate"))

# Load libraries
library(tidyverse)
library(caret)
library(lubridate)
library(ggplot2)
library(corrplot)
library(randomForest)

Make sure you have R and Rstudio installed on your PC.

2. Load and Inspect the Data

Here, we will be using an external dataset from the Kaggle website based on the Flight Delay Analysis of US Airlines from NYC.

Dataset Link: NYC Flight Data

R
# Load the dataset
data <- read.csv("flights.csv")

# Sample the data (0.1%) Since it is large
set.seed(123)
sampled_data <- data %>% sample_frac(0.001)

# Inspect the sampled data
str(sampled_data)
summary(sampled_data)

Output:

      year          month             day           dep_time      sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 2.0 Min. : 600
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 856.8 1st Qu.: 902
Median :2013 Median : 7.000 Median :16.00 Median :1401.5 Median :1359
Mean :2013 Mean : 6.715 Mean :16.06 Mean :1340.3 Mean :1350
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1745.0 3rd Qu.:1732
Max. :2013 Max. :12.000 Max. :31.00 Max. :2344.0 Max. :2359
NA's :9
dep_delay arr_time sched_arr_time arr_delay carrier
Min. :-14.00 Min. : 1 Min. : 4 Min. :-57.000 B6 :59
1st Qu.: -5.00 1st Qu.:1107 1st Qu.:1114 1st Qu.:-17.000 UA :59
Median : -1.00 Median :1541 Median :1550 Median : -6.000 DL :52
Mean : 12.58 Mean :1511 Mean :1541 Mean : 7.443 EV :51
3rd Qu.: 13.25 3rd Qu.:1960 3rd Qu.:2000 3rd Qu.: 16.000 AA :30
Max. :253.00 Max. :2352 Max. :2359 Max. :330.000 MQ :27
NA's :9 NA's :9 NA's :10 (Other):59
flight tailnum origin dest air_time distance
Min. : 11 N738MQ : 4 EWR:125 ATL : 18 Min. : 29.0 Min. : 94
1st Qu.: 623 N723TW : 3 JFK:111 BOS : 16 1st Qu.: 90.0 1st Qu.: 544
Median :1444 N0EGMQ : 2 LGA:101 CLT : 16 Median :139.0 Median :1005
Mean :1943 N11551 : 2 SFO : 15 Mean :157.3 Mean :1091
3rd Qu.:3361 N12238 : 2 FLL : 14 3rd Qu.:194.0 3rd Qu.:1400
Max. :5978 (Other):320 LAX : 14 Max. :661.0 Max. :4963
NA's : 4 (Other):244 NA's :10
hour minute time_hour
Min. : 6.00 Min. : 0.00 13-08-2013 08:00: 3
1st Qu.: 9.00 1st Qu.: 8.00 21-10-2013 20:00: 2
Median :13.00 Median :30.00 22-02-2013 18:00: 2
Mean :13.23 Mean :26.74 22-04-2013 06:00: 2
3rd Qu.:17.00 3rd Qu.:41.00 22-11-2013 17:00: 2
Max. :23.00 Max. :59.00 28-02-2013 17:00: 2
(Other) :324

3. Data Preprocessing

Data Preprocessing is one of the most important steps in Machine Learning since it helps us filter the data making it ready to train. Many such missing values can alter the prediction therefore it is important to deal with them. Data preprocessing includes data cleaning, handling missing values, feature engineering etc.

R
# Handle missing values
sampled_data <- na.omit(sampled_data)

# Convert variables to appropriate types
sampled_data <- sampled_data %>%
  mutate(
    dep_time = sprintf("%04d", dep_time),
    sched_dep_time = sprintf("%04d", sched_dep_time),
    arr_time = sprintf("%04d", arr_time),
    sched_arr_time = sprintf("%04d", sched_arr_time),
    dep_time = ymd_hm(paste(year, month, day, substr(dep_time, 1, 2), 
                            substr(dep_time, 3, 4))),
    sched_dep_time = ymd_hm(paste(year, month, day, substr(sched_dep_time, 1, 2), 
                                  substr(sched_dep_time, 3, 4))),
    arr_time = ymd_hm(paste(year, month, day, substr(arr_time, 1, 2), 
                            substr(arr_time, 3, 4))),
    sched_arr_time = ymd_hm(paste(year, month, day, substr(sched_arr_time, 1, 2), 
                                  substr(sched_arr_time, 3, 4))),
    carrier = as.factor(carrier),
    origin = as.factor(origin),
    dest = as.factor(dest),
    flight = as.integer(flight),
    tailnum = as.factor(tailnum)
  )

# Create new features
sampled_data <- sampled_data %>%
  mutate(
    dep_hour = hour(dep_time),
    dep_minute = minute(dep_time),
    sched_dep_hour = hour(sched_dep_time),
    sched_dep_minute = minute(sched_dep_time),
    arr_hour = hour(arr_time),
    arr_minute = minute(arr_time),
    sched_arr_hour = hour(sched_arr_time),
    sched_arr_minute = minute(sched_arr_time)
  )

4. EDA (Exploratory Data Analysis)

EDA is done on any dataset to understand the insights in data. These graphs will help us understand the data in better way and then make informed decisions.

Distribution of Departure Delays

Here, we will plot the distribution of departure delay in minutes giving us insights of departure.

R
ggplot(sampled_data, aes(x = dep_delay)) +
  geom_histogram(bins = 50, fill = "pink", color = "black") +
  labs(title = "Distribution of Departure Delays", x = "Departure Delay (minutes)",
                                                                    y = "Frequency")

Output:

Rplot05
Flight Delay Prediction Using R

Average Departure Delay by Airline

This gives us insights about the airlines and their average departure delay letting us know which airlines makes more delays.

R
ggplot(sampled_data, aes(x = carrier, y = dep_delay, fill = carrier)) +
    geom_boxplot() +
    labs(title = "Average Departure Delay by Airline", x = "Airline",
         y = "Departure Delay (minutes)")

Output:

gh
Flight Delay Prediction Using R

Heatmap of Average Departure Delay

The heatmap shows which days of the month and which months have higher or lower average departure delays. Lighter colors represent lower delays, and darker colors represent higher delays.

R
delay_heatmap <- sampled_data %>%
  group_by(month, day) %>%
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

ggplot(delay_heatmap, aes(x = factor(month), y = factor(day), fill = mean_dep_delay))+
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "red") +
  labs(title = "Heatmap of Average Departure Delay by Day and Month", 
       x = "Month", y = "Day")

Output:

flight-heatmap-GFG
Flight Delay Prediction Using R

Correlation Matrix

We plot correlation matrix to understand the factors affecting the delay of the flight.

R
corr_matrix <- cor(sampled_data %>% select(dep_delay, arr_delay, air_time, 
                                           distance, dep_hour, dep_minute, arr_hour, 
                                           arr_minute), use = "complete.obs")
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black")

OUTPUT:

CORR-MATRIX-GFG
Flight Delay Prediction Using R

5. Model Training using RandomForest

Random Forest is an ensemble learning method that uses multiple decision trees to improve classification or regression performance. It reduces overfitting and increases accuracy.

R
# Exclude high-cardinality categorical variables
model_data <- sampled_data %>%
    select(-tailnum) %>%
    mutate(across(where(is.factor), as.numeric))  # Convert factors to numeric

# Split data into training and test sets
set.seed(123)
train_index <- createDataPartition(model_data$dep_delay, p = 0.7, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

# Train random forest model
model_rf <- randomForest(dep_delay ~ ., data = train_data, ntree = 100)

6. Prediction using RandomForest

Here, we are predicting the delays of flight using a randomforest algorithm. Based on the trained model, we will predict if the flights will be delayed or not.

R
# Make predictions on the test data
predictions <- predict(model_rf, test_data)

# Create a results dataframe for comparison
results <- data.frame(
    Actual = test_data$dep_delay,
    Predicted = predictions,
    Residual = test_data$dep_delay - predictions
)

# Add a column indicating whether the flight is delayed
results <- results %>%
    mutate(
        Delayed = ifelse(Predicted > 0, "Yes", "No")
    ) 

# Print a summary of the results
head(results)

Output:

   Actual  Predicted   Residual Delayed
2 -6 -0.4863333 -5.513667 No
3 -3 -1.9815000 -1.018500 No
4 -1 1.9500000 -2.950000 Yes
5 -2 5.0131667 -7.013167 Yes
9 -10 -3.5148333 -6.485167 No
14 -9 28.9585000 -37.958500 Yes

Visualize the Prediction for Delays

To understand better we will visualize the prediction of delays.

R
# Filter for delayed flights
delayed_flights <- results %>% filter(Delayed == "Yes")

# Histogram of predicted delays for delayed flights
ggplot(delayed_flights, aes(x = Predicted)) +
    geom_histogram(bins = 30, fill = "orange", color = "black", alpha = 0.7) +
    labs(title = "Distribution of Predicted Delays for Delayed Flights",
         x = "Predicted Delay (minutes)",
         y = "Frequency") +
    theme_minimal()

Output:

gh
Flight Delay Prediction Using R

7. Model Evaluation

These metrics in R are used to evaluate the performance of the model and how accurate it is. This step is important since it helps us understand the precision and accuracy of the predictions made. The graphs help us understand better how to evaluate the model we trained and if our predictions are correct or not.

A performance matrix provides various metrics to evaluate the model's performance, such as accuracy, precision, recall, F1-score, and AUC.

R
# Calculate RMSE
rmse <- sqrt(mean((results$Actual - results$Predicted)^2))
print(paste("RMSE: ", rmse))

# Ensure that predictions and actual values are aligned
test_data$predicted_delay <- predict(model_rf, newdata = test_data)

# Convert actual delays and predicted delays to binary outcomes
test_data$delay_binary <- ifelse(test_data$dep_delay > 0, 1, 0)  # 1 for delayed, 0 for not delayed
test_data$predicted_binary <- ifelse(test_data$predicted_delay > 0, 1, 0)  # Binary predictions

# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_data$predicted_binary), 
                               as.factor(test_data$delay_binary))

# Accuracy
accuracy <- conf_matrix$overall['Accuracy']

# Precision, Recall, F1-score
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)

# AUC
roc_obj <- roc(test_data$delay_binary, as.numeric(test_data$predicted_delay))  # Ensure predicted_delay is numeric
auc <- auc(roc_obj)

# Print the metrics
cat("Accuracy: ", accuracy, "\n")
cat("Precision: ", precision, "\n")
cat("Recall (Sensitivity): ", recall, "\n")
cat("F1-score: ", f1_score, "\n")
cat("AUC: ", auc, "\n")

Output:

RMSE:  13.0528251991879"
Accuracy: 0.6145833
Precision: 0.8461538
Recall (Sensitivity): 0.4
F1-score: 0.5432099
AUC: 0.7933481

RMSE gives an idea of how close the model's predictions are to the actual values, with lower values indicating better performance

  • Accuracy: Accuracy measures the proportion of correct predictions among the total number of predictions made. Here, in our case, the accuracy is 0.61 which means 61% of the predictions were correctly classified as either delayed or not delayed.
  • Precision: This measures the accuracy of positive predictions. 84.61 of flights predicted as delayed were actually delayed similarly 84.61% of flights were not delayed as predicted.
  • Recall: A recall of 40% means that the model correctly identifies 40% of all actual delayed flights. This is relatively low, suggesting that many delayed flights are not being detected by the model.
  • F-1 score: This is the harmonic mean of precision and recall or sensitivity. This score indicates that the model works reasonably well in dealing with both the metrics.
  • AUC (Area Under the ROC Curve): This represents the two-dimensional area under the ROC curve. 0.79, It means the model has a 79.33% chance of correctly distinguishing between delayed and non-delayed flights. AUC values range from 0.5 (no discriminative ability) to 1 (perfect classification)

Conclusion

This article discussed the Flight delay prediction using R programming language and how machine learning can play an important role in our travel planning. We used an external dataset to predict flight delay and understand the relationship between the variables with the help of visualization.


Next Article

Similar Reads