Flight Delay Prediction Using R

Last Updated : 25 Jul, 2024

Predicting flight delays is an important aspect in today's moving modern world. This step is important for better time management and customer satisfaction. These delays can cause significant dissatisfaction among passengers even resulting in churn for further flights in the future. Using Machine Learning and Data analysis, we can estimate and predict flight delays using R, a popular statistical programming language. This article will cover how to predict flight delays with the help of R Programming Language.

The objective of Flight Delay Prediction

Flight delay prediction involves forecasting whether a flight will be delayed and by how much, based on various factors such as weather conditions, flight schedule, aircraft specifics, and air traffic control constraints. The goal is to build a predictive model that can assist stakeholders in making informed decisions.

There are certain steps to be followed to predict flight delay in R.

Load Libraries
Load and Inspect the Data
Data Preprocessing
Exploratory Data Analysis (EDA)
Model Building
Prediction
Model Evaluation

1. Load the Required Libraries

Loading and installing these necessary packages are important since they simplify the processing.

# Install required packages if not already installed
install.packages(c("dplyr", "ggplot2", "caret", "randomForest", "lubridate"))

# Load libraries
library(tidyverse)
library(caret)
library(lubridate)
library(ggplot2)
library(corrplot)
library(randomForest)

Make sure you have R and Rstudio installed on your PC.

2. Load and Inspect the Data

Here, we will be using an external dataset from the Kaggle website based on the Flight Delay Analysis of US Airlines from NYC.

Dataset Link: NYC Flight Data

# Load the dataset
data <- read.csv("flights.csv")

# Sample the data (0.1%) Since it is large
set.seed(123)
sampled_data <- data %>% sample_frac(0.001)

# Inspect the sampled data
str(sampled_data)
summary(sampled_data)

Output:

      year          month             day           dep_time      sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   2.0   Min.   : 600  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 856.8   1st Qu.: 902  
 Median :2013   Median : 7.000   Median :16.00   Median :1401.5   Median :1359  
 Mean   :2013   Mean   : 6.715   Mean   :16.06   Mean   :1340.3   Mean   :1350  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1745.0   3rd Qu.:1732  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2344.0   Max.   :2359  
                                                 NA's   :9                      
   dep_delay         arr_time    sched_arr_time   arr_delay          carrier  
 Min.   :-14.00   Min.   :   1   Min.   :   4   Min.   :-57.000   B6     :59  
 1st Qu.: -5.00   1st Qu.:1107   1st Qu.:1114   1st Qu.:-17.000   UA     :59  
 Median : -1.00   Median :1541   Median :1550   Median : -6.000   DL     :52  
 Mean   : 12.58   Mean   :1511   Mean   :1541   Mean   :  7.443   EV     :51  
 3rd Qu.: 13.25   3rd Qu.:1960   3rd Qu.:2000   3rd Qu.: 16.000   AA     :30  
 Max.   :253.00   Max.   :2352   Max.   :2359   Max.   :330.000   MQ     :27  
 NA's   :9        NA's   :9                     NA's   :10        (Other):59  
     flight        tailnum    origin         dest        air_time        distance   
 Min.   :  11   N738MQ :  4   EWR:125   ATL    : 18   Min.   : 29.0   Min.   :  94  
 1st Qu.: 623   N723TW :  3   JFK:111   BOS    : 16   1st Qu.: 90.0   1st Qu.: 544  
 Median :1444   N0EGMQ :  2   LGA:101   CLT    : 16   Median :139.0   Median :1005  
 Mean   :1943   N11551 :  2             SFO    : 15   Mean   :157.3   Mean   :1091  
 3rd Qu.:3361   N12238 :  2             FLL    : 14   3rd Qu.:194.0   3rd Qu.:1400  
 Max.   :5978   (Other):320             LAX    : 14   Max.   :661.0   Max.   :4963  
                NA's   :  4             (Other):244   NA's   :10                    
      hour           minute                 time_hour  
 Min.   : 6.00   Min.   : 0.00   13-08-2013 08:00:  3  
 1st Qu.: 9.00   1st Qu.: 8.00   21-10-2013 20:00:  2  
 Median :13.00   Median :30.00   22-02-2013 18:00:  2  
 Mean   :13.23   Mean   :26.74   22-04-2013 06:00:  2  
 3rd Qu.:17.00   3rd Qu.:41.00   22-11-2013 17:00:  2  
 Max.   :23.00   Max.   :59.00   28-02-2013 17:00:  2  
                                 (Other)         :324

3. Data Preprocessing

Data Preprocessing is one of the most important steps in Machine Learning since it helps us filter the data making it ready to train. Many such missing values can alter the prediction therefore it is important to deal with them. Data preprocessing includes data cleaning, handling missing values, feature engineering etc.

# Handle missing values
sampled_data <- na.omit(sampled_data)

# Convert variables to appropriate types
sampled_data <- sampled_data %>%
  mutate(
    dep_time = sprintf("%04d", dep_time),
    sched_dep_time = sprintf("%04d", sched_dep_time),
    arr_time = sprintf("%04d", arr_time),
    sched_arr_time = sprintf("%04d", sched_arr_time),
    dep_time = ymd_hm(paste(year, month, day, substr(dep_time, 1, 2), 
                            substr(dep_time, 3, 4))),
    sched_dep_time = ymd_hm(paste(year, month, day, substr(sched_dep_time, 1, 2), 
                                  substr(sched_dep_time, 3, 4))),
    arr_time = ymd_hm(paste(year, month, day, substr(arr_time, 1, 2), 
                            substr(arr_time, 3, 4))),
    sched_arr_time = ymd_hm(paste(year, month, day, substr(sched_arr_time, 1, 2), 
                                  substr(sched_arr_time, 3, 4))),
    carrier = as.factor(carrier),
    origin = as.factor(origin),
    dest = as.factor(dest),
    flight = as.integer(flight),
    tailnum = as.factor(tailnum)
  )

# Create new features
sampled_data <- sampled_data %>%
  mutate(
    dep_hour = hour(dep_time),
    dep_minute = minute(dep_time),
    sched_dep_hour = hour(sched_dep_time),
    sched_dep_minute = minute(sched_dep_time),
    arr_hour = hour(arr_time),
    arr_minute = minute(arr_time),
    sched_arr_hour = hour(sched_arr_time),
    sched_arr_minute = minute(sched_arr_time)
  )

4. EDA (Exploratory Data Analysis)

EDA is done on any dataset to understand the insights in data. These graphs will help us understand the data in better way and then make informed decisions.

Distribution of Departure Delays

Here, we will plot the distribution of departure delay in minutes giving us insights of departure.

ggplot(sampled_data, aes(x = dep_delay)) +
  geom_histogram(bins = 50, fill = "pink", color = "black") +
  labs(title = "Distribution of Departure Delays", x = "Departure Delay (minutes)",
                                                                    y = "Frequency")

Output:

Rplot05 — Flight Delay Prediction Using R

Average Departure Delay by Airline

This gives us insights about the airlines and their average departure delay letting us know which airlines makes more delays.

ggplot(sampled_data, aes(x = carrier, y = dep_delay, fill = carrier)) +
    geom_boxplot() +
    labs(title = "Average Departure Delay by Airline", x = "Airline",
         y = "Departure Delay (minutes)")

Output:

Heatmap of Average Departure Delay

The heatmap shows which days of the month and which months have higher or lower average departure delays. Lighter colors represent lower delays, and darker colors represent higher delays.

delay_heatmap <- sampled_data %>%
  group_by(month, day) %>%
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

ggplot(delay_heatmap, aes(x = factor(month), y = factor(day), fill = mean_dep_delay))+
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "red") +
  labs(title = "Heatmap of Average Departure Delay by Day and Month", 
       x = "Month", y = "Day")

Output:

flight-heatmap-GFG — Flight Delay Prediction Using R

Correlation Matrix

We plot correlation matrix to understand the factors affecting the delay of the flight.

corr_matrix <- cor(sampled_data %>% select(dep_delay, arr_delay, air_time, 
                                           distance, dep_hour, dep_minute, arr_hour, 
                                           arr_minute), use = "complete.obs")
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black")

OUTPUT:

CORR-MATRIX-GFG — Flight Delay Prediction Using R

5. Model Training using RandomForest

Random Forest is an ensemble learning method that uses multiple decision trees to improve classification or regression performance. It reduces overfitting and increases accuracy.

# Exclude high-cardinality categorical variables
model_data <- sampled_data %>%
    select(-tailnum) %>%
    mutate(across(where(is.factor), as.numeric))  # Convert factors to numeric

# Split data into training and test sets
set.seed(123)
train_index <- createDataPartition(model_data$dep_delay, p = 0.7, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

# Train random forest model
model_rf <- randomForest(dep_delay ~ ., data = train_data, ntree = 100)

6. Prediction using RandomForest

Here, we are predicting the delays of flight using a randomforest algorithm. Based on the trained model, we will predict if the flights will be delayed or not.

# Make predictions on the test data
predictions <- predict(model_rf, test_data)

# Create a results dataframe for comparison
results <- data.frame(
    Actual = test_data$dep_delay,
    Predicted = predictions,
    Residual = test_data$dep_delay - predictions
)

# Add a column indicating whether the flight is delayed
results <- results %>%
    mutate(
        Delayed = ifelse(Predicted > 0, "Yes", "No")
    ) 

# Print a summary of the results
head(results)

Output:

   Actual  Predicted   Residual Delayed
2      -6 -0.4863333  -5.513667      No
3      -3 -1.9815000  -1.018500      No
4      -1  1.9500000  -2.950000     Yes
5      -2  5.0131667  -7.013167     Yes
9     -10 -3.5148333  -6.485167      No
14     -9 28.9585000 -37.958500     Yes

Visualize the Prediction for Delays

To understand better we will visualize the prediction of delays.

# Filter for delayed flights
delayed_flights <- results %>% filter(Delayed == "Yes")

# Histogram of predicted delays for delayed flights
ggplot(delayed_flights, aes(x = Predicted)) +
    geom_histogram(bins = 30, fill = "orange", color = "black", alpha = 0.7) +
    labs(title = "Distribution of Predicted Delays for Delayed Flights",
         x = "Predicted Delay (minutes)",
         y = "Frequency") +
    theme_minimal()

Output:

7. Model Evaluation

These metrics in R are used to evaluate the performance of the model and how accurate it is. This step is important since it helps us understand the precision and accuracy of the predictions made. The graphs help us understand better how to evaluate the model we trained and if our predictions are correct or not.

A performance matrix provides various metrics to evaluate the model's performance, such as accuracy, precision, recall, F1-score, and AUC.

# Calculate RMSE
rmse <- sqrt(mean((results$Actual - results$Predicted)^2))
print(paste("RMSE: ", rmse))

# Ensure that predictions and actual values are aligned
test_data$predicted_delay <- predict(model_rf, newdata = test_data)

# Convert actual delays and predicted delays to binary outcomes
test_data$delay_binary <- ifelse(test_data$dep_delay > 0, 1, 0)  # 1 for delayed, 0 for not delayed
test_data$predicted_binary <- ifelse(test_data$predicted_delay > 0, 1, 0)  # Binary predictions

# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_data$predicted_binary), 
                               as.factor(test_data$delay_binary))

# Accuracy
accuracy <- conf_matrix$overall['Accuracy']

# Precision, Recall, F1-score
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)

# AUC
roc_obj <- roc(test_data$delay_binary, as.numeric(test_data$predicted_delay))  # Ensure predicted_delay is numeric
auc <- auc(roc_obj)

# Print the metrics
cat("Accuracy: ", accuracy, "\n")
cat("Precision: ", precision, "\n")
cat("Recall (Sensitivity): ", recall, "\n")
cat("F1-score: ", f1_score, "\n")
cat("AUC: ", auc, "\n")

Output:

RMSE:  13.0528251991879"
Accuracy:  0.6145833 
Precision:  0.8461538 
Recall (Sensitivity):  0.4 
F1-score:  0.5432099 
AUC:  0.7933481

RMSE gives an idea of how close the model's predictions are to the actual values, with lower values indicating better performance

Accuracy: Accuracy measures the proportion of correct predictions among the total number of predictions made. Here, in our case, the accuracy is 0.61 which means 61% of the predictions were correctly classified as either delayed or not delayed.
Precision: This measures the accuracy of positive predictions. 84.61 of flights predicted as delayed were actually delayed similarly 84.61% of flights were not delayed as predicted.
Recall: A recall of 40% means that the model correctly identifies 40% of all actual delayed flights. This is relatively low, suggesting that many delayed flights are not being detected by the model.
F-1 score: This is the harmonic mean of precision and recall or sensitivity. This score indicates that the model works reasonably well in dealing with both the metrics.
AUC (Area Under the ROC Curve): This represents the two-dimensional area under the ROC curve. 0.79, It means the model has a 79.33% chance of correctly distinguishing between delayed and non-delayed flights. AUC values range from 0.5 (no discriminative ability) to 1 (perfect classification)

Conclusion

This article discussed the Flight delay prediction using R programming language and how machine learning can play an important role in our travel planning. We used an external dataset to predict flight delay and understand the relationship between the variables with the help of visualization.

Transport Demand Prediction using Regression

algorhythm

Improve

Article Tags :

Flight Delay Prediction Using R

The objective of Flight Delay Prediction

1. Load the Required Libraries

2. Load and Inspect the Data

3. Data Preprocessing

4. EDA (Exploratory Data Analysis)

Distribution of Departure Delays

Average Departure Delay by Airline

Heatmap of Average Departure Delay

Correlation Matrix

5. Model Training using RandomForest

6. Prediction using RandomForest

Visualize the Prediction for Delays

7. Model Evaluation

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?