Flight Delay Prediction Using R
Last Updated :
25 Jul, 2024
Predicting flight delays is an important aspect in today's moving modern world. This step is important for better time management and customer satisfaction. These delays can cause significant dissatisfaction among passengers even resulting in churn for further flights in the future. Using Machine Learning and Data analysis, we can estimate and predict flight delays using R, a popular statistical programming language. This article will cover how to predict flight delays with the help of R Programming Language.
The objective of Flight Delay Prediction
Flight delay prediction involves forecasting whether a flight will be delayed and by how much, based on various factors such as weather conditions, flight schedule, aircraft specifics, and air traffic control constraints. The goal is to build a predictive model that can assist stakeholders in making informed decisions.
There are certain steps to be followed to predict flight delay in R.
- Load Libraries
- Load and Inspect the Data
- Data Preprocessing
- Exploratory Data Analysis (EDA)
- Model Building
- Prediction
- Model Evaluation
1. Load the Required Libraries
Loading and installing these necessary packages are important since they simplify the processing.
R
# Install required packages if not already installed
install.packages(c("dplyr", "ggplot2", "caret", "randomForest", "lubridate"))
# Load libraries
library(tidyverse)
library(caret)
library(lubridate)
library(ggplot2)
library(corrplot)
library(randomForest)
Make sure you have R and Rstudio installed on your PC.
2. Load and Inspect the Data
Here, we will be using an external dataset from the Kaggle website based on the Flight Delay Analysis of US Airlines from NYC.
Dataset Link: NYC Flight Data
R
# Load the dataset
data <- read.csv("flights.csv")
# Sample the data (0.1%) Since it is large
set.seed(123)
sampled_data <- data %>% sample_frac(0.001)
# Inspect the sampled data
str(sampled_data)
summary(sampled_data)
Output:
year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 2.0 Min. : 600
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 856.8 1st Qu.: 902
Median :2013 Median : 7.000 Median :16.00 Median :1401.5 Median :1359
Mean :2013 Mean : 6.715 Mean :16.06 Mean :1340.3 Mean :1350
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1745.0 3rd Qu.:1732
Max. :2013 Max. :12.000 Max. :31.00 Max. :2344.0 Max. :2359
NA's :9
dep_delay arr_time sched_arr_time arr_delay carrier
Min. :-14.00 Min. : 1 Min. : 4 Min. :-57.000 B6 :59
1st Qu.: -5.00 1st Qu.:1107 1st Qu.:1114 1st Qu.:-17.000 UA :59
Median : -1.00 Median :1541 Median :1550 Median : -6.000 DL :52
Mean : 12.58 Mean :1511 Mean :1541 Mean : 7.443 EV :51
3rd Qu.: 13.25 3rd Qu.:1960 3rd Qu.:2000 3rd Qu.: 16.000 AA :30
Max. :253.00 Max. :2352 Max. :2359 Max. :330.000 MQ :27
NA's :9 NA's :9 NA's :10 (Other):59
flight tailnum origin dest air_time distance
Min. : 11 N738MQ : 4 EWR:125 ATL : 18 Min. : 29.0 Min. : 94
1st Qu.: 623 N723TW : 3 JFK:111 BOS : 16 1st Qu.: 90.0 1st Qu.: 544
Median :1444 N0EGMQ : 2 LGA:101 CLT : 16 Median :139.0 Median :1005
Mean :1943 N11551 : 2 SFO : 15 Mean :157.3 Mean :1091
3rd Qu.:3361 N12238 : 2 FLL : 14 3rd Qu.:194.0 3rd Qu.:1400
Max. :5978 (Other):320 LAX : 14 Max. :661.0 Max. :4963
NA's : 4 (Other):244 NA's :10
hour minute time_hour
Min. : 6.00 Min. : 0.00 13-08-2013 08:00: 3
1st Qu.: 9.00 1st Qu.: 8.00 21-10-2013 20:00: 2
Median :13.00 Median :30.00 22-02-2013 18:00: 2
Mean :13.23 Mean :26.74 22-04-2013 06:00: 2
3rd Qu.:17.00 3rd Qu.:41.00 22-11-2013 17:00: 2
Max. :23.00 Max. :59.00 28-02-2013 17:00: 2
(Other) :324
3. Data Preprocessing
Data Preprocessing is one of the most important steps in Machine Learning since it helps us filter the data making it ready to train. Many such missing values can alter the prediction therefore it is important to deal with them. Data preprocessing includes data cleaning, handling missing values, feature engineering etc.
R
# Handle missing values
sampled_data <- na.omit(sampled_data)
# Convert variables to appropriate types
sampled_data <- sampled_data %>%
mutate(
dep_time = sprintf("%04d", dep_time),
sched_dep_time = sprintf("%04d", sched_dep_time),
arr_time = sprintf("%04d", arr_time),
sched_arr_time = sprintf("%04d", sched_arr_time),
dep_time = ymd_hm(paste(year, month, day, substr(dep_time, 1, 2),
substr(dep_time, 3, 4))),
sched_dep_time = ymd_hm(paste(year, month, day, substr(sched_dep_time, 1, 2),
substr(sched_dep_time, 3, 4))),
arr_time = ymd_hm(paste(year, month, day, substr(arr_time, 1, 2),
substr(arr_time, 3, 4))),
sched_arr_time = ymd_hm(paste(year, month, day, substr(sched_arr_time, 1, 2),
substr(sched_arr_time, 3, 4))),
carrier = as.factor(carrier),
origin = as.factor(origin),
dest = as.factor(dest),
flight = as.integer(flight),
tailnum = as.factor(tailnum)
)
# Create new features
sampled_data <- sampled_data %>%
mutate(
dep_hour = hour(dep_time),
dep_minute = minute(dep_time),
sched_dep_hour = hour(sched_dep_time),
sched_dep_minute = minute(sched_dep_time),
arr_hour = hour(arr_time),
arr_minute = minute(arr_time),
sched_arr_hour = hour(sched_arr_time),
sched_arr_minute = minute(sched_arr_time)
)
4. EDA (Exploratory Data Analysis)
EDA is done on any dataset to understand the insights in data. These graphs will help us understand the data in better way and then make informed decisions.
Distribution of Departure Delays
Here, we will plot the distribution of departure delay in minutes giving us insights of departure.
R
ggplot(sampled_data, aes(x = dep_delay)) +
geom_histogram(bins = 50, fill = "pink", color = "black") +
labs(title = "Distribution of Departure Delays", x = "Departure Delay (minutes)",
y = "Frequency")
Output:
Flight Delay Prediction Using RAverage Departure Delay by Airline
This gives us insights about the airlines and their average departure delay letting us know which airlines makes more delays.
R
ggplot(sampled_data, aes(x = carrier, y = dep_delay, fill = carrier)) +
geom_boxplot() +
labs(title = "Average Departure Delay by Airline", x = "Airline",
y = "Departure Delay (minutes)")
Output:
Flight Delay Prediction Using RHeatmap of Average Departure Delay
The heatmap shows which days of the month and which months have higher or lower average departure delays. Lighter colors represent lower delays, and darker colors represent higher delays.
R
delay_heatmap <- sampled_data %>%
group_by(month, day) %>%
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))
ggplot(delay_heatmap, aes(x = factor(month), y = factor(day), fill = mean_dep_delay))+
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "red") +
labs(title = "Heatmap of Average Departure Delay by Day and Month",
x = "Month", y = "Day")
Output:
Flight Delay Prediction Using RCorrelation Matrix
We plot correlation matrix to understand the factors affecting the delay of the flight.
R
corr_matrix <- cor(sampled_data %>% select(dep_delay, arr_delay, air_time,
distance, dep_hour, dep_minute, arr_hour,
arr_minute), use = "complete.obs")
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black")
OUTPUT:
Flight Delay Prediction Using R5. Model Training using RandomForest
Random Forest is an ensemble learning method that uses multiple decision trees to improve classification or regression performance. It reduces overfitting and increases accuracy.
R
# Exclude high-cardinality categorical variables
model_data <- sampled_data %>%
select(-tailnum) %>%
mutate(across(where(is.factor), as.numeric)) # Convert factors to numeric
# Split data into training and test sets
set.seed(123)
train_index <- createDataPartition(model_data$dep_delay, p = 0.7, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
# Train random forest model
model_rf <- randomForest(dep_delay ~ ., data = train_data, ntree = 100)
6. Prediction using RandomForest
Here, we are predicting the delays of flight using a randomforest algorithm. Based on the trained model, we will predict if the flights will be delayed or not.
R
# Make predictions on the test data
predictions <- predict(model_rf, test_data)
# Create a results dataframe for comparison
results <- data.frame(
Actual = test_data$dep_delay,
Predicted = predictions,
Residual = test_data$dep_delay - predictions
)
# Add a column indicating whether the flight is delayed
results <- results %>%
mutate(
Delayed = ifelse(Predicted > 0, "Yes", "No")
)
# Print a summary of the results
head(results)
Output:
Actual Predicted Residual Delayed
2 -6 -0.4863333 -5.513667 No
3 -3 -1.9815000 -1.018500 No
4 -1 1.9500000 -2.950000 Yes
5 -2 5.0131667 -7.013167 Yes
9 -10 -3.5148333 -6.485167 No
14 -9 28.9585000 -37.958500 Yes
Visualize the Prediction for Delays
To understand better we will visualize the prediction of delays.
R
# Filter for delayed flights
delayed_flights <- results %>% filter(Delayed == "Yes")
# Histogram of predicted delays for delayed flights
ggplot(delayed_flights, aes(x = Predicted)) +
geom_histogram(bins = 30, fill = "orange", color = "black", alpha = 0.7) +
labs(title = "Distribution of Predicted Delays for Delayed Flights",
x = "Predicted Delay (minutes)",
y = "Frequency") +
theme_minimal()
Output:
Flight Delay Prediction Using R7. Model Evaluation
These metrics in R are used to evaluate the performance of the model and how accurate it is. This step is important since it helps us understand the precision and accuracy of the predictions made. The graphs help us understand better how to evaluate the model we trained and if our predictions are correct or not.
A performance matrix provides various metrics to evaluate the model's performance, such as accuracy, precision, recall, F1-score, and AUC.
R
# Calculate RMSE
rmse <- sqrt(mean((results$Actual - results$Predicted)^2))
print(paste("RMSE: ", rmse))
# Ensure that predictions and actual values are aligned
test_data$predicted_delay <- predict(model_rf, newdata = test_data)
# Convert actual delays and predicted delays to binary outcomes
test_data$delay_binary <- ifelse(test_data$dep_delay > 0, 1, 0) # 1 for delayed, 0 for not delayed
test_data$predicted_binary <- ifelse(test_data$predicted_delay > 0, 1, 0) # Binary predictions
# Confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_data$predicted_binary),
as.factor(test_data$delay_binary))
# Accuracy
accuracy <- conf_matrix$overall['Accuracy']
# Precision, Recall, F1-score
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)
# AUC
roc_obj <- roc(test_data$delay_binary, as.numeric(test_data$predicted_delay)) # Ensure predicted_delay is numeric
auc <- auc(roc_obj)
# Print the metrics
cat("Accuracy: ", accuracy, "\n")
cat("Precision: ", precision, "\n")
cat("Recall (Sensitivity): ", recall, "\n")
cat("F1-score: ", f1_score, "\n")
cat("AUC: ", auc, "\n")
Output:
RMSE: 13.0528251991879"
Accuracy: 0.6145833
Precision: 0.8461538
Recall (Sensitivity): 0.4
F1-score: 0.5432099
AUC: 0.7933481
RMSE gives an idea of how close the model's predictions are to the actual values, with lower values indicating better performance
- Accuracy: Accuracy measures the proportion of correct predictions among the total number of predictions made. Here, in our case, the accuracy is 0.61 which means 61% of the predictions were correctly classified as either delayed or not delayed.
- Precision: This measures the accuracy of positive predictions. 84.61 of flights predicted as delayed were actually delayed similarly 84.61% of flights were not delayed as predicted.
- Recall: A recall of 40% means that the model correctly identifies 40% of all actual delayed flights. This is relatively low, suggesting that many delayed flights are not being detected by the model.
- F-1 score: This is the harmonic mean of precision and recall or sensitivity. This score indicates that the model works reasonably well in dealing with both the metrics.
- AUC (Area Under the ROC Curve): This represents the two-dimensional area under the ROC curve. 0.79, It means the model has a 79.33% chance of correctly distinguishing between delayed and non-delayed flights. AUC values range from 0.5 (no discriminative ability) to 1 (perfect classification)
Conclusion
This article discussed the Flight delay prediction using R programming language and how machine learning can play an important role in our travel planning. We used an external dataset to predict flight delay and understand the relationship between the variables with the help of visualization.
Similar Reads
Flight Delay Prediction using Deep Learning
Air travel has become an important part of our lives, and with this comes the problem of flights being delayed. Deep learning models can automatically learn hierarchical representations from data, making them best for flight delay prediction. In the article, we will build a flight delay predictor us
5 min read
IPL Score Prediction using Deep Learning
In the modern era of cricket analytics, where each run and decision can change the outcome, the application of Deep Learning for IPL score prediction stands at the forefront of innovation. This article explores the cutting-edge use of advanced algorithms to forecast IPL score in live matches with hi
7 min read
Flight Fare Prediction Using Machine Learning
In this article, we will develop a predictive machine learning model that can effectively predict flight fares. Why do we need to predict flight fares?There are several use cases of flight fare prediction, which are discussed below: Trip planning apps: Several Travel planning apps use airfare calcul
5 min read
ML | Rainfall prediction using Linear regression
Predicting rainfall is a vital aspect of weather forecasting, agriculture planning and water resource management. In this article we will use Linear regression algorithm that help establish relationship between two variables: one dependent (rainfall) and one or more independent variables (temperatur
4 min read
Best Flight Price Predictor Apps and Tools
Sometimes it becomes very difficult to find cheap flights, especially with varying prices for a particular ticket. Fortunately, in this aspect, there is availability of flight price predictor applications therefore assisting the travelers to estimate the best time for booking. Here in this article,
6 min read
Placement prediction using Logistic Regression
Prerequisites: Understanding Logistic Regression, Logistic Regression using Python In this article, we are going to discuss how to predict the placement status of a student based on various student attributes using Logistic regression algorithm. Placements hold great importance for students and educ
4 min read
Heart Disease Prediction Using Logistic Regression in R
Machine learning can effectively identify patterns in data, providing valuable insights from this data. This article explores one of these machine learning techniques called Logistic regression and how it can analyze the key patient details and determine the probability of heart disease based on pat
13 min read
Transport Demand Prediction using Regression
Transport demand prediction is a crucial aspect of transportation planning and management. Accurate demand forecasts enable efficient resource allocation, improved service planning, and enhanced customer satisfaction. Regression analysis, a statistical method for modeling relationships between varia
8 min read
Titanic Survival Prediction Using Machine Learning
The sinking of the RMS Titanic in 1912 remains one of the most infamous maritime disasters in history, leading to significant loss of life. Over 1,500 passengers and crew perished that fateful night. Understanding the factors that contributed to survival can provide valuable insights into safety pro
9 min read
COVID-19 Peak Prediction using Logistic Function
Making fast and accurate decisions are vital these days and especially now when the world is facing such a phenomenon as COVID-19, therefore, counting on current as well as projected information is decisive for this process. In this matter, we have applied a model in which is possible to observe the
4 min read