Transport Demand Prediction using Regression
Last Updated :
21 Aug, 2024
Transport demand prediction is a crucial aspect of transportation planning and management. Accurate demand forecasts enable efficient resource allocation, improved service planning, and enhanced customer satisfaction. Regression analysis, a statistical method for modeling relationships between variables, is widely used for predicting transport demand. This article delves into the methodologies, challenges, and applications of regression models in transport demand prediction.
Understanding Transport Demand Prediction
Transport demand prediction involves estimating future demand for transportation services based on historical data and various influencing factors. It is essential for optimizing routes, scheduling, and resource allocation in public transport systems. Accurate predictions can lead to cost savings, improved service quality, and better infrastructure planning.
Key Factors Influencing Transport Demand
Several factors influence transport demand, including:
- Economic Indicators: Economic growth, measured by GDP or regional output, significantly impacts transport demand. Economic downturns can reduce demand, while growth can increase it.
- Demographic Factors: Population size, age distribution, and urbanization levels affect transport demand. Urban areas with high population densities typically have higher public transport usage.
- Transport Infrastructure: The availability and quality of transport infrastructure, such as roads, railways, and public transit systems, influence demand. Improved infrastructure can induce demand by offering better service quality.
- Technological Advancements: Innovations in transport technology, such as intelligent transport systems (ITS), can affect demand by improving service efficiency and reliability.
- Social and Cultural Factors: Cultural events, holidays, and social behaviors impact transport patterns. For instance, peak travel times often coincide with holidays or major events.
Role of Regression Analysis in Predicting Transport Demand
Regression analysis is a powerful statistical tool used to model and understand the relationship between a dependent variable (in this case, transport demand measured by the number of seats sold) and one or more independent variables (such as travel date, travel time, origin, destination, vehicle type, etc.).
It enables transportation planners and data scientists to identify patterns in historical data and use these patterns to make predictions about future demand.
Why Regression Analysis?
The need for regression analysis in transport demand prediction stems from its ability to:
- Quantify Relationships: Regression helps quantify the relationship between transport demand and various influencing factors like time of travel, routes, payment methods, and vehicle types.
- Capture Trends: It can identify trends and patterns in historical data, such as peak travel times, popular routes, or the impact of holidays and weekends on demand.
- Provide Predictive Power: By establishing a mathematical model that connects demand to key variables, regression analysis allows for accurate forecasting of future transportation needs.
- Model Complexity: In simple cases, linear regression can be sufficient. However, transport demand is often affected by non-linear relationships between variables (e.g., the impact of peak hours on demand might not increase linearly). This is where more advanced regression models like Random Forests, Gradient Boosting, or XGBoost become useful. These models capture more complex interactions between the features.
Implementing Regression Model for Transport Demand Prediction
To build a practical model for transport demand prediction we will follow a structured approach.
Problem Statement
The goal is to predict the number of seats sold for each ride on specific routes, dates, and times using historical data from Mobiticket. This prediction will help optimize resource allocation and improve service planning for public transport in Nairobi.
Approach to the Model
The dataset includes variables such as ride_id, seat_number, payment_method, travel_date, travel_time, travel_from, travel_to, car_type, and max_capacity. Our objective is to use these features to predict the number of seats sold (seat_number). Steps to Build the Model:
- Data Preprocessing:
- Feature Engineering: Create new features such as day of the week, hour of the day, or whether the travel date falls on a weekend or holiday.
- Handling Categorical Variables: Convert categorical variables such as
payment_method
, travel_from
, travel_to
, and car_type
into numerical representations using one-hot encoding or label encoding. - Handling Dates: Extract useful information from
travel_date
and travel_time
(e.g., day, month, hour). - Normalization/Standardization: Standardize features like
max_capacity
to improve model performance.
- Modeling:
- Train-Test Split: Split the data into training and test sets to evaluate model performance.
- Model Selection: Start with a simple regression model like Linear Regression, and then explore more complex models like Random Forest, Gradient Boosting, or XGBoost to capture non-linear relationships.
- Model Evaluation: Evaluate the model's performance using appropriate metrics.
Step 1: Transport Demand: Data Loading and Preprocessing
Let's create a synthetic dataset that resembles the structure described for predicting the number of seats sold for each ride. This dataset will include features like ride_id
, seat_number
, payment_method
, travel_date
, travel_time
, travel_from
, travel_to
, car_type
, and max_capacity
.
Python
import numpy as np
import pandas as pd
from datetime import timedelta, datetime
# Set random seed for reproducibility
np.random.seed(42)
# Parameters
n_samples = 10000 # Number of rides
locations = ['Location_A', 'Location_B', 'Location_C', 'Location_D', 'Location_E']
car_types = ['bus', 'minibus', 'van']
payment_methods = ['cash', 'mobile_payment', 'card']
start_date = datetime(2024, 1, 1)
# Generate data
ride_ids = np.arange(1, n_samples + 1)
travel_dates = [start_date + timedelta(days=np.random.randint(0, 365)) for _ in range(n_samples)]
travel_times = [datetime(2024, 1, 1, np.random.randint(0, 24), np.random.randint(0, 60)).time() for _ in range(n_samples)]
travel_from = np.random.choice(locations, n_samples)
travel_to = np.random.choice(locations, n_samples)
car_type = np.random.choice(car_types, n_samples)
max_capacity = np.random.choice([14, 30, 50], n_samples)
payment_method = np.random.choice(payment_methods, n_samples)
# Calculate seat_number based on some logic
# Example logic: Bus type, capacity, time of day, and payment method affect seat_number
seat_number = (np.random.poisson(lam=10, size=n_samples)
+ (max_capacity / 2).astype(int)
+ np.random.randint(0, 5, n_samples)
- (np.array([t.hour for t in travel_times]) // 3)
+ (payment_method == 'mobile_payment').astype(int) * 5
).clip(1, max_capacity) # Ensure seat_number is between 1 and max_capacity
# Create the DataFrame
data = pd.DataFrame({
'ride_id': ride_ids,
'travel_date': travel_dates,
'travel_time': travel_times,
'travel_from': travel_from,
'travel_to': travel_to,
'car_type': car_type,
'max_capacity': max_capacity,
'payment_method': payment_method,
'seat_number': seat_number
})
data.to_csv("train_revised.csv", index=False)
data.head()
Output:
ride_id travel_date travel_time travel_from travel_to car_type max_capacity payment_method seat_number
0 1 2024-04-12 17:29:00 Location_E Location_E van 50 cash 38
1 2 2024-12-14 11:47:00 Location_E Location_D bus 30 mobile_payment 27
2 3 2024-09-27 04:19:00 Location_B Location_B van 14 card 14
3 4 2024-04-16 11:20:00 Location_E Location_D bus 30 cash 28
4 5 2024-03-12 10:08:00 Location_A Location_E minibus 50 cash 36
Step 2: Data Preprocessing
Python
# Feature Engineering
data['travel_date'] = pd.to_datetime(data['travel_date'])
data['day_of_week'] = data['travel_date'].dt.dayofweek
data['month'] = data['travel_date'].dt.month
data['hour'] = pd.to_datetime(data['travel_time']).dt.hour
# Drop irrelevant columns
X = data.drop(columns=['ride_id', 'seat_number', 'travel_date', 'travel_time'])
y = data['seat_number']
# Handling Categorical Variables and Scaling
categorical_features = ['payment_method', 'travel_from', 'travel_to', 'car_type']
numerical_features = ['max_capacity', 'day_of_week', 'month', 'hour']
categorical_transformer = OneHotEncoder(drop='first')
numerical_transformer = StandardScaler()
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features),
]
)
Step 3: Build the Model
Python
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Pipeline for the model
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', RandomForestRegressor(random_state=42))
])
pipeline.fit(X_train, y_train)
Step 4: Predictions and Evaluation of the Model
Python
# Predict on the test set
y_pred = pipeline.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")
Output:
Mean Squared Error (MSE): 8.306493985761055
R-squared (R²): 0.9101232043264215
Application of Regression Analysis in Your Model
In the model building, regression analysis is applied to the Mobiticket dataset to predict the number of seats sold for specific rides. Here’s how regression analysis plays a central role:
- Identifying Key Variables: Regression analysis helps determine which features (e.g., time of travel, route, vehicle type) most strongly influence the number of seats sold. For instance, routes from high-demand locations or specific travel times (e.g., morning rush hour) might be more likely to have a higher number of seats sold.
- Modeling Demand Patterns: Using regression techniques, your model can learn the demand patterns from historical data, which may include daily, weekly, or seasonal trends. For example, demand may rise on weekends or holidays, which the regression model can capture by incorporating time-based features.
- Predicting Future Demand: Once the model is trained, it can predict the number of seats sold for future rides based on known factors like the date, time, and route. These predictions enable transport companies to allocate resources efficiently by scheduling additional vehicles on high-demand routes or adjusting schedules to match predicted demand.
- Evaluating Model Performance: Regression models can be evaluated using metrics such as Mean Squared Error (MSE) and R-squared (R²) scores. These metrics help assess how well the model fits the data and how accurately it predicts demand. In your model, an R-squared of 0.91 indicates that your regression model explains 91% of the variance in the number of seats sold, which is a strong fit for predicting transport demand.
Conclusion
Transport demand prediction is a critical tool for optimizing transportation planning and resource management. Through regression analysis, predictive models can leverage historical data and key influencing factors to forecast future demand accurately. By adopting a structured approach to data preprocessing, feature engineering, and model selection, transportation planners can enhance service quality, improve operational efficiency, and better cater to passenger needs.
The presented methodology for predicting the number of seats sold in Nairobi's public transport highlights how data-driven decisions can shape the future of urban mobility, benefiting both service providers and passengers alike.