0% found this document useful (0 votes)
2 views

Assignment 2 09 10

The document discusses the development of multiple regression models to predict tip amounts based on various predictors such as passenger count, fare amount, and time-related variables. It outlines the creation of models (model5, model6, model7) and the evaluation of their performance using RMSE on test samples. Additionally, it addresses the significance of predictors and the exploration of non-linear relationships through polynomial terms and smoothing functions.

Uploaded by

Vaishali Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment 2 09 10

The document discusses the development of multiple regression models to predict tip amounts based on various predictors such as passenger count, fare amount, and time-related variables. It outlines the creation of models (model5, model6, model7) and the evaluation of their performance using RMSE on test samples. Additionally, it addresses the significance of predictors and the exploration of non-linear relationships through polynomial terms and smoothing functions.

Uploaded by

Vaishali Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Generally speaking, including a larger number of meaningful predictors will improve

the quality of predictions. It is reasonable to expect the following predictors to


influence tip paid: number of passengers (passenger_count), fare amount
(fare_amount), hour of the day of the ride (pickup_hour), whether the trip occurred in
the beginning, middle or end of the month (period_of_month), and day of the week
for the trip (pickup_day). Use these variables in a multiple regression to predict
tip_amount. Call this model5.
Which of the following variables are significant predictors of tip_amount? Please
note, a categorical predictor variable is statistically significant if even one of the
dummy variables representing it is statistically significant. Select one or more correct
answers.
Group of answer choices

passenger_count

fare_amount

pickup_hour

period_of_month

pickup_day
fare_amount.

Question 172 pts


In model5, which is the strongest predictor of tip_amount?
Group of answer choices

passenger_count

fare_amount

pickup_hour

period_of_month

pickup_day

fare_amount.

Question 182 pts


What is the RMSE for model5?
Root Mean Squared Error
Root Mean Squared Error

Question 192 pts


Now, let us explore non-linear relationships of fare_amount and pickup_hour by
including polynomial terms. Modify model5 by replacing fare_amount with
poly(fare_amount, 2) and pickup_hour with poly(pickup_hour, 2). Keep the rest of the
model the same. Call this model6.
In model6, which of the following variables are significant predictors of tip_amount.
Select one or more correct answers.
Group of answer choices

passenger_count

poly(fare_amount,2)1

poly(fare_amount,2)2

poly(pickup_hour,2)1

poly(pickup_hour,2)2

period_of_month

pickup_day
poly(fare_amount,2)

Question 202 pts


What is the RMSE for model6?

Question 212 pts


Use the variables in model5, to fit a Generalized Additive Model using
method="REML". Use smoothing functions for fare_amount [i.e.,s(fare_amount)] and
pickup_hour [i.e., s(pickup_hour)]. Leave the other variables unchanged. Call
this model7.
In model7, which of the following variables are significant predictors of tip_amount.
Select one or more correct answers.
Group of answer choices

passenger_count

s(fare_amount)

s(pickup_hour)

period_of_month

pickup_day

 s(fare_amount)
 s(pickup_hour)

Question 222 pts


What is the RMSE for model7?

Question 232 pts


The litmus test for model performance is how it performs on data that was not used
to estimate it. Model5 is the multiple regression model with linear terms. Compute
the RMSE for model5 on the test sample.

# Assuming data is already split into train and test datasets


# Fit Model 5 on training data model5 <- lm(tip_amount ~ passenger_count + fare_amount +
pickup_hour + period_of_month + pickup_day, data=train_data)
# Predict on the test set predictions <- predict(model5, newdata=test_data)
# Calculate RMSE actuals <- test_data$tip_amount rmse <- sqrt(mean((predictions - actuals)^2))
print(rmse)

Question 242 pts


GAM (model7) did better than the linear model (model5) on the train sample. Let us
see if the flexible GAM model outperforms the linear model in the test sample.
Compute the RMSE for model7 on the test sample.

# Load necessary libraries library(mgcv)


# Fit Model 7 using training data model7 <- gam(tip_amount ~ s(fare_amount) + s(pickup_hour) +
passenger_count + period_of_month + pickup_day, data = train_data, method = "REML")

# Predict on the test data predictions_model7 <- predict(model7, newdata = test_data)

# Calculate RMSE actuals <- test_data$tip_amount rmse_model7 <- sqrt(mean((actuals -


predictions_model7)^2))

# Print RMSE print(rmse_model7)

You might also like