Kaggle playground to predict the total ride duration of taxi trips in New York City.
The first part is to analyze the dataframe and observe correlation between variables.
The goal of this playground is to predict the trip duration of test set. We know that some neighborhoods are more congested. So, I used K-Means to compute geo-clusters for pickup and drop off.
I have found some odd long trips : one day trip with a mean spead < 1km/h.
I have removed these outliners.
I also added features from the data available : Haversine distance, Manhattan distance, means for clusters, PCA for rotation.
I compared Random Forest and XGBoost.
Current Root Mean Squared Logarithmic error : 0.391