3.1 Business Understanding
3.1 Business Understanding
framework (Shearer, 2000). This framework constitutes a general framework that emphasizes the
iterative nature of data mining problems. Figure 7 shows the general framework; its application to this
project is detailed in the following sections. The Evaluation steps are discussed in parts throughout this
section and in the Results section. The Deployment was not part of the scope of this project, but several
recommendations for deployment are given in section 6, Insights and Management Recommendations.
Figure 7: Cross Industry Standard Process for Data Mining CRISP-DM (Shearer, 2000)
of reading academic documents, industry reviews, and annual reports shared by organizations such as the
Intergovernmental Panel on Climate Change (IPCC) and the American Transportation Research Institute
21
(ATRI) for a high-level overview. This
Coca-Cola FEMSA also shared documentation regarding their telematics approach and objectives, to focus
the project and validate the company´s approach to using Telematics Data by comparing them with the
A series of weekly interviews and discussions were held with Coca-Cola FEMSA s secondary distribution
stakeholders. In these discussions, insights were shared from visualizations in Power BI, receiving
feedback and interpretation from Coca-Cola FEMSA´s experts. During these sessions we interviewed the
Director of Distribution, the Telematics Managers, and the Digital Analytics teams.
To gain a better understanding of the day-to-day of the truck drivers of Coca-Cola FEMSA, a field visit was
arranged. By accompanying truck drivers during their daily routes to deliver beverages, interesting insights
beyond the data from telematics emerged. Important insights from the field trip include that some
stre
with over 3,000 trucks to generate over 40 different tables that are updated every day or some even every
minute, each table having different parameters with different aggregation levels, therefore a clean data
set is fundamental for the project. A data dictionary was created to better understand each of the
different parameters shown in the reports, as well as the aggregation level of each report. Weekly calls
with the telematics team helped to clarify questions from the team.
22
With the data dictionary ready, different visualizations in Microsoft Power BI helped us to get an initial
feel for the data. These visualizations were iteratively validated and discussed with Coca-Cola FEMSA´s
stakeholders to clarify the expected ranges for important parameters and the expected relations between
them.
For outlier treatment we followed two approaches depending on the attribute. For most
attributes, we trimmed the values to what the users found to be real minimums or maximums. As
than 24 hours in one day. For attributes in which there was no knowledge of the limits, a
conservative approach was followed, trimming only the values that went beyond 3 times the
interquartile range.
Trucks that do not report fuel usage or full telematics data were removed. This removed a large
part of the trucks that the company uses but still left us with 360 trucks with data from 325 days
of delivery.
One-hot encoding of variables are used to transform categorical variables (e.g., truck type and
Several features had to be engineered to be used in the system. Mainly features that were a ratio
of the time an activity took in respect of the total operating time, for example, the total time a
Feature scaling was used to normalize the range of independent variables. The feature scaling
method used was min-max normalization. What this method does is that the minimum value of
23
an attribute becomes 0 and the maximum value becomes 1 and all the other values are adjusted
on that 0 to 1 scale.
3.4 Modeling
Our approach for modelling involved three different machine learning models. The first one was a
regression model to understand the how different driving behaviors impact fuel efficiency, the second
one was a machine learning model to understand how different driving behaviors impact safety and the
third model was a clustering analysis of driving behaviors to create different driving styles clusters to
analyze how these driving styles affect safety and fuel efficiency simultaneously.
kilometers per liter, where a higher fuel efficiency is better for the economy of the company and produces
a lesser amount of carbon dioxide per kilometer driven. To quantify the monetary impact of any change,
the average of the price per liter in Mexican pesos is used. To obtain the average price per liter, a dataset
of all refuels of 2020 is used, which was 18.16 MXN per liter. To obtain the average carbon dioxide
produced per liter of diesel we used a constant of 2.68 kilograms of carbon dioxide per liter.
Different regression algorithms were tested to explain the main forces impacting fuel consumption. Some
of the models considered were multiple linear regression (i.e., polynomial regression), support vector
regression and simple decision trees. Other ensemble methods were tested like boosting (e.g. AdaBoost,
XGBoost and LighGBMs), bagging (e.g. random forests) and stacking of various ensemble and simple
methods. Although the bagging and boosting models explained the variance in observations
measurements better (As reference, the adjusted R2 for the AdaBoost Regressor was 0.72 while for the
Linear Regression it was 0.67), we decided to use multiple linear regressions because they are fully
explainable. These types of models allow us to explain how the model interprets the inputs to produce
outputs as opposed to a black model that only produces outputs that are not explainable. To make sure
24
the results of the linear regression are reproducible, the main four assumptions behind linear regression
were tested using residuals plots. Here is a list of the main four assumptions:
1. Independence of observations
2. Linearity of Response
3. Normality of Residuals
Feature Selection was carried out by understanding the meaning behind each of the attributes to
A Pearson Correlation Map was created to discard attributes with correlation greater than 0.6 to
at least one of the attributes that a Pearson correlation coefficient, which indicates a high
correlation with another feature that increases the effect of multicollinearity. A correlation
analysis only checks the probability of a correlation problem between two attributes.
Variance Inflation Factor (VIF) was obtained for each of the attributes, and we discarded attributes
that had a factor greater than 5, which would indicate highly correlated attributes. A Pearson
Correlation Map helps with identifying pairs of attributes that are correlated, while the VIF
approach helps to identify multicollinearity among the interactions between the variables, not
25
Once the regression was validated, we developed a Microsoft Power BI-based simulation tool that allows
the users to simulate what would be the fuel efficiency gains if any of the dependent variables are
modified. Afterwards, we use the results of our regression and validated them against samples of data.
The samples used came from drivers where we had detected abrupt changes in their fuel efficiency. To
detect the abrupt changes in fuel efficiency behavior for each driver we calculated a rolling 7-day average
of fuel efficiency to smooth out the daily noise and only kept those drivers in which we saw a change that
3.4.2 Safety
Safety is not a straightforward concept to measure. Some of the proxies for measuring safety include
number of accidents or proprietary Safety Scores given by telematics data providers. Our initial approach
to understand which driving behaviors affect safety was to use accidents as our dependent variable on a
regression model.
Predicting crash rates is hard to measure given that accidents are stochastic events that not always follow
the same pattern and depend on a wider variety of directly controllable factors like driving style and
external factors like the weather, external traffic and highly uncertain events like people crossing streets
or other An econometric model was used to analyze the driving behavior of the
drivers that had accidents. The econometric model was based on logistic regressions to predict the
probability of an accident occurring. Econometric models allow the model to incorporate past events (by
using lag features) that have led to an accident, such as a driver incurring in unsafe practices for several
days in a row. Another option includes using several proxies for safety: events such as reducing the velocity
of the vehicle too quickly or events in which vehicles make hard turns at considerable speeds. Events like
these could be used as proxies for safety as they are considered unsafe behaviors.
26
Our econometric model to predict accidents did not produce statistically relevant results. This was seen
by an adjusted R2 that was less than 0.2 and attributes with p-values greater than 0.05. Therefore, this
part of the process was not integrated into the results. Our hypothesis of how to make this model work
would be to structure how the data is collected and analyze more data related to status of the driver as
Given that our first proxy for safety failed to work, we decided to use Coca-Cola FEMSA
provider Safety Score. This score uses various events to calculate a proxy to the probability to have an
accident. The calculations used are Intellectual Property of the supplier, but they are based on a micro
modeling approach similar method to Toledo et al. (2008) mentioned in the literature review (i.e., real
time analytics of telematics data). For example: a sudden longitudinal and lateral acceleration change
measured by the may indicate an abrupt turn. This way, the previously mentioned
independent variables were generated (e.g., abrupt lane change, abrupt turns acceleration or braking
To understand the relative importance of our independent variables regarding the Safety Score we
discretized the Safety Score variable into 6 equal-frequency categories. The reason for discretizing the
variable instead of treating it as continuous numerical feature is that the Safety Score is bounded by an
upper limit at 100. So, a linear regression model would produce results with heteroscedasticity problems.
Therefore, we ran a classification model using 20 independent variables to predict one of the 6 Safety
Score classes. Figure 15 shows the ranges and number of observations for each of the 6 classes. The
We tested among various machine learning models to decide on which machine learning model would
best fit the data. Among the options that we tried were Random Forest, AdaBoost, Naïve Bayes and
Support Vector Machines (SVM). Table 1 shows the comparison between the different machine learning
27
models. The machine learning algorithm that produced the best results in terms of Area Under the Curve
(AUC) was Random Forest Regressor. The AUC is a common metric to evaluate the results of multiclass
classification thresholds. A common way to interpret this metric is as the probability that the model ranks
unsupervised learning. The intention of using clustering was to group the different driving behavior
characteristics that are gathered at the daily and truck level to identify clusters of driving styles. As input
The clustering approach that we decided to use was a Bayesian Gaussian Mixture Model. The reason for
using a probabilistic Gaussian Mixture model is that it allowed to us to better understand the properties
of input examples. Many clustering algorithms like K-Means simply give a cluster representative that
shows nothing about how the points are spread. The Gaussian properties of this approach gives us not
only the mean of the cluster but also the variance which can be used to estimate the likelihood that a
point belongs to a certain cluster. The reason for choosing a Bayesian Gaussian Mixture Model instead of
the traditional Gaussian Mixture was to take a probabilistic approach to choosing the number of clusters.
With a traditional Gaussian Mixture Model, a Bayesian Information Criterion (BIC) or the Akaike
Information Criterion (AIC) techniques must be used to select an optimal number of clusters. While with
a Bayesian one, the algorithm takes the cluster parameters as latent random variables, not as fixed model
parameters. In other words, with this algorithm you can set an initial maximum number of clusters and
the algorithm will decide the optimal number of clusters to reward models that fit the data well while
minimizing a theoretical information criterion. The possible range of number of clusters would be
between 1 and the maximum number of clusters that was set. For our problem, we chose a maximum
28
number of clusters as 10 as this would allow us to separate the driving styles into business-relatable
information but the algorithm suggested 5 clusters as the optimal number for clusters.
for example idling times and excessive acceleration events, as well as independent variables related to
safety, for example abrupt lane changes, abrupt turns and acceleration or braking events while turning.
Each cluster was generated based on the independent variables that represent the driving behavior, to
explain which patterns each driver follows. Then we used the created clusters to see how they in terms
of fuel efficiency and safety score with the purpose of explaining the tradeoffs between the different types
of driving styles.
29
To business daily practices, a persona was
defined for each cluster. A persona is term borrowed from the marketing industry which is described as
acter . Our intention in using these personas was to create fictitious but
relatable characters so that the driving style of any driver could be identified and easily recognized. Our
Gaussian Mixture Model approach also allows for driving styles to be, probabilistically speaking, part of
3.5 Conclusions
To answer our research question of which driving styles can help fleet owners increase safety and fuel
efficiency we used multiple machine learning and analytics techniques. We used data from over 3,000
trucks to come up with a fuel efficiency regression model, we had an unsuccessful attempt at predicting
crash rate safety with econometric regression so we ended up using a proprietary Safety Score from the
telematics provider as a proxy for Safety and we developed a clustering analysis to drive the business
recommendations, actionable insights, and recommendations. Given the amount of data we were dealing
with we also followed the CRISP-DM methodology to guide us through the iterative process of data
mining. In the next section we will describe the results of each model.
4 RESULTS
liter of diesel) contained 13 independent variables plus the bias term. The linear regression model had an
R2 of 0.67, MAPE of 8.7%, MAE of 0.23 and a RMSE of 0.28. As context, the mean fuel efficiency was 2.72
kilometers per liter with a standard deviation of 0.52. Figure 8 shows a histogram of the distribution of
fuel efficiency.
30
Figure 8: Histogram of Fuel Efficiency
As expected, not all variables have the same impact in estimating fuel efficiency. Figure 9 shows the
relative standardized effects of each variable. The top three are AvgSpeed,
31
Figure 10 and Figure 11 show the Residuals Plots and the Prediction Error for the Linear Regression Model.
These plots were used to test the four key assumptions that were mentioned in the Methodology.
Note: these two plots shows that results are normally distributed (Gaussian Bell on the top right) which
visually proves the normally of the residuals. The points also appear to be evenly distributed (Points on
the center) which shows that the residuals have linearity of response and that there homogeneity in of
variance.
32
Figure 12 shows a matrix of all features and how they correlate with each of the other features. A number
closer to 1 means there is a perfect positive correlation between two variables. A number closer to 0
means there is no correlation between the two variables. A number closer to negative 1 means there is a
perfect negative correlation. This plot served as initial analysis to do feature selection to avoid
multicollinearity issues.
33
Table 3 shows the summarized results of the linear regression with the names of features, the coefficients,
the standard error of the coefficients, the T-values, the P-values and the VIF. The VIF values were used for
analysis with the different variables that impact fuel efficiency. Figure 13 displays an example of a possible
scenario. The objective of this simulation tool is to allow the company to estimate the potential gain of
experimenting different changes of any of the independent variables and seeing how this would impact
34
Figure 13: Scenario Analysis
Note: As an example, the figure shows the results of a reduction from 15% to 10% of time accelerating,
efficiency and a decrease of 253 tons of CO2 and a reduction in costs of 1.72 million MXN.
35