0% found this document useful (0 votes)
54 views15 pages

3.1 Business Understanding

The document describes applying the Cross Industry Standard Process for Data Mining (CRISP-DM) framework to a project with Coca-Cola FEMSA. It involved understanding the business needs, preparing the telematics data from over 3,000 trucks, and developing models to understand how driving behaviors impact fuel efficiency and safety. Linear regression and clustering models were used to identify relationships between driving style factors and both fuel consumption and safety metrics. The results were validated and developed into interactive tools to help Coca-Cola FEMSA optimize operations.

Uploaded by

Sam J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views15 pages

3.1 Business Understanding

The document describes applying the Cross Industry Standard Process for Data Mining (CRISP-DM) framework to a project with Coca-Cola FEMSA. It involved understanding the business needs, preparing the telematics data from over 3,000 trucks, and developing models to understand how driving behaviors impact fuel efficiency and safety. Linear regression and clustering models were used to identify relationships between driving style factors and both fuel consumption and safety metrics. The results were validated and developed into interactive tools to help Coca-Cola FEMSA optimize operations.

Uploaded by

Sam J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Throughout the project, we followed the Cross Industry Standard Process (CRISP) for Data Mining

framework (Shearer, 2000). This framework constitutes a general framework that emphasizes the

iterative nature of data mining problems. Figure 7 shows the general framework; its application to this

project is detailed in the following sections. The Evaluation steps are discussed in parts throughout this

section and in the Results section. The Deployment was not part of the scope of this project, but several

recommendations for deployment are given in section 6, Insights and Management Recommendations.

Figure 7: Cross Industry Standard Process for Data Mining CRISP-DM (Shearer, 2000)

3.1 Business understanding


Our initial research helped us understand the Coca-Cola FEMSA´s business needs. This research consisted

of reading academic documents, industry reviews, and annual reports shared by organizations such as the

Intergovernmental Panel on Climate Change (IPCC) and the American Transportation Research Institute

21
(ATRI) for a high-level overview. This

Coca-Cola FEMSA also shared documentation regarding their telematics approach and objectives, to focus

the project and validate the company´s approach to using Telematics Data by comparing them with the

priorities are safety and fuel

efficiency (which affect equally cost savings and CO2 emissions).

A series of weekly interviews and discussions were held with Coca-Cola FEMSA s secondary distribution

stakeholders. In these discussions, insights were shared from visualizations in Power BI, receiving

feedback and interpretation from Coca-Cola FEMSA´s experts. During these sessions we interviewed the

Director of Distribution, the Telematics Managers, and the Digital Analytics teams.

To gain a better understanding of the day-to-day of the truck drivers of Coca-Cola FEMSA, a field visit was

arranged. By accompanying truck drivers during their daily routes to deliver beverages, interesting insights

beyond the data from telematics emerged. Important insights from the field trip include that some

, for example, traffic,

stre

3.2 Data understanding


The amount of available data in this project was an important challenge. A telematics supplier integrates

with over 3,000 trucks to generate over 40 different tables that are updated every day or some even every

minute, each table having different parameters with different aggregation levels, therefore a clean data

set is fundamental for the project. A data dictionary was created to better understand each of the

different parameters shown in the reports, as well as the aggregation level of each report. Weekly calls

with the telematics team helped to clarify questions from the team.

22
With the data dictionary ready, different visualizations in Microsoft Power BI helped us to get an initial

feel for the data. These visualizations were iteratively validated and discussed with Coca-Cola FEMSA´s

stakeholders to clarify the expected ranges for important parameters and the expected relations between

them.

3.3 Data preparation


The following criteria were followed to clean the data:

For outlier treatment we followed two approaches depending on the attribute. For most

attributes, we trimmed the values to what the users found to be real minimums or maximums. As

, there is no way a driver could have driven for more

than 24 hours in one day. For attributes in which there was no knowledge of the limits, a

conservative approach was followed, trimming only the values that went beyond 3 times the

interquartile range.

Trucks that do not report fuel usage or full telematics data were removed. This removed a large

part of the trucks that the company uses but still left us with 360 trucks with data from 325 days

of delivery.

Data was also further processed in the following manner:

One-hot encoding of variables are used to transform categorical variables (e.g., truck type and

model) into dummy variables to be used in regression models.

Several features had to be engineered to be used in the system. Mainly features that were a ratio

of the time an activity took in respect of the total operating time, for example, the total time a

truck spent accelerating with respect to the total time of operation.

Feature scaling was used to normalize the range of independent variables. The feature scaling

method used was min-max normalization. What this method does is that the minimum value of

23
an attribute becomes 0 and the maximum value becomes 1 and all the other values are adjusted

on that 0 to 1 scale.

3.4 Modeling
Our approach for modelling involved three different machine learning models. The first one was a

regression model to understand the how different driving behaviors impact fuel efficiency, the second

one was a machine learning model to understand how different driving behaviors impact safety and the

third model was a clustering analysis of driving behaviors to create different driving styles clusters to

analyze how these driving styles affect safety and fuel efficiency simultaneously.

3.4.1 Fuel Efficiency


For our first regression model, the dependent variable to be analyzed is fuel efficiency as measured by

kilometers per liter, where a higher fuel efficiency is better for the economy of the company and produces

a lesser amount of carbon dioxide per kilometer driven. To quantify the monetary impact of any change,

the average of the price per liter in Mexican pesos is used. To obtain the average price per liter, a dataset

of all refuels of 2020 is used, which was 18.16 MXN per liter. To obtain the average carbon dioxide

produced per liter of diesel we used a constant of 2.68 kilograms of carbon dioxide per liter.

Different regression algorithms were tested to explain the main forces impacting fuel consumption. Some

of the models considered were multiple linear regression (i.e., polynomial regression), support vector

regression and simple decision trees. Other ensemble methods were tested like boosting (e.g. AdaBoost,

XGBoost and LighGBMs), bagging (e.g. random forests) and stacking of various ensemble and simple

methods. Although the bagging and boosting models explained the variance in observations

measurements better (As reference, the adjusted R2 for the AdaBoost Regressor was 0.72 while for the

Linear Regression it was 0.67), we decided to use multiple linear regressions because they are fully

explainable. These types of models allow us to explain how the model interprets the inputs to produce

outputs as opposed to a black model that only produces outputs that are not explainable. To make sure

24
the results of the linear regression are reproducible, the main four assumptions behind linear regression

were tested using residuals plots. Here is a list of the main four assumptions:

1. Independence of observations

2. Linearity of Response

3. Normality of Residuals

4. Homogeneity of Variance (i.e., homoscedasticity)

Multicollinearity issues were addressed in three different ways:

Feature Selection was carried out by understanding the meaning behind each of the attributes to

discard metrics that were proxies of each other.

A Pearson Correlation Map was created to discard attributes with correlation greater than 0.6 to

at least one of the attributes that a Pearson correlation coefficient, which indicates a high

correlation with another feature that increases the effect of multicollinearity. A correlation

analysis only checks the probability of a correlation problem between two attributes.

Variance Inflation Factor (VIF) was obtained for each of the attributes, and we discarded attributes

that had a factor greater than 5, which would indicate highly correlated attributes. A Pearson

Correlation Map helps with identifying pairs of attributes that are correlated, while the VIF

approach helps to identify multicollinearity among the interactions between the variables, not

just between two of them.

25
Once the regression was validated, we developed a Microsoft Power BI-based simulation tool that allows

the users to simulate what would be the fuel efficiency gains if any of the dependent variables are

modified. Afterwards, we use the results of our regression and validated them against samples of data.

The samples used came from drivers where we had detected abrupt changes in their fuel efficiency. To

detect the abrupt changes in fuel efficiency behavior for each driver we calculated a rolling 7-day average

of fuel efficiency to smooth out the daily noise and only kept those drivers in which we saw a change that

remain constant, an example is shown in Figure 13.

3.4.2 Safety
Safety is not a straightforward concept to measure. Some of the proxies for measuring safety include

number of accidents or proprietary Safety Scores given by telematics data providers. Our initial approach

to understand which driving behaviors affect safety was to use accidents as our dependent variable on a

regression model.

Predicting crash rates is hard to measure given that accidents are stochastic events that not always follow

the same pattern and depend on a wider variety of directly controllable factors like driving style and

external factors like the weather, external traffic and highly uncertain events like people crossing streets

or other An econometric model was used to analyze the driving behavior of the

drivers that had accidents. The econometric model was based on logistic regressions to predict the

probability of an accident occurring. Econometric models allow the model to incorporate past events (by

using lag features) that have led to an accident, such as a driver incurring in unsafe practices for several

days in a row. Another option includes using several proxies for safety: events such as reducing the velocity

of the vehicle too quickly or events in which vehicles make hard turns at considerable speeds. Events like

these could be used as proxies for safety as they are considered unsafe behaviors.

26
Our econometric model to predict accidents did not produce statistically relevant results. This was seen

by an adjusted R2 that was less than 0.2 and attributes with p-values greater than 0.05. Therefore, this

part of the process was not integrated into the results. Our hypothesis of how to make this model work

would be to structure how the data is collected and analyze more data related to status of the driver as

suggested by Houston, J. (2003) and Harris, P. (2014).

Given that our first proxy for safety failed to work, we decided to use Coca-Cola FEMSA

provider Safety Score. This score uses various events to calculate a proxy to the probability to have an

accident. The calculations used are Intellectual Property of the supplier, but they are based on a micro

modeling approach similar method to Toledo et al. (2008) mentioned in the literature review (i.e., real

time analytics of telematics data). For example: a sudden longitudinal and lateral acceleration change

measured by the may indicate an abrupt turn. This way, the previously mentioned

independent variables were generated (e.g., abrupt lane change, abrupt turns acceleration or braking

events while turning, etc.).

To understand the relative importance of our independent variables regarding the Safety Score we

discretized the Safety Score variable into 6 equal-frequency categories. The reason for discretizing the

variable instead of treating it as continuous numerical feature is that the Safety Score is bounded by an

upper limit at 100. So, a linear regression model would produce results with heteroscedasticity problems.

Therefore, we ran a classification model using 20 independent variables to predict one of the 6 Safety

Score classes. Figure 15 shows the ranges and number of observations for each of the 6 classes. The

independent variables used are listed in Table 2.

We tested among various machine learning models to decide on which machine learning model would

best fit the data. Among the options that we tried were Random Forest, AdaBoost, Naïve Bayes and

Support Vector Machines (SVM). Table 1 shows the comparison between the different machine learning

27
models. The machine learning algorithm that produced the best results in terms of Area Under the Curve

(AUC) was Random Forest Regressor. The AUC is a common metric to evaluate the results of multiclass

classification problems as it provides an aggregate measure of performance across all possible

classification thresholds. A common way to interpret this metric is as the probability that the model ranks

a random positive example more highly than a random negative example.

3.4.3 Cluster Analysis


Our third and final model was clustering. Clustering is a family of machine learning that allow for

unsupervised learning. The intention of using clustering was to group the different driving behavior

characteristics that are gathered at the daily and truck level to identify clusters of driving styles. As input

variables we used twenty different variables shown in Table 2.3.

The clustering approach that we decided to use was a Bayesian Gaussian Mixture Model. The reason for

using a probabilistic Gaussian Mixture model is that it allowed to us to better understand the properties

of input examples. Many clustering algorithms like K-Means simply give a cluster representative that

shows nothing about how the points are spread. The Gaussian properties of this approach gives us not

only the mean of the cluster but also the variance which can be used to estimate the likelihood that a

point belongs to a certain cluster. The reason for choosing a Bayesian Gaussian Mixture Model instead of

the traditional Gaussian Mixture was to take a probabilistic approach to choosing the number of clusters.

With a traditional Gaussian Mixture Model, a Bayesian Information Criterion (BIC) or the Akaike

Information Criterion (AIC) techniques must be used to select an optimal number of clusters. While with

a Bayesian one, the algorithm takes the cluster parameters as latent random variables, not as fixed model

parameters. In other words, with this algorithm you can set an initial maximum number of clusters and

the algorithm will decide the optimal number of clusters to reward models that fit the data well while

minimizing a theoretical information criterion. The possible range of number of clusters would be

between 1 and the maximum number of clusters that was set. For our problem, we chose a maximum

28
number of clusters as 10 as this would allow us to separate the driving styles into business-relatable

information but the algorithm suggested 5 clusters as the optimal number for clusters.

Table 1: Features used for Cluster Analysis

Feature ID Variable name


1 Life mileage
2 Max. engine t(°C)
3 Max. RPM
4 Top Speed
5 Operation time
6 Time in DC
7 Avg. Speed
8 % route under min. t(°C)
9 Over Revolution Time %
10 Idling time %
11 Acceleration Route time %
12 Overspeed events (%)
13 Number of stops (%)
14 Abrupt Acceleration
15 Abrupt Braking
16 Abrupt turns
17 Abrupt Lane Changes
18 Acceleration while turning
19 Braking while turning
20 OverAcceleration events

Note: All variables were measured at a vehicle-day disaggregation level.

3.4.4 Individual Cluster Analysis


Each cluster generated by our model had a weight from independent variables that impact fuel efficiency,

for example idling times and excessive acceleration events, as well as independent variables related to

safety, for example abrupt lane changes, abrupt turns and acceleration or braking events while turning.

Each cluster was generated based on the independent variables that represent the driving behavior, to

explain which patterns each driver follows. Then we used the created clusters to see how they in terms

of fuel efficiency and safety score with the purpose of explaining the tradeoffs between the different types

of driving styles.

29
To business daily practices, a persona was

defined for each cluster. A persona is term borrowed from the marketing industry which is described as

acter . Our intention in using these personas was to create fictitious but

relatable characters so that the driving style of any driver could be identified and easily recognized. Our

Gaussian Mixture Model approach also allows for driving styles to be, probabilistically speaking, part of

many of driving styles.

3.5 Conclusions
To answer our research question of which driving styles can help fleet owners increase safety and fuel

efficiency we used multiple machine learning and analytics techniques. We used data from over 3,000

trucks to come up with a fuel efficiency regression model, we had an unsuccessful attempt at predicting

crash rate safety with econometric regression so we ended up using a proprietary Safety Score from the

telematics provider as a proxy for Safety and we developed a clustering analysis to drive the business

recommendations, actionable insights, and recommendations. Given the amount of data we were dealing

with we also followed the CRISP-DM methodology to guide us through the iterative process of data

mining. In the next section we will describe the results of each model.

4 RESULTS

4.1 Fuel Efficiency

4.1.1 Regression Model


Our polynomial regression model to understand the main drivers behind fuel efficiency (kilometers per

liter of diesel) contained 13 independent variables plus the bias term. The linear regression model had an

R2 of 0.67, MAPE of 8.7%, MAE of 0.23 and a RMSE of 0.28. As context, the mean fuel efficiency was 2.72

kilometers per liter with a standard deviation of 0.52. Figure 8 shows a histogram of the distribution of

fuel efficiency.

30
Figure 8: Histogram of Fuel Efficiency

As expected, not all variables have the same impact in estimating fuel efficiency. Figure 9 shows the

relative standardized effects of each variable. The top three are AvgSpeed,

RatioAceleradorDuracionEventos (the percentage of time a truck spends accelerating) and RPMMaxima.

Figure 9: Pareto Chart of the Standardized Effects

31
Figure 10 and Figure 11 show the Residuals Plots and the Prediction Error for the Linear Regression Model.

These plots were used to test the four key assumptions that were mentioned in the Methodology.

Figure 10: Residuals for Linear Regression Model

Figure 11: Prediction Error for Linear Regression

Note: these two plots shows that results are normally distributed (Gaussian Bell on the top right) which
visually proves the normally of the residuals. The points also appear to be evenly distributed (Points on
the center) which shows that the residuals have linearity of response and that there homogeneity in of
variance.

32
Figure 12 shows a matrix of all features and how they correlate with each of the other features. A number

closer to 1 means there is a perfect positive correlation between two variables. A number closer to 0

means there is no correlation between the two variables. A number closer to negative 1 means there is a

perfect negative correlation. This plot served as initial analysis to do feature selection to avoid

multicollinearity issues.

Figure 12: Telematics Parameters Correlation Matrix

33
Table 3 shows the summarized results of the linear regression with the names of features, the coefficients,

the standard error of the coefficients, the T-values, the P-values and the VIF. The VIF values were used for

feature selection to avoid multicollinearity problems.

Table 2: Fuel Efficiency Linear Regression Results

Note: Adjusted R2: 0.67

4.1.2 Fuel Efficiency Scenario Analysis


After developing the model, we built a simulator that allowed Coca-Cola FEMSA -

analysis with the different variables that impact fuel efficiency. Figure 13 displays an example of a possible

scenario. The objective of this simulation tool is to allow the company to estimate the potential gain of

experimenting different changes of any of the independent variables and seeing how this would impact

both costs and CO2 emissions.

34
Figure 13: Scenario Analysis

Note: As an example, the figure shows the results of a reduction from 15% to 10% of time accelerating,

And how this is correlated to a 3% increase in fuel

efficiency and a decrease of 253 tons of CO2 and a reduction in costs of 1.72 million MXN.

35

You might also like