0% found this document useful (0 votes)

54 views15 pages

3.1 Business Understanding

The document describes applying the Cross Industry Standard Process for Data Mining (CRISP-DM) framework to a project with Coca-Cola FEMSA. It involved understanding the business needs, preparing the telematics data from over 3,000 trucks, and developing models to understand how driving behaviors impact fuel efficiency and safety. Linear regression and clustering models were used to identify relationships between driving style factors and both fuel consumption and safety metrics. The results were validated and developed into interactive tools to help Coca-Cola FEMSA optimize operations.

Uploaded by

Sam J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views15 pages

3.1 Business Understanding

Uploaded by

Sam J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Throughout the project, we followed the Cross Industry Standard Process (CRISP) for Data Mining

framework (Shearer, 2000). This framework constitutes a general framework that emphasizes the

iterative nature of data mining problems. Figure 7 shows the general framework; its application to this

project is detailed in the following sections. The Evaluation steps are discussed in parts throughout this

section and in the Results section. The Deployment was not part of the scope of this project, but several

recommendations for deployment are given in section 6, Insights and Management Recommendations.

Figure 7: Cross Industry Standard Process for Data Mining CRISP-DM (Shearer, 2000)

3.1 Business understanding

Our initial research helped us understand the Coca-Cola FEMSA´s business needs. This research consisted

of reading academic documents, industry reviews, and annual reports shared by organizations such as the

Intergovernmental Panel on Climate Change (IPCC) and the American Transportation Research Institute

21
(ATRI) for a high-level overview. This

Coca-Cola FEMSA also shared documentation regarding their telematics approach and objectives, to focus

the project and validate the company´s approach to using Telematics Data by comparing them with the

priorities are safety and fuel

efficiency (which affect equally cost savings and CO2 emissions).

A series of weekly interviews and discussions were held with Coca-Cola FEMSA s secondary distribution

stakeholders. In these discussions, insights were shared from visualizations in Power BI, receiving

feedback and interpretation from Coca-Cola FEMSA´s experts. During these sessions we interviewed the

Director of Distribution, the Telematics Managers, and the Digital Analytics teams.

To gain a better understanding of the day-to-day of the truck drivers of Coca-Cola FEMSA, a field visit was

arranged. By accompanying truck drivers during their daily routes to deliver beverages, interesting insights

beyond the data from telematics emerged. Important insights from the field trip include that some

, for example, traffic,

stre

3.2 Data understanding

The amount of available data in this project was an important challenge. A telematics supplier integrates

with over 3,000 trucks to generate over 40 different tables that are updated every day or some even every

minute, each table having different parameters with different aggregation levels, therefore a clean data

set is fundamental for the project. A data dictionary was created to better understand each of the

different parameters shown in the reports, as well as the aggregation level of each report. Weekly calls

with the telematics team helped to clarify questions from the team.

22
With the data dictionary ready, different visualizations in Microsoft Power BI helped us to get an initial

feel for the data. These visualizations were iteratively validated and discussed with Coca-Cola FEMSA´s

stakeholders to clarify the expected ranges for important parameters and the expected relations between

them.

3.3 Data preparation

The following criteria were followed to clean the data:

For outlier treatment we followed two approaches depending on the attribute. For most

attributes, we trimmed the values to what the users found to be real minimums or maximums. As

, there is no way a driver could have driven for more

than 24 hours in one day. For attributes in which there was no knowledge of the limits, a

conservative approach was followed, trimming only the values that went beyond 3 times the

interquartile range.

Trucks that do not report fuel usage or full telematics data were removed. This removed a large

part of the trucks that the company uses but still left us with 360 trucks with data from 325 days

of delivery.

Data was also further processed in the following manner:

One-hot encoding of variables are used to transform categorical variables (e.g., truck type and

model) into dummy variables to be used in regression models.

Several features had to be engineered to be used in the system. Mainly features that were a ratio

of the time an activity took in respect of the total operating time, for example, the total time a

truck spent accelerating with respect to the total time of operation.

Feature scaling was used to normalize the range of independent variables. The feature scaling

method used was min-max normalization. What this method does is that the minimum value of

23
an attribute becomes 0 and the maximum value becomes 1 and all the other values are adjusted

on that 0 to 1 scale.

3.4 Modeling
Our approach for modelling involved three different machine learning models. The first one was a

regression model to understand the how different driving behaviors impact fuel efficiency, the second

one was a machine learning model to understand how different driving behaviors impact safety and the

third model was a clustering analysis of driving behaviors to create different driving styles clusters to

analyze how these driving styles affect safety and fuel efficiency simultaneously.

3.4.1 Fuel Efficiency

For our first regression model, the dependent variable to be analyzed is fuel efficiency as measured by

kilometers per liter, where a higher fuel efficiency is better for the economy of the company and produces

a lesser amount of carbon dioxide per kilometer driven. To quantify the monetary impact of any change,

the average of the price per liter in Mexican pesos is used. To obtain the average price per liter, a dataset

of all refuels of 2020 is used, which was 18.16 MXN per liter. To obtain the average carbon dioxide

produced per liter of diesel we used a constant of 2.68 kilograms of carbon dioxide per liter.

Different regression algorithms were tested to explain the main forces impacting fuel consumption. Some

of the models considered were multiple linear regression (i.e., polynomial regression), support vector

regression and simple decision trees. Other ensemble methods were tested like boosting (e.g. AdaBoost,

XGBoost and LighGBMs), bagging (e.g. random forests) and stacking of various ensemble and simple

methods. Although the bagging and boosting models explained the variance in observations

measurements better (As reference, the adjusted R2 for the AdaBoost Regressor was 0.72 while for the

Linear Regression it was 0.67), we decided to use multiple linear regressions because they are fully

explainable. These types of models allow us to explain how the model interprets the inputs to produce

outputs as opposed to a black model that only produces outputs that are not explainable. To make sure

24
the results of the linear regression are reproducible, the main four assumptions behind linear regression

were tested using residuals plots. Here is a list of the main four assumptions:

1. Independence of observations

2. Linearity of Response

3. Normality of Residuals

4. Homogeneity of Variance (i.e., homoscedasticity)

Multicollinearity issues were addressed in three different ways:

Feature Selection was carried out by understanding the meaning behind each of the attributes to

discard metrics that were proxies of each other.

A Pearson Correlation Map was created to discard attributes with correlation greater than 0.6 to

at least one of the attributes that a Pearson correlation coefficient, which indicates a high

correlation with another feature that increases the effect of multicollinearity. A correlation

analysis only checks the probability of a correlation problem between two attributes.

Variance Inflation Factor (VIF) was obtained for each of the attributes, and we discarded attributes

that had a factor greater than 5, which would indicate highly correlated attributes. A Pearson

Correlation Map helps with identifying pairs of attributes that are correlated, while the VIF

approach helps to identify multicollinearity among the interactions between the variables, not

just between two of them.

25
Once the regression was validated, we developed a Microsoft Power BI-based simulation tool that allows

the users to simulate what would be the fuel efficiency gains if any of the dependent variables are

modified. Afterwards, we use the results of our regression and validated them against samples of data.

The samples used came from drivers where we had detected abrupt changes in their fuel efficiency. To

detect the abrupt changes in fuel efficiency behavior for each driver we calculated a rolling 7-day average

of fuel efficiency to smooth out the daily noise and only kept those drivers in which we saw a change that

remain constant, an example is shown in Figure 13.

3.4.2 Safety
Safety is not a straightforward concept to measure. Some of the proxies for measuring safety include

number of accidents or proprietary Safety Scores given by telematics data providers. Our initial approach

to understand which driving behaviors affect safety was to use accidents as our dependent variable on a

regression model.

Predicting crash rates is hard to measure given that accidents are stochastic events that not always follow

the same pattern and depend on a wider variety of directly controllable factors like driving style and

external factors like the weather, external traffic and highly uncertain events like people crossing streets

or other An econometric model was used to analyze the driving behavior of the

drivers that had accidents. The econometric model was based on logistic regressions to predict the

probability of an accident occurring. Econometric models allow the model to incorporate past events (by

using lag features) that have led to an accident, such as a driver incurring in unsafe practices for several

days in a row. Another option includes using several proxies for safety: events such as reducing the velocity

of the vehicle too quickly or events in which vehicles make hard turns at considerable speeds. Events like

these could be used as proxies for safety as they are considered unsafe behaviors.

26
Our econometric model to predict accidents did not produce statistically relevant results. This was seen

by an adjusted R2 that was less than 0.2 and attributes with p-values greater than 0.05. Therefore, this

part of the process was not integrated into the results. Our hypothesis of how to make this model work

would be to structure how the data is collected and analyze more data related to status of the driver as

suggested by Houston, J. (2003) and Harris, P. (2014).

Given that our first proxy for safety failed to work, we decided to use Coca-Cola FEMSA

provider Safety Score. This score uses various events to calculate a proxy to the probability to have an

accident. The calculations used are Intellectual Property of the supplier, but they are based on a micro

modeling approach similar method to Toledo et al. (2008) mentioned in the literature review (i.e., real

time analytics of telematics data). For example: a sudden longitudinal and lateral acceleration change

measured by the may indicate an abrupt turn. This way, the previously mentioned

independent variables were generated (e.g., abrupt lane change, abrupt turns acceleration or braking

events while turning, etc.).

To understand the relative importance of our independent variables regarding the Safety Score we

discretized the Safety Score variable into 6 equal-frequency categories. The reason for discretizing the

variable instead of treating it as continuous numerical feature is that the Safety Score is bounded by an

upper limit at 100. So, a linear regression model would produce results with heteroscedasticity problems.

Therefore, we ran a classification model using 20 independent variables to predict one of the 6 Safety

Score classes. Figure 15 shows the ranges and number of observations for each of the 6 classes. The

independent variables used are listed in Table 2.

We tested among various machine learning models to decide on which machine learning model would

best fit the data. Among the options that we tried were Random Forest, AdaBoost, Naïve Bayes and

Support Vector Machines (SVM). Table 1 shows the comparison between the different machine learning

27
models. The machine learning algorithm that produced the best results in terms of Area Under the Curve

(AUC) was Random Forest Regressor. The AUC is a common metric to evaluate the results of multiclass

classification problems as it provides an aggregate measure of performance across all possible

classification thresholds. A common way to interpret this metric is as the probability that the model ranks

a random positive example more highly than a random negative example.

3.4.3 Cluster Analysis

Our third and final model was clustering. Clustering is a family of machine learning that allow for

unsupervised learning. The intention of using clustering was to group the different driving behavior

characteristics that are gathered at the daily and truck level to identify clusters of driving styles. As input

variables we used twenty different variables shown in Table 2.3.

The clustering approach that we decided to use was a Bayesian Gaussian Mixture Model. The reason for

using a probabilistic Gaussian Mixture model is that it allowed to us to better understand the properties

of input examples. Many clustering algorithms like K-Means simply give a cluster representative that

shows nothing about how the points are spread. The Gaussian properties of this approach gives us not

only the mean of the cluster but also the variance which can be used to estimate the likelihood that a

point belongs to a certain cluster. The reason for choosing a Bayesian Gaussian Mixture Model instead of

the traditional Gaussian Mixture was to take a probabilistic approach to choosing the number of clusters.

With a traditional Gaussian Mixture Model, a Bayesian Information Criterion (BIC) or the Akaike

Information Criterion (AIC) techniques must be used to select an optimal number of clusters. While with

a Bayesian one, the algorithm takes the cluster parameters as latent random variables, not as fixed model

parameters. In other words, with this algorithm you can set an initial maximum number of clusters and

the algorithm will decide the optimal number of clusters to reward models that fit the data well while

minimizing a theoretical information criterion. The possible range of number of clusters would be

between 1 and the maximum number of clusters that was set. For our problem, we chose a maximum

28
number of clusters as 10 as this would allow us to separate the driving styles into business-relatable

information but the algorithm suggested 5 clusters as the optimal number for clusters.

Table 1: Features used for Cluster Analysis

Feature ID Variable name

1 Life mileage
2 Max. engine t(°C)
3 Max. RPM
4 Top Speed
5 Operation time
6 Time in DC
7 Avg. Speed
8 % route under min. t(°C)
9 Over Revolution Time %
10 Idling time %
11 Acceleration Route time %
12 Overspeed events (%)
13 Number of stops (%)
14 Abrupt Acceleration
15 Abrupt Braking
16 Abrupt turns
17 Abrupt Lane Changes
18 Acceleration while turning
19 Braking while turning
20 OverAcceleration events

Note: All variables were measured at a vehicle-day disaggregation level.

3.4.4 Individual Cluster Analysis

Each cluster generated by our model had a weight from independent variables that impact fuel efficiency,

for example idling times and excessive acceleration events, as well as independent variables related to

safety, for example abrupt lane changes, abrupt turns and acceleration or braking events while turning.

Each cluster was generated based on the independent variables that represent the driving behavior, to

explain which patterns each driver follows. Then we used the created clusters to see how they in terms

of fuel efficiency and safety score with the purpose of explaining the tradeoffs between the different types

of driving styles.

29
To business daily practices, a persona was

defined for each cluster. A persona is term borrowed from the marketing industry which is described as

acter . Our intention in using these personas was to create fictitious but

relatable characters so that the driving style of any driver could be identified and easily recognized. Our

Gaussian Mixture Model approach also allows for driving styles to be, probabilistically speaking, part of

many of driving styles.

3.5 Conclusions
To answer our research question of which driving styles can help fleet owners increase safety and fuel

efficiency we used multiple machine learning and analytics techniques. We used data from over 3,000

trucks to come up with a fuel efficiency regression model, we had an unsuccessful attempt at predicting

crash rate safety with econometric regression so we ended up using a proprietary Safety Score from the

telematics provider as a proxy for Safety and we developed a clustering analysis to drive the business

recommendations, actionable insights, and recommendations. Given the amount of data we were dealing

with we also followed the CRISP-DM methodology to guide us through the iterative process of data

mining. In the next section we will describe the results of each model.

4 RESULTS

4.1 Fuel Efficiency

4.1.1 Regression Model

Our polynomial regression model to understand the main drivers behind fuel efficiency (kilometers per

liter of diesel) contained 13 independent variables plus the bias term. The linear regression model had an

R2 of 0.67, MAPE of 8.7%, MAE of 0.23 and a RMSE of 0.28. As context, the mean fuel efficiency was 2.72

kilometers per liter with a standard deviation of 0.52. Figure 8 shows a histogram of the distribution of

fuel efficiency.

30
Figure 8: Histogram of Fuel Efficiency

As expected, not all variables have the same impact in estimating fuel efficiency. Figure 9 shows the

relative standardized effects of each variable. The top three are AvgSpeed,

RatioAceleradorDuracionEventos (the percentage of time a truck spends accelerating) and RPMMaxima.

Figure 9: Pareto Chart of the Standardized Effects

31
Figure 10 and Figure 11 show the Residuals Plots and the Prediction Error for the Linear Regression Model.

These plots were used to test the four key assumptions that were mentioned in the Methodology.

Figure 10: Residuals for Linear Regression Model

Figure 11: Prediction Error for Linear Regression

Note: these two plots shows that results are normally distributed (Gaussian Bell on the top right) which
visually proves the normally of the residuals. The points also appear to be evenly distributed (Points on
the center) which shows that the residuals have linearity of response and that there homogeneity in of
variance.

32
Figure 12 shows a matrix of all features and how they correlate with each of the other features. A number

closer to 1 means there is a perfect positive correlation between two variables. A number closer to 0

means there is no correlation between the two variables. A number closer to negative 1 means there is a

perfect negative correlation. This plot served as initial analysis to do feature selection to avoid

multicollinearity issues.

Figure 12: Telematics Parameters Correlation Matrix

33
Table 3 shows the summarized results of the linear regression with the names of features, the coefficients,

the standard error of the coefficients, the T-values, the P-values and the VIF. The VIF values were used for

feature selection to avoid multicollinearity problems.

Table 2: Fuel Efficiency Linear Regression Results

Note: Adjusted R2: 0.67

4.1.2 Fuel Efficiency Scenario Analysis

After developing the model, we built a simulator that allowed Coca-Cola FEMSA -

analysis with the different variables that impact fuel efficiency. Figure 13 displays an example of a possible

scenario. The objective of this simulation tool is to allow the company to estimate the potential gain of

experimenting different changes of any of the independent variables and seeing how this would impact

both costs and CO2 emissions.

34
Figure 13: Scenario Analysis

Note: As an example, the figure shows the results of a reduction from 15% to 10% of time accelerating,

And how this is correlated to a 3% increase in fuel

efficiency and a decrease of 253 tons of CO2 and a reduction in costs of 1.72 million MXN.

Railway TTE Syllabus New PDF
100% (1)
Railway TTE Syllabus New PDF
6 pages
Grade 11 Electromagnetism
0% (1)
Grade 11 Electromagnetism
43 pages
Box-Jenkins Method: Time Series Analysis: Forecasting and Control
100% (1)
Box-Jenkins Method: Time Series Analysis: Forecasting and Control
4 pages
Business Analytics Presentation: Titanic Survival Analysis and Prediction
No ratings yet
Business Analytics Presentation: Titanic Survival Analysis and Prediction
15 pages
CFE - 10 - Chemical Kinetics Part 2
No ratings yet
CFE - 10 - Chemical Kinetics Part 2
19 pages
Problem Set
No ratings yet
Problem Set
39 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Python Module-1 QB Solution (21EC643)
No ratings yet
Python Module-1 QB Solution (21EC643)
25 pages
Forecasting Financial Markets Using Neural Networks - An Analysis of Methods and Accuracy
100% (1)
Forecasting Financial Markets Using Neural Networks - An Analysis of Methods and Accuracy
82 pages
Unofficial Solutions Manual To R.A Gibbon's A Primer in Game Theory
83% (23)
Unofficial Solutions Manual To R.A Gibbon's A Primer in Game Theory
36 pages
Ipl Matches Documentation
No ratings yet
Ipl Matches Documentation
28 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
Zhang Haoze 202112 MSC
No ratings yet
Zhang Haoze 202112 MSC
114 pages
s7 1500 Compare Table en Mnemo
No ratings yet
s7 1500 Compare Table en Mnemo
71 pages
Math 4 Matatag
No ratings yet
Math 4 Matatag
12 pages
Plastic Analysis
No ratings yet
Plastic Analysis
37 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Lecture 2 Exploration
No ratings yet
Lecture 2 Exploration
54 pages
Gonzalez Torresarpi Capstone Fuel Efficiency and Safety in Coca-Cola Femsa Last-Mile Logistics
No ratings yet
Gonzalez Torresarpi Capstone Fuel Efficiency and Safety in Coca-Cola Femsa Last-Mile Logistics
55 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
HORIZONTAL and VERTICAL CURVES Problem Set
No ratings yet
HORIZONTAL and VERTICAL CURVES Problem Set
4 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Prob Ass
No ratings yet
Prob Ass
33 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Unit 5 - Word Meaning (2) - Handout
No ratings yet
Unit 5 - Word Meaning (2) - Handout
48 pages
Jumong Story
No ratings yet
Jumong Story
9 pages
Analysis of Old Cars Data
No ratings yet
Analysis of Old Cars Data
32 pages
FULLTEXT01
No ratings yet
FULLTEXT01
56 pages
Report
No ratings yet
Report
24 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Eln PPT Module 1
No ratings yet
Eln PPT Module 1
32 pages
Car Prediction Analysis
No ratings yet
Car Prediction Analysis
19 pages
Topic5 Lab TwoDimVariable
No ratings yet
Topic5 Lab TwoDimVariable
14 pages
VJC 2024 H2 JC2 Math Prelim P1 Questions - Student
No ratings yet
VJC 2024 H2 JC2 Math Prelim P1 Questions - Student
21 pages
Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University
No ratings yet
Retail Market Analysis: Ke Yuan, Yaoxin Liu, Shriyesh Chandra, Rishav Roy New York University
12 pages
Phase 3
No ratings yet
Phase 3
19 pages
Main Phase 3 Dharani
No ratings yet
Main Phase 3 Dharani
19 pages
Group 1 5b Report
No ratings yet
Group 1 5b Report
10 pages
Analysis of Mine Haul Truck Fuel Consumption Report
No ratings yet
Analysis of Mine Haul Truck Fuel Consumption Report
24 pages
Leif Eric Easley
No ratings yet
Leif Eric Easley
22 pages
RWB Prealgebra Grade 7
No ratings yet
RWB Prealgebra Grade 7
11 pages
AA Project Report PDF 112253 113208 112309
No ratings yet
AA Project Report PDF 112253 113208 112309
13 pages
Finalised FBA CIA 3
No ratings yet
Finalised FBA CIA 3
16 pages
WSE3701 Study Material #3 and #4
No ratings yet
WSE3701 Study Material #3 and #4
4 pages
Algorithms (OBF) Dummies - SPARK
No ratings yet
Algorithms (OBF) Dummies - SPARK
29 pages
Cab Fare Prediction Report by Abhinav Jha
No ratings yet
Cab Fare Prediction Report by Abhinav Jha
41 pages
Predicting Vehicle Fuel Efficiency With Regression Modeling
No ratings yet
Predicting Vehicle Fuel Efficiency With Regression Modeling
9 pages
Iare Ehvac Lecture Notes
No ratings yet
Iare Ehvac Lecture Notes
40 pages
AI-MAJOR-AUGUST - Aryal Ashish
No ratings yet
AI-MAJOR-AUGUST - Aryal Ashish
16 pages
Car Features Case Study
No ratings yet
Car Features Case Study
10 pages
74x181 4-Bit ALU: Floyd, Digital Fundamentals, 10th Ed
No ratings yet
74x181 4-Bit ALU: Floyd, Digital Fundamentals, 10th Ed
12 pages
Jitendra (BA Project Document)
No ratings yet
Jitendra (BA Project Document)
8 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
The Mechanical Properties of Skin
No ratings yet
The Mechanical Properties of Skin
11 pages
Revenue Predictor - Udit Ennam PDF
No ratings yet
Revenue Predictor - Udit Ennam PDF
30 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
Practical - 4: First Order Partial Differential Equation
No ratings yet
Practical - 4: First Order Partial Differential Equation
6 pages
Exponential and Log
No ratings yet
Exponential and Log
2 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Foundation Myths
No ratings yet
Foundation Myths
3 pages
Detection of Bone Fracture Using Image Processing Methods: Dataset Used
No ratings yet
Detection of Bone Fracture Using Image Processing Methods: Dataset Used
6 pages
Weekly Diary Report-244
No ratings yet
Weekly Diary Report-244
9 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
EB103 - Math 3 - MATLAB 3 - Session 5 - Single & Double Integration - Fall 2022-2023-1
No ratings yet
EB103 - Math 3 - MATLAB 3 - Session 5 - Single & Double Integration - Fall 2022-2023-1
4 pages
Insights
No ratings yet
Insights
2 pages
Grade 1-12 Daily Lesson Log: Learning Plan
No ratings yet
Grade 1-12 Daily Lesson Log: Learning Plan
8 pages
Lab Week 5 Mba 5300
No ratings yet
Lab Week 5 Mba 5300
3 pages
STAAD PRO V8i Technical Reference - Design Parameters
No ratings yet
STAAD PRO V8i Technical Reference - Design Parameters
4 pages
Lesson Plan Function Operations 3
No ratings yet
Lesson Plan Function Operations 3
9 pages
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Highway-Rail Grade Crossing Identification and Prioritizing Model Development
From Everand
Highway-Rail Grade Crossing Identification and Prioritizing Model Development
Maxim A. Dulebenets
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Earned Schedule
From Everand
Earned Schedule
Walter Lipke
No ratings yet
Application Performance Management in Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Application Performance Management in Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
From Everand
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AI-Driven Web Apps: Practical Machine Learning for Software Developers
From Everand
AI-Driven Web Apps: Practical Machine Learning for Software Developers
Sivaramarajalu Ramadurai Venkataraajalu
No ratings yet
Twins For Optimizing Oil Extraction Processes
From Everand
Twins For Optimizing Oil Extraction Processes
DHIVAKAR POOSAPADI
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Dancing on a Cloud: A Framework for Increasing Business Agility
From Everand
Dancing on a Cloud: A Framework for Increasing Business Agility
David Sterling
No ratings yet
Sustainable Asset Management: AI & Blockchain Unleashed
From Everand
Sustainable Asset Management: AI & Blockchain Unleashed
Prashant Sinha
No ratings yet
Monthly People
From Everand
Monthly People
Sung-rae Park
No ratings yet
Applied Econometrics: A Simple Introduction
From Everand
Applied Econometrics: A Simple Introduction
K.H. Erickson
5/5 (2)
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
The Asia–Pacific Road Safety Observatory’s Indicators for Member Countries
From Everand
The Asia–Pacific Road Safety Observatory’s Indicators for Member Countries
Asian Development Bank
No ratings yet
Leveraging Technology for Property Tax Management in Asia and the Pacific–Guidance Note
From Everand
Leveraging Technology for Property Tax Management in Asia and the Pacific–Guidance Note
Asian Development Bank
No ratings yet
Harmonizing Power Systems in the Greater Mekong Subregion: Regulatory and Pricing Measures to Facilitate Trade
From Everand
Harmonizing Power Systems in the Greater Mekong Subregion: Regulatory and Pricing Measures to Facilitate Trade
Asian Development Bank
No ratings yet
Facilitating Power Trade in the Greater Mekong Subregion: Establishing and Implementing a Regional Grid Code
From Everand
Facilitating Power Trade in the Greater Mekong Subregion: Establishing and Implementing a Regional Grid Code
Asian Development Bank
No ratings yet
Launching A Digital Tax Administration Transformation: What You Need to Know
From Everand
Launching A Digital Tax Administration Transformation: What You Need to Know
Asian Development Bank
No ratings yet
Microsoft Excel-Based Tool Kit for Planning Hybrid Energy Systems: A User Guide
From Everand
Microsoft Excel-Based Tool Kit for Planning Hybrid Energy Systems: A User Guide
Asian Development Bank
No ratings yet

3.1 Business Understanding

Uploaded by

3.1 Business Understanding

Uploaded by

Throughout the project, we followed the Cross Industry Standard Process (CRISP) for Data Mining

3.1 Business understanding

priorities are safety and fuel

efficiency (which affect equally cost savings and CO2 emissions).

, for example, traffic,

3.2 Data understanding

3.3 Data preparation

, there is no way a driver could have driven for more

Data was also further processed in the following manner:

model) into dummy variables to be used in regression models.

truck spent accelerating with respect to the total time of operation.

3.4.1 Fuel Efficiency

4. Homogeneity of Variance (i.e., homoscedasticity)

Multicollinearity issues were addressed in three different ways:

discard metrics that were proxies of each other.

just between two of them.

remain constant, an example is shown in Figure 13.

suggested by Houston, J. (2003) and Harris, P. (2014).

events while turning, etc.).

independent variables used are listed in Table 2.

classification problems as it provides an aggregate measure of performance across all possible

a random positive example more highly than a random negative example.

3.4.3 Cluster Analysis

variables we used twenty different variables shown in Table 2.3.

Table 1: Features used for Cluster Analysis

Feature ID Variable name

Note: All variables were measured at a vehicle-day disaggregation level.

3.4.4 Individual Cluster Analysis

many of driving styles.

4.1 Fuel Efficiency

4.1.1 Regression Model

RatioAceleradorDuracionEventos (the percentage of time a truck spends accelerating) and RPMMaxima.

Figure 9: Pareto Chart of the Standardized Effects

Figure 10: Residuals for Linear Regression Model

Figure 11: Prediction Error for Linear Regression

Figure 12: Telematics Parameters Correlation Matrix

feature selection to avoid multicollinearity problems.

Table 2: Fuel Efficiency Linear Regression Results

Note: Adjusted R2: 0.67

4.1.2 Fuel Efficiency Scenario Analysis

both costs and CO2 emissions.

And how this is correlated to a 3% increase in fuel

You might also like