Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost
Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost
A R T I C L E I N F O A B S T R A C T
Keywords: The number of theft cases is much higher than that of other criminal cases, which frequently occurs in daily life
Theft prediction and is seriously destructive to social order. Studying the law of theft cases has a positive impact on social
Multi-classification governance and optimizing police deployment. Therefore, based on the data of theft cases in H city, this study
Decomposition method
proposes an optimized decomposition and fusion method based on XGBoost, and establishes two multi-
XGBoost
Crime prediction
classification prediction models, such as OVR-XGBoost and OVO-XGBoost. As the theft data is a datasets with
unbalanced class distribution, this paper uses SMOTENN algorithm to process it into a datasets with balanced
distribution, which effectively improves the effect of the model. Experiments show that the prediction accuracy
of OVR-XGBoost and OVO-XGBoost models is higher than that of baseline XGBoost models. For categories with
few samples, the classification effect of OVO-XGBoost is better than that of baseline XGBoost and OVO-XGBoost
models. Compared with baseline XGBoost model, the average overall classification accuracy of OVO-XGBoost
model is improved by more than 7%, and the MacroR accuracy is also improved by more than 15%. The
model proposed in this study has a good effect on the classification and prediction of theft types, and is of great
significance for the prevention of theft cases.
* Corresponding author.
E-mail address: [email protected] (Z. Xu).
https://doi.org/10.1016/j.eswa.2022.117943
Received 19 December 2021; Received in revised form 5 June 2022; Accepted 20 June 2022
Available online 26 June 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
above studies do not consider the influence of imbalanced datasets on effects and competitiveness of the three prediction models.
their proposed models and whether there are solutions to these bad XGBoost algorithm is widely used in multi-classification problems.
influences, so this paper conducts research based on this. Xu (Guotian and Shen, 2021) proposed a multi-classification detection
The contributions of this paper are as follows. method for malicious programs based on the XGBoost and Stacking
● Based on XGBoost algorithm, an innovative optimization decom fusion model, which realized the integration of XGBoost and Stacking
position and fusion XGBoost method is proposed, and two optimization and greatly improved the accuracy of multi-classification for malicious
models, OVO-XGBoost and OVR-XgBoost, are constructed, and the programs. Qian (Hq et al, 2021), considering the effect of feature
multi-classification problem is decomposed into multiple binary classi importance on feature selection, proposed a heuristic algorithm based
fication problems, which solves the problem that the baseline multi- on Permutation importance (PIMP) called PIMP-XGBoost, thus correct
classification model has poor classification effect on this data set. The ing the biased feature importance measure and improving the prediction
effect of classification is verified on a real data set. accuracy and interpretation. Mateo (Mateo et al, 2021) proposed a
● SMOTENN algorithm is used to optimize the unbalanced data set multi-label model, which combined machine learning with MS recog
into balanced data set, which improves the performance of baseline nition in vacuum system, and obtained better test results than random
model and optimization model. forest and XGBoost and other six commonly used multi-classification
● The proposed model is applied to the field of theft crime predic machine models. Ivașcu (Ivacu, 2020) looked at data-driven ap
tion, and the prediction effect of baseline model and optimization model proaches based on nonparametric models, The historical and implicit
in this field is improved. parameter comparison between SVM, genetic algorithm, random forest,
The rest of this paper is organized as follows: The second part in XGBoost and LightGMB and the classical methods of economic statistical
troduces the problem of decomposing multiple classifications into hypothesis, such as Black-Scholes and Corrado-SU, is analyzed, and it is
multiple two classifications and the related application of XGBoost found that the machine learning algorithm has better performance.
model in classification prediction and crime prediction. The third part Huang (XiaoLing et al, 2022) proposed a fair-Adaboost method based on
introduces the model design and experimental design. The fourth part AdaBoost, which also extended the non-dominated sorting genetic al
shows the effect of a single experiment and the comparison of several gorithm -II, which not only retained the advantages of interpretability,
experiments. The fifth part summarizes the full text, reflects on the scalability and accuracy of basic AdaBoost, but also improved
shortcomings, and looks forward to the future. performance.
The datasets determines the upper limit of prediction model per
2. 2.Related work formance, which can be approached by algorithm optimization and
feature engineering (Ding et al, 2021). In reality, the data of most
In reality, we often encounter multi-classification learning tasks, and datasets are not evenly distributed, so pre-processing measures can
some two-classification methods can be directly extended to multi- greatly improve the performance of classification models. Xu (LingLing
classification, but more often, two-classification learners are used to & DongXiang, 2020) sums up the research on class unbalanced datasets
solve multi-classification problems, that is, the multi-classification tasks from two aspects. One is to reconstruct the datasets from the data itself,
are disassembled into several two-classification tasks and solved. Every with the purpose of changing the distribution structure of sample
split binary classification task needs to be trained with a binary classi numbers, so that the number of different classes in the unbalanced
fication classifier. When testing, after the prediction results of these datasets can be relatively balanced. On the other hand, aiming at the
binary classification classifiers are integrated, the final multi- characteristics that the traditional classification model has high overall
classification results can be obtained according to certain strategies. classification accuracy but low recognition ability for minority classes, a
Sun (Jie et al, 2021) combined the multi-classification prediction of series of targeted improvement strategies are put forward from the
corporate financial distress with three decomposition and fusion classification algorithm and classification thought level, which tend to
methods and SVM algorithm, and put forward three prediction models pay more attention to minority classes and improve the classification
combined with SVM, and compared them with multiple discriminant accuracy of minority classes. Qin (XiWen et al, 2021) combines a new
analysis (MDA) and polynomial logic (MNLogit), and expounded the feature selection algorithm (F-SCORE) with a variety of machine
Fig. 1. Overall framework of multi-classification modeling based on decomposition and fusion XGBoost.
2
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
3
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
Table 3
LabelEncoder coding and OneHotEncoder coding.
No. Case category Commercial … Time of crime_h0 … Weather condition_w0 … Stolen object_obj0 …
places_l0
1 3 0 1 0 0
2 0 1 0 0 0
3 2 0 0 1 0
… … … … … … … … … …
1461 3 0 0 0 1
4
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
Table 5
Number of samples of the training set after processing.
Train set t0 t1 t2 t3 sum
As can be seen from Fig. 2 above, in the original datasets, the pro portable. It implements machine learning algorithm in Gradient Boost
portions of t0, t1, t2 and t3 were 38.44%, 6.16%, 7.52% and 47.88%, ing framework. XGBoost provides parallel tree promotion (also known
respectively. It is obvious that the number of samples of t0 and t3 is far as GBDT, GBM) that can solve many data science problems quickly and
more than that of other types. accurately. XGBoost is an improvement of the gradient lifting algorithm.
In the meantime, as is shown in Table 4, due to the unbalanced Newton’s method is used to solve the extreme value of the loss function,
distribution of class data in datasets, use SMOTE and SMOTENN algo and the loss function Taylor is expanded to the second order. In addition,
rithm to process the datasets respectively, and give the proportion of test regularization terms are added to the loss function. The objective
set and training set processed by each classification. SMOTE and SMO function during training consists of two parts: the first part is the loss of
TENN improve the imbalance of class sample distribution in the training gradient lifting algorithm, and the second part is the regularization term.
set. The loss function is defined as Eq. (1).
As is shown in Table 5, Smote and smoteNN have the best effect in
∑
n
( ′ ) ∑
comparison of smote and smoteNN after comparing many kinds of set L(Φ) = l ya , ya + Ωf (b) (1)
balance. Therefore, the authoer choose these two methods to deal with a=1 b
the dataset with category imbalance. Where n sample for training function, and l is for the loss of a single
sample, assume it as convex function, ya for the model predicted value of
′
3.3. BasedLine XGBoost the training sample, ya is a true labels of the training sample values, b is
the number of regularized terms in the tree. Regularization terms define
Baggging and Boosting are both model fusing methods that combine the complexity of the model as Eq. (2).
weak classifiers to form a strong classifier with better results than the 1
best weak classifiers. Bagging and Boosting have different emphasis in Ωf (b ) = αT+ β‖ω‖2 (2)
2
sample selection, sample weight, prediction function and parallel
computation. In general, bagging reduces variance, while Boosting re Where α and β are manually set parameters, ω is the vector formed by
duces bias. the values of all leaf nodes in the decision tree, and T is the number of
XGBoost algorithm is a Boosting ensemble algorithm, which is usu leaf nodes.
ally based on decision tree. The newly generated tree continuously
learns the residuals between the predicted value and the true value of 3.4. Decomposition and fusion combined with XGBoost
the current tree, and finally accumulates the learning results of multiple
trees as the prediction results. XGBoost is an optimized distributed For dichotomous problems, the result of the problem is 0 or 1, which
gradient enhancement library designed to be efficient, flexible and clearly divides into two categories. However, in real tasks, it is often not
Fig. 3. OVR-XGBoost.
5
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
Fig. 4. OVO-XGBoost.
dichotomous tasks to solve them. The most important problem is how to CPU Core 8 cores and 16 threads
split the multi-classification tasks and integrate multiple classifiers. memory 16 GB
GPU NVIDIA GeForce RTX 3060
Common decomposition and fusion methods are one-versus-one (OVO)
Memory amount 6 GB
and one-versus-rest (OVR).
3.4.1. OVR-XGBoost
Table 7
One-versus-rest (OVR) is, during training, the samples of a certain
Classification effect table of baseline XGBoost algorithm.
category are grouped into One category and the remaining samples are
grouped into another category, so that k XGBoosts can be constructed precision recall f1-score support Train_set Test_set
from the samples of K categories. In classification, unknown samples are 3 0.68 0.92 0.78 210
classified into the category with the maximum classification function 2 0.80 0.12 0.21 33
value. For example, in the theft datasets, there are four theft types to be 1 0.00 0.00 0.00 27
0 0.86 0.75 0.80 169
divided, namely t0, t1, t2 and t3. When extracting the training set, Accuracy 439 76.34% 81.80%
extract separately. macro avg 0.58 0.45 0.45 439
The four training sets were used for training respectively, and then weighted avg 0.71 0.74 0.70 439
four training result files were obtained. During the test, the corre
sponding test vectors were tested using the four training result files
respectively. Finally, each test had a result f1,f2,f3,f4,the final classifi classification OVO = max{Ni |i = 1, 2, 3, 4⋯n} (4)
cation result of OVR-XGBoost model can be abstracted into a formula as
Eq. (3) and the OVR-XGBoost model is shown in Fig. 3.
4. 4.Results and dicussion
classification OVR = max{fi |i = 1, 2, 3, 4⋯n} (3)
4.1. Experimental configuration
3.4.2. OVO-XGBoost
One-versus-one (OVO), is designs an XGBoost between any two types The experiment is based on windows-10 Professional operating sys
of samples, so t(t-1)/2 XGBoosts are required for samples of t classes. tem and python integration environment Anaconda3 is used in the
When classifying an unknown sample, the category with the most votes experiment. Python version is 3.8.8. The configuration of the experi
is the category of the unknown sample. For example, in the theft data mental machine is as Table 6.
sets, there are four theft types to be divided, namely t0, t1, t2 and t3.
During training, choosing t0, t1; t0, t2; t0, t3; t1, t2; t1, t3; The vectors
4.2. Experimental result
corresponding to t2 and t3 are used as the training set, and then six
training results are obtained. During the test, the corresponding vectors
4.2.1. Effect of baseline XGBoost algorithm
are tested on the six results respectively, and then the voting form is
According to the datasets processed in Table 3, the baseline XGBoost
adopted, and finally a group of results are obtained.
algorithm was used to classify and predict the theft cases. The parameter
The end result is that the largest of the four values is abstracted into a
Settings of the baseline XGBoost algorithm are as follows: N_estimators
formula as a result of classification as Eq. (4). The OVO-XGBoost model
= 30, learning_rate = 0.3, max_depth = 7, determine the minimum leaf
is shown in Fig. 4
node sample weight and min_child_weight = 1, Colsample_bytree = 0.8,
6
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
Table 8
Classification effect table of t0, t1, t2 and t3 based on OVR-XGBoost.
precision recall f1-score support Train_set Train_set AUC
set objective=’multi:softmax’, Category Number of categories num_ forest, and the K-Fold cross-validation method is used to show the K-Fold
class = 4. The effects of a single baseline XGBoost classification are (K = 10) cross-validation results and average efficiency of these three
shown in Table 7. Boosting algorithms with the same Hyperparameters, as shown in Fig. 5.
It is clear that after 10-fold cross-validation, most of the XGBoost
4.2.2. Comparison of XGboost algorithm with other Bagging algorithms model’s performance is better than AdaBoost and random forest for this
In this paper, XGboost is compared with some Boosting algorithms stolen data set, and XGBoost model has the best average performance, so
such as AdaBoost and with some Bagging algorithms such as Random this paper chooses XGBoost as the baseline model.
7
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
Table 9
Classification effect table of t0, t1, t2 and t3 based on OVO-XGBoost.
precision recall f1-score support Train_set Train_set AUC
Fig. 6. Comparison of the classification accuracy curve of 20 experiments based on the original datasets.
4.2.3. Effect of OVR-XGBoost algorithm obtained. The comparison effect is shown in Figs. 6, 7 and 8.
According to the requirements of OVR method, the datasets was The average efficiency comparison is given for the above 20 exper
processed, and the stolen datasets was divided into K categories (k = 4). iments, as shown in Table 10. In Table 10, the overall accuracy is
Four XGBoost models were established respectively for the four determined by dividing the correctly classified test samples by the total
dichotomous datasets constructed according to OVR method. The results test samples, while the accuracy of each category is obtained by dividing
of a single experiment are shown in Table 8. the number of test samples in this category by the total number of test
samples in this category. MacroR refers to the macro recall rate. When
4.2.4. The effect of the OVO-XGBoost algorithm training/testing different models multiple times based on the same data
According to the design of the OVO method, the datasets was pro set, or evaluating the same model based on multiple data sets, or per
cessed, and t(T-1)/2 XGBoost models were designed for SAMPLES of T forming tasks of multiple classes, many confusion matrices can be
categories, i.e., six XGBoost models were designed for samples of four generated. At this point, MacroR is needed for evaluation. The higher the
categories, and the results of a single experiment are shown in Table 9. value of MacroR, the better the effect of the model, and the more
correctly classified samples in positive cases.
4.2.5. Comparison of model effects Table 10 shows the average values of relative strain variables ob
After 20 experiments, the effect comparison of the proposed fusion tained from 20 experiments, where the maximum values for each row
model under the conditions of three data processing methods is are highlighted and the minimum values for each row are underlined. As
8
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
Fig. 7. Comparison of the classification accuracy curve of 20 experiments based on the datasets dealed with smote.
Fig. 8. Comparison of the classification accuracy curve of 20 experiments based on the datasets dealed with smotenn.
can be seen from the table, both the OVR-XGBoost and OVO-XGBoost data set processed by these two algorithms. However, the classification
models performed better than the baseline XGBoost model in both raw accuracy of OVR-XGBoost and OVR-XGBoost models is not significantly
data and data sets processed by the two unbalanced algorithms, with reduced, while the accuracy of MacroR is greatly improved. Overall, the
little difference in overall accuracy between the two models. In the OVO-XGBoost model preprocessed by SMOTENN algorithm fits the
original data set, class T1 was a small sample, and the classification dataset better and performs better than the baseline XGBoost model.
accuracy of the baseline XGBoost model was 0, while that of the OVO-
XGBoost model was improved to 58.44%. After smote and smotenn, 5. Conclusions
the baseline model got up to about 25% accuracy for a few classes,
whereas OVO-XgBoost still had over 50% accuracy. Due to the uneven This study adopts the crime data of H city in 2019 provided by the
class distribution of this data set, after SMOTE and SMOTENN, the class Public Security Bureau of H City, and makes multi-classification pre
sample distribution of the data set is basically balanced, the overall diction of the crime. A multi-class theft case prediction model based on
classification accuracy and all classification accuracy of XGBoost, OVR- decomposing-fusion XGBoost is proposed. By combining XGBoost with
XgBoost and OVR-XgBoost models are decreased. But MacroR’s accu two decomposing-fusion methods, such as one-to-one and one-to-many,
racy has improved. two multi-class models are established. The multi-classification problem
Among them, the classification accuracy of XGBoost model is is decomposed into multiple dichotomies, and the two multi-class
significantly reduced, indicating that the model is not suitable for the models are compared with the baseline XGBoost model. It is found
9
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943
10