0% found this document useful (0 votes)
7 views

Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost

This study presents an optimized decomposition and fusion method for predicting multi-class theft crimes using XGBoost, specifically developing two models: OVR-XGBoost and OVO-XGBoost. By addressing the issue of unbalanced datasets through the SMOTENN algorithm, the proposed models demonstrate improved prediction accuracy, particularly for categories with fewer samples. The findings indicate that the OVO-XGBoost model outperforms baseline models, enhancing the classification and prediction of theft types, which is significant for crime prevention strategies.

Uploaded by

suykey3x7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost

This study presents an optimized decomposition and fusion method for predicting multi-class theft crimes using XGBoost, specifically developing two models: OVR-XGBoost and OVO-XGBoost. By addressing the issue of unbalanced datasets through the SMOTENN algorithm, the proposed models demonstrate improved prediction accuracy, particularly for categories with fewer samples. The findings indicate that the OVO-XGBoost model outperforms baseline models, enhancing the classification and prediction of theft types, which is significant for crime prevention strategies.

Uploaded by

suykey3x7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Expert Systems With Applications 207 (2022) 117943

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Research on prediction of multi-class theft crimes by an optimized


decomposition and fusion method based on XGBoost
Zhongzhen Yan , Hao Chen , Xinhua Dong , Kewei Zhou , Zhigang Xu *
School of Computer Science and Technology, Hubei University of Technology, Wuhan 430000, China

A R T I C L E I N F O A B S T R A C T

Keywords: The number of theft cases is much higher than that of other criminal cases, which frequently occurs in daily life
Theft prediction and is seriously destructive to social order. Studying the law of theft cases has a positive impact on social
Multi-classification governance and optimizing police deployment. Therefore, based on the data of theft cases in H city, this study
Decomposition method
proposes an optimized decomposition and fusion method based on XGBoost, and establishes two multi-
XGBoost
Crime prediction
classification prediction models, such as OVR-XGBoost and OVO-XGBoost. As the theft data is a datasets with
unbalanced class distribution, this paper uses SMOTENN algorithm to process it into a datasets with balanced
distribution, which effectively improves the effect of the model. Experiments show that the prediction accuracy
of OVR-XGBoost and OVO-XGBoost models is higher than that of baseline XGBoost models. For categories with
few samples, the classification effect of OVO-XGBoost is better than that of baseline XGBoost and OVO-XGBoost
models. Compared with baseline XGBoost model, the average overall classification accuracy of OVO-XGBoost
model is improved by more than 7%, and the MacroR accuracy is also improved by more than 15%. The
model proposed in this study has a good effect on the classification and prediction of theft types, and is of great
significance for the prevention of theft cases.

1. 1.Introduction put forward the application of GCLSTM (graph convolution long-term


and short-term memory) neural network, which combines graph
Crime analysis is an important content of social security manage­ convolution model with LSTM to predict vehicle theft in Chile’s
ment, and the occurrence of crimes in time and space have certain rules. metropolitan areas. Song (Song et al, 2020) put forward a new data-
Studying the temporal and spatial correlation of crime can provide more driven method to build a prediction model for the detection of port
useful clues for crime analysis and help to discover potential crime cargo theft risk, which is not only helpful to understand the dependency
patterns (Zhou et al, 2021). Theft, as one of the crimes against property, relationship among related risk factors, but also helpful to optimize the
occurs frequently in our daily life, which has a serious destructive effect risk management strategy. Han (XinGe, 2021) established the prediction
on the social order (LanLan, 2021). model of urban theft crime based on the integration of LSTM and ST-
The history of theft crime can be traced back to the period when GCN, which has a good effect on the prediction of the spatial and tem­
private property came into being. In all kinds of criminal cases in the poral distribution of urban theft crime. At the same time, the prediction
socialist modernization stage, the number of theft cases is much higher model of the number and grade classification of community cases based
than that of other criminal cases. The overall characteristics are the on the combination of GCN and LSTM also has a good effect on verifying
specificity of criminal subject, regularity of crime time and diversity of the prediction results of the above model. Kwon (Kwon et al, 2021)
property involved (Xiao, 2021). At present, there are some researches on based on Kmodes clustering ANN model, explained the influence of
theft prediction at home and abroad. B.Sivanagaleela (Ali et al, 2019) environmental problems on theft type prediction. Zhou (Zhou et al,
use fuzzy C-Means algorithm to cluster some cognizable crime data, and 2021) using multi-source crowd perception data to study urban crime in
then based on clustering and preprocessing, extract the crime area and time and space, it is found that there is a strong correlation between taxi
the causes of crime from the structured data, such as the details of travel and temperature and crime, and compared with other weather
people’s crimes and other factors; Esquivel (Nicolis & Marquez, 2020) conditions, cloudy days are more prone to crime. However, most of the

* Corresponding author.
E-mail address: [email protected] (Z. Xu).

https://doi.org/10.1016/j.eswa.2022.117943
Received 19 December 2021; Received in revised form 5 June 2022; Accepted 20 June 2022
Available online 26 June 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

above studies do not consider the influence of imbalanced datasets on effects and competitiveness of the three prediction models.
their proposed models and whether there are solutions to these bad XGBoost algorithm is widely used in multi-classification problems.
influences, so this paper conducts research based on this. Xu (Guotian and Shen, 2021) proposed a multi-classification detection
The contributions of this paper are as follows. method for malicious programs based on the XGBoost and Stacking
● Based on XGBoost algorithm, an innovative optimization decom­ fusion model, which realized the integration of XGBoost and Stacking
position and fusion XGBoost method is proposed, and two optimization and greatly improved the accuracy of multi-classification for malicious
models, OVO-XGBoost and OVR-XgBoost, are constructed, and the programs. Qian (Hq et al, 2021), considering the effect of feature
multi-classification problem is decomposed into multiple binary classi­ importance on feature selection, proposed a heuristic algorithm based
fication problems, which solves the problem that the baseline multi- on Permutation importance (PIMP) called PIMP-XGBoost, thus correct­
classification model has poor classification effect on this data set. The ing the biased feature importance measure and improving the prediction
effect of classification is verified on a real data set. accuracy and interpretation. Mateo (Mateo et al, 2021) proposed a
● SMOTENN algorithm is used to optimize the unbalanced data set multi-label model, which combined machine learning with MS recog­
into balanced data set, which improves the performance of baseline nition in vacuum system, and obtained better test results than random
model and optimization model. forest and XGBoost and other six commonly used multi-classification
● The proposed model is applied to the field of theft crime predic­ machine models. Ivașcu (Ivacu, 2020) looked at data-driven ap­
tion, and the prediction effect of baseline model and optimization model proaches based on nonparametric models, The historical and implicit
in this field is improved. parameter comparison between SVM, genetic algorithm, random forest,
The rest of this paper is organized as follows: The second part in­ XGBoost and LightGMB and the classical methods of economic statistical
troduces the problem of decomposing multiple classifications into hypothesis, such as Black-Scholes and Corrado-SU, is analyzed, and it is
multiple two classifications and the related application of XGBoost found that the machine learning algorithm has better performance.
model in classification prediction and crime prediction. The third part Huang (XiaoLing et al, 2022) proposed a fair-Adaboost method based on
introduces the model design and experimental design. The fourth part AdaBoost, which also extended the non-dominated sorting genetic al­
shows the effect of a single experiment and the comparison of several gorithm -II, which not only retained the advantages of interpretability,
experiments. The fifth part summarizes the full text, reflects on the scalability and accuracy of basic AdaBoost, but also improved
shortcomings, and looks forward to the future. performance.
The datasets determines the upper limit of prediction model per­
2. 2.Related work formance, which can be approached by algorithm optimization and
feature engineering (Ding et al, 2021). In reality, the data of most
In reality, we often encounter multi-classification learning tasks, and datasets are not evenly distributed, so pre-processing measures can
some two-classification methods can be directly extended to multi- greatly improve the performance of classification models. Xu (LingLing
classification, but more often, two-classification learners are used to & DongXiang, 2020) sums up the research on class unbalanced datasets
solve multi-classification problems, that is, the multi-classification tasks from two aspects. One is to reconstruct the datasets from the data itself,
are disassembled into several two-classification tasks and solved. Every with the purpose of changing the distribution structure of sample
split binary classification task needs to be trained with a binary classi­ numbers, so that the number of different classes in the unbalanced
fication classifier. When testing, after the prediction results of these datasets can be relatively balanced. On the other hand, aiming at the
binary classification classifiers are integrated, the final multi- characteristics that the traditional classification model has high overall
classification results can be obtained according to certain strategies. classification accuracy but low recognition ability for minority classes, a
Sun (Jie et al, 2021) combined the multi-classification prediction of series of targeted improvement strategies are put forward from the
corporate financial distress with three decomposition and fusion classification algorithm and classification thought level, which tend to
methods and SVM algorithm, and put forward three prediction models pay more attention to minority classes and improve the classification
combined with SVM, and compared them with multiple discriminant accuracy of minority classes. Qin (XiWen et al, 2021) combines a new
analysis (MDA) and polynomial logic (MNLogit), and expounded the feature selection algorithm (F-SCORE) with a variety of machine

Fig. 1. Overall framework of multi-classification modeling based on decomposition and fusion XGBoost.

2
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

learning algorithms to solve multi-classification problems. Experiments Table 1


show that this combination has good results. Compared with the original Summary of theft datasets.
datasets, the combined algorithm not only uses fewer features, but also No. Case Commercial Time of Weather Stolen
produces good classification results. Wang (Wang et al, 2021) used category places crime condition object
SMOTE + ENN combined with some machine learning methods, which 1 burglary hospital noon sunny old
effectively processed the medical data with unbalanced distribution of women
categories, and improved the discrimination efficiency of adverse out­ 2 Vehicle- entertainment evening rainy young
comes of patients with chronic heart failure. Mushava (Mushava & related places women
theft
Murray, 2022) proposed in XGBoost the quantile function of generalized … … … … … …
extreme value (GEV) distribution as the link function and applied the 1521 Others Others Others Others Others
improved focus loss function in the case of severe category imbalance of
credit score data sets. The modified method has better prediction ability
than other models for the category unbalanced data sets. Seng (Zs et al,
Table 2
2020) proposed a neighborhood undersampled stack integration (NUS- Code table corresponding to characteristics of stolen datasets.
SE) method, which improves the classification performance of stack
feature Code/ category
integration on unbalanced data sets and has better performance than
Case category t0 vehicle-related theft t2 pickpocketing
stack integration based on non-resampling. Chen (Amrit & Poel, 2021) t1 burglary t3 other types of theft
developed a machine learning framework based on historical behavior
L0 primary and secondary L10 entertainment place
data and verified that the framework has good feasibility and effec­ schools L11 hotels and restaurants
tiveness in solving the sparsity and imbalance of customer purchase data L1 transportation tools L12 residential buildings
according to real data sets. Zelenkov (Zelenkov & Volodarskiy, 2021) L2 party and government L13 industrial enterprises
proposed a multi-objective classifier selection algorithm (MOCS) for organs
L3 buses
corporate bankruptcy prediction based on unbalanced data, and regar­
Commercial L4 highways L14 square and streets
ded the classification problem as a multi-objective optimization prob­ places
lem. The test results show that the algorithm produces significant L5 other places L15 cultural and sports
improvements in the predicted results in terms of standard metrics and L6 hospitals places
L7 commercial places L16 bathing places
visual representation of FNR/FPR space.
L8 subway and light rail L17 internet cafes
L9 colleges and universities L18 stations and docks
3. 3.Research methods and research design L19 banks
Time of crime h0 wee hours h3 afternoon
3.1. Overall framework h1 morning h4 evening
h2 noon h5 others
Weather w0 sunny w2 rainy
To model the decomposition and fusion XGBoost model, firstly, the condition w1 cloudy w3 others
original datasets is divided into a training set and a test set in proportion; Stolen object obj0 old women obj6 girls
secondly, the class samples with unbalanced data distribution in the obj1 old men obj7 boys
obj2 middle-aged women obj8 vehicles
training set are processed into data samples with balanced distribution
obj3 middle-aged men obj9 personal belongings
by SMOTENN algorithm; and then, the decomposition and fusion obj4 young women obj10 other objects
XGBoost model, i.e., two multi-classification models OVR-XGBoost and obj5 young men
OVO-XGBoost, is used to train the sample data. Finally, the divided test
sets are used to test on two multi-classification models, and the final
classification results are obtained. Overall framework of multi- o’clock), 13 ~ 18 o’clock in the afternoon (excluding 18 o’clock) and 18
classification modeling based on decomposition and fusion XGBoost, is ~ 24 o’clock in the evening. Case category are classified into vehicle-
shown in Fig. 1. related theft, burglary, pickpocketing and other types of theft. The sto­
len objects are divided into 11 categories according to people and ob­
3.2. Data preprocessing jects: old women, old men, middle-aged women, middle-aged men,
young women, young men, girls, boys, vehicles, personal belongings and
This paper studies the criminal registration data obtained from the other objects. This classification method includes the data in the feature
Public Security Bureau of H City: the data includes 3629 criminal cases column that cannot be separated into one column because of the small
recorded in H City in 2019. Because the number of other types of cases is quantity. For example, there are detailed classifications such as theft of
too scattered, the theft case data with the largest proportion, about 1521 electric vehicles and theft of motorcycles in the theft of vehicle models,
data, is selected. There are totally 19 features in theft case data, from and data such as batteries and mobile phones are included in the
which the features such as occurrence place, timing, stolen object and selected objects. Therefore, the data of 1521 theft cases in h city are
case category are selected for analysis. divided into five items, and the data table is shown in Table 1, all var­
iables in the table are of categorical types.
3.2.1. Data feature processing Partitioning and populating the data in Table 1 yields Table 2 below,
Due to the original data set the timing characteristics of scattered all variables in the table are of categorical types.
storage, weather and time, the time the crime was committed, therefore
the timing to screen is divided into two columns, respectively stored in 3.2.2. Missing value handling
the time of the crime and weather conditions two characteristics in the In all kinds of practical databases, the situation of missing attribute
column, so the final selection has five characteristics of variables, values is often unavoidable. Therefore, in most cases, information sys­
including input variables for the Commercial places, time of the crime, tems are incomplete, or to some extent incomplete. Commonly used
weather conditions, stolen objects, the output variable is case category. missing value completion methods: manual filling, special value filling,
Weather conditions are divided into four categories: sunny, cloudy, average value filling, mode filling, etc.
rainy and others. The time of crime is divided into five categories: 0 ~ As the feature columns selected in the stolen datasets are all classi­
6o’clock in the wee hours (excluding 6 o’clock), 6 ~ 11 o’clock in the fied data, most of the missing values are caused by the complex and
morning (excluding 11 o’clock), 11 ~ 13 o’clock at noon (excluding 13 scattered case, the lack of relevant information and the inability to

3
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Table 3
LabelEncoder coding and OneHotEncoder coding.
No. Case category Commercial … Time of crime_h0 … Weather condition_w0 … Stolen object_obj0 …
places_l0

1 3 0 1 0 0
2 0 1 0 0 0
3 2 0 0 1 0
… … … … … … … … … …
1461 3 0 0 0 1

classify in detail during the police registration. For example, when


registering the time chosen for theft, only the weather or time is
recorded in the input information, so there will be missing values when
splitting into two columns. When filling in the classification of theft
cases, the cases are too tedious and regional, so the registration is empty.
Therefore, the missing values in datasets are filled with special values,
and the values with the most frequent occurrence in each feature column
are filled with missing values. For example, the vacancy in case classi­
fication is filled with other cases, and the weather is filled with other
cases, etc.

3.2.3. LabelEncoder and OneHotEncoder


In the process of data processing, we sometimes need to digitize
discontinuous numbers or texts. In machine learning, most algorithms,
such as logistic regression, SUPPORT vector machine SVM, k-nearest
Neighbor algorithm and so on, can only deal with numerical data, not
text. However, in reality, many labels and features are not represented
by numbers after data collection. In this case, in order to fit the data into
the algorithm, the data must be encoded, that is, converted from literal
to numeric.
In the datasets of theft cases, the 5 feature columns selected are all
label data, so they should be converted into continuous numerical data.
LabelEncoder coding is used to convert the data of input variable case Fig. 2. Distribution of the number of theft cases by category before applying
categories from [vehicle type theft, burglary, pickpocket theft and other SMOTE technieque.
types of theft] into numerical data of [0,1,2,3]. OneHotEncoder coding
is used to convert the Commercial places, time of the crime, weather
Table 4
conditions, stolen objects of the output variables into binary form
Number of training samples and test samples of various categories before
respectively and then into a sparse 0,1 matrix, and then the processed
applying SMOTE technieque.
data is fed into the model. The datasets after coding is summarized as
Table 3, all variables in the table are of categorical types. Data type t0 t1 t2 t3 sum

Train set 393 63 77 490 1023


3.2.4. Smotenn preprocessing Test set 169 27 33 210 439
Validate set
There are often unbalanced distribution of class data in datasets, and
sum 562 90 110 700 1462
the model established according to the traditional method is easy to lead
to poor classification effect and accuracy of the category with a small
number of samples, and high misjudgment rate. In order to solve the on top of the original, which in turn gives you extra samples by inter­
problem of unbalanced distribution of class data, methods such as over- polation in a few categories. The undersampling method ENN is used to
sampling and under-sampling are usually adopted. Oversampling is to clean the overlap in the sample increase in SMOTE, that is, for a sample
replicate samples from a small number of categories and increase the in the majority, delete it if half of the K nearest points in the sample are
number of samples from a small number of categories, so as to achieve not in the majority. The combination of the two can overcome its
data balance. Undersampling is the process of randomly selecting sam­ shortcomings and make the datasets achieve the effect of distributed
ples from a large number of categories with the same number of samples balance, and the distributed balance datasets improves the performance
from a small number of categories to balance the dataset by reducing the of the model on the datasets.
number of samples from a large number of categories.
However, both simple over-sampling and under-sampling will cause 3.2.5. Datasets partitioning
problems. Simple over-sampling will lead to unreasonable data distri­ In this study, 20 experiments were conducted based on OVR-XGBoost
bution or even inconsistent with the real situation, because over- and OVO-XGBoost multi-classification models. During each experiment,
sampling may lead to over-fitting problems. However, the simple 70% of the data from each class of the original datasets were randomly
random undersampling method has great randomness and may delete selected as the training set and the remaining 30% as the test set. Twenty
the important sample information in most categories. Therefore, in this experiments were also conducted based on the baseline XGBoost model.
experiment, the method of over-sampling and under-sampling is used, The training set randomly selected 70% of the total original data and the
that is, SMOTENN algorithm combined with oversampling and UNDER- remaining 30% as the training set. In the meantime, due to the unbal­
sampling prevention nearest neighbor rule (ENN). This is because anced distribution of class data in datasets, use SMOTE and SMOTENN
SMOTE not only replicates the existing data, but also allows for algorithm to process the datasets respectively, and give the proportion of
expansion of the dataset by using the algorithm to create a new sample test set and training set processed by each classification.

4
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Table 5
Number of samples of the training set after processing.
Train set t0 t1 t2 t3 sum

Original number 393 63 77 490 1023


proportion 38.42% 6.16% 7.53% 47.89% 100%
SMOTE number 490 490 490 490 1960
proportion 25% 25% 25% 25% 100%
SMOTENN number 235 160 220 129 744
proportion 31.57% 21.51 29.57% 17.35% 100%
Condense Nearest Neighbours number 119 63 47 205 434
proportion 27.42% 14.51% 10.83% 47.24% 100%
Neighbours Cleaning Rules number 330 63 12 350 755
proportion 43.71% 8.34% 1.59% 46.36% 100%
Tomek Links number 393 63 77 490 1023
proportion 38.42% 6.16% 7.53% 47.89% 100%
Random Undersampling number 63 63 63 63 252
proportion 25.0% 25.0% 25.0% 25.0% 100%

As can be seen from Fig. 2 above, in the original datasets, the pro­ portable. It implements machine learning algorithm in Gradient Boost­
portions of t0, t1, t2 and t3 were 38.44%, 6.16%, 7.52% and 47.88%, ing framework. XGBoost provides parallel tree promotion (also known
respectively. It is obvious that the number of samples of t0 and t3 is far as GBDT, GBM) that can solve many data science problems quickly and
more than that of other types. accurately. XGBoost is an improvement of the gradient lifting algorithm.
In the meantime, as is shown in Table 4, due to the unbalanced Newton’s method is used to solve the extreme value of the loss function,
distribution of class data in datasets, use SMOTE and SMOTENN algo­ and the loss function Taylor is expanded to the second order. In addition,
rithm to process the datasets respectively, and give the proportion of test regularization terms are added to the loss function. The objective
set and training set processed by each classification. SMOTE and SMO­ function during training consists of two parts: the first part is the loss of
TENN improve the imbalance of class sample distribution in the training gradient lifting algorithm, and the second part is the regularization term.
set. The loss function is defined as Eq. (1).
As is shown in Table 5, Smote and smoteNN have the best effect in

n
( ′ ) ∑
comparison of smote and smoteNN after comparing many kinds of set L(Φ) = l ya , ya + Ωf (b) (1)
balance. Therefore, the authoer choose these two methods to deal with a=1 b

the dataset with category imbalance. Where n sample for training function, and l is for the loss of a single
sample, assume it as convex function, ya for the model predicted value of

3.3. BasedLine XGBoost the training sample, ya is a true labels of the training sample values, b is
the number of regularized terms in the tree. Regularization terms define
Baggging and Boosting are both model fusing methods that combine the complexity of the model as Eq. (2).
weak classifiers to form a strong classifier with better results than the 1
best weak classifiers. Bagging and Boosting have different emphasis in Ωf (b ) = αT+ β‖ω‖2 (2)
2
sample selection, sample weight, prediction function and parallel
computation. In general, bagging reduces variance, while Boosting re­ Where α and β are manually set parameters, ω is the vector formed by
duces bias. the values of all leaf nodes in the decision tree, and T is the number of
XGBoost algorithm is a Boosting ensemble algorithm, which is usu­ leaf nodes.
ally based on decision tree. The newly generated tree continuously
learns the residuals between the predicted value and the true value of 3.4. Decomposition and fusion combined with XGBoost
the current tree, and finally accumulates the learning results of multiple
trees as the prediction results. XGBoost is an optimized distributed For dichotomous problems, the result of the problem is 0 or 1, which
gradient enhancement library designed to be efficient, flexible and clearly divides into two categories. However, in real tasks, it is often not

Fig. 3. OVR-XGBoost.

5
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Fig. 4. OVO-XGBoost.

so simple, and most of the problems encountered are multi-classification


Table 6
problems. Although some dichotomous learning methods can be applied
Experimental machine configuration table.
to multi-classification problems, in more cases, decomposition and
fusion can be used to divide a multi-classification task into multiple CPU AMD Ryzen7 5800H

dichotomous tasks to solve them. The most important problem is how to CPU Core 8 cores and 16 threads
split the multi-classification tasks and integrate multiple classifiers. memory 16 GB
GPU NVIDIA GeForce RTX 3060
Common decomposition and fusion methods are one-versus-one (OVO)
Memory amount 6 GB
and one-versus-rest (OVR).

3.4.1. OVR-XGBoost
Table 7
One-versus-rest (OVR) is, during training, the samples of a certain
Classification effect table of baseline XGBoost algorithm.
category are grouped into One category and the remaining samples are
grouped into another category, so that k XGBoosts can be constructed precision recall f1-score support Train_set Test_set
from the samples of K categories. In classification, unknown samples are 3 0.68 0.92 0.78 210
classified into the category with the maximum classification function 2 0.80 0.12 0.21 33
value. For example, in the theft datasets, there are four theft types to be 1 0.00 0.00 0.00 27
0 0.86 0.75 0.80 169
divided, namely t0, t1, t2 and t3. When extracting the training set, Accuracy 439 76.34% 81.80%
extract separately. macro avg 0.58 0.45 0.45 439
The four training sets were used for training respectively, and then weighted avg 0.71 0.74 0.70 439
four training result files were obtained. During the test, the corre­
sponding test vectors were tested using the four training result files
respectively. Finally, each test had a result f1,f2,f3,f4,the final classifi­ classification OVO = max{Ni |i = 1, 2, 3, 4⋯n} (4)
cation result of OVR-XGBoost model can be abstracted into a formula as
Eq. (3) and the OVR-XGBoost model is shown in Fig. 3.
4. 4.Results and dicussion
classification OVR = max{fi |i = 1, 2, 3, 4⋯n} (3)
4.1. Experimental configuration
3.4.2. OVO-XGBoost
One-versus-one (OVO), is designs an XGBoost between any two types The experiment is based on windows-10 Professional operating sys­
of samples, so t(t-1)/2 XGBoosts are required for samples of t classes. tem and python integration environment Anaconda3 is used in the
When classifying an unknown sample, the category with the most votes experiment. Python version is 3.8.8. The configuration of the experi­
is the category of the unknown sample. For example, in the theft data­ mental machine is as Table 6.
sets, there are four theft types to be divided, namely t0, t1, t2 and t3.
During training, choosing t0, t1; t0, t2; t0, t3; t1, t2; t1, t3; The vectors
4.2. Experimental result
corresponding to t2 and t3 are used as the training set, and then six
training results are obtained. During the test, the corresponding vectors
4.2.1. Effect of baseline XGBoost algorithm
are tested on the six results respectively, and then the voting form is
According to the datasets processed in Table 3, the baseline XGBoost
adopted, and finally a group of results are obtained.
algorithm was used to classify and predict the theft cases. The parameter
The end result is that the largest of the four values is abstracted into a
Settings of the baseline XGBoost algorithm are as follows: N_estimators
formula as a result of classification as Eq. (4). The OVO-XGBoost model
= 30, learning_rate = 0.3, max_depth = 7, determine the minimum leaf
is shown in Fig. 4
node sample weight and min_child_weight = 1, Colsample_bytree = 0.8,

6
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Fig. 5. K-fold cross-validation of the effect of comparison.

Table 8
Classification effect table of t0, t1, t2 and t3 based on OVR-XGBoost.
precision recall f1-score support Train_set Train_set AUC

Model 1 1 0.86 0.74 0.79 169 0.83


0 0.85 0.92 0.88 270
Accuracy 439 86.61% 85.19%
macro avg 0.85 0.83 0.84 439
weighted avg 0.85 0.85 0.85 439
Model 2 1 0.20 0.04 0.06 27 0.51
0 0.94 0.99 0.96 412
Accuracy 439 94.13% 93.17%
macro avg 0.57 0.51 0.51 439
weighted avg 0.89 0.93 0.91 439
Model 3 1 0.38 0.15 0.22 33 0.57
0 0.93 0.98 0.96 406
Accuracy 439 94.04% 91.80%
macro avg 0.66 0.57 0.59 439
weighted avg 0.89 0.92 0.90 439
Model 4 1 0.71 0.81 0.76 210 0.75
0 0.80 0.69 0.74 229
Accuracy 439 82.11% 74.94%
macro avg 0.85 0.75 0.77 439
weighted avg 0.83 0.81 0.80 439

set objective=’multi:softmax’, Category Number of categories num_­ forest, and the K-Fold cross-validation method is used to show the K-Fold
class = 4. The effects of a single baseline XGBoost classification are (K = 10) cross-validation results and average efficiency of these three
shown in Table 7. Boosting algorithms with the same Hyperparameters, as shown in Fig. 5.
It is clear that after 10-fold cross-validation, most of the XGBoost
4.2.2. Comparison of XGboost algorithm with other Bagging algorithms model’s performance is better than AdaBoost and random forest for this
In this paper, XGboost is compared with some Boosting algorithms stolen data set, and XGBoost model has the best average performance, so
such as AdaBoost and with some Bagging algorithms such as Random this paper chooses XGBoost as the baseline model.

7
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Table 9
Classification effect table of t0, t1, t2 and t3 based on OVO-XGBoost.
precision recall f1-score support Train_set Train_set AUC

Model 1 1 0.91 0.95 0.93 169 0.68


0 0.55 0.41 0.47 27
Accuracy 196 91.23% 87.24%
macro avg 0.73 0.68 0.70 196
weighted avg 0.86 0.87 0.86 196
Model 2 1 0.90 0.98 0.94 169 0.72
0 0.94 0.99 0.96 33
Accuracy 202 89.57% 89.11%
macro avg 0.85 0.72 0.76 202
weighted avg 0.88 0.89 0.88 202
Model 3 1 0.87 0.76 0.71 169 0.84
0 0.83 0.91 0.87 210
Accuracy 379 86.18% 84.43%
macro avg 0.85 0.84 0.84 379
weighted avg 0.85 0.84 0.84 379
Model 4 1 0.96 0.81 0.88 27 0.89
0 0.86 0.97 0.91 33
Accuracy 60 86.43% 90%
macro avg 0.91 0.89 0.90 60
weighted avg 0. 91 0.90 0.90 60
Model 5 1 0.40 0.07 0.12 27 0.53
0 0.89 0.99 0.94 210
Accuracy 237 90.24% 88.19%
macro avg 0.65 0.53 0.53 237
weighted avg 0.84 0.88 0.84 237
Model 6 1 0.80 0.12 0.21 33 0.56
0 0.88 1.00 0.93 210
Accuracy 243 90.30% 87.65%
macro avg 0.84 0.56 0.57 243
weighted avg 0.87 0.88 0.83 243

Fig. 6. Comparison of the classification accuracy curve of 20 experiments based on the original datasets.

4.2.3. Effect of OVR-XGBoost algorithm obtained. The comparison effect is shown in Figs. 6, 7 and 8.
According to the requirements of OVR method, the datasets was The average efficiency comparison is given for the above 20 exper­
processed, and the stolen datasets was divided into K categories (k = 4). iments, as shown in Table 10. In Table 10, the overall accuracy is
Four XGBoost models were established respectively for the four determined by dividing the correctly classified test samples by the total
dichotomous datasets constructed according to OVR method. The results test samples, while the accuracy of each category is obtained by dividing
of a single experiment are shown in Table 8. the number of test samples in this category by the total number of test
samples in this category. MacroR refers to the macro recall rate. When
4.2.4. The effect of the OVO-XGBoost algorithm training/testing different models multiple times based on the same data
According to the design of the OVO method, the datasets was pro­ set, or evaluating the same model based on multiple data sets, or per­
cessed, and t(T-1)/2 XGBoost models were designed for SAMPLES of T forming tasks of multiple classes, many confusion matrices can be
categories, i.e., six XGBoost models were designed for samples of four generated. At this point, MacroR is needed for evaluation. The higher the
categories, and the results of a single experiment are shown in Table 9. value of MacroR, the better the effect of the model, and the more
correctly classified samples in positive cases.
4.2.5. Comparison of model effects Table 10 shows the average values of relative strain variables ob­
After 20 experiments, the effect comparison of the proposed fusion tained from 20 experiments, where the maximum values for each row
model under the conditions of three data processing methods is are highlighted and the minimum values for each row are underlined. As

8
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Fig. 7. Comparison of the classification accuracy curve of 20 experiments based on the datasets dealed with smote.

Fig. 8. Comparison of the classification accuracy curve of 20 experiments based on the datasets dealed with smotenn.

can be seen from the table, both the OVR-XGBoost and OVO-XGBoost data set processed by these two algorithms. However, the classification
models performed better than the baseline XGBoost model in both raw accuracy of OVR-XGBoost and OVR-XGBoost models is not significantly
data and data sets processed by the two unbalanced algorithms, with reduced, while the accuracy of MacroR is greatly improved. Overall, the
little difference in overall accuracy between the two models. In the OVO-XGBoost model preprocessed by SMOTENN algorithm fits the
original data set, class T1 was a small sample, and the classification dataset better and performs better than the baseline XGBoost model.
accuracy of the baseline XGBoost model was 0, while that of the OVO-
XGBoost model was improved to 58.44%. After smote and smotenn, 5. Conclusions
the baseline model got up to about 25% accuracy for a few classes,
whereas OVO-XgBoost still had over 50% accuracy. Due to the uneven This study adopts the crime data of H city in 2019 provided by the
class distribution of this data set, after SMOTE and SMOTENN, the class Public Security Bureau of H City, and makes multi-classification pre­
sample distribution of the data set is basically balanced, the overall diction of the crime. A multi-class theft case prediction model based on
classification accuracy and all classification accuracy of XGBoost, OVR- decomposing-fusion XGBoost is proposed. By combining XGBoost with
XgBoost and OVR-XgBoost models are decreased. But MacroR’s accu­ two decomposing-fusion methods, such as one-to-one and one-to-many,
racy has improved. two multi-class models are established. The multi-classification problem
Among them, the classification accuracy of XGBoost model is is decomposed into multiple dichotomies, and the two multi-class
significantly reduced, indicating that the model is not suitable for the models are compared with the baseline XGBoost model. It is found

9
Z. Yan et al. Expert Systems With Applications 207 (2022) 117943

Table 10 Program of Guangdong Province, Project’s number: 2020B1111420002,


Average performance of models built against class-unbalanced datasets. Project’s title: Research on Key Technologies of Collaborative Research
Model Grading Original datasets dealed datasets dealed and Judgment of Anti Drug Intelligence and Its Application
type datasets with smote with smotenn Demonstration.
overall 74.23 % 66.10 % 59.51 %
accuracy References
XGBoost t0 accuracy 84.45 % 85.45 % 84.15 %
t1 accuracy 0.00 % 24.60 % 24.55 % Ali, M., Abdullah, S., Raizal, C. S., Rohith, K. F., & Menon, V. G. (2019). A Novel and
t2 accuracy 56.85 % 39.45 % 25.25 % Efficient Real Time Driver Fatigue and Yawn Detection-Alert System. In 2019 3rd
t3 accuracy 69.30 % 80.85 % 78.75 % International Conference on Trends in Electronics and Informatics (ICOEI) (pp.
macroR 45.20 % 69.30 % 61.40 % 687–691). https://doi.org/10.1109/ICOEI.2019.8862691
overall 89.58 % 81.86 % 80.73 % Amrit, C., & Poel, M. (2021). Customer purchase prediction from the perspective of
accuracy imbalanced data: A machine learning framework based on factorization machine.
Expert Systems With Applications, 173, Article 114756. https://doi.org/10.1016/j.
OVR- t0 accuracy 85.60 % 82.75 % 77.70 %
eswa.2021.114756
XGBoost
Ding, Y., Fan, L., & Liu, X. (2021). Analysis of feature matrix in machine learning
t1 accuracy 1.00 % 25.15 % 21.60 %
algorithms to predict energy consumption of public buildings[J]. Energy and
t2 accuracy 55.70 % 24.90 % 24.75 % Buildings, 249(9), Article 111208. https://doi.org/10.1016/j.enbuild.2021.111208
t3 accuracy 70.40 % 70.05 % 68.75 % Guotian, X., & Shen, Y. (2021). Multiple Classification Detection Method for Malware
macroR 65.68 % 75.39 % 75.43 % Based on XGBoost and Stacking Fusion Model. Netinfo Security, 21(6), 52–62.
overall 86.69 % 83.18 % 81.24 % https://doi.org/10.3969/j.issn.1671-1122.2021.06.007
accuracy Hq, A. , Bw, B. , My, C. , Sg, C. , & You, S. . (2021). Financial distress prediction using a
OVO- t0 accuracy 90.54 % 92.98 % 91.50 % corrected feature selection measure and gradient boosted decision tree.Expert
XGBoost Systems with Applications, 190,116-202.https://doi.org/10.1016/j.
t1 accuracy 58.44 % 54.93 % 52.70 % eswa.2021.116202.
t2 accuracy 69.80 % 59.27 % 56.15 % Ivacu, C. F. (2020). Option pricing using machine learning. Expert Systems with
t3 accuracy 86.62 % 89.33 % 89.67 % Applications, 163, Article 113799. https://doi.org/10.1016/j.eswa.2020.113799
macroR 69.84 % 77.77 % 78.08 % Jie,S., Hamido,F., Yujiao,Z., &Wenguo,A.. (2021).Multi-class financial distress prediction
based on support vector machines integrated with the decomposition and fusion
methods,Information Sciences,Volume 559,Pages 153-170,ISSN 0020-0255.https://
doi.org/10.1016/j.ins.2021.01.059.
that the prediction accuracy of OVR-XGBoost and OVO-XGBoost models Kwon,E., Jung,S., &Lee,J.. (2021). Artificial Neural Network Model Development to
is higher than that of baseline XGBoost models. Due to the imbalance of Predict Theft Types in Consideration of Environmental Factors. ISPRS International
Journal of Geo-Information,10(2),99. https://doi.org/10.3390/ijgi10020099.
SMOTE data set classification, adopt SMOTE and SMOTENN algorithm
LanLan,H..(2021).On the legislative perfection of larceny [J]. Guangxi Quality
to preprocess the data set, balance the distribution of classification data, Supervision Guide Periodical,297-298. https://kns.cnki.net/kcms/detail/detail.
and make comparison and verification based on the algorithm model aspx?FileName=GXZL202103140&DbName=CJFQ2021.
mentioned above. The study of this kind of multi-classification model LingLing, X., & DongXiang, C. (2020). Machine learning classification strategies for
unbalanced Data sets. Computer Engineering and Applications, 56(24), 12–27. https://
can classify and predict the types of theft, and predict the possible types doi.org/10.3778/j.issn.1002-8331.2007-0120
of theft according to different factors such as weather, time and possible Mateo, C., Iniesta, J., Jenninger, B., Gómez-Sanchís, J., & Chiggiato, P. (2021).
objects, which can greatly improve the prevention and control work of Automatic mass spectra recognition for ultra high vacuum systems using multilabel
classification. Expert Systems with Applications, 114–959. https://doi.org/10.1016/j.
theft and have a benign influence on the optimization of police force use. eswa.2021.114959
There are shortcomings in this study and directions for future research: Mushava, J., & Murray, M. (2022). A novel xgboost extension for credit scoring class-
1. Due to the limitations of the current police data collection means, the imbalanced data combining a generalized extreme value link and a modified focal
loss function. Expert Systems with Applications, 202, 117–233. https://doi.org/
number of samples in the data set is limited. After the police data 10.1016/j.eswa.2022.117233
collection means are improved in the future, the available data sets will Nicolis, O., & Marquez, B.. (2020). Predicting Motor Vehicle Theft in Santiago de Chile
be greatly increased. 2. The classification effect of the baseline model using Graph-Convolutional LSTM. 1-7. 10.1109/SCCC51225.2020.9281174. doi:
10.1109/SCCC51225.2020.9281174.
can be improved by changing the multi-classification problem to the
Song, R., Huang, L., Cui, W., Óskarsdóttir, M., & Vanthienen, J. (2020). Fraud Detection
binary classification problem, and the performance effect may be of Bulk Cargo Theft in Port Using Bayesian Network Models. Applied Sciences., 10,
different by combining the decomposition and fusion method with some 1056. https://doi.org/10.3390/app10031056
Wang, K., Tian, J., Zheng, C., Yang, H., & Zhang, Y. (2021). Improving risk identification
machine learning models to predict the big data of public security.
of adverse outcomes in chronic heart failure using smote+enn and machine learning.
Risk management and healthcare policy, 14, 2453–2463. https://doi.org/10.2147/
CRediT authorship contribution statement RMHP.S310295
XiaoLing, H., Zhenghui, L., Yilun, J., & Wenyu, Z. (2022). Fair-AdaBoost: Extending
AdaBoost method to achieve fair classification. Expert Systems with Applications,
Zhongzhen Yan: Methodology, Software, Writing – original draft. 117240. https://doi.org/10.1016/j.eswa.2022.117240
Hao Chen: Data curation. Xinhua Dong: Visualization, Investigation. Xiao,R.(2021). Spatial-temporal analysis and prevention and control countermeasures of
Kewei Zhou: Software, Validation. Zhigang Xu: Supervision. theft cases in R City. Computer and Information Technology, 29(4),62-66. https://
kns.cnki.net/kcms/detail/detail.aspx?
FileName=DNXJ202104017&DbName=CJFQ2021.
Declaration of Competing Interest XinGe, H. (2021). Prediction Method of Theft Crimes in Urban: An IntegratedModel of
LSTM and ST-GCN. People’s Public Security University of. China.https://kns.cnki.net/
KCMS/detail/detail.aspx?dbname=CMFD202102&filename=1021603154.nh.
The authors declare that they have no known competing financial XiWen, Q., Rui, W., AiJun, Y., XiaoGang, D., & SiQi, Z. (2021). Application of feature
interests or personal relationships that could have appeared to influence selection algorithm based on F-Score in multi-classification problem. Journal of
the work reported in this paper. Changchun University of Technology, 42(02), 128–134. https://doi.org/10.15923/j.
cnki.cn22-1382/t.2021.2.06
Zelenkov, Y., & Volodarskiy, N. (2021). Bankruptcy prediction on the base of the
Data availability unbalanced data using multi-objective selection of classifiers. Expert Systems with
Applications, 185, Article 115559. https://doi.org/10.1016/j.eswa.2021.115559
Zhou, B. , Chen, L. , Zhao, S. , Zhou, F. , & Pan, G. . (2021). Spatio-temporal analysis of
No data was used for the research described in the article.
urban crime leveraging multisource crowdsensed data. Personal and Ubiquitous
Computing(8), 1-14.https://doi.org/10.1007/s00779-020-01456-6.
Acknowledgment Zs, A., Sak, A., & Kdv, B. (2020). A neighborhood undersampling stacked ensemble (nus-
se) in imbalanced classification - sciencedirect. Expert Systems with Applications, 168,
Article 114246. https://doi.org/10.1016/j.eswa.2020.114246
This research is supported by Key-Area Research and Development

10

You might also like