0% found this document useful (0 votes)
43 views12 pages

ML Regensburg

Machine learning appendicitis

Uploaded by

bogdanneamtu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views12 pages

ML Regensburg

Machine learning appendicitis

Uploaded by

bogdanneamtu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ORIGINAL RESEARCH

published: 29 April 2021


doi: 10.3389/fped.2021.662183

Using Machine Learning to Predict


the Diagnosis, Management and
Severity of Pediatric Appendicitis
Ricards Marcinkevics 1† , Patricia Reis Wolfertstetter 2*† , Sven Wellmann 3 , Christian Knorr 2‡
and Julia E. Vogt 1‡
1
Department of Computer Science, ETH Zurich, Zurich, Switzerland, 2 Department of Pediatric Surgery and Pediatric
Edited by:
Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children’s Hospital Regensburg (KUNO),
Francesco Morini,
Regensburg, Germany, 3 Division of Neonatology, Hospital St. Hedwig of the Order of St. John of God, University Children’s
Bambino Gesù Children Hospital
Hospital Regensburg (KUNO), University of Regensburg, Regensburg, Germany
(IRCCS), Italy

Reviewed by:
José Estevão-Costa, Background: Given the absence of consolidated and standardized international
Centro Hospitalar Universitário de São guidelines for managing pediatric appendicitis and the few strictly data-driven studies
João (CHUSJ), Portugal
Sherif Mohamed Shehata,
in this specific, we investigated the use of machine learning (ML) classifiers for predicting
Tanta University, Egypt the diagnosis, management and severity of appendicitis in children.
*Correspondence:
Materials and Methods: Predictive models were developed and validated on a
Patricia Reis Wolfertstetter
patricia.reiswolfertstetter@ dataset acquired from 430 children and adolescents aged 0-18 years, based on a range
barmherzige-regensburg.de of information encompassing history, clinical examination, laboratory parameters, and
† These authors have contributed abdominal ultrasonography. Logistic regression, random forests, and gradient boosting
equally to this work and share first machines were used for predicting the three target variables.
authorship
‡ These authors have contributed
Results: A random forest classifier achieved areas under the precision-recall curve
equally to this work and share last of 0.94, 0.92, and 0.70, respectively, for the diagnosis, management, and severity of
authorship appendicitis. We identified smaller subsets of 6, 17, and 18 predictors for each of targets
that sufficed to achieve the same performance as the model based on the full set of
Specialty section:
This article was submitted to 38 variables. We used these findings to develop the user-friendly online Appendicitis
Pediatric Surgery, Prediction Tool for children with suspected appendicitis.
a section of the journal
Frontiers in Pediatrics Discussion: This pilot study considered the most extensive set of predictor and target
Received: 31 January 2021 variables to date and is the first to simultaneously predict all three targets in children:
Accepted: 01 April 2021
diagnosis, management, and severity. Moreover, this study presents the first ML model
Published: 29 April 2021
for appendicitis that was deployed as an open access easy-to-use online tool.
Citation:
Marcinkevics R, Reis Wolfertstetter P, Conclusion: ML algorithms help to overcome the diagnostic and management
Wellmann S, Knorr C and Vogt JE
(2021) Using Machine Learning to
challenges posed by appendicitis in children and pave the way toward a more
Predict the Diagnosis, Management personalized approach to medical decision-making. Further validation studies are
and Severity of Pediatric Appendicitis. needed to develop a finished clinical decision support system.
Front. Pediatr. 9:662183.
doi: 10.3389/fped.2021.662183 Keywords: appendicitis, pediatrics, predictive medicine, machine learning, classification

Frontiers in Pediatrics | www.frontiersin.org 1 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

INTRODUCTION MATERIALS AND METHODS


Appendicitis is among the commonest childhood diseases, Data Acquisition
accounting for a third of admissions for abdominal pain (1). The cohort study included all children and adolescents aged
Life-time risk ranges from 6 to 9%, and incidence is highest 0-18 years admitted with abdominal pain and suspected
between 10 and 19 years of age (2). Perforation rates are appendicitis to the Department of Pediatric Surgery at the
significantly higher in preschool children than in older children tertiary Children’s Hospital St. Hedwig in Regensburg, Germany,
or adults (3). over the 3-year period from January 1, 2016 to December
Diagnosis remains essentially clinical, backed by laboratory 31, 2018. Non-inclusion criteria were prior appendectomy,
data and imaging. In a pooled analysis of serum biomarkers abdominal conditions such as chronic inflammatory bowel
for diagnosing acute appendicitis and perforation, Acharya disease or intestinal duplication, simultaneous appendectomy,
et al. reported areas under the receiver operating characteristic and treatment with antibiotics for concurrent disease such as
(AUROC) of 0.75 and 0.69, respectively, for the white blood pneumonia, resulting in a final total of 430 patients (Table 1).
cell (WBC) count and 0.80 and 0.78 for C-reactive protein The study was approved by the University of Regensburg
(CRP) (4). Despite increasing research there remains no specific institutional review board (no. 18-1063-101) which also waived
biomarker for predicting acute appendicitis in clinical practice informed consent to routine data analysis. For patients followed
(4, 5). Abdominal and, specifically, appendix ultrasonography up after discharge, informed consent was obtained from
(US) is the standard imaging modality in children, being low- parents or legal representatives. All methods were performed
cost, non-invasive and repeatable, but it remains operator- in accordance with the relevant guidelines and regulations.
dependent. Reported sensitivities and specificities for US- Conservative management was defined as intravenous fluids,
based diagnosis range widely: from 87 to 100%, and from enemas, analgesics, and clinical/US monitoring without
15 to 95% (6). The scores most frequently used to assist antibiotics in an inpatient setting. For patients with criteria
physicians in risk-stratifying children with abdominal pain for simple appendicitis presenting clinical and sonographic
are the Alvarado Score (AS) and Pediatric Appendicitis Score improvement, non-operative therapy was maintained, otherwise
(PAS) (Supplementary Table 1) (7, 8). They may help to they underwent operation. Appendectomy was laparoscopic
exclude appendicitis in an emergency setting (AUROC 0.84 in 88% of cases and traditional in 12%. Histological and
for AS ≤ 3 and PAS ≤ 2) (9), but neither is in widespread intra-operative findings were assessed. The routine procedure
routine use. for children and adolescents with suspected appendicitis is
There are still no consistent and widely used international summarized in Supplementary Figure 1.
guidelines for managing acute appendicitis in children.
Minimally invasive appendectomy remains the standard Data Description
treatment of acute appendicitis despite increasing evidence of Our analysis considered predictive models for three binary
similar results being achieved by conservative therapy with response variables:
antibiotics (10, 11), not to mention the reports of spontaneous • diagnosis: appendicitis (n = 247, 57.21%) and no appendicitis
resolution in uncomplicated cases suggesting that an antibiotic- (n = 183, 42.79%)
free approach might be effective in selected school-age children • management: surgical (n = 165, 38.37%) and conservative
(1, 12). (n = 265, 61.63%)
Machine learning (ML) enhances the early detection and • severity: complicated (n = 51, 11.86%) and uncomplicated
monitoring of multiple medical conditions (13). Supervised appendicitis or no appendicitis (n = 379, 88.14%).
learning models leverage large amounts of labeled data to
extract complex statistical patterns predictive of a target
variable, often achieving superhuman performance levels TABLE 1 | Counts of patients in different diagnosis, management, and severity
(14). In this study we applied ML to achieve three outcomes: categories.

diagnosing appendicitis in children with abdominal pain; Appendicitis: No Total:


guiding management (conservative without antibiotics Uncomplicated/ appendicitis: Uncomplicated/
vs. operative); and risk stratifying severity (gangrene and Complicated Uncomplicated/ Complicated
perforation). Our aim was to develop and validate a pilot Complicated

ML tool to support physicians in diagnosing appendicitis at


Surgical 114/51 0/0 114/51
presentation, assessing severity, and deciding management. management:
The purpose of this paper is not to develop a finished Conservative 82/0 183/0 265/0
clinical decision support system, but rather to present a management:
pilot study for a promising research prototype based on machine Total: 196/51 183/0 379/51
learning. To the best of our knowledge, this is the first study
Rows correspond to different management categories; columns correspond to different
using ML to simultaneously predict diagnosis, conservative
diagnoses. Each cell contains counts of patients with uncomplicated appendicitis
vs. operative management, and severity in children with or without appendicitis and with complicated appendicitis (separated by “/”) in the
suspected appendicitis. corresponding subgroup.

Frontiers in Pediatrics | www.frontiersin.org 2 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

The “appendicitis” category included both acute and subacute • logistic regression (LR), as implemented in the R glmnet
cases, while “surgical” comprised primary and secondary surgical package (20);
treatment. It is important to note that we could not confirm • random forest (RF) (21), as implemented in the R
the diagnosis in every patient: histology was only possible in randomForest package (15);
patients who underwent surgery. Conservatively treated patients • generalized boosted regression model (GBM) (22), as
were retrospectively assigned the “appendicitis” label only if implemented in the R gbm package (23).
they had AS and/or PAS values ≥ 4 and an appendix diameter
LR is only capable of learning a linear decision boundary to
≥ 6 mm. Diagnosis was a proxy for confirmed disease status.
differentiate between classes, whereas the RF and GBM models
Patients with the above criteria for appendicitis who were first
are non-linear ensemble classification methods and can thus
treated conservatively (n = 86) were contacted at least 6 months
potentially learn more complex patterns. Both RF and GBM
after discharge (mean 28 months). We reached 61 individuals,
achieve this by training many simple classifiers and consequently
five of whom had since undergone appendectomy and were
aggregating their predictions into a single estimate.
therefore included in the surgical group. Appendicitis was
To identify which variables were crucial for predictive
classified as “uncomplicated” in all conservatively treated cases.
performance, we compared classifiers trained on the following
The “uncomplicated” category also included patients without
predictor subsets:
appendicitis since none had complications during treatment; it
was almost 8 times larger than the “complicated” category. To • full set of 38 predictor variables
address this major imbalance, we investigated the use of cost- • without US data (“US-free”)
sensitive classification models, e.g., by introducing prior category • without the “peritonitis/abdominal guarding” variable
probabilities in random forest models (15), but performance • without US data or the “peritonitis/abdominal
was not markedly improved. The other two category pairs were guarding” variable.
reasonably balanced. Table 1 contains detailed counts of patients
It was interesting to investigate whether responses could be
within different diagnosis, management, and severity categories.
predicted without including the US variables that might be
Our analysis considered 38 predictor variables including
operator-dependent or unavailable in emergencies (24, 25).
patient and US data. Variables were continuous, binary, and
We singled out the “peritonitis/abdominal guarding” variable
categorical. All were measured before treatment was assigned and
because detection can be unreliable, requiring an experienced
none represent intraoperative findings. Supplementary Table 2
examiner; our analysis considered it under three subcategories:
contains explanations of all 38 predictor variables included in the
(i) no peritonitis/abdominal guarding, (ii) localized, and
model development and validation.
(iii) generalized.
We computed summary statistics for patient subgroups, based
on the three responses. Statistical tests for differences between
subgroups were performed in the R programming language Evaluation Metrics
(version 3.6.2) (16). Summary and test statistics were based To evaluate and compare predictive models, we performed 10-
on non-missing data only. Chi-squared tests of independence fold cross-validation (CV) (26), using the k-NN method for
were used for discrete variables and unpaired two-sided Mann- imputing missing values separately for train and test sets. Ten-
Whitney U-tests for continuous variables; p-values were adjusted fold cross-validation is a standard procedure for the evaluation
for multiple comparisons using Hommel’s method (17). A level of of ML models, wherein the model is repeatedly trained on 90%
α = 0.05 was chosen for statistical significance. Predictors with of data and tested on 10% of withheld data for 10 disjoint test
several categories were binarized prior to the chi-squared test. folds. Predictive performance was assessed using AUROC and
area under the precision-recall (AUPR) curve (Figure 1) (27).
AUPR is particularly informative for classification problems with
Preprocessing extreme class imbalance (27). It was therefore more appropriate
The dataset contained missing values. As a preprocessing step, for comparing models predicting appendicitis severity. We
we performed missing data imputation using the k-nearest compared model performance using two-sided 10-fold cross-
neighbors (k-NN) (with k = 5) method based on Gower distance validated paired t-tests at a significance level α = 0.05 (28). In
(18), as implemented in the R VIM package (19). This method addition to AUROC and AUPR, sensitivity, specificity, negative
imputes missing variables in every instance based on values and postive predictive values of the classifiers were evaluated.
occurring within the proximity given by Gower distance for
continuous, categorical, and ordered variables (19). To avoid data
Variable Selection
leakage and the introduction of spurious associations between
In a clinical setting, variables can be systemically missing at test
predictor and response variables, we performed the imputation
time. We therefore also examined the importance of predictor
without response variables and separately for train and test sets.
variables in case the number of predictors used by classifiers
could be reduced without compromising their performance. Both
Machine Learning RF and GBM provide measures of variable importance (15, 21,
To predict the above response variables, we trained and 23). We examined the averages of class-specific measures of
validated three different ML models for classification in the R variable importance given by the mean decrease in RF accuracy
programming language (version 3.6.2) (16): (15). We trained random forests on 300 bootstrap resamples of

Frontiers in Pediatrics | www.frontiersin.org 3 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

FIGURE 1 | Machine learning analysis schematic. Machine learning models, namely logistic regression (LR), random forest (RF), and generalized boosted regression
model (GBM), based on various sets of predictor variables, are evaluated using areas under receiver operating characteristic (AUROC) and precision-recall (AUPR)
curves in the 10-fold cross-validation procedure. Ten-fold cross-validation is a standard procedure for evaluating the performance of predictive ML models wherein the
model is trained on 90% of the data and tested on the remaining 10% repeatedly for 10 disjoint test folds. In our analysis, missing value imputation was performed
separately for train and test sets using the k-nearest neighbors (k-NN) method.

the data and used boxplots to visualize the distributions of the appendix diameter, body temperature, WBC count, neutrophil
importance values obtained (29). percentage, CRP, and peritonitis/abdominal guarding. These
In addition, we cross-validated a variable selection procedure variables had previously been identified as useful in predicting
based on the RF importance measure to determine the minimal appendicitis (8, 30–32). Table 2 and Supplementary Tables 3, 4
number of variables that could be used without compromising show the summary statistics and statistical test results for patient
predictive performance. The procedure can be summarized as subgroups based on response variables. In general, the descriptive
follows. For number of predictors q from 1 to 38, repeat: statistics suggested that the data featured strong associations
between some predictors and responses.
1. Train full RF model Mfull (all predictor variables included) on
Table 3 shows the 10-fold CV results for the different
the train set. Retrieve variable importance values.
ML classifiers for predicting diagnosis, management, and
2. Train RF model Mq based on q predictors with the highest
severity. For diagnosis classification, full RF (average AUROC:
importance values, on the train set.
0.96, average AUPR: 0.94) and GBM (average AUROC: 0.96,
3. Evaluate AUROC and AUPR of Mq on the test set.
average AUPR: 0.94) models significantly outperformed logistic
4. Repeat steps 1-3 for all 10 folds in CV.
regression (average AUROC: 0.91, average AUPR: 0.88). AUROC
This procedure evaluates the performance of random forest and AUPR p-values were 0.002 and 0.006 for RF, and 0.007
classifiers that use varying numbers of predictors chosen on the and 0.03 for GBM. This suggests benefit from using non-linear
basis of importance values. classification methods for predicting a diagnosis of appendicitis.
Finally, we examined which variable subsets were chosen The full GBM and RF classifiers performed equally with respect to
consistently, for each q. For q from 1 through 38, we trained both evaluation metrics. All ML models performed considerably
random forest classifiers on 300 bootstrap resamples of the data better than the random classifier, that is, a random guess. On
and counted how many times each predictor was among the average, classifiers that used the full set of predictors had higher
q most important variables. In this way, we could assess the AUROCs and AUPRs than the clinical baselines, such as AS,
variability of a set of q most important predictors, rather than PAS, and suspected diagnosis, given by hospital specialists. Based
provide a single selection which could be unstable because based on the CV results, US input is crucial for accurately diagnosing
on only one replication of the experiment. appendicitis because average AUROC and AUPR degraded in
all models when it was absent. Peritonitis had less influence on
RESULTS prediction quality.
For predicting management, the full RF and GBM models
Distributions of several predictors differed significantly (at had the highest average AUROC (0.94), while the full GBM
level α = 0.05) for all three responses, namely, AS, PAS, had the highest average AUPR (0.93). Both non-linear methods

Frontiers in Pediatrics | www.frontiersin.org 4 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

TABLE 2 | Dataset description for patients with and without appendicitis. significantly outperformed logistic regression (average AUROC:
0.90, average AUPR: 0.88). AUROC and AUPR p-values were 0.01
Variable Appendicitis No P-value
(n = 247) appendicitis
and 0.06 (non-significant) for RF, and 0.02 and 0.03 for GBM.
(n = 183) All models had considerably better average AUROCs and AUPRs
than the random classifier. Based on the CV results, peritonitis
Age, years 11.48 [9.18, 12.10 [9.57, 0.6 is a very important variable for predicting management. Average
13.29] 14.46] model performance dropped considerably when removing this
Male sex, % 58.13 47.83 0.5 variable. US findings did not affect prediction quality as much
Height, cm 149.1 [137.5, 152.2 [139.6, 0.8 as when diagnosing appendicitis.
162.0] 164.0]
As for appendicitis severity, US-free logistic regression
Weight, kg 39.75 [31.00, 47.10 [32.48, 0.4
achieved the highest average AUROC (0.91) alongside US-free
52.75] 57.08]
GBM, while full-set RF achieved the highest average AUPR (0.70)
Body mass index (BMI), kg/m2 17.84 [15.72, 18.90 [15.95, 0.3
20.55] 22.39]
(Table 3). Although all models performed considerably better
Alvarado score, pts 7 [5, 8] 4 [3, 6] ≤0.001
than the random classifier, complicated appendicitis appeared
harder to predict than either diagnosis or management. The
Pediatric appendicitis score, pts 5 [4, 7] 4 [3, 5] ≤0.001
AUPRs were much lower, and all models had high variances
Peritonitis/abdominal guarding, % 61.38 7.61 ≤0.001
across the folds. This could be due to the very low prevalence
Migration of pain, % 30.89 18.48 0.09
of complicated appendicitis (12% of all patients). There was
Tenderness in right lower quadrant 97.97 95.63 1.0
little gain in performance from using non-linear classification
(RLQ), %
methods. The differences in AUROC and AUPR between RF,
Rebound tenderness, % 40.98 25.68 ≤0.05
GBM, and (US-free) logistic regression were non-significant.
Cough tenderness, % 32.65 19.57 0.06
AUROC and AUPR p-values were 0.94 and 0.97 for RF, and 0.76
Psoas sign, % 27.85 33.91 1.0
and 0.58 for GBM. US input had almost no effect on average
Nauseous/vomiting, % 62.20 48.37 0.1
classifier performance whereas peritonitis was important and
Anorexia, % 31.71 25.68 1.0
its exclusion markedly decreased AUROC and AUPR values in
Body temperature, ◦ C 37.75 [37.20, 37.20 [36.80, ≤0.001 all models.
38.20] 37.85]
We also evaluated model sensitivities, specificities, and
Dysuria, % 3.45 7.82 0.7
negative and positive predictive values (NPV/PPV). Tables 4,
Abnormal stool, % 28.40 27.07 1.0
5 contain results of the 10-fold CV for all three responses.
White blood cell count, 103 /µl 13.80 [10.68, 8.80 [7.00, ≤0.001
In this analysis, a threshold of 0.5 was used to predict labels.
17.40] 11.90]
When incorporating any of these models into clinical decision-
Neutrophils, % 78.95 [70.40, 61.50 [52.35, ≤0.001
84.17] 77.55] making, the threshold will have to be chosen based on the desired
C-reactive protein, mg/l 15.00 [4.00, 1.00 [0.00, ≤0.001 sensitivity and specificity. For diagnosis, full non-linear classifiers
46.00] 13.00] achieved better combinations of sensitivity, specificity, NPV, and
Ketones in urine, % 44.94 31.54 0.5 PPV than the clinical baseline (AS or PAS ≥ 4 and appendix
Erythrocytes in urine, % 23.42 20.81 1.0 diameter ≥ 6 mm). Similar to the evaluation in Table 3, on
White blood cells in urine, % 12.03 12.75 1.0 average, non-linear classifiers performed noticeably better than
Visibility of appendix, % 86.53 34.97 ≤0.001 logistic regression in predicting diagnosis.
Appendix diameter, mm 8.00 [7.00, 5.00 [4.05, ≤0.001
To identify the most crucial predictive variables, we trained
10.00] 5.28] RF classifiers on 300 bootstrap resamples of the dataset
Free intraperitoneal fluid, % 52.56 31.84 ≤0.01 and obtained a distribution of importance values for every
Irregular appendix layers, % 41.74 11.11 0.1 predictor. The RF variable importance quantifies how important
Target sign, % 67.37 9.10 ≤0.001
each variable is for predicting the outcome in the random
Appendix perfusion, % 74.47 12.50 ≤0.05
forest model. For diagnosing appendicitis, on average, the
most important predictors were appendix diameter, appendix
Surrounding tissue reaction, % 86.01 16.22 ≤0.001
visibility on US, and peritonitis. For management, they were
Pathological lymph nodes, % 62.20 74.70 0.8
peritonitis, appendix diameter, and WBC count. For severity,
Mesenteric lymphadenitis, % 79.69 81.08 1.0
they were CRP, peritonitis, and body temperature (details
Thickening of the bowel wall, % 55.77 19.44 ≤0.05
in Figure 2). Plots of importance values for the full set
Ileus, % 25.00 0.00 0.17 of predictors are shown in Supplementary Figure 2. Overall,
Coprostasis, % 34.15 42.42 1.0 these findings agreed with the statistical results in Table 2
Meteorism, % 59.18 84.48 0.1 and Supplementary Tables 3, 4. Predictor variables that differ
Enteritis, % 16.67 69.57 ≤0.05 significantly across patient subgroups are often among the most
important features used by random forests for predictions.
Distributions of variables are presented as either medians with interquartile ranges (in
square brackets) or percentages. For significant differences, p-values are reported in bold In addition, we performed variable selection using RF
as “≤0.001,” “≤0.01” or “≤0.05” (at significance level α = 0.05). importance. Figure 3 contains AUROC and AUPR plots for RF

Frontiers in Pediatrics | www.frontiersin.org 5 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

TABLE 3 | Ten-fold cross-validation results for logistic regression (LR), random forest (RF), and generalized boosted regression (GBM) models for predicting diagnosis,
management, and severity.

Classifier Diagnosis Management Severity

AUROC (±SD) AUPR (±SD) AUROC (±SD) AUPR (±SD) AUROC (±SD) AUPR (±SD)

Random 0.50 0.43 0.50 0.38 0.50 0.12


AS 0.75 0.71 — — — —
PAS 0.71 0.67 — — — —
AS or PAS ≥ 4 and appendix diameter ≥ 6 mm 0.79 0.83 — — — —
Suspected diagnosis 0.73 0.85 — — — —
LR (full) 0.91 (±0.04) 0.88 (±0.07) 0.90 (±0.04) 0.88 (±0.06) 0.82 (±0.13) 0.53 (±0.26)
LR (w/o US) 0.82 (±0.06) 0.71 (±0.12) 0.91 (±0.04) 0.90 (±0.05) 0.91 (±0.09) 0.69 (±0.26)
LR (w/o peritonitis/abdominal guarding) 0.90 (±0.04) 0.87 (±0.06) 0.83 (±0.04) 0.79 (±0.06) 0.82 (±0.15) 0.58 (±0.28)
LR (w/o US and peritonitis/abdominal guarding) 0.77 (±0.06) 0.67 (±0.14) 0.80 (±0.04) 0.77 (±0.06) 0.81 (±0.16) 0.62 (±0.26)
RF (full) 0.96 (±0.01) 0.94 (±0.03) 0.94 (±0.02) 0.92 (±0.05) 0.90 (±0.08) 0.70 (±0.17)
RF (w/o US) 0.85 (±0.05) 0.77 (±0.11) 0.93 (±0.03) 0.90 (±0.07) 0.90 (±0.08) 0.67 (±0.18)
RF (w/o peritonitis/abdominal guarding) 0.95 (±0.01) 0.93 (±0.05) 0.85 (±0.07) 0.79 (±0.11) 0.88 (±0.10) 0.65 (±0.23)
RF (w/o US and peritonitis/abdominal guarding) 0.80 (±0.06) 0.73 (±0.11) 0.78 (±0.05) 0.70 (±0.10) 0.86 (±0.10) 0.58 (±0.21)
GBM (full) 0.96 (±0.02) 0.94 (±0.03) 0.94 (±0.02) 0.93 (±0.04) 0.90 (±0.07) 0.64 (±0.21)
GBM (w/o US) 0.85 (±0.06) 0.75 (±0.10) 0.92 (±0.04) 0.90 (±0.05) 0.91 (±0.07) 0.60 (±0.25)
GBM (w/o peritonitis/abdominal guarding) 0.95 (±0.02) 0.92 (±0.05) 0.87 (±0.05) 0.82 (±0.08) 0.84 (±0.13) 0.58 (±0.25)
GBM (w/o US and peritonitis/abdominal guarding) 0.79 (±0.06) 0.71 (±0.11) 0.79 (±0.07) 0.72 (±0.08) 0.84 (±0.12) 0.55 (±0.27)

Results are given by average areas under receiver operating characteristic (AUROC) and precision-recall (AUPR) curves and standard deviations across 10 folds. “Full” models use
all predictors; models “w/o US” were trained without ultrasonographic findings; models “w/o peritonitis/abdominal guarding” were trained without “peritonitis/abdominal guarding”
predictor; and models “w/o US and peritonitis/abdominal guarding” were trained without ultrasonographic findings and “peritonitis/abdominal guarding” predictor. For fixed classification
rules, such as Alvarado (AS) and pediatric appendicitis scores (PAS), AUROC and AUPR on the whole dataset are reported without standard deviations. For random classifiers, we
report expected AUROC and AUPR. “Random” corresponds to a random guess and serves as a naïve baseline. Bold values correspond to the best average performances achieved
across all models.

TABLE 4 | Ten-fold cross-validation results for logistic regression (LR), random forest (RF), and generalized boosted regression (GBM) models for predicting diagnosis,
management, and severity.

Classifier Diagnosis Management Severity

Sens. (±SD) Spec. (±SD) Sens. (±SD) Spec. (±SD) Sens. (±SD) Spec. (±SD)

Random 0.57 0.43 0.62 0.38 0.88 0.12


AS or PAS ≥ 4 and appendix diameter ≥ 6 mm 0.91 0.73 — — — —
Suspected diagnosis 1.00 0.46 — — — —
LR (full) 0.88 (±0.06) 0.76 (±0.11) 0.85 (±0.09) 0.82 (±0.09) 0.93 (±0.05) 0.42 (±0.32)
LR (w/o US) 0.75 (±0.06) 0.72 (±0.09) 0.92 (±0.07) 0.85 (±0.05) 0.95 (±0.04) 0.52 (±0.29)
LR (w/o peritonitis/abdominal guarding) 0.87 (±0.07) 0.76 (±0.12) 0.84 (±0.10) 0.68 (±0.15) 0.94 (±0.05) 0.40 (±0.36)
LR (w/o US and peritonitis/abdominal guarding) 0.77 (±0.06) 0.67 (±0.11) 0.82 (±0.06) 0.63 (±0.07) 0.97 (±0.05) 0.44 (±0.34)
RF (full) 0.91 (±0.03) 0.86 (±0.08) 0.94 (±0.07) 0.80 (±0.09) 0.98 (±0.02) 0.45 (±0.16)
RF (w/o US) 0.81 (±0.07) 0.71 (±0.07) 0.93 (±0.07) 0.82 (±0.07) 0.97 (±0.02) 0.44 (±0.13)
RF (w/o peritonitis/abdominal guarding) 0.91 (±0.04) 0.90 (±0.06) 0.86 (±0.07) 0.65 (±0.18) 0.98 (±0.02) 0.37 (±0.17)
RF (w/o US and peritonitis/abdominal guarding) 0.79 (±0.06) 0.64 (±0.11) 0.81 (±0.06) 0.56 (±0.06) 0.98 (±0.02) 0.40 (±0.15)
GBM (full) 0.93 (±0.02) 0.86 (±0.07) 0.93 (±0.07) 0.86 (±0.07) 0.97 (±0.02) 0.46 (±0.18)
GBM (w/o US) 0.80 (±0.07) 0.74 (±0.11) 0.91 (±0.08) 0.85 (±0.05) 0.97 (±0.03) 0.44 (±0.16)
GBM (w/o peritonitis/abdominal guarding) 0.92 (±0.04) 0.83 (±0.09) 0.88 (±0.04) 0.66 (±0.11) 0.97 (±0.03) 0.47 (±0.20)
GBM (w/o US and peritonitis/abdominal guarding) 0.80 (±0.06) 0.61 (±0.10) 0.82 (±0.07) 0.59 (±0.09) 0.97 (±0.03) 0.47 (±0.19)

Results are given by average sensitivities (sens.) and specificities (spec.) with standard deviations across 10 folds. “Full” models use all predictors; models “w/o US” were trained
without ultrasonographic findings; models “w/o peritonitis/abdominal guarding” were trained without the “peritonitis/abdominal guarding” predictor; and models “w/o US and
peritonitis/abdominal guarding” were trained without ultrasonographic findings or the “peritonitis/abdominal guarding” predictor. For all classifiers, a probability threshold of 0.5 was
used to differentiate between classes. “Random” corresponds to a random guess and serves as a naïve baseline. Bold values correspond to the best average performances achieved
across all models.

Frontiers in Pediatrics | www.frontiersin.org 6 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

TABLE 5 | Ten-fold cross-validation results for logistic regression (LR), random forest (RF), and generalized boosted regression (GBM) models for predicting diagnosis,
management, and severity.

Classifier Diagnosis Management Severity

PPV (±SD) NPV (±SD) PPV (±SD) NPV (±SD) PPV (±SD) NPV (±SD)

Random 0.57 0.43 0.62 0.38 0.88 0.12


AS or PAS ≥ 4 and appendix diameter ≥ 6 mm 0.82 0.85 — — — —
Suspected diagnosis 0.71 1.00 — — — —
LR (full) 0.83 (±0.07) 0.83 (±0.09) 0.89 (±0.06) 0.79 (±0.09) 0.92 (±0.04) 0.51 (±0.28)
LR (w/o US) 0.78 (±0.08) 0.68 (±0.10) 0.91 (±0.03) 0.88 (±0.10) 0.94 (±0.04) 0.61 (±0.34)
LR (w/o peritonitis/abdominal guarding) 0.83 (±0.09) 0.82 (±0.11) 0.82 (±0.05) 0.74 (±0.09) 0.92 (±0.04) 0.45 (±0.29)
LR (w/o US and peritonitis/abdominal guarding) 0.76 (±0.09) 0.68 (±0.10) 0.78 (±0.04) 0.68 (±0.09) 0.93 (±0.04) 0.69 (±0.33)
RF (full) 0.89 (±0.08) 0.88 (±0.05) 0.88 (±0.04) 0.90 (±0.12) 0.93 (±0.03) 0.80 (±0.26)
RF (w/o US) 0.78 (±0.07) 0.74 (±0.10) 0.89 (±0.04) 0.88 (±0.10) 0.93 (±0.03) 0.72 (±0.24)
RF (w/o peritonitis/abdominal guarding) 0.92 (±0.05) 0.88 (±0.07) 0.81 (±0.09) 0.74 (±0.13) 0.92 (±0.04) 0.77 (±0.24)
RF (w/o US and peritonitis/abdominal guarding) 0.74 (±0.11) 0.69 (±0.09) 0.75 (±0.05) 0.65 (±0.10) 0.92 (±0.03) 0.72 (±0.23)
GBM (full) 0.89 (±0.07) 0.90 (±0.04) 0.91 (±0.04) 0.88 (±0.10) 0.93 (±0.02) 0.67 (±0.21)
GBM (w/o US) 0.81 (±0.09) 0.73 (±0.11) 0.91 (±0.03) 0.87 (±0.11) 0.93 (±0.02) 0.70 (±0.25)
GBM (w/o peritonitis/abdominal guarding) 0.87 (±0.08) 0.89 (±0.06) 0.81 (±0.04) 0.77 (±0.08) 0.93 (±0.03) 0.72 (±0.24)
GBM (w/o US and peritonitis/abdominal guarding) 0.73 (±0.09) 0.70 (±0.10) 0.76 (±0.06) 0.67 (±0.11) 0.93 (±0.03) 0.68 (±0.23)

Results are given by average positive and negative predictive values (PPV/NPV) with standard deviations across 10 folds. “Full” models use all predictors; models “w/o US” were
trained without ultrasonographic findings; models “w/o peritonitis/abdominal guarding” were trained without the “peritonitis/abdominal guarding” predictor; and models “w/o US and
peritonitis/abdominal guarding” were trained without ultrasonographic findings or the “peritonitis/abdominal guarding” predictor. For all classifiers, a probability threshold of 0.5 was
used to differentiate between classes. “Random” corresponds to a random guess and serves as a naïve baseline. Bold values correspond to the best average performances achieved
across all models.

models based on varying numbers of predictors. For predicting erythrocytes in urine, and target sign. Supplementary Table 5
diagnosis, classifier AUROC and AUPR values saturated at summarizes these variable selection results.
q = 3 (Figures 3A,B). Thus, a few variables suffice for
accurate appendicitis risk stratification. For management, there ONLINE TOOL
was a steady increase in average AUROC (Figure 3C) with
an increase in the number of predictor variables selected. For We provide an easy-to-use online tool for the three response
AUPPR, classifiers with <14 predictors (Figure 3D) had higher variables at http://papt.inf.ethz.ch/ (33). The RF models
variances in 10-fold CV. Predictive performance stabilized at implemented in this tool use limited sets of predictors chosen
q = 14. Similarly, for predicting severity, average AUROC and based on variable importance and 10-fold CV. We chose random
AUPR increased steadily with model complexity (Figures 3E,F). forests because they outperformed logistic regression and were,
AUROC saturated at q = 5, and AUPR at q = 11. For all three in general, on a par with GBM. We included the variables
prediction tasks, we observed that the full set of predictors is selected into subsets in ≥5% of bootstrap resamples of the
far from necessary because full-model performance levels can be dataset. The tool presents a pilot status and was developed for
achieved with a smaller number of variables. educational use only. Even in further steps after prospective
We used bootstrapping to determine how frequently variables validation, practical clinical considerations must be incorporated
were selected based on their RF importance. For predicting into decision-making.
diagnosis, we looked at choosing q = 3 most important variables.
The variables chosen in >5% of bootstrap resamples included DISCUSSION
appendix diameter, appendix visibility on US, peritonitis, target
sign, WBC count, and neutrophil percentage. For management This observational study of children referred with abdominal
we examined a subset of size q = 14. The variables selected pain to the pediatric surgical department used different ML
in ≥5% of bootstrap resamples included peritonitis, CRP, models to predict the diagnosis, management and severity
neutrophil percentage, WBC count, appendix diameter, enteritis, of appendicitis. Starting with a granular dataset including
target sign, appendix perfusion, AS, body temperature, age, demographic, clinical, laboratory, and US variables, we identified
surrounding tissue reaction, appendix layer structure, weight, a minimal subset of key predictors and trained classifiers that
body mass index (BMI), height, and PAS. For severity we far outperformed conventional scores such as the AS and PAS.
chose a subset of q = 11 variables. The following predictors Since all the variables we used in this study are standardized and
were selected in >5% of bootstrap resamples: peritonitis, widely available for evaluating patients with abdominal pain, our
CRP, body temperature, WBC count, neutrophil percentage, findings are broadly relevant. We also developed the Appendicitis
appendix diameter, appendix perfusion, weight, age, bowel wall Prediction Tool (APT) to predict the diagnosis, management and
thickening, height, AS, BMI, ileus, appendix layer structure, PAS, severity of appendicitis with unlimited online access.

Frontiers in Pediatrics | www.frontiersin.org 7 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

FIGURE 2 | Boxplots of random forest (RF) importance values for a few most important predictors. RF variable importance quantifies how important each variable is for
predicting the considered outcome. Appendix diameter, peritonitis/abdominal guarding, white blood cell (WBC) count, neutrophil percentage, and C-reactive protein
(CRP) are among 10 most important variables for predicting diagnosis, management and severity. Distributions were obtained by training random forest classifiers on
300 bootstrap resamples of the dataset. The bootstrapping was performed to provide uncertainty about variable importance values, rather than mere point estimates.

A basic challenge with ML models is that their performance Nevertheless, based on the CV results (Table 3), the models
depends largely on the quality and representativity of the incorporating US variables performed considerably better in
training data, and their applicability in real life depends on the predicting diagnosis and management and hence are preferred,
accessibility of required features (34). For example, assessing to avoid complications and misdiagnosis. Most children with
abdominal guarding as a sign of peritonitis can be challenging missed appendicitis on presenting to the ED of a tertiary care
during initial presentation of small children with abdominal pain. hospital did not undergo US (67 vs. 13% of correctly diagnosed
If this finding is unclear, it is recommended that assessment cases, p < 0.05) (37).
be repeated during the clinical observation period, if necessary Several studies have used ML to support the diagnosis of
under analgesia (35, 36). Based on RF variable importance and appendicitis (30, 38, 39). Four recent studies have focused
CV results, we found that “peritonitis/abdominal guarding” had exclusively on the pediatric population (40–43). Reismann et al.
the highest importance for predicting management, but not performed feature selection and trained a logistic regression to
appendicitis or appendicitis severity, for which other predictors diagnose appendicitis and differentiate between uncomplicated
were more important (Figure 2). The AS and PAS can be easily and complicated cases of pediatric acute appendicitis (40). They
calculated after clinical examination and hemogram. Although analyzed laboratory variables and appendix diameter in US and
abdominal and appendix US is the most suitable and cost- achieved AUROCs of 0.91 and 0.80 for diagnosing appendicitis
effective imaging modality for suspected appendicitis, it is highly and differentiating complicated appendicitis, respectively.
operator-dependent, requiring years of training, particularly for Akmese et al. analyzed demographic and laboratory data and
children, and is not always on hand in every ED. That is why we used a range of ML methods to predict whether pediatric
also trained models without “peritonitis/abdominal guarding,” patients with suspected acute appendicitis underwent surgery
without US, and without either “peritonitis/abdominal guarding” (41). In their analysis, gradient boosting attained the highest
or US. These variables are not mandatory in the prediction accuracy (95%). Similar to Akmese et al. (41) Aydin et al.
tool, making it easier to deploy. The predictors are imputed detected pediatric appendicitis based on demographic and pre-
using the k-NN method if the user decides to omit them. operative laboratory data (42). In addition, they differentiated

Frontiers in Pediatrics | www.frontiersin.org 8 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

FIGURE 3 | Results of 10-fold cross-validation for random forest classifiers based on different numbers of predictor variables selected based on variable importance.
(A,B) Show areas under receiver operating characteristic (AUROC) and precision-recall (AUPR) curves, respectively, for predicting diagnosis. (C,D) Show AUROCs
and AUPRs, respectively, for predicting management. (E,F) Show AUROCs and AUPRs, respectively, for predicting severity. Black-colored bars correspond to 95%
confidence intervals, constructed using t-distribution; red-colored dots correspond to averages. Recall that random classifier AUROCs are 0.50 for all three targets
and random classifier AUPRs are 0.43, 0.38, and 0.12 for diagnosis, treatment, and complicated appendicitis, respectively.

between complicated and uncomplicated appendicitis. Their respectively. Stiel et al. applied different appendicitis scores
decision tree model achieved AUROCs of 0.94 and 0.79 (AS, PAS, Heidelberg, and Tzanakis Score) to a dataset of
for predicting appendicitis and uncomplicated appendicitis, pediatric patients presenting with abdominal pain to predict

Frontiers in Pediatrics | www.frontiersin.org 9 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

diagnosis and perforated appendicitis (43). The Heidelberg decision-making. The model could be extended to differentiate
Score was modified and a data-driven score was developed patients requiring primary surgery from those suitable for
using decision trees and random forests, achieving AUROCs of, conservative management with or without antibiotics by
respectively, 0.92 and 0.86 for appendicitis diagnosis, and both identifying the characteristics supporting spontaneous regression
0.71 for perforation. of acute appendicitis. Furthermore, predictive models could
Our own analysis focused exclusively on the pediatric be used to support the decision on which surgical approach
population given the particularities of appendicitis in this age is the best suitable for the patient. Certain minimal invasive
range: atypical clinical course and elevated perforation rates approaches such as TULAA (trans-umbilical laparoscopic-
in preschool-aged children, high prevalence, and multiple assisted appendectomy) may benefit from preoperative patient
differential diagnoses (44, 45). In addition to demographic, stratification, guiding the decision between single incision vs.
laboratory, and ultrasonographic data, we considered clinical 2-trocar technique (51).
predictors, such as peritonitis/abdominal guarding, and
appendicitis scores (AS and PAS). Moreover, we targeted
STRENGTHS AND LIMITATIONS
the prediction of all three targets simultaneously: diagnosis,
management, and severity. None of the machine learning models The current dataset was acquired from patients admitted to
mentioned above were deployed as an open access online tool a pediatric surgical unit with suspected appendicitis. Those
(40–43), whereas our models are available as an easy-to-use APT. with mild symptoms and/or rapid improvement had already
Our 10-fold CV results (Table 3) are overall comparable to been discharged by the emergency department. This can be
the performance levels reported by Reismann et al. (40), Akmese assumed to have increased the probability of appendicitis among
et al. (41), Aydin et al. (42), and Stiel et al. (43) whose studies are surgical admissions. The predictors for all three outcomes include
similar to ours. Compared to the previous work on using ML to clinical, laboratory, and US parameters that are readily and
predict pediatric appendicitis (40–43), our analysis considers the cost-effectively available during a patient’s work-up. Limitations
most extensive set of variables and, to the best of our knowledge, include certain missing variables, a limited number of patients,
is the first to simultaneously predict diagnosis, management, and especially with complicated appendicitis, the lack of a definitive
severity of appendicitis in pediatric patients. In a retrospective histological diagnosis in conservatively managed patients (we
study Cohen et al. found that children with a normal WBC provide a more detailed discussion of this limitation in the
count and an appendix non-visualized on US could initially be Supplementary Material), and the current absence of external
kept under observation (46). According to our data, appendix validation. Due to these limitations, the APT is merely a
visibility on US is one of the most important predictors for research prototype and must not be relied on for health or
diagnosis (Figure 2). personal advice.
In the presented collective, pediatric patients with suspected
simple appendicitis and persistent symptoms after initial
treatment and evaluation at the ED were admitted to further CONCLUSION
observation and therapy, as shown in Supplementary Figure 1.
Pediatric appendicitis remains an important disease with
They received initial clinical support, e.g., intravenous fluids,
a heterogeneous presentation. The APT should help
enemas, without antibiotics. Eighty two patients with clinical
clinicians identify and manage patients with potential
and US signs of uncomplicated appendicitis showed clinical
appendicitis. It could become an important tool for
improvement, including appendicitis regression signs in US.
clinical observation in the near future. The goal of further
Therefore, they were discharged after a period of observation.
research should be the expanded application of ML
Several studies indicate that simple and complicated appendicitis
models for the early differential diagnosis of children
might have a different pathophysiology, suggesting that some
with abdominal pain. We see it as a valuable tool for
forms of uncomplicated appendicitis may be reversible, and,
recognizing appendicitis severity and facilitating a personalized
as an alternative to operation, could be treated with or even
management approach.
without antibiotics (1, 47–50). Ohba et al. (12) conducted
a prospective study of pediatric appendicitis based on US
findings such as appendix diameter, wall structure, and DATA AVAILABILITY STATEMENT
perfusion. Their results support the possibility of treating
pediatric patients conservatively without antibiotics if The dataset analyzed is available in anonymized form alongside
abundant blood flow in the appendix submucosal layer is with the code in a GitHub repository: https://github.com/
still detectable. i6092467/pediatric-appendicitis-ml.
The APT is an academic instrument whose sensitivity
and specificity require further clinical testing. This prototype ETHICS STATEMENT
was developed based on our first dataset as a pilot trial
with a promising application of ML as a basis for further The study involving human participants was reviewed and
prospective studies. It needs a larger training dataset and external approved by the University of Regensburg institutional
blinded validation before it can be integrated into clinical review board (Ethikkommission der Universität Regensburg,

Frontiers in Pediatrics | www.frontiersin.org 10 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

no. 18-1063-101), which also waived informed consent to ACKNOWLEDGMENTS


routine data analysis. For patients followed up after discharge,
written informed consent was obtained from parents or The authors would like to express their gratitude to Imant
legal representatives. Daunhawer and Kieran Chin-Cheong from the Medical Data
Science research group at the Department of Computer Science,
AUTHOR CONTRIBUTIONS ETH Zurich, for their help with the development of the online
prediction tool. They also would like to thank Vivien Rimili and
All authors made substantial contributions to conception the physicians of the department of pediatric surgery of the Clinic
and design, analyses and interpretation of data, and revising Hedwig for helping with data acquisition. The authors sincerely
the article. PRW and CK performed clinical data acquisition, thank Ian Young, St Bartholomew’s Hospital, London, UK, and
coordination and check. PRW performed literature review and Lingua Medica for proofreading the article.
contributed to the manuscript. RM performed statistical
and machine learning analysis and contributed to the SUPPLEMENTARY MATERIAL
manuscript. CK and SW supervised the clinical part of the
project. JV supervised the machine learning part of the The Supplementary Material for this article can be found
project. All authors have read the manuscript and approved online at: https://www.frontiersin.org/articles/10.3389/fped.
its submission. 2021.662183/full#supplementary-material

REFERENCES 13. Daunhawer I, Kasser S, Koch G, Sieber L, Cakal H, Tütsch J,


et al. Enhanced early prediction of clinically relevant neonatal
1. Andersson RE. The natural history and traditional management of hyperbilirubinemia with machine learning. Pediatr Res. (2019)
appendicitis revisited: spontaneous resolution and predominance 86:122–7. doi: 10.1038/s41390-019-0384-x
of prehospital perforations imply that a correct diagnosis is 14. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med.
more important than an early diagnosis. World J Surg. (2006) (2019) 380:1347–58. doi: 10.1056/NEJMra1814259
31:86–92. doi: 10.1007/s00268-006-0056-y 15. Liaw A, Wiener M. Classification and regression by randomForest. R News.
2. Addiss DG, Shaffer N, Fowler BS, Tauxe RV. The epidemiology of appendicitis (2007) 2:18–22. Available online at: https://cran.r-project.org/doc/Rnews/
and appendectomy in the United States. Am J Epidemiol. (1990) 132:910– Rnews_2002-3.pdf
25. doi: 10.1093/oxfordjournals.aje.a115734 16. R Core Team. R: A Language and Environment for Statistical Computing.
3. Bonadio W, Peloquin P, Brazg J, Scheinbach I, Saunders J, Okpalaji Vienna: R Foundation for Statistical Computing (2019). Available online
C, et al. Appendicitis in preschool aged children: regression analysis of at: https://www.R-project.org/ (accessed January 5, 2021).
factors associated with perforation outcome. J Pediatr Surg. (2015) 50:1569– 17. Hommel G. A stagewise rejective multiple test procedure based on a modified
73. doi: 10.1016/j.jpedsurg.2015.02.050 Bonferroni test. Biometrika. (1988) 75:383–6. doi: 10.1093/biomet/75.2.383
4. Acharya A, Markar SR, Ni M, Hanna GB. Biomarkers of acute appendicitis: 18. Gower JC. A general coefficient of similarity and some of its properties.
systematic review and cost–benefit trade-off analysis. Surg Endosc. (2016) Biometrics. (1971) 27:857–71. doi: 10.2307/2528823
31:1022–31. doi: 10.1007/s00464-016-5109-1 19. Kowarik A, Templ M. Imputation with the R package VIM. J Stat Softw. (2016)
5. Shommu NS, Jenne CN, Blackwood J, Martin DA, Joffe AR, Eccles R, 74:1–16. doi: 10.18637/jss.v074.i07
et al. The use of metabolomics and inflammatory mediator profiling 20. Friedman JH, Hastie T, Tibshirani R. Regularization paths for
provides a novel approach to identifying pediatric appendicitis in the generalized linear models via coordinate descent. J Stat Softw. (2010)
emergency department. Sci Rep. (2018) 8:4083. doi: 10.1038/s41598-018-2 33:1–22. doi: 10.18637/jss.v033.i01
2338-1 21. Breiman L. Random forests. Mach Learn. (2001) 45:5–
6. Dingemann J, Ure B. Imaging and the use of scores for the 32. doi: 10.1023/A:1010933404324
diagnosis of appendicitis in children. Eur J Pediatr Surg. (2012) 22. Friedman JH. Greedy function approximation: a gradient boosting machine.
22:195–200. doi: 10.1055/s-0032-1320017 Ann Stat. (2001) 29:1189–232. doi: 10.1214/aos/1013203451
7. Alvarado A. A practical score for the early diagnosis of acute appendicitis. Ann 23. Greenwell B, Boehmke B, Cunningham J. gbm: Generalized Boosted Regression
Emerg Med. (1986) 15:557–64. doi: 10.1016/S0196-0644(86)80993-3 Models. R package version 2.1.5. (2019). Available online at: https://CRAN.
8. Samuel M. Pediatric appendicitis score. J Pediatr Surg. (2002) 37:877– R-project.org/package%3DgbmCRAN.R-project.org/package=gbm (accessed
81. doi: 10.1053/jpsu.2002.32893 January 5, 2021).
9. Nepogodiev D, Wilkin RJ, Bradshaw CJ, Skerritt C, Ball A, Moni- 24. Sola R, Wormer BA, Anderson WE, Schmelzer TM, Cosper GH. Predictors
Nwinia W, et al. Appendicitis risk prediction models in children and outcomes of nondiagnostic ultrasound for acute appendicitis in
presenting with right iliac fossa pain (RIFT study): a prospective, children. Am J Surg. (2017) 83:1357–62. doi: 10.1177/0003134817083
multicentre validation study. Lancet Child Adolesc Health. (2020) 4:271– 01218
80. doi: 10.1016/S2352-4642(20)30006-7 25. Soundappan SS, Karpelowsky J, Lam A, Lam L, Cass D. Diagnostic accuracy
10. Svensson JF, Patkova B, Almström M, Naji H, Hall NJ, Eaton S, of surgeon performed ultrasound (SPU) for appendicitis in children. J Pediatr
et al. Nonoperative treatment with antibiotics versus surgery for Surg. (2018) 53:2023–7. doi: 10.1016/j.jpedsurg.2018.05.014
acute nonperforated appendicitis in children. Ann Surg. (2015) 26. Stone M. Cross-validatory choice and assessment of statistical predictions. J R
261:67–71. doi: 10.1097/SLA.0000000000000835 Stat Soc: B. (1974) 36:111–33. doi: 10.1111/j.2517-6161.1974.tb00994.x
11. Svensson J, Hall N, Eaton S, Pierro A, Wester T. A review of conservative 27. Davis J, Goadrich M. The relationship between precision-recall and ROC
treatment of acute appendicitis. Eur J Pediatr Surg. (2012) 22:185– curves. In: Proceedings of the 23rd International Conference on Machine
94. doi: 10.1055/s-0032-1320014 learning—ICML’06. ACM Press (2006). p. 233–40.
12. Ohba G, Hirobe S, Komori K. The usefulness of combined B mode and 28. Dietterich TG. Approximate statistical tests for comparing
doppler ultrasonography to guide treatment of appendicitis. Eur J Pediatr supervised classification learning algorithms. Neural Comput. (1998)
Surg. (2016) 26:533–6. doi: 10.1055/s-0035-1570756 10:1895–923. doi: 10.1162/089976698300017197

Frontiers in Pediatrics | www.frontiersin.org 11 April 2021 | Volume 9 | Article 662183


Marcinkevics et al. Machine Learning in Pediatric Appendicitis

29. Efron B, Tibshirani R. Bootstrap methods for standard errors, confidence of acute appendicitis in children. Pediatr Surg Int. (2020) 36:735–
intervals, and other measures of statistical accuracy. Stat Sci. (1986) 1:54– 42. doi: 10.1007/s00383-020-04655-7
75. doi: 10.1214/ss/1177013817 43. Stiel C, Elrod J, Klinke M, Herrmann J, Junge CM, Ghadban T, et al. The
30. Hsieh CH, Lu RH, Lee NH, Chiu WT, Hsu MH, Li YCJ. Novel solutions for modified Heidelberg and the AI appendicitis score are superior to current
an old disease: diagnosis of acute appendicitis with random forest, support scores in predicting appendicitis in children: a two-center cohort study. Front
vector machines, and artificial neural networks. Surgery. (2011) 149:87– Pediatr. (2020) 8:592892. doi: 10.3389/fped.2020.592892
93. doi: 10.1016/j.surg.2010.03.023 44. Nwokoma NJ. Appendicitis in children. In: Lander A, editor, Appendicitis—A
31. Owen TD, Williams H, Stiff G, Jenkinson LR, Rees BI. Evaluation of the Collection of Essays from Around the World. Rijeka: InTech (2011). p. 133–69.
Alvarado score in acute appendicitis. J R Soc Med. (1992) 85:87–8. 45. Zachariou Z. Appendizitis. In: von Schweinitz D, Ure B, editors,
32. Wu HP, Lin CY, Chang CF, Chang YJ, Huang CY. Predictive value of C- Kinderchirurgie. Viszerale und Allgemeine Chirurgie des Kindesalters.
reactive protein at different cutoff levels in acute appendicitis. Am J Emerg Berlin Heidelberg: Springer (2013) p. 465–74.
Med. (2005) 23:449–53. doi: 10.1016/j.ajem.2004.10.013 46. Cohen B, Bowling J, Midulla P, Shlasko E, Lester N, Rosenberg H,
33. Marcinkevics R. Pediatric Appendicitis Prediction Tool. (2020). Available et al. The non-diagnostic ultrasound in appendicitis: is a non-visualized
online at: http://papt.inf.ethz.ch/ (accessed January 5, 2021). appendix the same as a negative study? J Pediatr Surg. (2015) 50:923–
34. Koch G, Pfister M, Daunhawer I, Wilbaux M, Wellmann S, Vogt JE. 7. doi: 10.1016/j.jpedsurg.2015.03.012
Pharmacometrics and machine learning partner to advance clinical data 47. Bhangu A, Søreide K, Saverio SD, Assarsson JH, Drake FT. Acute appendicitis:
analysis. Clin Pharmacol Ther. (2020) 107:926–33. doi: 10.1002/cpt.1774 modern understanding of pathogenesis, diagnosis, and management. Lancet.
35. Kim MK, Strait RT, Sato TT, Hennes HM. A randomized clinical trial of (2015) 386:1278–87. doi: 10.1016/S0140-6736(15)00275-5
analgesia in children with acute abdominal pain. Acad Emerg Med. (2002) 48. Andersson R, Hugander A, Thulin A, Nystrom PO, Olaison G. Indications for
9:281–7. doi: 10.1111/j.1553-2712.2002.tb01319.x operation in suspected appendicitis and incidence of perforation. BMJ. (1994)
36. Green R, Bulloch B, Kabani A, Hancock BJ, Tenenbein M. Early analgesia 308:107–10. doi: 10.1136/bmj.308.6921.107
for children with acute abdominal pain. Pediatrics. (2005) 116:978– 49. Kiss N, Minderjahn M, Reismann J, Svensson J, Wester T, Hauptmann
83. doi: 10.1542/peds.2005-0273 K, et al. Use of gene expression profiling to identify candidate genes for
37. Galai T, Beloosesky O, Scolnik D, Rimon A, Glatstein M. Misdiagnosis pretherapeutic patient classification in acute appendicitis. BJS Open. (2021)
of acute appendicitis in children attending the emergency department: the 5:zraa045. doi: 10.1093/bjsopen/zraa045
experience of a large, tertiary care pediatric hospital. Eur J Pediatr Surg. (2016) 50. Migraine S, Atri M, Bret PM, Lough JO, Hinchey JE. Spontaneously resolving
27:138–41. doi: 10.1055/s-0035-1570757 acute appendicitis: clinical and sonographic documentation. Radiology. (1997)
38. Deleger L, Brodzinski H, Zhai H, Li Q, Lingren T, Kirkendall ES, et al. 205:55–8. doi: 10.1148/radiology.205.1.9314962
Developing and evaluating an automated appendicitis risk stratification 51. Borges-Dias M, Carmo L, Lamas-Pinheiro R, Henriques-Coelho T,
algorithm for pediatric patients in the emergency department. J Am Estevão-Costa J. Trans-umbilical laparoscopic-assisted appendectomy
Med Inform Assoc. (2013) 20:e212–20. doi: 10.1136/amiajnl-2013-0 in the pediatric population: comparing single-incision and 2-
01962 trocar techniques. Minim Invasive Ther Allied Technol. (2017)
39. Rajpurkar P, Park A, Irvin J, Chute C, Bereket M, Mastrodicasa D, 27:160–3. doi: 10.1080/13645706.2017.1399279
et al. AppendiXNet: deep learning for diagnosis of appendicitis from a
small dataset of CT exams using video pretraining. Sci Rep. (2020) 10:1– Conflict of Interest: The authors declare that the research was conducted in the
7. doi: 10.1038/s41598-020-61055-6 absence of any commercial or financial relationships that could be construed as a
40. Reismann J, Romualdi A, Kiss N, Minderjahn MI, Kallarackal J, Schad M, potential conflict of interest.
et al. Diagnosis and classification of pediatric acute appendicitis by artificial
intelligence methods: an investigator-independent approach. PLoS ONE. Copyright © 2021 Marcinkevics, Reis Wolfertstetter, Wellmann, Knorr and Vogt.
(2019) 14:e0222030. doi: 10.1371/journal.pone.0222030 This is an open-access article distributed under the terms of the Creative Commons
41. Akmese OF, Dogan G, Kor H, Erbay H, Demir E. The use of machine learning Attribution License (CC BY). The use, distribution or reproduction in other forums
approaches for the diagnosis of acute appendicitis. Emerg Med Int. (2020) is permitted, provided the original author(s) and the copyright owner(s) are credited
2020:1–8. doi: 10.1155/2020/7306435 and that the original publication in this journal is cited, in accordance with accepted
42. Aydin E, Türkmen IU, Namli G, Öztürk Ç, Esen AB, Eray YN, et al. A academic practice. No use, distribution or reproduction is permitted which does not
novel and simple machine learning algorithm for preoperative diagnosis comply with these terms.

Frontiers in Pediatrics | www.frontiersin.org 12 April 2021 | Volume 9 | Article 662183

You might also like