Early Detection of Breast Cancer
Early Detection of Breast Cancer
]]]]]]
Rationale and Objectives: To develop and evaluate an AI algorithm that detects breast cancer in MRI scans up to one year before
radiologists typically identify it, potentially enhancing early detection in high-risk women.
Materials and Methods: A convolutional neural network (CNN) AI model, pre-trained on breast MRI data, was fine-tuned using a
retrospective dataset of 3029 MRI scans from 910 patients. These contained 115 cancers that were diagnosed within one year of a
negative MRI. The model aimed to identify these cancers, with the goal of predicting cancer development up to one year in advance. The
network was fine-tuned and tested with 10-fold cross-validation. Mean age of patients was 52 years (range, 18–88 years), with average
follow-up of 4.3 years (range 1–12 years).
Results: The AI detected cancers one year earlier with an area under the ROC curve of 0.72 (0.67–0.76). Retrospective analysis by a
radiologist of the top 10% highest risk MRIs as ranked by the AI could have increased early detection by up to 30%. (35/115,
CI:22.2–39.7%, 30% sensitivity). A radiologist identified a visual correlate to biopsy-proven cancers in 83 of prior-year MRIs (83/115, CI:
62.1–79.4%). The AI algorithm identified the anatomic region where cancer would be detected in 66 cases (66/115, CI:47.8–66.5%); with
both agreeing in 54 cases (54/115, CI:%37.5–56.4%).
Conclusion: This novel AI-aided re-evaluation of "benign" breasts shows promise for improving early breast cancer detection with MRI.
As datasets grow and image quality improves, this approach is expected to become even more impactful.
Key Words: Breast cancer; Magnetic resonance imaging; Early detection; Deep learning.
© 2024 The Association of University Radiologists. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1218
Academic Radiology, Vol 32, No 3, March 2025 EARLY DETECTION OF BREAST CANCER IN MRI
high-risk screening population. By analyzing the current MRI, at the follow-up exam, alongside a group of randomly selected
the AI identifies higher probability cases, which a radiologist can screening patients with no cancer detected during that time-
then re-evaluate. This study will determine in a retrospective frame. This resulted in 3029 sagittal MRIs. These sagittal MRIs
analysis how many cancers could potentially have been detected were separated by left and right breast images, with the ex
early through this re-evaluation. Additionally, the network ception of unilateral studies (there were 930 unilateral exams).
identified regions of concern. We evaluate whether these match This yielded 5128 individual breast images (Fig 1). Breasts were
the location of future cancers. labeled "benign" (n = 4965) if two years of imaging with BI-
RADS ≤ 3 or a negative biopsy were documented. Breasts were
labeled (future) "malignant" if the follow-up MRI led to a
MATERIALS AND METHODS malignant pathology finding (n = 163), noting that all were
initially deemed benign.
Patient Sample A breast radiologist with 10 years of clinical experience re
The evaluation used retrospective data from 910 women viewed all 163 MRIs of breasts that developed cancer within a
who underwent breast MRI at a tertiary Cancer Center in year and excluded 48 meeting the following criteria: MRI at
the United States, for screening purposes, with consecutive diagnosis unavailable (n = 30), post-lumpectomy change (n = 6),
screening exams of up to 12 years (Fig. S1). We selected axillary recurrence (n = 2), or biopsy change obscuring visuali
patients with sagittal-plane MRIs in the period between zation (n = 10). This left 115 breasts (from 112 patients) for
2002 and 2014. The use of this retrospective data was ap cancer location analysis and BI-RADS feature evaluation (Fig 1).
proved by the institutional review board with a waiver of The radiologist identified the cancer's anatomical location as the
informed consent, and all procedures were HIPAA com slice number containing the largest lesion ("index slice").
pliant. Patient information was removed, and MRIs were
saved with anonymized identifiers before analysis. Lesion Sizes
Inclusion criteria for all MRIs were a BI-RADS assessment
≤ 3 and a follow-up exam within 15 months (referred to as "1 To obtain an unbiased estimate of lesion size at both time
year"). We included all individuals with screen-detected cancer points, we used automatic segmentation using an automatic
Figure 1. Patient sample. Partitions for training and testing of the AI algorithm were done per patient. Results were evaluated per breast and
per exam.
1219
HIRSCH ET AL Academic Radiology, Vol 32, No 3, March 2025
segmentation tool that has been previously validated (17). We Localization of Cancer Lesions
selected a contiguous segmented area at the location of the
The 2D CNN estimates the probability of future cancer de
index lesion, and measured its length along the principal axis in
tection for each slice in an MRI volume, thus predicting the
2D. This metric was confirmed in four cases measured by the
location of cancer as the slice with the highest probability. We
radiologist using a conventional clinical approach (see Fig. S2).
use the segmentation of the tumor in the follow-up MRI (see
Lesion Size previously), and coregistered it to the current MRI
Development of a Cancer Detection and Localization using the “NiftyReg” software (22) (see Supplement). We
Network considered the machine's prediction “correct” if the highest
The low prevalence of cancer in the screening population predicted probability fell within one slice of the segmented le
poses challenges for both detection (18,19) and network sion location, otherwise it was a 'miss'. We also used the resulting
training due to the large number of parameters. To address segmentation to compute the size, which were incorporated as
this, we leveraged an existing network that had been trained features into the BIRADS assessment process (see Table S1).
on a large dataset of 11,000 patients from the same clinical site
(20). For details and hyperparameters of the pre-trained model Cost and Benefit in Early Detection
see supplementary material. The patients from this earlier
work did not overlap with the current patient cohort. This A variety of measures are used to quantify performance in
pre-trained 2D convolutional neural network is designed to binary classification problems. All measures are defined based
detect breast cancer in current MRIs, by assigning a prob on four possible outcomes, namely, the number of classifi
ability of containing cancer to each 2D sagittal slice, with the cations that are true positives (TP), true negatives (TN), false
overall output being the maximum probability across all slices positives (FP) and false negatives (FN). Here, “positive”
of a breast. We fine-tuned this network to predict cancer in means that cancer will be present in the next exam and
subsequent MRIs using the data described previously. We “negative” means that cancer will not be present. The tra
employed 10-fold cross-validation for fine-tuning and testing. ditional approach in the context of diagnosis is to quantify
Patients do not overlap between training and testing to pre the tradeoff between sensitivity and specificity:
vent bias when estimating generalization to unseen patients.
Sensitivity = TP/(TP+FN) (1)
However, in the training set, there can be multiple images
from the same patients, including previous exams and the
Specificity = TN/(TN+FP) (2)
contralateral breast. This means that similar images may have
both benign and malignant labels. This is a form of within- This is captured by the conventional receiver operating
patient control that may help identify subtle differences characteristic (ROC) curve. In the context of early detec
without overtraining on accidental differences between pa tion, sensitivity can be seen as a “fraction of re-evaluated
tients. Each fold used 90% of patient MRIs for training and cancers”, following our proposal to re-evaluate the highest
10% for testing. Due to the small data size resulting from low risk exams. These re-evaluations may lead to early detec
prevalence, no validation set was used. We fine-tuned only tions, and therefore we take sensitivity as a measure of the
the last two layers (610 parameters) and used a fixed 50 epochs ‘benefit’ of re-evaluations. On the flip side, re-evaluating
of training as we did not expect significant overtraining with exams comes at a ‘cost’ in the sense of additional work for
this parameter count. This was confirmed in a post-hoc ana radiologists, which we quantify with the:
lysis by re-training the same folds, reserving 10% of the data in
each fold to tracking loss during training. We observed no Re-evaluation rate = (TP + FN)/(TP + FP + TN + FN) (3)
increase in the validation loss during training (Fig. S3). Posi
Due to the low prevalence of cancers, we note that the re-
tive examples were taken from the index slice where cancer
evaluation rate is approximately equal to 1-specificity (Fig.
eventually developed (we were certain that it contained a
S5). Therefore, this cost-benefit tradeoff is approximately
tumor for this slide only), while negative examples used a slice
captured by the ROC curve.
from the center of benign breasts and another randomly se
An additional ‘cost’ are the potential recalls for a biopsy
lected slice (this sufficed as we had a large number of benign
that will yield a benign pathology. The worst-case cost
exams) (see Methods, Localization of cancer lesions, and Fig.
would be incurred if all re-evaluations resulted in a recall.
S4). To compensate for strong class imbalance, we enriched
This worst-case cost is captured by the false discovery rate
malignant cases and use a “focal loss” for training (21). A high
(FDR):
gamma parameter in the focal loss (we used gamma = 5)
reduces the loss for samples with high certainty for the correct False discovery rate = FP/(TP+FP) (4)
class, while increasing it for those with low certainty. This
mitigates class imbalance by focusing on more challenging rare In practice, however, we expect fewer recalls upon re-eva
examples, rather than the abundant easy examples. Details on luation. FDR can be selected based on the traditional perfor
data acquisition, preprocessing, harmonization, and demo mance metric for recalls, namely, the positive predictive value
graphics are provided in the Supplement. (PPV) (23), since FDR = 1 - PPV. In statistics, sensitivity is
1220
Academic Radiology, Vol 32, No 3, March 2025 EARLY DETECTION OF BREAST CANCER IN MRI
Figure 2. AI-estimated probability of developing breast cancer one year in advance from the current cancer-free breast MRI in a clinical
screening population, and cost-benefit analysis. (a) Cross-validation ROC curve for 12-month cancer prediction (cross-validation perfor
mance). The suggested operating point for sensitivity is selected at 30% (circle), resulting in a specificity of 90%. (b) Distribution of future
screening outcomes for all breasts based on AI-derived probability from the current cancer-free MRI. The histogram is in logarithmic scale to
better visualize the low prevalence of screen-detected cancers (n = 115). (c) Trade-off between the false discovery rate (FDR) and sensitivity
(effectively, a “precision-recall” curve). One can select the operating point based on the desired benefit (sensitivity, vertical arrow) or al
ternatively, the acceptable cost (FDR, horizontal arrow). (d) Relation between FDR and AI probability to determine the decision threshold for
re-evaluations. The operating point in panel A (circle) corresponds to a decision threshold of 0.64 in panel B (dotted line).
also known as “recall”, while 1-FDR is known as precision method. For AUC-ROC values, they were computed using
(beware that “recall” carries a different meaning in statistics and bootstrapping with replacement.
radiology). Therefore, this cost-benefit tradeoff, captured by
sensitivity and FDR, represents a precision-recall relationship,
which is recommended in low prevalence scenarios where
RESULTS
ROC analysis is less appropriate (24,25).
We trained a network to predict the outcome of the next
Confidence Intervals
scheduled screening from the current MRI, distinguishing 115
future screen-detected cancers from 4965 breasts that remained
All confidence intervals represent 95% confidence. For ra benign. The distribution of AI-predicted cancer probability in
tios, they were computed using the Clopper-Pearson exact this cohort is shown in Figure 2b. On this test set, the network
1221
HIRSCH ET AL Academic Radiology, Vol 32, No 3, March 2025
Figure 3. Localization and predicted probability of future cancers. Each of the four panels shows the healthy breast in the current MRI (left)
and the cancer in the subsequent MRI (right), with the cancer highlighted in yellow. In the current MRI, the slice is selected by AI, while in the
subsequent MRI, it is selected by the radiologist. The numeric value in the top-left corner indicates the predicted one-year cancer risk for this
breast. N indicates the number of screen-detected cancers in each category, totaling 115. Panels on top show true positive predictions with
matching (a) or non-matching localizations (b), and panels on the left show matching localization for successful (a) or missed (c) early
detections. (Color version of figure is available online.)
achieved an area under the receiver operating characteristic true positives (early detections) and 80 false negatives
curve (ROC–AUC) of 0.72 (CI: 0.67–0.77, N = 115, standard (N = 115 total). When evaluating at the level of exams, a
deviation across the 10 folds=0.07) when evaluating individual 10% re-evaluation rate could potentially detect cancer earlier
breast MRIs (Fig 2a). When evaluating at the exam level in 23% of cancers in this cohort (circle in Fig. S6b).
(n = 112, with 3 exams having malignancy in both breasts), the
AUC was 0.66 (CI: 0.61–0.71). This prediction task is con Cancer Localization
siderably more challenging than diagnosis on the current exam,
as it focuses on breasts deemed cancer-free by radiologists (92% The network assigns a probability of future cancer presence
of MRIs had BI-RADS ≤ 3), and cancers only become ap to each MRI slice, potentially guiding radiologists during re-
parent within a year in 2% of cases (Fig 4), consistent with evaluation for decision referral. We evaluated the accuracy of
reported cancer rates (26). this localization using the index slice as ground truth (see
We propose to re-evaluate high-risk cases that exceed a Methods, Localization of cancer lesions).
given probability of developing cancer as predicted by the AI Examples of correct vs. incorrect AI localization are
(Fig 2b, dotted line). To determine this decision threshold shown in Figure 3 (left vs. right). First, we evaluated loca
we consider the tradeoff between benefit and costs as defined lization independently of classification. Overall, the AI se
in the Methods. If we select the desired benefit (sensitivity) lected the correct location for future cancer in 57% of cases
we incur a cost in terms of radiologist time (re-evaluation (66/115, CI:47.8–66.5%). Next, we analyzed localization
rate) and worst-case cost of recalls (false discovery rate, separately in true positive and false negative early detections.
FDR). As an example, we selected a sensitivity of 30% Of the 35 true positives, 25 had correct localization
(Fig 2c, vertical arrow), which means potentially detecting (Fig 3a), indicating the AI correctly localized 71% of the
one-third of all cancers early. This results in a FDR of 96% cases recommended for re-evaluation (25/35, CI:
(Fig 2c, horizontal arrow), which corresponds to a positive 53.7–85.4%, reported in Fig 4: “Correct location and de
predictive value (PPV) of 6%. The cost in terms of radi tection”). These may be easier for radiologists to detect upon
ologist time can be read from the ROC curve (Fig 2a, see re-evaluation. In 10/35 true positive cases, the network did
Methods). At a sensitivity of 30%, we obtain a specificity of not select the correct slice, focusing instead on a different
90%, which corresponds to re-evaluating approximately the mass (Fig 3b). Closer examination of these 10 cases revealed
top 10% of all cases. At this sensitivity level, there were 35 that in all instances, a visually more suspicious mass
1222
Academic Radiology, Vol 32, No 3, March 2025 EARLY DETECTION OF BREAST CANCER IN MRI
Figure 4. Summary of early detection and localization results. Each circle represents the total number of breasts examined for screening.
Areas are scaled to the fraction of cases. Left: of all benign exams most will remain benign in the next screening exam (green area) and a
small fraction will have a cancer diagnosis (2% in orange). The AI-tool suggests to re-evaluate 10% of breasts (blue circle). Center: From all
cancers that will be detected in the subsequent screening exam, the AI-tool recommends re-evaluating 30% (blue overlap: “AI: Correct
detection upon re-evaluation”). The AI-tool also correctly flagged the location where cancers would be found next year in 57% of all cancers
(red circle: “AI: Correct location”). Right: of the correct detections recommended for re-evaluation (blue circle), a large portion were also
correctly localized (71% overlap between red circle and blue circle: “AI: Correct location and detection upon re-evaluation”). (Color version of
figure is available online.)
influenced the model's decision. Notably, in four out of these on pre-diagnosis and diagnostic MRIs (Table S1), separated
10 cases, the model locates the correct lesion with a prob by AI-determined probability. The AI model was more
ability above the high-risk threshold of 0.64 (Fig. S7). likely to assign low probability to images without a visual
Among the 80 false negatives, the AI selected the correct correlate of cancer (Table S1 note 1, “No visual correlate”:
slice in over half (41/80, Fig 3c). While these cases had a 38% vs. 6% cases, z = 3.5, p = 0.00046). A greater percentage
probability below the detection threshold (Fig 2a) they likely of mass-type lesions were classified as high-probability by the
had evidence of future malignancy at the correct location. model (Table S1, note 2, 21% vs. 6%, z = 2.3, p = 0.02).
For some false negatives, the AI did not correctly locate the There were no notable differences between high- and low-
future cancer (Fig 3d), representing genuinely challenging probability AI predictions in terms of shape, margin, internal
cases with no obvious evidence of future malignancy re enhancement, T2 signal, or distribution between focus/
ported by the AI. mass-type lesions. Similarly, non-mass enhancement lesions
Figure 4 summarizes the proposed approach and results. exhibited no notable differences in their characteristics across
After the initial reading by the radiologist, BI-RADS 1–3 the two probability groups. We did not find any significant
cases are evaluated by AI (Fig 4, green). Of these, 2% de difference in the cancer pathology between high vs low
velop cancer in the next year's exam (Fig 4, orange; 27.3% of probability AI predictions (Fig. S9).
these were BIRADS 3). The AI ranks cases by the prob Finally, we noted that neither age nor family history of
ability of developing cancer. If selecting the top 10% for re- cancer was predictive of cancer in this patient sample (age:
evaluation, a radiologist would see 30% of all cancers de rank-sum test, p = 0.09, W=1.71; family history: Chi-square
tected in the subsequent exams (Fig 4: “AI: Correct detec test statistic = 0.43, p = 0.51, df = 1). AUC-ROC was not
tion”). AI-flagged regions of concern contained the correct significantly different when demographic information was
cancer location in 57% of cases. held constant (Fig. S10).
1223
HIRSCH ET AL Academic Radiology, Vol 32, No 3, March 2025
detection is diagnosing an interval change. Such interval takes effort. Therefore, the benefit of early detection (added
change is considered suspicious and guides management for sensitivity) must be balanced against the effort of re-evalua
all these cases. tion rate, approximately 1 minus specificity at low pre
It is important to note that the sensitivity of 30% reported valence. We suggest a 10% re-evaluation rate might be
here are for cancers that would have remained otherwise acceptable if the benefit is 30% earlier cancer detection. The
undetected until the next exam. They are additional cancers ROC curve, approximating an effort-benefit curve, allows
above and beyond those that have already been detected selecting a different tradeoff if desired.
with high sensitivity by the radiologist. We have selected an Limitations of this study include a relatively small number
operating point for the AI based on a desired sensitivity, and of screen-detected cancers and only including sagittal scans
reported the resulting cost in terms of re-evaluation rate. from one clinical site. Performance was reported primarily
Alternatively, we could have selected an acceptable cost and for individual breasts as we suggest direct re-evaluation of the
noted the resulting benefit. For instance, the recommended relevant breast. However, it is more difficult to declare a
benchmark for radiologists currently is a PPV of 15% for patient cancer-free in both breasts, which explains the drop
tissue diagnosis, and PPV of 4.4% for abnormal interpretation in AUC for exam-level performance. AI performance and
(27). At a sensitivity of 30% the PPV for re-evaluation based robustness will likely improve with higher-resolution axial
on the AI would have been 6%. If radiologists recalled only MRIs now routine in clinical practice, using prior-year
half of these re-evaluated cases, they would approximate the MRIs to assess lesion changes, and increasing multi-site da
recommended PPV for tissue diagnosis, and detect at least an tasets necessary for robust deep learning. Nevertheless, this
additional 15% of tumors, which is a clinically meaningful study provides proof-of-principle and baseline performance
improvement. for early detection.
Within a high-risk population, we observed an
ROC–AUC of 0.72 in predicting cancer at a 1-year follow-
up (0–15 months) using MRI. It's worth noting that recent DATA AVAILABILITY
studies utilizing AI for mammography have reported AUC
values ranging from 0.66 to 0.84 in predicting outcomes 1–2 Datasets analyzed in the current study are not public due to
years in advance (28–30). An independent validation re patient confidentiality. However, risk prediction and out
ported a 1-year prediction of 0.62–0.71 for various AI come information for statistical evaluation of the results are
models (31). Similar AUC results (0.68–0.73) were obtained available together with the code.
in a multi-institutional validation study employing MIRAI
(32). However, it's crucial to recognize that all these studies
were conducted within a general mammographic screening DECLARATION OF COMPETING INTEREST
population. In contrast, predicting cancer development
All authors declare no financial or non-financial competing
within the high-risk population may be more challenging.
interests.
This population undergoes highly sensitive yearly MRI
screening (33), and cancers have already been removed
earlier through mammographic screening. We are not aware
of short-term prediction studies based on MRI in this po ACKNOWLEDGMENT
pulation. The only available study reported an AUC of 0.63 We want to thank Joanne Chin for extensive and thorough
at a 5-year follow-up (34), highlighting the complexity of proofreading of earlier versions of this manuscript.
this task compared to the broader mammography popula
tion. We have used here deep-learning methods, but it is
worth noting that traditional machine-learning techniques
AUTHOR CONTRIBUTIONS
may still beneficial for lesion characterization, diagnosis, and
segmentation, especially when only smaller datasets are LH designed the computational methods, analyzed the data,
available (35–37). programmed the network, generated figures and wrote the
Artificial intelligence research often focuses on algorithm manuscript. YH performed all the image preprocessing.
performance rather than clinically relevant outcomes (38). HAM contributed to the design of the study and models, as
For example, most cancer diagnosis studies only report well as writing the manuscript. MH segmented images. DM
ROC–AUC, and MRI risk prediction studies often focus on provided imaging and clinical data. LCP designed the overall
AUC-ROC (34) without considering clinical workflow approach and analysis methods and wrote the manuscript.
impact. More recent studies report the concordance index EJS provided imaging and clinical data, evaluated the pre
(c-index) (28). These metrics quantify AI performance but dictions of the network, evaluated BI-RADS features on all
do not directly address the effort-benefit tradeoff or early cancers, and edited the manuscript. SEW and KP provided
detection implications. For breast cancer, early detection is extensive input to the manuscript. EM contributions include
crucial for improving treatment outcomes, making even formulating overall study design, data anonymization and
small increases impactful. However, re-evaluating MRIs curation, result interpretation and manuscript review.
1224
Academic Radiology, Vol 32, No 3, March 2025 EARLY DETECTION OF BREAST CANCER IN MRI
1225