0% found this document useful (0 votes)
11 views12 pages

Sciadv Abb7973

This research article discusses the use of generative models to synthesize high-resolution radiographs, addressing the limitations of data sharing in medical imaging due to privacy concerns. The study demonstrates that synthesized images can closely resemble real radiographs, enabling improved training of computer vision algorithms without compromising patient data. The proposed method allows hospitals with limited datasets to contribute to and benefit from enhanced medical imaging technologies through federated learning strategies.

Uploaded by

Kübra Temel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Sciadv Abb7973

This research article discusses the use of generative models to synthesize high-resolution radiographs, addressing the limitations of data sharing in medical imaging due to privacy concerns. The study demonstrates that synthesized images can closely resemble real radiographs, enabling improved training of computer vision algorithms without compromising patient data. The proposed method allows hospitals with limited datasets to contribute to and benefit from enhanced medical imaging technologies through federated learning strategies.

Uploaded by

Kübra Temel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SCIENCE ADVANCES | RESEARCH ARTICLE

COMPUTER SCIENCE Copyright © 2020


The Authors, some
Breaking medical data sharing boundaries by using rights reserved;
exclusive licensee
synthesized radiographs American Association
for the Advancement
of Science. No claim to
Tianyu Han1, Sven Nebelung2, Christoph Haarburger3, Nicolas Horst4, Sebastian Reinartz1,5, original U.S. Government
Dorit Merhof4,6,7, Fabian Kiessling6,7,8, Volkmar Schulz1,6,7*†, Daniel Truhn3,5* Works. Distributed
under a Creative
Computer vision (CV) has the potential to change medicine fundamentally. Expert knowledge provided by CV can Commons Attribution
enhance diagnosis. Unfortunately, existing algorithms often remain below expectations, as databases used for NonCommercial
training are usually too small, incomplete, and heterogeneous in quality. Moreover, data protection is a serious License 4.0 (CC BY-NC).
obstacle to the exchange of data. To overcome this limitation, we propose to use generative models (GMs) to
produce high-resolution synthetic radiographs that do not contain any personal identification information. Blinded
analyses by CV and radiology experts confirmed the high similarity of synthesized and real radiographs. The com-
bination of pooled GM improves the performance of CV algorithms trained on smaller datasets, and the inte-
gration of synthesized data into patient data repositories can compensate for underrepresented disease entities.
By integrating federated learning strategies, even hospitals with few datasets can contribute to and benefit from
GM training.

Downloaded from https://www.science.org on December 01, 2023


INTRODUCTION patient privacy issues prohibit combining data from multiple sites.
The application of computer vision (CV) in medicine promises to This is even more conspicuous given that the majority of patients
personalize diagnosis, decision management, and therapy based on are willing to share their data for research purposes if adequate
the combination of patient information with knowledge of thousands measures have been taken to protect their privacy (7). Secure ways
of experts and the outcomes of billions of patients. In recent years, to share and merge medical images are essential for the develop­
scientific effort has focused on applications of CV in medicine, in ment of future CV algorithms (8).
particular in radiology (1). Where there has been progress toward Federated learning has gathered attention and is suitable where
this vision of an omniscient radiological CV, this has mostly been data sharing is hindered by privacy considerations. In this paradigm,
anticipated by corresponding technical advances in the field of CV a central model is updated by exchanging encrypted gradients or
on natural images. A prominent example is convolutional neural weights between global and selected models (9). To further improve
networks (CNNs), which had their breakthrough when the perform­ privacy in medical applications, a fraction of weights or gradients
ance of AlexNet surpassed more conventional CV algorithms in within local models can be blurred by injecting random noise, i.e.,
2012 (2). Since then, CNNs have matched and even surpassed hu­ differential privacy. Such a random module has been successfully
man performance on natural image recognition tasks (3). Similar integrated into a federated brain segmentor (10). However, in the
developments took place in medicine, where CNNs performed com­ conventional federated learning settings, the central instance cannot
parably to the performance of experts in computed tomography inspect the raw training data due to privacy concerns, and hence,
(CT) screening for lung cancer (4) and retinal disease detection (5). modeling tasks become challenging.
However, human performance in CV on medical images has so far Another promising solution to overcome data sharing limitations
only been achieved but not surpassed. Whenever human perform­ is the use of generative adversarial networks (GANs), which enable the
ance in CV on medical images was achieved, large datasets were generation of an anonymous and potentially infinite dataset of images
used, often pooled from many sites, containing thousands of images. based on a limited database of radiographs. GANs are a special class
Going a step further and surpassing human performance in CV on of neural networks that were first introduced by Goodfellow et al.
natural images, however, always required even larger databases con­ (11) in 2014 and have since then been advanced to generate high-­
taining up to billions of natural images (6). resolution, photorealistic synthetic images (12). While the first imple­
Unfortunately, collecting and sharing such large quantities of mentations of GANs made it possible to synthesize unconditioned
medical images seem inconceivable, caused, in part, by their insuf­ images, the development and usage of informative priors to drive
ficient public availability. Even if the combined data worldwide generators that output conditional samples are desired in medical
reach billions of images, like in the case of thoracic radiographs, applications. A common choice for such a conditional prior is an
existing image as used in pix2pix (13) and Cycle-GAN (14). Recently,
1
Physics of Molecular Imaging Systems, Experimental Molecular Imaging, RWTH Cycle-GAN–based networks have gained attention in the medical
Aachen University, Aachen, Germany. 2Department of Diagnostic and Interven- imaging community due to their capabilities of achieving intermo­
tional Radiology, University Hospital Düsseldorf, Düsseldorf, Germany. 3Aristra GmbH, dality image transitions. On the basis of Cycle-GAN frameworks,
Berlin, Germany. 4Institute of Imaging and Computer Vision, RWTH Aachen Uni-
versity, Aachen, Germany. 5Department of Diagnostic and Interventional Radiology,
researchers such as Wolterink et al. (15) and Chartsias et al. (16)
University Hospital Aachen, Aachen, Germany. 6Fraunhofer Institute for Digital Medi- successfully demonstrated bidirectional CT–magnetic resonance
cine MEVIS, Bremen, Germany. 7Comprehensive Diagnostic Center Aachen (CDCA), imaging (MRI) transitions in both brain and heart imaging. Fur­
University Hospital RWTH Aachen, Aachen, Germany. 8Institute for Experimental thermore, Zhang et al. (17) introduced a segmentor-based shape
Molecular Imaging, RWTH Aachen University, Aachen, Germany.
*These authors contributed equally to this work. consistency term to the Cycle-GAN loss and achieved realistically
†Corresponding author. Email: [email protected] looking volumetric CT-MRI data transitions. The performance of

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 1 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

segmentors and GANs was boosted by shape consistency and online eration of synthesized radiographs, was much faster with a rate of
augmentation, respectively. Nevertheless, image-based conditioning 67,925, 41,379, and 4511 generated radiographs per hour at the
always carries the risk of leaking patient-sensitive data to the gener­ three spatial resolution stages. Sample images are shown in fig. S4A
ator during the training process. for a spatial resolution of 256 × 256. Further images for spatial res­
Here, we propose to use generative models (GMs) on the basis of olutions of 512 × 512 and 1024 × 1024 are given in the Supplemen­
convolutional GANs (18) to break the boundary of sharing medical tary Materials.
images and to enable merging of disparate databases without the We have chosen the multiscale structural similarity (MS-SSIM)
limitations that are now restricting the collection of radiographs in as a metric (19) to detect a possible mode collapse of our GAN (i.e.,
a public database (see Fig. 1). To demonstrate the performance of missing diversity in the images). The MS-SSIM has been successfully
our concept, we show that fully synthetic and thus anonymous used in predicting perceptual similarity judgments of humans. A
images can be generated, which look deceivingly real—even to the lower MS-SSIM reflects perceptually distinct samples and proves
expert’s eye—and that these images can be used in the medical data the high diversity of a dataset. In fig. S2, we depict the MS-SSIM of
sharing process. Our concept proposes how medical images or data 1000 randomly selected pairs of samples within a given pathology
can be shared in the future. class. As can be seen, the overall MS-SSIM among synthesized pairs
is comparable to that of real sample pairs.

RESULTS Ability of human readers to distinguish synthesized


Generation of synthesized radiographs radiographs from real x-ray images
Generating synthesized two-dimensional images in high resolution To test the quality of the synthesized radiographs (i.e., radiographs
is a nontrivial task and has just recently been made feasible by using synthesized by the generator), six readers were presented 50 synthe­

Downloaded from https://www.science.org on December 01, 2023


progressive growing during training (12) or by using large-scale sized radiographs each and 50 radiographs of real patients in ran­
networks that demand massive amounts of computing power. As domized order, and the readers were separately tasked with deciding
the computing power required for the latter approach is, in general, whether the presented radiograph was real or synthesized. The tests
not accessible to most hospitals, we used progressive spatial resolu­ were repeated with spatial resolutions of 256 × 256, 512 × 512, and
tion growing during training of our networks. Thus, the GAN was 1024 × 1024, resulting in a total of 18 tests with 100 radiographs each.
trained by starting with a spatial resolution of 4 × 4 and stepping up To assess whether experience with machine learning or radio­
in powers of 2 (8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, logical expertise was necessary to identify synthesized radiographs,
512 × 512) to a spatial resolution of 1024 × 1024. the readers were grouped and chosen as follows: group 1 consisted
We measured the time needed to train a GAN on a dataset size of three readers that had a background in CV (readers 1, 2, and 3,
of 112,120 radiographs with a hardware setup that is accessible to who had 4, 2, and 5 years of experience in CV, respectively), while
any small hospital: We used a desktop computer with an Intel group 2 consisted of experienced radiologists (readers 4, 5, and 6,
Xeon(R) E5-2650 v4 processor (Intel, Santa Clara, CA) and an who had 4, 19, and 6 years of experience in general radiology, no
Nvidia Tesla P100 16 GB GPU (Nvidia, Santa Clara, CA). To com­ dedicated specialization to thoracic radiology).
pletely train the GAN with this setup to generate synthesized x-rays Accuracies in differentiating the synthesized images from the
with a spatial resolution of 256 × 256 took 60 hours. Continuing the real images at spatial resolutions of 256 × 256 were 60 ± 5% for
training to generate synthesized x-rays of spatial resolutions 512 × group 1 and 51 ± 5% for group 2. Generating convincing radio­
512 and 1024 × 1024 took 114 and 272 hours of computational time, graphs at higher spatial resolutions proved more difficult, and ex­
respectively. Once the training had finished, inference, i.e., gen­ perts were able to distinguish real from synthesized radiographs

Fig. 1. Concept of constructing a public database without disclosing patient-sensitive data. The GAN in each hospital consists of a generator G and a discriminator
D. During training, patient-sensitive data (shown in red) are never exhibited to the generators G directly. Patient-sensitive data is only exhibited to discriminator D while
it is trying to differentiate between real and synthesized radiographs. After training is completed, only the generators G are transferred to a public database and can be
used to generate synthesized radiographs.

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 2 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

more easily at spatial resolutions of (512 × 512)/(1024 × 1024) with renchyma, which reflect the network’s difficulty to generate fine
accuracies of 67 ± 17%/82 ± 4% for group 1 and 65 ± 5%/77 ± 13% details (see fig. S4E).
for group 2. Thus, radiologists and CV experts performed similarly
when identifying synthesized radiographs at high resolutions when Ensuring non-transference of private information
judged by their accuracy. As shown in table S2, the sensitivity, i.e., To exclude the possibility that the GAN memorizes and subse­
the correct identification of a synthesized radiograph, was higher quently merely reproduces the given training examples, 1000 ran­
than the specificity, i.e., the correct identification of a real radiograph. domly synthesized radiographs were generated, and their nearest
This is probably attributable to the fact that some synthesized radio­ neighbors in the database of real radiographs were sought accord­
graphs show telltale signs of synthesization (see fig. S4E) and thus ing to the structured similarity index. All 1000 radiographs along
allow for a more reliable identification. While radiologists predom­ with their respective three nearest neighbors were then plotted, and
inantly detected errors in anatomical details such as bone shape or a board-certified radiologist assessed whether an entity from the
rib cage morphology, CV experts tended to focus more on tiny de­ database of real radiographs had been duplicated.
tails such as wave-like patterns (see fig. S4E). There was no inter­ In this set of 1000 randomly drawn GAN images, we did not find
reader agreement between the readers for spatial resolutions of 256, a single instance in which the synthesized radiograph looked iden­
underlining the fact that identification of synthesized radiographs tical to its closest neighbor in the real dataset (fig. S4B). When as­
at this spatial resolution stage is hardly possible (see Table 1). At sessing similarity in terms of the SSIM, we did not find a single case
higher spatial resolutions, the interreader agreement was consistently in a set of 105 randomly drawn synthesized radiographs, in which a
higher following the found higher accuracy in identifying synthe­ digital twin was found in any of the real radiographs. In addition,
sized radiographs. These results were observed under restrictions: the reader was asked to examine the synthesized radiograph for lo­
The radiographs were assessed on conventional 24-inch computer cal information that might lead to the identification of a specific

Downloaded from https://www.science.org on December 01, 2023


monitors without zooming into the images. The radiographs were patient, e.g., an anatomic variant unique to a patient or a necklace
presented in a given order: first, the low-resolution radiographs, with a name on it. No such information was found in this set of
followed by the mid- and then the high-resolution radiographs. The 1000 images.
readers were not allowed to go back and change previous decisions. We reason that the duplication of images from the database of
When these restrictions were lifted, accuracy in determining whether real radiographs is unlikely. The GAN consists of a generator and a
a radiograph is real or synthesized was significantly increased. A discriminator network. Only the discriminator network will be in
radiologist with 9 years of experience who was given unlimited direct contact with patient images. The generator is never directly
amounts of time and who first examined the high-resolution radio­ presented a patient image in the training process. Thus, only the part
graphs on specialized radiological monitors to identify typical GAN-­ of the architecture (the generator) that has never been presented
related artifacts before going back to the 256 × 256 radiographs was with real patient images is transferred to the central database.
able to identify synthesized radiographs in 86% of cases.
The difficulties to generate convincing radiographs at high reso­ Performance of classifiers trained on
lutions were understandable, as the task becomes more difficult synthesized radiographs
with the growing number of pixels: even for low-resolution gray­ To demonstrate the feasibility of our approach in a clinical setting,
scale images of 100 × 100 pixels and 8-bit grayscale depth, the num­ as shown in Fig. 1, we have decided to apply our concept to the de­
ber of possible different images amounts to 256(100 × 100). The GAN tection of pneumonia. In the United States alone, pneumonia ac­
was tasked with identifying the subset of real-looking images out of counted for over 500,000 visits to emergency departments and over
this set that grows exponentially in size with increasing spatial reso­ 50,000 deaths in 2015 (20). The Radiological Society of North
lution. Not unexpectedly, this process was not perfect, and although America (RSNA) has recently hosted a challenge to automatically
the GAN managed to capture the general appearance of a real ra­ detect pneumonia in x-rays using machine learning algorithms. Often,
diograph at high resolutions, small details revealed the synthesized local hospitals can only gather medical datasets with limited diver­
origin. After having performed the tests and with the knowledge of sity due to a specific patient population with associated pathologies.
the ground truth, the readers conferred to identify these typical pat­ However, diversity of the datasets is crucial to the performance of
terns that allowed for the differentiation of real from synthesized deep learning algorithms due to the complex features of a specific
images at high resolution. Among these were unphysiological con­ pathology. By using our approach of pooled GANs, different patients
figurations of the pulmonary vessels, aberrant bone structures, and from different locations can be jointly considered and thus boost the
subtle periodic, wave-like patterns superimposed on the lung pa­ diversity of the local dataset without violating any privacy protection

Table 1. Real/synthesized radiographs test. Accuracy and interreader agreement for the group of three CV experts, three radiologists, and all readers when
differentiating whether the presented radiograph is real or synthesized.
256 × 256 512 × 512 1024 × 1024
Accuracy, % Fleiss’ kappa Accuracy, % Fleiss’ kappa Accuracy, % Fleiss’ kappa
CV experts 60 ± 5 −0.03 67 ± 17 0.07 82 ± 4 0.46
Radiologists 51 ± 5 0.10 65 ± 5 0.18 77 ± 13 0.39
All readers 55 ± 7 0.00 67 ± 14 0.07 80 ± 10 0.37

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 3 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

laws. We simulated a local dataset with a limited diversity by using potential influence that the size of the training set might have had
a subset of the RSNA dataset with 1000 real x-rays for training, of which on the performance of the classifiers. Similarly, improved perform­
only 5% exhibited signs of pneumonia. The resulting classifier achieved ance measures were found for sensitivity, specificity, accuracy,
an area under the curve (AUC) of 0.74 on a test set of 6000 previously positive predictive value (PPV), negative predictive value (NPV),
unseen x-rays from the RSNA dataset (see Fig. 2B). To alleviate the lim­ and F1 score (see Fig. 2C). This experiment thus demonstrated that
iting scarcity of pathological images and improve the classifier perform­ our pooled dataset approach is capable of improving deep learning
ance, we used our public database of generated images [trained on classifiers by enriching scarce datasets.
the National Institutes of Health (NIH) and the Stanford dataset]. To simulate the data merging scenario as outlined in Fig. 1, we
From this database, we randomly sampled a total of 500 synthesized compared the results of a comprehensive pathology classification,
x-rays: 100 that exhibited signs of pneumonia and 400 that exhibited i.e., cardiomegaly, effusion, pneumothorax, atelectasis, edema, con­
no signs of pneumonia. These were then joined with 500 real x-rays solidation, and pneumonia, with a classifier solely trained on the
from the RSNA subset (450 healthy and 50 pneumonia), which re­ NIH-GAN versus a classifier that was trained on merged synthesized
sulted in a set of 1000 x-rays for training of the classifier (healthy: images from different sources. Generated samples of our Stanford-­
450 real and 400 synthesized; pneumonia: 50 real and 100 synthe­ GAN can be found in fig. S5. The average values of the AUC, accuracy,
sized). When trained on this artificially enriched set of x-rays, the sensitivity, and specificity all increased significantly after integra­
performance of the classifier increased with an overall AUC of 0.81. tion of the synthesized external dataset (see Fig. 3). This demonstrated
We hypothesize that the reason for the improvement in performance that the merging of multiple databases of generated radiographs can
was probably due to the greater diversity of pathological cases as boost the performance of classifier networks and can alleviate the
produced by the generator: As reflected by the lower MS-SSIM in performance bottleneck due to insufficient amounts of training data.
Fig. 2A, the GAN-augmented dataset (MS-SSIM, 0.18 ± 0.09) achieved The performance improvements have been achieved without

Downloaded from https://www.science.org on December 01, 2023


a higher level of diversity in comparison with the local RSNA subset any techniques of domain adaption, i.e., without any efforts to ho­
(MS-SSIM, 0.24 ± 0.12). Note that for both cases, we have chosen to mogenize the appearance of the radiographs from different data­
train the classifiers on the same number of x-rays to exclude any bases. Adopting these techniques not only offers an opportunity for

B C

Fig. 2. Pooled GAN training can improve pneumonia detection by enriching the diversity of the dataset. (A) Distributions of MS-SSIM of randomly selected
2450 pneumonia-positive pairs. Higher diversity of pneumonia cases in the GAN-augmented dataset is confirmed by a lower MS-SSIM (GAN-augmented MS-SSIM: 0.18 ±
0.09 versus RSNA subset MS-SSIM: 0.24 ± 0.12). (B)The performance of the classifier when trained on 1000 x-rays from the GAN-enriched dataset (healthy: 450 real and
400 synthesized; pneumonia: 50 real and 100 synthesized) reaches an AUC of 0.81 in pneumonia detection, outperforming that of a classifier trained on 1000 real x-rays
(healthy, 950; pneumonia, 50) that reaches an AUC of 0.74. (C) Similarly, improved performance measures were found for sensitivity (Sens), specificity (Spec), accuracy
(Accu), PPV, NPV, and F1 score. We used a test set of 6000 x-rays randomly sampled from the RSNA dataset to calculate those scores. The GANs used to generate the
synthesized x-rays were trained based on the NIH and Stanford datasets.

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 4 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

Downloaded from https://www.science.org on December 01, 2023


Fig. 3. Using pooled synthesized data from different sites, classification performance can be increased. To simulate the scenario in Fig. 1, two classifiers were
trained and compared: a classifier solely trained on anonymous radiographs generated with the NIH-GAN (blue) and a classifier trained on the pooled anonymized data-
set generated with the NIH-GAN and the Stanford-GAN (red). The schematic of the data selection process is shown in (A). AUC, sensitivity, and specificity for the seven
diseases are given in (B). In particular, the classification performances of formerly problematic cases such as edema, consolidation, and pneumonia were boosted by
merging data from multiple sites (red arrows).

further performance improvements through domain adaption— trollable gradient/model updates, it is difficult to detect adversarial
now, an active area of research (21)—but would also most likely attacks and protect against them (23). However, the security of
make the classifier network more robust to deployments in differ­ GAN-based federated learning has the advantage of offering an ad­
ent environments. This is an important aspect in the translation of ditional degree of freedom for screening of databases by using con­
CV algorithms from the workbench to clinics. fidence calibrated checking (24) or manual inspection (25). We
therefore investigated the use of federated learning in training one
Federated averaging facilitates the training of local GANs central GAN as an alternative to the pooled GAN approach.
Large amounts of data are required to obtain robust results from To simulate hospitals with limited amount of training data, we
GANs that are trained locally. This potentially limits the sites at randomly sampled 20,000 patient radiographs from the Stanford
which a GAN can be trained to large hospitals. Federated learning CheXpert dataset and then partitioned them into 20 local clients
algorithms (22) offer a remedy to this limitation as the GAN can be each receiving 1000 patient radiographs. We trained and compared
trained without the original images, leaving the protected space of the following models: a centralized “20k model,” which was trained
the hospital. One possible reservation is that, because of the uncon­ on 20,000 patient radiographs, a centralized “1k model,” which was

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 5 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

solely trained on 1000 local radiographs, and a federated “20 × 1k Exemplary frames visualizing the transition are given in fig. S4C for
model,” which was trained federally (22) on 20 distributed datasets cardiomegaly and effusion. With cardiomegaly, we observed an
consisting of 1000 radiographs each (see Fig. 4A). enlargement of the projected heart shape, reflecting the expected
An important property of Wasserstein GANs is that their dis­ radiological change. Similarly, effusion showed the typical opacifi­
criminator loss directly reflects the quality of generated samples cation of the lower lung field mirroring the collection of fluid there.
(26). We therefore visualized the negative discriminator loss in Animations for all of the 14 disease states are given in the Supple­
Fig. 4B. As can be seen in Fig. 4B, because of insufficient training mentary Materials.
images, the centralized 1k GAN overfitted quickly and led to an un­ Second, the pixel-wise difference image between the fully dis­
stable training. However, as indicated by a lower loss and Fréchet eased and the healthy radiograph was calculated and superimposed
inception distance (FID) in Fig. 4 (B and C), the federated trained on the healthy radiograph as a colormap (see Fig. 5A for a schematic
GAN (federated 20 × 1k) overcame this local overfitting issue and of the process). Examples of such found visualizations are given in
significantly outperformed the locally trained GAN. The loss curve Fig. 5B for all 14 pathologies.
of the federated GAN was smoother because it represented an aver­ One advantage of having full control over the disease state of the
age over local iteration losses. GAN radiographs is that any combination of diseases in a single
radiograph can be generated by changing the corresponding entries
Generated images as a visualization of in the input vector simultaneously. We found that the disease state
what neural networks see as represented by the GAN transition reflected the underlying dis­
The images generated by the generator could be specifically con­ ease and was in good agreement with radiological expertise if many
trolled: By changing the part of the input vector signifying the dis­ marked examples of this disease were present in the original dataset
ease, radiographs with specific pathologies could be generated. We and if the disease-related changes occurred on a large scale of the

Downloaded from https://www.science.org on December 01, 2023


used two techniques of visualizing the disease-specific hallmark radiographs (e.g., cardiomegaly or effusion) rather than on small
changes. First, the disease-specific entry in the input vector was patches at different sites (nodules).
gradually changed from 0 to 1, while all other entries were kept at 0. To uncover correlations between disparate pathologies, we let
The generated images then showed the transition from healthy to the classifier rate the score of a specific pathology when the GAN was
diseased states and were stitched together to form an animation. tasked with generating a synthesized radiograph of another disease

A C

Fig. 4. Federated learning facilitates GAN training when facing insufficient amounts of local data. Hospitals can use federated learning algorithms to train a global
GAN, and the central GAN deposit can serve as a hub. (A) Illustration of the GAN-related federated learning system. After local model initialization, local hospitals B and C
(in red frames) were selected to update their local models. The global generator and discriminator were updated by the weights (w) transferred to the aggregation server
(red arrows). All local models were subsequently redefined by the updated global GAN (blue arrows). The exchange of local and global weights continued until the glob-
al GAN converged. (B) Discriminator loss curves for three trained Wasserstein GANs. The Wasserstein GAN trained by federated averaging algorithm (federated 20 × 1k)
outperformed the centralized GAN trained on only 1000 x-rays (centralized 1k) and performed comparably to the centralized 20k GAN. (C) FID evaluations of the GAN
training process.

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 6 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

Generator
Healthy
(null vector) Concatenation

Healthy Atelectasis Cardiomegaly Consolidation Edema


Random patient
(512-dimensional vector
with random entries)
Overlay
...

Cardiomegaly Effusion Empyhsema Fibrosis Hernia Infiltration


(one hot encoding)

Concatenation Subtraction

Mass Nodule Pleural thickening Pneumonia Pneumothorax

A B

Downloaded from https://www.science.org on December 01, 2023


C D

Fig. 5. Learned pathological features. (A) Generation of the disease-specific pixel map. A randomly chosen vector with 512 Gaussian distributed entries characterizes
one specific patient. The GAN was tasked with generating a healthy and a diseased radiograph of that patient (cardiomegaly in this example). A subtraction map was
generated to denote the changes brought about by the disease and was superimposed as a colormap over the generated healthy radiograph. (B) Disease-specific pat-
terns generated by the generator for an exemplary randomly drawn pseudopatient. Red denotes higher signal intensity in the pathological radiograph, while blue de-
notes lower signal intensity. Note that for some diseases such as cardiomegaly and edema, the pattern looks realistic, while the GAN struggled with diseases that have a
variable appearance and where ground truth data are limited, e.g., pneumonia. (C) Revealing correlations within generated pathological radiographs by the classifier
trained on the real dataset. For each pathology, 5000 random synthesized radiographs with a pathology label drawn from a uniform distribution between 0.0 and 1.0
were generated. The images were then rated by the classifier network, and Pearson’s correlation coefficient was calculated for each pairing of pathologies [shown in (C)
with the GAN cardiomegaly label on the x axis and the cardiomegaly and fibrosis classifier output on the y axis in red and blue, respectively]. (D) Resulting correlation
coefficients for all 14 × 14 pairings are displayed and color coded in (B). Clustering on the x axis (i.e., the GAN label axis) was performed to group related diseases. The
obtained clustered blocks are marked with white-bordered boxes.

and calculated the Pearson’s correlation coefficient (Fig. 5C). We This might be due to the limited number of those cases in the data­
found that the clustering of related pathologies based on the cor­ sets. For example, for pneumonia-labeled cases in the NIH dataset,
relation coefficient agreed well with clinical intuition: Infiltration, only 1431 cases are positive and 110,689 are negative. As can be seen
pneumonia, consolidation, effusion, and edema—all pathologies in Fig. 5D, the NIH generator cannot reliably generate pneumonia-­
where lung opacity increases—were related to each other while be­ related features. The nature of pneumonia was better captured by
ing distant from, e.g., emphysema or pneumothorax—diseases that the Stanford-GAN in Fig. 5D, which shows the challenging cases
are associated with increased radiolucency. The magnitude of diag­ from this particular dataset. This GAN was trained with a much
onal elements in the 14 × 14 matrix in Fig. 5D directly reflects the higher number of 20,656 pneumonia cases in the Stanford CheXpert
quality of the generation of pathological images with our method. dataset.
Diseases, such as cardiomegaly (corr = 0.8) and effusion (corr = 0.7)
could be reliably generated due to their localized and predictable
pathological features. However, the GAN trained on these par­ DISCUSSION
ticular datasets performed less reliable in generating infiltration In this study, we demonstrated that GANs can be used to gener­
(corr = 0.5), emphysema (corr = 0.5), and pneumonia (corr = 0.5). ate deceivingly real-looking radiographs and to merge databases of

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 7 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

radiological images without disclosing patient-sensitive information. approach of pooling locally trained GANs is to apply a quality crite­
This helps to build large radiological image databases for the training rion such as the FID or the inception score (28) assessment. Locally
of CV algorithms. While radiographs may, in principle, be abundantly trained generators can be rejected to be included in the central GAN
available, universal access is, in general, severely restricted due to repository. Second, federated learning allows for training of a single
data protection laws: privacy concerns restrict the export of sensitive global GAN with several smaller distributed databases as demon­
patient information to extramural institutions, and often, only a small strated by Fig. 4A. In this way, several smaller databases can be
fraction of the available data can be used (e.g., a patient consent form combined and act as one large database without actually sharing the
may not be universally available). In these cases, GANs that have underlying patient information.
been trained in-house may serve as a mean to distribute the infor­ Attention needs to be paid to adversarial attacks on distributed
mation contained within the database without actually providing a learning systems. Models might be affected by poisoning attacks (29).
real snapshot of patient-sensitive data: only the weight distribution Local gradients can be easily manipulated and distorted before be­
of the GAN needs to be transferred, and a representative synthesized ing transmitted to central servers, and adversarial attacks might not
dataset of millions of radiographs may be generated in reasonable be detected in the federated learning approach. Our GAN-based
computational time at a peripheral site. This is in contrast to previous distributed learning approach offers passive and active robustness
works of Shin and co-workers, in which lower–spatial resolution against adversarial attacks. The posterior distribution could be esti­
synthesized images could be produced but always required recourse mated (30), and the confidence threshold (24) of any given example in
to the original patient images as inputs to an image to image trans­ the local training set could be deduced. Such confidence thresholds
lational network (13). Another group has previously demonstrated could be used to detect and filter the suspicious training examples
that synthesized chest radiographs can be used to augment training, to secure GAN training from dataset poisoning (29). In addition,
see, e.g., Salehinejad et al. (27). However, they used a less advanced adversarial training (31) is an efficient method to increase model

Downloaded from https://www.science.org on December 01, 2023


GM and were only capable of synthesizing radiographs with limited robustness against adversarial perturbations. We demonstrated (in
quality, which are of less clinical value. We demonstrated the feasi­ fig. S3) that the robustness of our radiograph classifier was signifi­
bility of using GANs as a tool of effective oversampling when the cantly improved by adversarial training.
pathology distribution within a medical dataset was highly im­ The concepts demonstrated in this work rely on two-dimensional
balanced. In particular, we demonstrated how a deep learning classifier images, but there is no principal restriction on the number of di­
can benefit from using synthesized x-rays from a publicly accessible mensions that the real and synthesized images are allowed to have
database in cases where only few instances of a particular disease are or even that the data have to consist of images only. Thus, the same
present. Moreover, our developed GANs could be used to visualize concept could be translated to volumetric CT or magnetic resonance
what the generator neural network sees and to reveal correlations images, to fluoroscopy, to time series of volumetric data (e.g.,
between diseases. The synthesis of pathological radiographs and the contrast-enhanced CT), or even to imaging data in conjunction with
subsequent analysis by classification networks reveal correlations clinical data (e.g., an MRI with associated expression profiles of lab­
between diseases. For example, there seemed to be a block of diseases oratory tumor markers). However, because of the exponentially in­
(lower right corner of Fig. 5D) that was characterized by lung opaci­ creasing size of the data, we expect that the problem of generating
fication, namely, effusion, consolidation, pneumonia, infiltration, synthesized data of very high dimensionality is much more difficult
and edema. This makes sense from a clinical viewpoint as these dis­ and that a far greater number of real cases would be needed for the
eases are clinically correlated, and similar observations hold for the GAN to converge.
remaining clusters in this figure. Cardiomegaly was an outlier in the Diagnosis in the clinical setting usually relies on more than just
sense that it was associated with seemingly only one disease from imaging and comprises the patients’ demographics, their medical
that block (effusion), but not the others. This again makes sense, history, and previous and ongoing treatments. Future work will inves­
considering that effusion can be a direct consequence of congestive tigate how to include these important parameters into our approach
heart failure. The correlation to edema was quite low for this case, by letting the GANs generate not only radiographs but also accom­
which may indicate that the edema was not described consistently panying clinical data such as laboratory values. However, to realize
in the radiological reports and was therefore difficult for the neural this, more training data are probably needed, as the data to be syn­
network to synthesize and classify. thesized will have higher dimensions/degrees of freedom. Federated
A potential problem in training a GAN on-site exists when the learning as presented here can help overcome those difficulties by
set of training images is limited—as is the case in small community providing the means to combine several distinct databases.
hospitals. In those cases, the GAN might not converge, and realistically
looking radiological images might not be produced. To overcome
this problem, we proposed to use federated learning for training of MATERIALS AND METHODS
the GAN: radiological images remained on-site and never left the Dataset and preprocessing
premises, while only the weight updates got transferred to the Three datasets were used in this study: first, the ChestX-ray dataset
central repository. We simulated this approach by splitting a set of released by the NIH in 2017, containing 112,120 frontal radiographs
20,000 patients from the CheXpert dataset into 20 smaller datasets, of 30,805 unique patients. At the time of its publication, this dataset
and we found that the federally trained model significantly outper­ comprised 8 disease entities and was later updated to contain
formed the locally constrained model. 14 pathologies (32). To ensure that no information leaked into the
One caveat when dealing with many smaller distributed databases test set used for the evaluation of the algorithms, patient-wise strat­
is the potential low quality of locally trained GANs. To prevent ification into training (21,528 patients, 78,468 radiographs, 70%),
the central repository from being contaminated by inferior synthetic validation (3,090 patients, 11,219 radiographs, 10%), and test set
x-rays, we propose two possible remedies: One way of pursuing the (6187 patients, 22,433 radiographs, 20%) was performed. The test

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 8 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

set was kept separately until the final testing of the algorithms. Dedicated explanations about techniques used in our GANs can
Detailed label statistics for the ChestX-ray 14 dataset can be found be found at the “Network training details” section in the Supple­
in the “Preprocessing steps in CheXpert dataset” section in the Sup­ mentary Materials.
plementary Materials and in table S3. Second, a densely connected CNN with 121 layers (DenseNet-121)
The second dataset used in this study is the CheXpert dataset, was used as a classifier. It was pretrained on 14 million natural images
which has been released by Irvin et al. (33) in January 2019. It con­ [ImageNet database (2)] and subsequently trained on the radiographs
tains 224,316 chest radiographs of 65,240 patients. This dataset was in this study. The architecture has been shown to achieve state-of-
used to train a second GAN to demonstrate the feasibility of the the-art performance on the ChestX-ray dataset (35) before. Imple­
proposed data sharing approach (see Fig. 1). A detailed explanation mentations were done using TensorFlow 1.9.0 and PyTorch 1.1.0.
of the label preparation and statistics for the CheXpert dataset is
given in table S3. Algorithms of classification were tested on the Training of the GANs
NIH test set. Therefore, no subdivision of the CheXpert dataset into We trained two GANs on the basis of two separate datasets in a
test, training, and validation sets was needed, and all available frontal progressive growing strategy: on the NIH ChestX-ray14 dataset and
radiographs of the CheXpert dataset (n = 191,027) were used for the Stanford CheXpert dataset. Note that weights were initialized
training of the GAN. randomly. Training proceeded in repetitive stages: once training of
The third dataset used in this study is a dataset of x-rays released one spatial resolution stage stabilized after being presented a total of
by the RSNA to host a challenge about pneumonia detection. We 600,000 real radiographs (with repetitions), the layers responsible
used this dataset to train a classifier for pneumonia detection and to for the next spatial resolution stage were gradually faded in and train­
test whether the inclusion of synthesized x-rays could improve the ing continued with another 600,000 radiographs during this fade-in
performance of said classifier. stage (again with repetitions). In total, discriminators of GANs were

Downloaded from https://www.science.org on December 01, 2023


Before training, radiograph datasets such as NIH and CheXperts each presented 12 million radiographs. The training scheme was chosen
were downsampled to dedicated spatial resolutions, i.e., ranging from so that the GANs learned to first explore the large-scale pattern and
4 × 4, 8 × 8, …, 210 × 210, and converted into separate files. Thus, overall contrast before focusing their attention on finer details.
each of those files contained all training radiographs with a fixed To measure whether the images generated by the generator con­
spatial resolution. The radiographs’ intensity values were normalized verged to real-looking images, we used the FID between a set of
to the range of [−1, 1] (12). 10,000 real x-rays and 10,000 generated x-rays at each training epoch
(28). We ensured an equal contribution from each pathological class
Model architecture and implementation by using a uniform distribution among the 14 classes, i.e., roughly
Two neural network architectures were used here. First, GANs as 700 radiographs per class. To compute the FID, we extracted features
introduced by Goodfellow et al. (11) were adapted to incorporate an of radiographs from the third pooling layer of the inception network
input condition (19) to selectively generate synthesized radiographs (28). The FID score among real and synthesized radiographs was
with a certain pathology. We used two different inputs to the then computed according to
networks: the conditional vector, which controls the type of disease
‖ ​  Tr​(​​ ​Σ​  x​​  + ​Σ​  g​​  − 2 ​(​Σ​  x​​ ​Σ​  g​​)​​  ​2​​​)​​​​

1_
present in the synthetic image, and the random noise vector, which ​​FID(x, g ) = ​​ ​​ ​μ​  x​​  − ​μ​  g​​​ 2​​​2​ + (2)
determines which item from the set of possible x-rays is generated.
Both vectors are concatenated and fed to the network as an input where  and  are mean and covariances of a multivariate Gaussian
(19). As depicted in fig. S1C, such concatenation-based conditioning distribution that models the feature distributions. We found that the
is equivalent to adding bias to the hidden activations based on the FID decreased nearly monotonically, indicating that the general ap­
conditional input (34). In addition, we also added an auxiliary clas­ pearance of the generated images approaches that of the real x-rays.
sifier at the end of the discriminator and additional classification The corresponding figures depicting the evolution of the FID are
loss terms in the objective given in fig. S1D.

​ ​CG​ ​= ​  𝔼​ ​​  [ − logP(C = c∣​ x˜​  )]


L Training of the classifier with real and synthesized data
x​  ˜​~ℙg
   ​​ ​
         
​​ (1) All classifier models used validation-based early stopping with sig­
​LCG​ ​ ​= ​  𝔼​ ​​  [ − logP(C = c∣x )] moid binary cross-entropy loss as the criterion. No oversampling of
x~ℙr
underrepresented classes was used except for the experiment, in which
where c is the pathological class label. we specifically tested for the effect of oversampling. Training of the
To generate high–spatial resolution images, we used progressive classifier network was done for a variety of different settings.
growing, a technique in which the GAN is trained in progressively In the experiment depicted in Fig. 2, we first trained a classifier
higher–spatial resolution stages (12). The network architecture re­ on a set of 1000 real x-rays (950 healthy, 50 exhibited signs of pneu­
sulting in a final spatial resolution of 1024 × 1024 is shown in table S5. monia) provided by the RSNA. Subsequently, we trained a classifier
We picked leaky rectified linear unit (ReLU) ( = 0.2) and pixel norm on a set of 500 real and 500 synthesized x-rays (450 healthy real,
(12) as the major activation function and normalization layer. Note 50 pneumonia real, 400 healthy synthesized, and 100 pneumonia
that, instead of using a common tanh activation function, Karras et al. synthesized), whereby the synthesized radiographs were generated
(12) suggested to use linear activation at the end of the generator. by generators that had previously been trained on the NIH and
During training, we used a mini-batch size of 128 for spatial resolutions Stanford datasets. As a test set, we used a random subset of the dataset
42 − 322 and then decreased the batch size by a factor of 2 when spatial published by the RSNA, comprising 6000 real x-rays with the rela­
resolution doubled to account for the limited memory size: 64 × 64 → 64, tion healthy:pneumonia as 2:3. In addition, we tested whether this
128 × 128 → 32, 256 × 256 → 16, 512 × 512 → 8, and 1024 × 1024 → 4. concept could also be used in a more challenging task of differentiating

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 9 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

between a variety of diseases. As not all of the 14 pathologies labeled To determine the number of needed samples for the performed
in the NIH dataset had been labeled in the Stanford dataset, we only experiments, we used power analyses according to (36). In general, all
classified those pathologies that were present in both datasets’ labels, of our performed experiments followed a binomial distribution,
namely, cardiomegaly, effusion, pneumothorax, atelectasis, edema, because each decision for a radiograph was binary: either yes (e.g.,
consolidation, and pneumonia. We trained a classifier to differentiate was real for the case of deciding between real and synthesized radio­
between these classes with three different training sets: (i) synthesized graph or disease was present for the case of the classifiers) or no
x-rays generated by the NIH-GAN, (ii) synthesized x-rays generated (was not real or disease not present). We could thus use the binomial
_
by both the NIH-GAN and the Stanford-GAN, and (iii) real x-rays ​​ absolute numbers​​  = ​√n × p × q ​​,
formula for the SD of absolute numbers: SD​ 
from the NIH dataset. _
p×q
In addition, an experiment was carried out, in which the generated ​​ percentages​​  = ​√_
or equally well the SD of percentages: SD​  ​ n ​ ​​.
images were evaluated by the trained DenseNet to discover correla­ The difference of metrics, such as AUC, sensitivity, and specificity,
tions between different pathologies. For each pairing of pathology was defined as a metric (see table S4). For the total number of
as generated by the generator and pathology as classified by the n = 1000 bootstrapping, models were built after randomly per­
classifier, we calculated Pearson’s correlation coefficient and performed muting predictions of two classifiers, and metric difference metrici
clustering on the resulting correlation matrix. were computed from their respective scores. We obtained the P value
of individual metrics by counting all metrici above the threshold
Federated averaging GAN metric. Statistical significance was defined as P < 0.001.
The pseudocode of our federated averaging GAN is given in
algorithm S1. Specifically, we controlled our federated learning ex­ SUPPLEMENTARY MATERIALS
periment by setting 10% of local clients that ran local GAN updates Supplementary material for this article is available at http://advances.sciencemag.org/cgi/

Downloaded from https://www.science.org on December 01, 2023


(C = 10%), 10 local generator iterations on each round (E = 10), and content/full/6/49/eabb7973/DC1

a local batch size of 32 (b = 32). Following Gulrajani et al. (26), pa­ View/request a protocol for this paper from Bio-protocol.

rameters of local Wasserstein GAN training was set to  = 10 and


ndiscriminator = 5. All local models were initialized identically. During REFERENCES AND NOTES
1. S. Bickelhaupt, P. F. Jaeger, F. B. Laun, W. Lederer, H. Daniel, T. A. Kuder, L. Wuesthof,
one global update round, as shown in Fig. 4A, a subset of clients
D. Paech, D. Bonekamp, A. Radbruch, S. Delorme, H.-P. Schlemmer, F. H. Steudle,
(10% here) was picked to run local GAN updates on isolated datasets. K. H. Maier-Hein, Radiomics based on adapted diffusion kurtosis imaging helps
Local clients were asked to transmit updated weights (red arrows in to clarify most mammographic findings suspicious for cancer. Radiology 287, 761–770
Fig. 4A) to the aggregation server once local updates were finished. (2018).
The global model was updated by the weighted average over collected 2. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
neural networks. Commun. ACM 60, 84–90 (2017).
weights (22). To finish the global round, all local models were updated 3. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level
by the weights from the global model (blue arrows in Fig. 4A). performance on ImageNet classification, in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Santiago, Chile, 7 to 13 December 2015.
Reader study 4. A. Hosny, C. Parmar, T. P. Coroller, P. Grossmann, R. Zeleznik, A. Kumar, J. Bussink,
R. J. Gillies, R. H. Mak, H. J. W. L. Aerts, Deep learning for lung cancer prognostication:
Six readers were tasked with identifying whether a radiograph was
A retrospective multi-cohort radiomics study. PLOS Med. 15, e1002711 (2018).
real or synthesized. The tests were performed as follows. Each reader 5. V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan,
was given 30 s within which she or he had to decide whether the K. Widner, T. Madams, J. Cuadros, Development and validation of a deep learning
presented radiograph was real or synthesized. To prevent readers algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316,
from identifying GAN-related features on the high–spatial resolu­ 2402–2410 (2016).
6. D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, L. van der
tion radiographs first—which are harder to produce and thus pre­ Maaten, Exploring the limits of weakly supervised pretraining, in Proceedings of the
sumably more prone to artifacts—and transferring that knowledge European Conference on Computer Vision (ECCV), Munich, Germany, 8 to 14 September
to the low–spatial resolution images, the radiographs were presented 2018.
in the following order: 100 radiographs of 256 × 256, 100 radiographs 7. M. M. Mello, V. Lieou, S. N. Goodman, Clinical trial participants’ views of the risks
and benefits of data sharing. N. Engl. J. Med. 378, 2202–2211 (2018).
of 512 × 512, and, lastly, 100 radiographs of 1024 × 1024. All pre­
8. L. M. Prevedello, S. S. Halabi, G. Shih, C. C. Wu, M. D. Kohli, F. H. Chokshi, B. J. Erickson,
sented radiographs were different, i.e., the 256 × 256 were different J. Kalpathy-Cramer, K. P. Andriole, A. E. Flanders, Challenges related to artificial
from the 512 × 512 and 1024×1024 radiographs. Reading tests were intelligence research in medical imaging and the importance of image analysis
done on a 24-inch computer monitor. To exclude the possibility that competitions. Radiol. Artif. Intell. 1, e180031 (2019).
readers investigated the metal markers or pixel-hardcoded letters 9. J. Xu, F. Wang, Federated learning for healthcare informatics (2019); arXiv:1911.06270.
10. W. Li, F. Milletarì, D. Xu, N. Rieke, J. Hancox, W. Zhu, M. Baust, Y. Cheng, S. Ourselin,
(e.g., denoting patient side—L or R) as potential artifacts to differ­ M. J. Cardoso, A. Feng, Privacy-preserving Federated Brain Tumour Segmentation, in
entiate between real and synthesized images, these were covered by International Workshop on Machine Learning in Medical Imaging (Springer, 2019), pp. 133–141.
an independent investigator before handing out the x-ray to the testers. 11. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing
Systems (Neural Information Processing Systems Foundation, Inc., 2014), pp. 2672–2680.
Statistical analysis
12. T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality,
For each of the experiments, we calculated the following parameters stability, and variation (2017); arXiv:1710.10196.
on the test set: AUC, accuracy, sensitivity, and specificity. To assess 13. H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole,
the errors due to sampling of the specific test set, we used bootstrapping M. Michalski, Medical image synthesis for data augmentation and anonymization using
with 10,000 redraws. The SE of the accuracy in the real versus generative adversarial networks, in International Workshop on Simulation and Synthesis in
Medical Imaging (Springer, 2018), pp. 1–11.
synthesized tests for each human reader was calculated among the 14. J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using
reader performances, and Fleiss’ kappa was used to assess interreader cycle-consistent adversarial networks, Proceedings of the 2017 IEEE International
agreement between readers. Conference on Computer Vision (ICCV), Venice, Italy, 22 to 29 October 2017.

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 10 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

15. J. M. Wolterink, A. M. Dinkla, M. H. F. Savenije, P. R. Seevinck, C. A. T. van den Berg, localization of common thorax diseases, in Proceedings of the IEEE Conference on
I. Išgum, Deep MR to CT synthesis using unpaired data, in Simulation and Synthesis in Computer Vision and Pattern Recognition (IEEE, 2017), pp. 2097–2106.
Medical Imaging (Springer, 2017), pp. 14–23. 33. J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo,
16. A. Chartsias, T. Joyce, R. Dharmakumar, S. A. Tsaftaris, Adversarial image synthesis for R. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones,
unpaired multi-modal cardiac data, in Simulation and Synthesis in Medical Imaging D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, A. Y. Ng, CheXpert: A large chest
(Springer, 2017), pp. 3–13. radiograph dataset with uncertainty labels and expert comparison. arXiv:1901.07031
17. Z. Zhang, L. Yang, Y. Zheng, Translating and segmenting multimodal medical volumes (2019).
with cycle- and shape-consistency generative adversarial network, Proceedings of the IEEE 34. E. Perez, F. Strub, H. de Vries, V. Dumoulin, A. Courville, FiLM: Visual reasoning with a
Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 18 to general conditioning layer. arXiv:1709.07871 (2017).
22 June 2018. 35. P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul,
18. A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep C. P. Langlotz, B. N. Patel, K. W. Yeom, K. Shpanskaya, F. G. Blankenberg, J. Seekins,
convolutional generative adversarial networks (2015); arXiv:1511.06434. T. J. Amrhein, D. A. Mong, S. S. Halabi, E. J. Zucker, A. Y. Ng, M. P. Lungren, Deep learning
19. A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm
Proceedings of the 34th International Conference on Machine Learning-Volume 70 to practicing radiologists. PLOS Med. 15, e1002686 (2018).
(JMLR.org, 2017), pp. 2642–2651. 36. B. Rosner, Fundamentals of Biostatistics (Nelson Education, 2015).
20. P. Rui, K. Kang, National Hospital Ambulatory Medical Care Survey: 2015 Emergency 37. M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN. arXiv:1701.07875 (2017).
Department Summary Tables (2015). https://www.cdc.gov/nchs/data/nhamcs/web_ 38. D. P. Kingma, P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions,
tables/2015_ed_web_tables.pdf [accessed 16 January 2020]. in Advances in Neural Information Processing Systems (2018), pp. 10215–10224.
21. S. Conjeti, A. Katouzian, A. G. Roy, L. Peter, D. Sheet, S. Carlier, A. Laine, N. Navab, 39. T. Salimans, A. Karpathy, X. Chen, D. P. Kingma, PixelCNN++: Improving the PixelCNN
Supervised domain adaptation of decision forests: Transfer of models trained in vitro with discretized logistic mixture likelihood and other modifications. arXiv:1701.05517
for in vivo intravascular ultrasound tissue characterization. Med. Image Anal. 32, 1–17 (2016). (2017).
22. B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication-Efficient Learning 40. J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, T. Darrell, CyCADA:
of Deep Networks from Decentralized Data, in Proceedings of the 20th International Cycle-consistent adversarial domain adaptation. arXiv:1711.03213 (2017).
Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20 to 22 April 2017.

Downloaded from https://www.science.org on December 01, 2023


23. A. N. Bhagoji, S. Chakraborty, P. Mittal, S. Calo, Analyzing federated learning through an Acknowledgments
adversarial lens. arXiv:1811.12470 (2018). Funding: This research project was supported by the START program of the Faculty of
24. D. Stutz, M. Hein, B. Schiele, Confidence-calibrated adversarial training: Generalizing to Medicine, RWTH Aachen, Germany, through the START rotation program granted to D.T. and
unseen attacks. arXiv:1910.06259 (2019). by the DFG, Germany, through a grant given to S.N. Author contributions: T.H., D.T., V.S., and
25. S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, F.K. conceived the idea and approach. F.K., V.S., S.R., S.N., C.H., N.H., D.M., and D.T. contributed
R. Mathews, B. A. y Arcas, Generative models for effective ML on private, decentralized to the experiments. T.H., D.T., C.H., and N.H. developed the code infrastructure and GAN
datasets. arXiv:1911.06679 (2019). training setup. T.H., D.T., F.K., and V.S. wrote the manuscript. Competing interests: The
26. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of authors declare that they have no competing interests. Data and materials availability: This
Wasserstein GANs, in Advances in Neural Information Processing Systems, I. Guyon, study used three publicly available datasets: NIH ChestX-ray14 dataset (https://nihcc.app.box.
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, Eds. (Curran com/v/ChestXray-NIHCC), Stanford CheXpert dataset (https://stanfordmlgroup.github.io/
Associates, Inc., 2017), pp. 5767–5777. competitions/chexpert), and RSNA pneumonia dataset (https://kaggle.com/c/rsna-
27. H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, J. Barfett, Generalization of deep neural pneumonia-detection-challenge). The full images used in our real/synthesized radiograph test
networks for chest pathology classification in X-rays using generative adversarial are available at https://drive.google.com/open?id=1_snb7hQ47WYxJEYK95G3cYlWSqckvRDW.
networks, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing Details of the implementation as well as the weights of the neural networks after training and
(ICASSP) (IEEE, 2018), pp. 990–994. the full code producing the results of this paper are made publicly available at https://github.
28. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two com/peterhan91/Thorax_GAN.git. All data needed to evaluate the conclusions in the paper
time-scale update rule converge to a local nash equilibrium, in Advances in Neural are present in the paper and/or the Supplementary Materials. Additional data related to this
Information Processing Systems (Curran Associates, Inc., 2017), pp. 6626–6637. paper may be requested from the authors.
29. S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, I. S. Kohane, Adversarial attacks
on medical machine learning. Science 363, 1287–1289 (2019). Submitted 19 March 2020
30. Y. Li, Y. Gal, Dropout inference in Bayesian neural networks with alpha-divergences, in Accepted 14 October 2020
Proceedings of the 34th International Conference on Machine Learning-Volume 70 Published 2 December 2020
(JMLR.org, 2017), pp. 2052–2061. 10.1126/sciadv.abb7973
31. A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models
resistant to adversarial attacks. arXiv:1706.06083 (2017). Citation: T. Han, S. Nebelung, C. Haarburger, N. Horst, S. Reinartz, D. Merhof, F. Kiessling, V. Schulz,
32. X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, ChestX-ray8: Hospital-scale D. Truhn, Breaking medical data sharing boundaries by using synthesized radiographs. Sci. Adv.
chest X-ray database and benchmarks on weakly-supervised classification and 6, eabb7973 (2020).

Han et al., Sci. Adv. 2020; 6 : eabb7973 2 December 2020 11 of 11


Breaking medical data sharing boundaries by using synthesized radiographs
Tianyu Han, Sven Nebelung, Christoph Haarburger, Nicolas Horst, Sebastian Reinartz, Dorit Merhof, Fabian Kiessling,
Volkmar Schulz, and Daniel Truhn

Sci. Adv. 6 (49), eabb7973. DOI: 10.1126/sciadv.abb7973

View the article online

Downloaded from https://www.science.org on December 01, 2023


https://www.science.org/doi/10.1126/sciadv.abb7973
Permissions
https://www.science.org/help/reprints-and-permissions

Use of this article is subject to the Terms of service

Science Advances (ISSN 2375-2548) is published by the American Association for the Advancement of Science. 1200 New York Avenue
NW, Washington, DC 20005. The title Science Advances is a registered trademark of AAAS.

Copyright © 2020 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim
to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

You might also like