Developing Image Analysis Pipelines of Whole Slide Images Pre and Post Processing
Developing Image Analysis Pipelines of Whole Slide Images Pre and Post Processing
Translational Science
Developing image analysis pipelines of
whole-slide images: Pre- and post-processing
www.cambridge.org/cts
Byron Smith1 , Meyke Hermsen2, Elizabeth Lesser3, Deepak Ravichandar4 and
Walter Kremers1,5
1
Review Article Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA; 2Department of Pathology, Radboud
University Medical Center, Nijmegen, The Netherlands; 3Department of Health Sciences Research, Mayo Clinic,
Research Methods and Jacksonville, FL, USA; 4Department of Management Engineering and Consulting, Mayo Clinic, Rochester, MN,
Technology USA and 5William J. von Liebig Center for Transplantation and Clinical Regeneration, Mayo Clinic, Rochester,
MN, USA
Cite this article: Smith B, Hermsen M, Lesser E,
Ravichandar D, and Kremers W. Developing
image analysis pipelines of whole-slide images: Abstract
Pre- and post-processing. Journal of Clinical
and Translational Science 5: e38, 1–11.
Deep learning has pushed the scope of digital pathology beyond simple digitization and tele-
doi: 10.1017/cts.2020.531 medicine. The incorporation of these algorithms in routine workflow is on the horizon and
maybe a disruptive technology, reducing processing time, and increasing detection of anoma-
Received: 17 March 2020 lies. While the newest computational methods enjoy much of the press, incorporating deep
Revised: 11 August 2020
learning into standard laboratory workflow requires many more steps than simply training
Accepted: 19 August 2020
and testing a model. Image analysis using deep learning methods often requires substantial
Keywords: pre- and post-processing order to improve interpretation and prediction. Similar to any data
Image analysis; data science; analysis pipeline; processing pipeline, images must be prepared for modeling and the resultant predictions need
deep learning; pathology; computer vision further processing for interpretation. Examples include artifact detection, color normalization,
Address for correspondence: image subsampling or tiling, removal of errant predictions, etc. Once processed, predictions are
B. Smith, PhD, Department of Health Sciences complicated by image file size – typically several gigabytes when unpacked. This forces images
Research, Mayo Clinic, Rochester, MN, USA. to be tiled, meaning that a series of subsamples from the whole-slide image (WSI) are used in
Email: [email protected] modeling. Herein, we review many of these methods as they pertain to the analysis of biopsy
slides and discuss the multitude of unique issues that are part of the analysis of very large
images.
Introduction
Recent developments in hardware and software have expanded the opportunities for modeling
and analysis of whole-slide images (WSIs) in pathology. The possibility to train multi-layered
(deep) neural networks combined with the generation of multiresolution images (i.e., WSIs) has
generated entire new opportunities for the analysis of histopathologic tissue. Unfortunately,
many of the resulting models lack a generalizable workflow from whole-slide preparation to
results. This makes translation to clinical practice or application on large clinical trials a major
challenge [1, 2]. Construction of end-to-end WSI analysis pipelines requires many more steps
than model building in a controlled digital environment (Fig. 1). Tissue needs to be fixed, cut,
© The Association for Clinical and Translational stained, and digitized and the resultant images need to be “ cleaned” so that clinically relevant
Science 2020. This is an Open Access article, features can be selected and used in modeling. Because of the size of the images, subsamples of
distributed under the terms of the Creative
Commons Attribution licence (http://
images or tiles are used for training deep learning algorithms. The raw output of the algorithm is
creativecommons.org/licenses/by/4.0/), which post-processed and used to calculate possibly predictive data.
permits unrestricted re-use, distribution, and It is often said that 80% of an analyst’s time is spent on data cleaning rather than data analysis
reproduction in any medium, provided the and image data is no exception. First, digitized images in pathology represent an overabundance
original work is properly cited. of data (in terms of computational memory) and many of the pre- and post-processing steps
focus on data reduction. Second, tissue preparation and digitization can cause substantial vari-
ability in a dataset. Analyzing images generated from a single center’s lab on an individual scan-
ner may mitigate some of these issues, but variations may still exist across time as staining
protocols and hardware changes. To combat these sources of variability, preprocessing and
post-processing are used in tandem with complex modeling to create a reliable diagnostic tool
(Fig. 1).
Once pre- and post-processed, aggregating predictions from the subsample (tile) level to the
slide or patient level is another challenge as there may be multiple slides or tiles on a subject.
Standard statistical tools, such as the mixed-effects model, that use random intercepts and slopes
to account for multiple observations per patient, are not used in the context of deep learning
although some alternatives have been considered [3–6]; these alternatives are described in more
detail in the prediction section below.
Although many methods described in this article can and have been used in standard image
analysis, we focus on how they augment the deep learning model development process.
Specifically, we use the term deep learning to refer to the application of convolutional neural
Fig. 1. Typical flow of a WSI processing pipeline. In several instances, one may have to step backwards or start a stage over due to suboptimal outputs.
networks (CNNs) to images as opposed to other machine learning identify local patterns. Filters are applied to an image by perform-
algorithms such as support vector machines or random forest pre- ing a convolution operation of the filter with the image. This effec-
dictions. To better understand this context, we briefly describe how tively slides the filter across the pixels of the image, evaluating the
deep learning is applied to images before moving on to the pre- and product of the filter parameters with the image. Ultimately, this
post-processing steps that improve output. technique known as parameter sharing saves memory because a
The exact steps used in the pre- and post-processing rely heavily model is not applied to every pixel (such as a slope or parameter
on the end goal of any image analysis pipeline. Herein, we review a for logistic fit), but rather the filter parameters for each filter
series of pre- and post-processing methods used in tandem with are shared across a neighborhood of pixels. Despite this, deep
standard modeling techniques while motivating why some meth- learning models used in image analysis can often range from
ods may be preferable to others. Available functions in R and 1 to 100 million parameters.
Python are described throughout and summarized in Table 1 as Another consequence of filtering is that it reduces the size of an
a quick reference. image. One example is that a filter cannot sit on the edge of an
Note that a general recommendation is to start with the simplest image. Either the image has to be modified (i.e., padded) or the out-
solution and expand on this solution with increasing complexity; put will be smaller than the input. A filtering method known as
examples of this will be included throughout. pooling also reduces the image dimensions by taking windows
of pixels and reducing them to a single-pixel output. As a result,
output tile size post-filtering is almost always smaller than the
A Brief Overview of Deep Learning for Image Analysis
input size. Depending on the model at use, this fact may necessitate
Deep learning is defined as the use of deep neural networks for fully the use of overlapping tiles when sampling from an image. More
automatic discovery of patterns in (image) datasets. A specific type details are given below in the preprocessing section.
of network, CNNs are commonly used for the analysis of medical
images throughout. In this article, we will focus on these types of The main elements of a deep learning model are as follows:
networks. CNNs are “deep” in the sense that they follow a hierar-
chical analysis where the output of one “layer” of the network feeds (1) The input of the model – Many images are too large to be ana-
into another layer and so on. In addition, these networks are differ- lyzed directly and so sub-images are fed into the network.
entiated from other hierarchical modeling methods (e.g., hierar- Because filtering an image can change the image size, model
chical regression models and multilayer perceptrons) in that architecture can depend on input size.
they analyze an image by applying a filter to the image and feeding (2) The architecture of the model – The number of filters, the num-
that filtered output to the next layer. These filters are used to ber of layers, and the overall connections from input to output.
(3) The output of the model – Most commonly, a classification of prone to overfitting. To overcome this issue, it is a common prac-
the input into one or more classes or a segmentation of the tice to stop the training procedure when the performance of the
image into separate classes (pixel-wise classification). model on the training and the validation set is optimized. This pro-
Almost always the output will be of dimensions smaller than cedure is similar to optimization of hyperparameters in other
or equal to the input image. machine learning algorithms where models are trained on the
training set, hyper-parameter choices are evaluated on the valida-
As with any modeling processes, once these choices have been tion set, and the final model performance is evaluated on the test
made the problem is converted to an optimization problem by set [7].
defining a loss function to quantify how good or bad the fit was. Another critical issue is that models must be robust to color
A loss function is identified based on the endpoint being used shifts and geometric changes (translations and rotations). One
and this is then optimized using an iterative approach. For exam- way to build a model that is robust to these sources of variation
ple, the cross-entropy (binomial loss) is frequently used in classi- is to apply the corresponding transformations during optimiza-
fication as it would be used for any logistic model. tion. This is known as data augmentation.
Just as with many machine learning methods, the hierarchical Results of the deep learning algorithm may depend on each
nature of deep learning models allows for complex relationships choice made during the model set up. Therefore, in order to find
between input and output; however, this creates a sort of “black- the best solution for a problem, many models are often fit with a
box” model that does not carry the same rigorous statistical back- series of different choices. This “guess and check” style is not
ground that a logistic model might. More specifically, this means unique to deep learning – frequently statistical analysis is iterative,
that it can be challenging to understand how a single filter in a sin- based on model assumptions, checking interactions, model perfor-
gle layer may play a role in the overall model; there is no direct mance, etc. However, a standard model may run in seconds while
measure of association such as an odds ratio or p-value. deep learning models can take hours for a single optimization.
For the development and testing of deep learning algorithms, Given that pre- and post-processing of output results may impact
the dataset is split into training, validation, and test sets. The net- model performance, these steps act as additional choices that must
work is optimized using the training and validation set, while the be made in order to find an optimal solution. The typical flow of a
test set remains untouched. The optimization procedure in most project involving image analysis with deep learning can be seen
deep learning environments is unique to its field (Fig. 2). The in Fig. 1.
parameter settings are optimized iteratively, while the network The most common packages for deep learning include the
aims to achieve better performance on the training set. Given that TensorFlow [8] and Keras [9] packages which are accessible in
many CNN architectures have millions of parameters, modeling is Python with analogues in R (although these require an installation
file size and resolution; image compression may save space, but
ultimately impact final model performance by blurring features
of interest. In the author’s experience, JPEG80 compression
reduces space significantly while having minimal effect on image
quality or network performance. Note that downstream modeling
is based solely on pixel values stored in the form of arrays and,
therefore, different file formats are more relevant to storage than
analysis if the resolution is similar. Proper control of these sources
of error can result in robust deep learning models that can process
data across scanners regardless of file format [15].
The incorporation of graphical processing units (GPUs) into
any analysis pipeline may be critical. While deep learning models
necessitate the use of GPUs during the training process, predic-
tions from these models can be performed in near real time on
a standard workstation or computational processing unit (CPU).
However, the repetitive calculations involved in most image analy-
sis processes can be dramatically improved by parallelization and
the inclusion of GPUs into any system used for high throughput
pipelines. To facilitate these hardware dependencies, Griffin and
Treanor recommends the inclusion of information technology
(IT) teams in the construction process [2].
In addition to file size differences, color profile differences arise
across laboratories and scanners. It is well known that differences in
core facility staining procedures can give rise to high variability in
image color profile, but so too can the digital scanning process [16].
The settings for all whole-slide scanners may determine the perfor-
mance and/or generalizability of any downstream modeling, espe-
cially when no color augmentation was applied during the training
of the model. One solution for this issue can be preprocessing to a
standard color distribution, i.e., color normalization, which will be
discussed below (preprocessing; color management) [17, 18].
Preprocessing
The general purpose of preprocessing image data is threefold: (1) to
discard data that is not informative or useful such as slide back-
ground or artifacts, (2) to create a consistent dataset, and (3) to
enable processing of large WSIs during downstream modeling.
The first stage is necessary to utilize computational resources in
the most efficient way and to prevent generation of noise by, e.g.,
pen markings or dust particles. The second stage of preprocessing
is used to reduce disturbing sources of image variability. The third
stage is achieved by tissue sampling (dividing the image into tiles and
feeding these batchwise to the network), because most deep learning
Fig. 2. A flowchart of the training process in deep learning. Note that data augmen- models cannot process gigapixel images at once. Below, we provide
tation is a step that is carried out during the training process.
more detail for each of these steps as well as suggestions for those
beginning to develop image analysis pipelines.
of Python to run). Additionally, Caffe [10] and PyTorch [11] pro- An important note is that many of the methods mentioned
vide alternative frameworks for deep learning in Python. within the post-processing section below may be useful in prepro-
cessing although the motivations may differ. In preprocessing,
morphological transformations are most often used to identify
Data Acquisition artifacts on the slide whereas during post-processing they are used
to increase prediction interpretability.
Before the development of an analysis pipeline is considered, one
should explore the plethora of hardware that may be involved. The
Artifact Detection and Tissue Segmentation
consequences of these choices can drive downstream methodology.
First, there are several commercially available digital slide scanners An important element in training and prediction is to limit the
and many of these scanners use unique image file formats [12]. amount of non-informative data being analyzed. Non-informative
While many efforts exist to format images as Digital Imaging data lies not just in the background of the slide, but also in artifacts
and Communications in Medicine or another standard, similar on the slide that are not relevant to the histopathology of the indi-
to radiological images, no standard has been adopted [13, 14]. vidual patient. Artifact detection can be considered a quality con-
Just as with any image file type, there is a trade-off between image trol step in the image analysis pipeline as a way to prevent
Fig. 3. (A) Initial WSI, (B) tissue identified by thresholding (green), and (C) tissue identified by thresholding followed by filling hulls. By identifying tissue prior to analysis, slide
background can be removed from the analysis.
Fig. 4. Color differences between an (A) Aperio ScanScope v1 and (B) Aperio AT2 slide scanner based on default color profiles using the same slide.
Fig. 5. Tissue tiling based on (A) a regular grid, (B) an overlapping grid, and (C) random tiles.
concept of color deconvolution, i.e., separating hematoxylin from Most WSIs are on the gigapixel scale (on the order of 10,000 ×
eosin [24]. They use a singular value decomposition (SVD) of a color 10,000 pixels) while the input to most deep learning algorithms
space (hue, saturation, value) to establish color vectors that corre- analysis of images on the scale of 100-pixel squares (Fig. 5).
spond to particular dyes. Unfortunately, a similar method may not This discrepancy typically means that training or prediction from
generalize to a dye of three or more components because this no a deep learning model requires tiling the image and reshaping these
longer corresponds to a reduction in dimensionality relative to the tiles into an array useable by the algorithms.
color space. In order to overcome this limitation, methods have For the purpose of training, regions of interest (ROIs) are often
focused on color decomposition stratified by tissue type [25–27]. annotated to curate a representative set of learning examples for
Clustering has also been proposed to identify the tissue type and per- the network. Unfortunately, these ROIs are often chosen to classify
form color normalization within each cluster [28]. or segment ideal images, avoiding digital image flaws such as finger-
Extending the concept of tissue-specific normalization, a deep prints, markers, digital striations, or background. Doing so may limit
learning method has been adapted to recognize latent histologic the generalizability of trained models to real-world datasets and clini-
structures and normalize them correspondingly using generative cal applications. Nevertheless, sampling in this way under samples
adversarial networks (GANs) [29, 30]. In a GAN, a generator net- slide background, sparing computational expense, and can be used
work is trained to create “fakes” of input data (i.e., normalized in multistep analysis pipelines where an initial model is used to iden-
slides) while an adversarial network identifies which images are tify large-scale ROIs followed by analysis within those regions [22,32].
true and which have been normalized. In this way, the two net- Sampling tiles from the ROI can be done in several ways (Fig. 5).
works train each other, getting incrementally better at stain nor- First, a strict grid can be used although this may result in systematic
malization so that the opposing network cannot distinguish sampling errors such as structure cleaving where certain patterns
between the true stains and normalized stains. This application are cut into piecemeal. Overlapping grids can be used, but a choice
allows for normalization that can be performed at the tile level in overlap amount is critical to balance redundancy and computa-
and integrated seamlessly with downstream networks. tional efficiency. Finally, random sampling to cover the region of
Many of the color normalization algorithms have been hand interest can be used although this can result in high levels of redun-
coded in languages such as Python and MATLAB, but could be dancy as well. Image tiling can be performed automatically with the
adapted for use in R. K-means clustering and SVD are part of “py_wsi” package in Python.
the base R package. K-means clustering is a standard clustering For the purposes of prediction, random sampling is impractical
function contained within the scikit-learn library [31] for to predict over the full region of interest. Overlapping, regular grids
Python. Histogram matching from one image’s color distribution can be useful in the prediction of nominal class results. In the case
to another’s can be performed with scikit-image in Python. of segmentation, this is often necessary because the output classi-
During the training phase, exposing a model to as much color fication mask can be smaller than the input image. This is a direct
variation as possible can benefit the generalizability of the model. result of the choice of using zero padding which adds zeros around
The easiest way to accomplish this is through data augmentation the border of an image to assure that the convolution of a function
which should be the first step. However, for studies with larger within a neural network (filter) with the image results in an output
sources of error, i.e., multicenter studies, further methods may of the same size.
be of value. Macenko’s method has been implemented in several Each decision during the training process determines the con-
programming languages and provides an overall color normaliza- struct of the ground truth used to evaluate the loss function (like-
tion that can be adopted quickly. Finally, if the color of a slide is lihood) during optimization. Built into the loss function is the level
tissue-specific, methods that normalize a slide uniformly rather in an image size hierarchy that is of interest for prediction or infer-
than based on tissue composition may be of less utility. Here we ence. In the case of classification, every tile can be given a label and
recommend using more complex methods summarized above. this label may be chosen after the sampling process has been
declared. Alternatively, the class can be declared at the slide level
using the multiple instance learning (MIL) framework [33].
Tissue Sampling
We discuss these choices more below. For segmentation, an image
The size of WSIs not only lead to computational limitations, but “mask” defines the class of each pixel and is sampled in accordance
also necessitate additional steps in any analysis pipeline [12]. with the input image.
Fig. 6. Hypothetical predictions of the presence of a glomerulus on a renal transplant biopsy from a deep learning algorithm either (A) by pixel (segmentation) prior to post-
processing, (B) by pixel after post-processing, and (C) by tile. Note that raw output from a segmentation model may leave errant pixel predictions as well as hulls within objects of
interest.
The final issue in sampling arises from the imbalance in class the boundary of an object may be easier than identification of
labels between good and bad pathology. This age-old statistical the interior. For this reason, the closed object prediction may form
problem means that training sets are frequently constructed using a “hull” (Fig. 6A). In order to improve the interpretability of seg-
a more balanced design to give networks exposure to bad pathology mentation output, hull filling is a quickly applied method of assign
in sharp contrast to a general population. Alternatively, an unbal- pixels in the interior of a hull to those of the bounding area. The
anced dataset may result in a lower sensitivity to identifying ROIs hull-filling operation then converts all pixels within the closed
due to the fact that these regions may be confused with tissue that is object to the pixel classification that defines the closure
not of interest due to nuances in the texture or color of the tissue. In (Fig. 6B). In order to perform this operation, pixels must be defined
order to overcome this, training can be followed by hard negative as a binary classification such as one versus all.
mining wherein false positive predictions are oversampled or
added to the training set post hoc to prevent errors associated with
Morphological Operations
balancing an unbalanced class.
Although hull filling can increase the interpretability of bounded
objects, this does not overcome the fact that objects may be “open”
Post-Processing with boundary pixels predicted less consistently than internal pix-
Many post-processing steps are more in line with the problem of els of an object. To combat this issue, the morphological operations
segmentation rather than of classification. When predictions are at known as dilation and erosion may be used. These methods slide
the pixel level rather than at the patch or WSI level, interpretability windows across the image, assigning the maximum (dilation) or
can be diminished by errant pixel predictions (Fig. 6A). This can minimum (erosion) pixel values within each window to the pixel
occur as small areas of rare tissue types are often contained in large at the center of the window, respectively. The resultant image can
swaths of common tissue. Alternatively, several tissue subtypes join disconnected boundaries due to the fact that they are within
may look similar and, therefore, any algorithm may be able to iden- the same neighborhood of a pixel that defines the object. Using a
tify the larger structure, but fail to further characterize that struc- dilation followed by an erosion results in a closure while the inverse
ture. Post-processing, therefore, provides a set of tools to fix the results in an opening. This allows an analyst to programmatically
smaller errors that a model makes. join pixels of a similar type without expanding overall regions.
In addition to errant pixel predictions, algorithms may predict Although these operations can be abstract, the difference in their
gaps in an object (Fig. 6A). In order to improve interpretability, these results can be dramatic; a juxtaposition is provided in Fig. 7.
gaps can be filled in using an operation known as “hull filling”.
Most post-processing steps are morphological in nature and Instance Segmentation
have been reviewed in detail previously [34]. Available functions
in R come chiefly through the “EBImage” package developed by In medical images, it is often seen that structures lay close together,
Bioconductor [35] although the ImageR package offers very sim- such as glands in prostate tissue and tubuli in kidney tissue. It can
ilar functionality (https://dahtah.github.io/imager/imager.html). be useful to separate these structures after segmentation by a CNN
A detailed set of examples is provided by Bioconductor as a vignette to perform better post-processing or more accurate quantification.
with code and sample data for many of the methods provided below A way to identify individual objects is instance segmentation. With
(https://www.bioconductor.org/packages/release/bioc/vignettes/ instance segmentation, the boundaries of an object are detected at
EBImage/inst/doc/EBImage-introduction.html). In Python, tools pixel level by including an additional “boundary” class during
are available through the OpenCV [36] and SciPy packages [37]. training. This class can also be used to train a separate, binary net-
work that is applied prior to the network aimed at the true struc-
tures of interest. The boundary class can subsequently be used to
Hull Filling
identify individual objects to which post-processing rules can then
A common occurrence in segmentation output is objects for which be applied. This method is more computationally expensive than
the circumferential predictions are of one class, but the interior simple segmentation and may provide more benefits for tightly-
predictions are of a different class. Intuitively, identification of packed structures.
Fig. 7. Morphological operations using an 11 × 11 filter size followed by hull filling. (A) Dilation, (B) Erosion, (C) Opening, and (D) Closing. Note that in this case, the dilation and
closing with small differences in how close the segmentation is to the tissue boundaries.
unusable. Instead, a close analogue known as MIL is frequently when generating prediction estimates from these slides [44].
applied [33]. Below, we provide some examples of prediction at Noting that weighting is equivalent to differential sampling,
the tile level and then extensions to slide-level prediction similar Monte Carlo sampling has been implemented as well [4].
to Dimitriou et al [38]. Simplifying tiles to single values as part of the modeling process
has also been proposed [46].
Tile-Level Predictions
The simplest level of prediction is to treat each tile of the WSI as an Discussion
independent entity (Fig. 6C). Tile-level predictions can be moti-
vated by evidence of severity from even a single positive prediction. The last decade’s evolutions in machine learning and whole-slide
For example, a small region of cancer is still cancer at the slide level tissue scanners have changed the conversation around digital path-
and patient level. However, some predictions may require a ology and provided opportunities for increased accuracy and effi-
broader context which means that prediction at the tile level still ciency by the incorporation of computational pathology. While the
requires information from other tiles or slides [41, 42]. An example promise of robust, novel models that are capable of assisting physi-
is the CAMELYON17 challenge, where participants were asked not cians in the decision-making process is exciting, the barrier to
only to detect metastases in lymph nodes, but also to report translate models to clinical practice cannot be overstated. Unlike
whether this concerned isolated tumor cells, a micro-metastases, conventional statistical models, image analysis frequently requires
or macro-metastases on a patient level [43]. Only two participants pre- and post-processing in order to achieve optimal solutions.
actually measured the size of the largest region. More typically used Digital pathology distinguishes itself from other medical image
measures were the number of detected metastases, mean detection analysis scenarios in a large part due to the image size. Standard
size and standard deviation, mean detection likelihood, and stan- images, unpacked for the purpose of analysis, prevent full-scale
dard deviation. Alternatively, the use of gradient boosting may modeling. This requires subsampling small tiles from a very large
provide a way to oversample borderline cases by weighting. This WSI, whether systematic using a grid or selecting at random.
technique is more common in machine learning and utilizes The application of these systems can be complicated if the intent
sequential fits of a model followed by up-weighting of observations is to use modeling predictions in the framework of clinical trials. In
that were misclassified [17]. order to meet regulatory demands, end-to-end system validation
must be performed as each step creates a source of variation that
could destabilize any results [47, 48]. While the Food and Drug
Slide-Level Predictions
Administration has taken a stance on this [49], the Digital
Due to the fact that the information contained within slides is Pathology Association (DPA) has provided detailed information
dense, annotation of WSIs on a large scale has been a challenge. on what steps need to be taken for validation [47].
Data reduction is often required as a preprocessing step to focus Many of the methods in image pre- and post-processing may be
on areas of interest. Popular strategies often involve one of two familiar to statisticians such as data and dimensional reduction
methods: (1) MIL [3, 44, 45] or (2) unsupervised methods with methods as well as linear and nonlinear transforms. However,
the goal of reducing the amount of input data prior to an analysis. some aspects of the process are unique to image processing. For
MIL is chiefly used in classification problems by using “bags” example, the fact that image data is unstructured allows us to gen-
(slides) of instances (tiles) to predict the class of a slide, iteratively erate a nearly infinite number of predictors based on the intensity
selecting features from these tiles that would improve the classifi- of a pixel’s intensity as well as the intensities of that pixel’s neigh-
cation. More specifically, a slide may be labeled as “positive” for bors. Furthermore, most linear statistical models fit through the
some condition although the tiles composing the slide may be a mean of the data. In this case, adding noise in an augmentation
mixture of positive and negative. An example would be declaring step does not impact predictions, but will inflate the model error.
a full slide as cancer positive while only local regions of that slide This may not be the case when implementing deep learning models
(and therefore a few of the tiles belonging to the slide) as positive. that are highly nonlinear and hierarchical.
On the other hand, a slide is only considered cancer negative if all Throughout this manuscript, we provide a variety of options for
tiles are considered negative. At each iteration of the modeling processing images along with recommendations of where to begin.
process, features of those tiles that result in a slide being declared It is important to note that many image analysis projects are
positive are selected to improve the model. More details of these unique and our recommendations may not be valid or ideal.
methods are provided in Carbonneau et al [33]. Trying different combinations of these tools will often be of benefit
Unsupervised methods focus on two objects: (1) using a subset of to any project. Also, note that the perspective taken for many of
the tiles to make a prediction over the full slide or (2) reduce the these processing steps is during the model training stage, but
information contained within a single tile to a single value or a small pre- and post-processing are just as useful in prediction.
set of values which could be used in downstream analysis. The meth- Regardless of the methodology, analysis of WSIs is distin-
ods range from more standard machine learning methods such as k- guished by unusually large datasets which are often analyzed piece-
means clustering or principal component analysis (PCA) [5, 6]. meal and clinically relevant predictions at the slide or patient level
Alternatively, deep learning has been applied to perform tile-level require some degree of reconstruction. While the standard meth-
prediction followed by more standard analysis. Criticisms of these odology of mixed-effects modeling is unavailable, alternatives such
methods are that the two processes (tile compression followed by as MIL provide a method to enforce similar conditions during the
prediction) are performed independently and therefore improve- training process. By focusing on image cleaning, pre- and
ments are iterative rather than simultaneous [38]. post-processing and predictive modeling, improvements in image
In an attempt to marry data reduction and prediction, several classification and prediction are likely to aid the pathologist in
strategies have been proposed. The EM algorithm has been imple- improving the detection of important pathology and in reducing
mented to give more weight to slides that are more discriminatory pathologist burden by automating many tasks.
Disclosures. WK received grant funding from AstraZeneca, Biogen, and 23. Fernandez-Carrobles MM, et al. Automatic quantification of IHC stain in
Roche. There are no additional disclosures. breast TMA using colour analysis. Computerized Medical Imaging and
Graphics 2017; 61: 14–27.
Data Deposit. Images from the manuscript, as well as the code used to 24. Macenko M, et al. A method for normalizing histology slides for quanti-
create the images, have been uploaded here: https://github.com/BHSmith3/ tative analysis. In IEEE International Symposium on Biomedical Imaging.
Image-Cleaning. Boston, MA; 2009. 1107–1110.
25. Kothari S, et al. Removing batch effects from histopathological images for
enhanced cancer diagnosis. IEEE Journal of Biomedical and Health
References Informatics 2014; 18(3): 765–772.
1. Pell R, et al. The use of digital pathology and image analysis in clinical tri- 26. Khan AM, et al. A nonlinear mapping approach to stain normalization in
als. The Journal of Pathology: Clinical Research 2019; 5(2): 81–90. digital histopathology images using image-specific color deconvolution.
2. Griffin J, Treanor D. Digital pathology in clinical use: where are we now IEEE Transactions on Biomedical Engineering 2014; 61(6): 1729–1738.
and what is holding us back? Histopathology 2017; 70(1): 134–145. 27. Bejnordi BE, et al. Quantitative Analysis of Stain Variability in
3. Zhou Z-H. A brief introduction to weakly supervised learning. National Histology Slides and an Algorithm for Standardization. Bellingham, WA:
Science Review 2018; 5(1): 44–53. SPIE; 2014.
4. Combalia M, Vilaplana V. Monte-Carlo sampling applied to multiple 28. Vicory J, et al. Appearance normalization of histology slides. Computerized
instance learning for histological image classification. In Deep Learning Medical Imaging and Graphics 2015; 43: 89–98.
in Medical Image Analysis and Multimodal Learning for Clinical 29. Janowczyk A, Basavanhally A, Madabhushi A. Stain Normalization using
Decision Support. Granada; 2018. 274–281. Sparse AutoEncoders (StalloSA): application to digital pathology.
5. Yue X, et al. Colorectal cancer outcome prediction from H&E whole slide Computerized Medical Imaging and Graphics 2017; 57: 50–61.
images using machine learning and automatically inferred phenotype pro- 30. Zanjani FG, et al. Stain normalization of histopathology images using gen-
files. In Conference on Bioinformatics and Computational Biology. erative adversarial networks. In 2018 IEEE International Symposium on
Honolulu, HI; 2019. Biomedical Imaging (ISBI 2018); Washington, DC: Institute of Electrical
6. Zhu X, et al. WSISA: making survival prediction from whole slide histo- and Electronics Engineering (IEEE); 2018. 573–577.
pathological images. In 2017 IEEE Conference on Computer Vision and 31. Pedregosa F, et al. Scikit-learn: matching learning in python. Journal of
Pattern Recognition (CVPR). Honolulu, HI; 2017. 6855–6863. Machine Learning Research 2011; 12: 2825–2830.
7. Chicco D. Ten quick tips for machine learning in computational biology. 32. Khoshdeli M, Parvin B. Feature-based representation improves
BioData Mining 2017; 10: 35. color decomposition and nuclear detection using a convolutional neural
8. Abadi M, et al. TensorFlow: Large-Scale Machine Learning on network. IEEE Transactions on Biomedical Engineering 2018; 65(3):
Heterogeneous Systems. 2015. Software available from tensorflow.org. 625–634.
9. Chollet F. Keras. San Francisco, CA: Github, 2015. 33. Carbonneau M-C, et al. Multiple instance learning: A survey of problem
10. Jia Y, et al. Caffe: Convolutional architecture for fast feature embedding. characteristics and applications. ArXiv 2016.
arXiv:14085093, 2014. 34. Arganda-Carreras I, Andrey P. Designing image analysis pipelines in light
11. Paszke A, et al. PyTorch: an imperative style, high-performance deep microscopy: a rational approach. Methods in Molecular Biology 2017; 1563:
learning library. In: Wallach H, Larochelle H, Beygelzimer A, de-Buc F, 185–207.
Fox E, Garnett R, eds. Advances in Neural Information Processing 35. Pau G, et al. EBImage – an R package for image processing with applica-
Systems. Vol. 32. Red Hook, NY: Curran Associates, Inc.; 2019, pp. tions to cellular phenotypes. Bioinformatics 2010; 26(7): 979–981.
8024–8035. 36. Bradski G. The OpenCV Library. 2000. Software available from https://
12. Webster JD, Dunstan RW. Whole-slide imaging and automated image github.com/itseez/opencv
analysis: considerations and opportunities in the practice of pathology. 37. Virtanen P, et al. SciPy 1.0: fundamental algorithms for scientific comput-
Veterinary Pathology 2014; 51(1): 211–223. ing in Python. Nature Methods 2020; 17(3): 261–272.
13. Herrmann MD, et al. Implementing the DICOM standard for digital path- 38. Dimitriou N, Arandjelovic O, Caie PD. Deep learning for whole slide
ology. Journal of Pathology Informatics 2018; 9: 37. image analysis: an overview. Frontiers of Medicine 2019; 6: 264.
14. Clunie DA. Dual-personality DICOM-TIFF for whole slide images: a 39. Zhu X, Yao J, Huang J. Deep convolutional neural network for survival
migration technique for legacy software. Journal of Pathology analysis with pathological images. In 2016 IEEE International
Informatics 2019; 10: 12. Conference on Bioinformatics and Biomedicine (BIBM). Shenzhen,
15. Hermsen M, et al. Deep learning-based histopathologic assessment of kid- China; 2016. 544–547.
ney tissue. Journal of the American Society of Nephrology 2019; 30(10): 40. Lindstrom MJ, Bates DM. Newton—Raphson and EM algorithms for lin-
1968–1979. ear mixed-effects models for repeated-measures data. Journal of the
16. Ciompi F, et al. The importance of stain normalization in colorectal tissue American Statistical Association 1988; 83(404): 1014–1022.
classification with convolutional networks. In IEEE 14th International 41. Bejnordi BE, et al. Deep learning-based assessment of tumor-associated
Symposium on Biomedical Imaging (ISBI 2017); 2017. stroma for diagnosis breast cancer in histopathology images. In: IEEE,
17. Komura D, Ishikawa S. Machine learning methods for histopathological ed. 14th International Symposium on Biomedical Imaging; 2017.
image analysis. Computational and Structural Biotechnology Journal 2018; 929–932.
16: 34–42. 42. Bejnordi BE, et al. Context-aware stacked convolutional neural networks
18. Tellez D, et al. Neural image compression for gigapixel histopathology for classification of breast carcinomas in whole-slide histopathology
image analysis. IEEE Trans Pattern Anal Mach Intell 2019. doi: 10.1109/ images. Journal of Medical Imaging 2017; 4(4): 044504.
TPAMI.2019.2936841 43. Bandi P, et al. From detection of individual metastases to classification of
19. Janowczyk A, et al. HistoQC: an open-source quality control tool for digital lymph node status at the patient level: The CAMELYON17 challenge. IEEE
pathology slides. JCO Clinical Cancer Informatics 2019; 3: 1–7. Transactions on Medical Imaging 2019; 38(2): 550–560.
20. Shrestha P, et al. A quantitative approach to evaluate image quality of 44. Hou L, et al. Patch-based convolutional neural network for whole slide tis-
whole slide imaging scanners. Journal of Pathology Informatics 2016; 7: 56. sue image classification. Proceedings of the IEEE Computer Society
21. Otsu N. Threshold selection method from gray-level histograms. IEEE Conference on Computer Vision and Pattern Recognition 2016; 2016:
Transactions on Systems, Man, and Cybernetics 1979; 9(1): 62–66. 2424–2433.
22. Bandi P, et al. Resolution-agnostic tissue segmentation in whole-slide 45. Jia Z, et al. Constrained deep weak supervision for histopathology
histopathology images with convolutional neural networks. PeerJ 2019; image segmentation. IEEE Transactions on Medical Imaging 2017;
7: e8242. 36(11): 2376–2388.
46. Courtiol P, et al. Classification and disease localization in histopathology 48. Pantanowitz L, et al. Validating whole slide imaging for diagnostic pur-
using only global labels: a weakly-supervised approach. arXiv:1802.02212v2 poses in pathology: guideline from the College of American Pathologists
[cs.CV], 2018. Pathology and Laboratory Quality Center. Archives of Pathology &
47. Lowe A, et al. Validation of Digital Pathology in a Healthcare Laboratory Medicine 2013; 137(12): 1710–1722.
Environment. Digital Pathology Association, 2011. https:// 49. United States Food and Drug Administration. Technical performance
digitalpathologyassociation.org/_data/cms_files/files/DPA-Healthcare- assessment of digital pathology whole slide imaging devices. Silver
White-Paper–FINAL_v1.0.pdf. Spring, MD: Center for Devices and Radiological Health; 2016.