Use of Whole Genome Shotgun Metagenomics A Practical Guide For The Microbiome-Minded Physician Scientist
Use of Whole Genome Shotgun Metagenomics A Practical Guide For The Microbiome-Minded Physician Scientist
1 Division of Maternal-Fetal Medicine, Department of Obstetrics and Address for correspondence Kjersti M. Aagaard, MD, PhD,
Gynecology, Baylor College of Medicine Department of Obstetrics & Gynecology, Division of Maternal-Fetal
2 Department of Molecular and Human Genetics, Bioinformatics Medicine and Department of Molecular and Cell Biology, Department
Research Lab, Baylor College of Medicine of Molecular Physiology and Biophysics, National School for Tropical
3 Department of Molecular and Cell Biology, Baylor College of Medicine Medicine, Center for Metagenomics and Microbiome Research, Center
4 Alkek Center for Metagenomics and Microbiome Research, Baylor
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
for Reproductive Medicine, John M. Eisenberg Center for Health
College of Medicine, Houston, Texas Outcomes Research, Bionformatics Research Lab at the HGSC,
Translational Biology and Molecular Medicine and Co-Director, Baylor
Semin Reprod Med 2014;32:5–13 College of Medicine MSTP program, 1 Baylor Plaza, Houston, TX 77030
(e-mail: [email protected]).
Abstract Whole genome shotgun sequencing (WGS) has been increasingly recognized as the
most comprehensive and robust approach for metagenomics research. When com-
pared with 16S-based metagenomics, it offers the advantage of identification of species
level taxonomy and the estimation of metabolic pathway activities from human and
environmental samples. Several large-scale metagenomic projects have been recently
conducted or are currently underway utilizing WGS. With the generation of vast
amounts of data, the bioinformatics and computational analysis of WGS results become
vital for the success of a metagenomics study. However, each step in the WGS data
analysis, including metagenome assembly, gene prediction, taxonomy identification,
function annotation, and pathway analysis, is complicated by the shear amount of data.
Keywords Algorithms and tools have been developed specifically to handle WGS-generated
► whole genome metagenomics data with the hope of reducing the requirement on computational
sequencing time and storage space. Here, we present an overview of the current state of
► microbiome metagenomics through WGS sequencing, challenges frequently encountered, and
► assembly up-to-date solutions. Several applications that are uniquely applicable to microbiome
► metabolic pathways studies in reproductive and perinatal medicine are also discussed.
The relationship of humans with our environmental microbes cancer.5–8 Further, dysbiosis of the microbiome has been
is documented throughout history. The discovery of the implicated as a cause for preterm birth. Gravidae that under-
smallpox vaccine by Edward Jenner and the great pandemics go preterm birth often have an intrauterine infection with
of the Bubonic Plague and the 1918 influenza have demon- increases in inflammatory cytokines,9,10 such as IL-6 and IL-
strated the volatile and parasitic side of microbes. However, 1β. Thus, these studies have begged the question of how do
we also have mutualistic and commensal relationships with we establish and maintain a healthy microbiome. With the
the microbes in our environment. Recently, the Human exponentially expanding interest in human microbiome re-
Microbiome Project (HMP) consortium took on the task of search, a working knowledge of the methodology and tools
documenting what constitutes a healthy microbiome.1–4 This used in this field is fundamental to translational research. This
question has help to highlight studies demonstrating that is notably true in reproductive and perinatal research initia-
dysbiosis of the microbiome is associated with type 2 diabetes tives, where there is a tremendous potential need for inves-
mellitus, obesity, inflammatory bowel disease, and colorectal tigators well versed in both the technology and biology of the
Issue Theme The Microbiome and Copyright © 2014 by Thieme Medical DOI http://dx.doi.org/
Reproduction; Guest Editors, James H. Publishers, Inc., 333 Seventh Avenue, 10.1055/s-0033-1361817.
Segars, MD, and Kjersti M. Aagaard, New York, NY 10001, USA. ISSN 1526-8004.
MD, PhD Tel: +1(212) 584-4662.
6 Use of Whole Genome Shotgun Metagenomics Ma et al.
expanding field of research. While an increasing number of The advantages of WGS sequencing on Illumina platform over
investigators are familiar with and employing 16S-based 16S rDNA sequencing on 454 platform are the ability to
metagenomic approaches, there are far fewer investigators provide information on genome assembly, species level taxon-
who have a working knowledge of alternative metagenomic omy abundance, gene predication, and metabolic pathway
approaches. reconstruction.12 However, each stage of the analysis is com-
Before the era of massively parallel NextGen sequencing, plicated by incomplete coverage, the high volume of data, the
the clone-based metagenome approach in combination with short length of reads, and intrinsic errors caused by parallelism
Sanger sequencing was used for early metagenomics re- sequencing.13,14 In this review, we will primarily focus on the
search. First, DNA content of a genomic clone is sheared bioinformatics procedure to transform Illumina-generated
into random fragments before cloning fragments into plasmid short reads into biologically meaningful taxonomic and func-
vectors that are grown to produce monoclonal libraries tional entities (►Fig. 1). Recent developed tools specific for
containing enough genomic material for sequencing. Al- metagenomic data analysis and their application to human
though Sanger sequencing produces long reads (100–2,000 reproductive medicine will be discussed as well (►Table 1).
bp), usually only a few selected inserts could be obtained.
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
Thus, this process is low throughput and suffers from the
Genome Assembly
limitation of assembly regions with large repeats and cloning
bias. Hence, it is not surprising that NextGen sequencing Genome assembly is essential for the study of gene arrange-
(NGS) techniques have quickly replaced Sanger sequencing ments and gene function. For assembly in a single organism,
because of their collectively unique advantages. In addition to all the DNA fragments come from the same genome. However,
economical low per base cost and higher throughput, the this is not the case when it comes to metagenome assembly,
cloning step and its inherent problems seen in Sanger se- and several obstacles make metagenome assembly especially
quencing methods are no longer issues for NGS techniques. challenging. For samples from an environment with low
Environmental samples can be sequenced directly by NGS microbial abundance, the coverage on the genome is usually
techniques, which allows for the investigation of unculturable incomplete. Although longer gene sequences could be
and low abundance species. Therefore, the comprehensive achieved, there is still a risk of making chimeric contigs
characterization of more complex and diverse microbial from different operational taxonomic units (OTUs). This risk
communities, such as microbial communities related to the of chimeric contigs is further complicated by genomic repeats.
human reproductive system, become feasible. NGS techni-
ques used in metagenomics research mainly include 454
Genome Sequencer Pyrosequencing (454 Life Sciences; Roche
Company, Branford, CT) for 16S rDNA sequencing, Solexa/
Illumina (Illumina Inc., San Diego, CA) for whole genome
shotgun sequencing (WGS) studies, and the most recent
Helicos (Helicos Bio Sciences, Cambridge, MA) single-mole-
cule sequencing technology also for WGS studies.11
The majority of recent studies examining the bacterial
flora communities residing within humans have utilized 16S
rDNA sequencing techniques. The nine variable regions of the
16s rRNA gene are flanked by conserved stretches in the
majority of bacteria. This conservation can be used as targets
for PCR primers with near-universal bacterial specificity.
Therefore, through 454 Pyrosequencing, sequences read are
obtained from one region of the 16S rRNA gene, which is then
quantified and subsequently assigned a taxonomy. Thus,
when compared with WGS techniques of the full length
16S rRNA gene, the coverage of each sample is much higher
and many more samples are able to be run in parallel using
bar-coding system. However, the downside to WGS techni-
ques is that a small proportion of reads could be assigned to
lower level taxonomy due to the shorter read length. Overall,
the resolution of the community composition obtained with
16S Pyrosequencing techniques is orders of magnitude larger
than Sanger sequencing with a lower per base cost.
The Illumina technology was introduced around the same
time as 454 Pyrosequencing technology. The Illumina instru-
ments produce more than 10 times the number of reads per
run as the 454 GS FLX machines, albeit of much shorter lengths Figure 1 Flowchart of metagenomics whole genome shotgun se-
(less than 100 bp compared with 400–500 bp of 454 reads). quencing data analysis.
Table 1 Tools for metagenomics analysis mentioned in this assemblers have been developed specifically for the assembly
review of metagenomes.19
One assembler in particular is MetaVelvet20 that was
Process Tools Reference developed based on Velvet,16 a popular assembler for single
Metagenome assembly MetaVelvet 20 genomes. The basic idea of MetaVelvet is to take the de Bruijn
Meta-IDBA 21 graph constructed from sequences obtained from multiple
GeneStitch 22 species as a mixture of multiple de Bruijn subgraphs, where
Ray Meta 23
each subgraph represents an individual species. The mixed de
Bambus 2 25
Bruijn graph is then decomposed into individual subgraphs
Gene prediction MetaGene Annotator 26 based on coverage difference and graph connectivity, and the
MetaGeneMark 28
Orphelia 29 subgraphs are subsequently used for building scaffolds.
The program Meta-IDBA21 is used to address the issue of
Taxonomic MEGAN 31
identifying the branches in the de Bruijn graph caused by
identification WebCARMA 33
Phymm PhymmBL 35 polymorphism in similar subspecies (sp-branches) or caused
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
MetaPhyler 36 by similar genomic regions shared by different species (cr-
MetaPhlAn 37 branches). Meta-IDBA first identifies and removes cr-
Functional and BLASTX branches in the graph, which leaves connected components
pathway annotation UBLAST 40 corresponding to a set of subgraphs of the same species.
HMMER 41 Finally, each component is transformed into a multiple
THINK-Back 42
alignment of consensus sequences to represent the contigs
HUMAnN 45
of different subspecies.
Data analysis pipeline MetAMOS 58 GeneStitch22 uses the prior knowledge of the species
MG-RAST 55
composition and gene contents to guide the assembly pro-
Genboree tool kit 54
cess. The idea is that the assembled contigs are similar to
given reference genes. Alternatively, the contigs could be
For the same reason, the assembly process could distort the inferred from the tangled de Bruijn graph using a network
species abundance as well.15 matching algorithm, and if no prior sequence knowledge is
Reconstructing genomes without referencing a previously available, a general dataset of genes could be used to reference
sequenced genome is called de novo assembly, which is the gene sets as well. With the ever increasing number of
proven to be hard to solve computationally (NP-hard). The samples and reads, scalability is becoming important for
traditional method used for assembly of Sanger-based se- metagenome assembly.
quences is Overlap Layout Consensus. An overlap graph is first Ray Meta23 is a method that was developed for scalable
constructed with each read as a node and edge representing distributed de novo metagenome assembly on Ray.24 Ray
the overlap identified between reads. The graph is thereafter Meta does not modify the de Bruijn subgraphs as MetaVelvet
analyzed to determine the paths connecting reads together to and Meta-IDBA. It applies heuristics-guided graph traversals
construct the genome. However, this method is not suitable to on k-mers in parallel, which is more amenable to distributed
be used on the assembly of short reads generated on an NGS
platform because in the worst case each read must be
compared with all other reads. NGS methods usually generate
an order of magnitude more reads compared with Sanger
sequencing, which significantly increases the computational
complexity. Most of the recently developed metagenomics
assembly algorithms are based on Eulerian tour of de Bruijn
graphs. In de Bruijn graphs, reads are first decomposed into
fixed length k-mers. Nodes are represented by k-mers, with
the reads themselves being the edges connecting the nodes
(►Fig. 2). The overlaps are implicitly represented in the graph
by paths that traverse from one read to its neighbor. The
output is usually a simple path of contigs. In this way, the high
number of reads does not affect the number of nodes and
because repeats only appear once, the problem of high
redundancy in reads is also solved. Moreover, the solution
to a de Bruijn graph is an Eulerian path, and a linear-time
algorithm to solve an Eulerian path does exist. Figure 2 A sample de Bruijn graph with k ¼ 4. The edges of the graph
are unique subsequences of reads with length of k. The nodes of the
Traditional assemblers designed for the assembly of single
graph represent common subsequence of length of k-1. If the suffix of
organism genomes were initially applied to assemble meta- one node matches with the prefix of the other node with length of k-2,
genomes (i.e., Velvet,16 Celera,17 and SOAPdenovo18) with the two nodes are connected. This graph consists of short reads for the
limited success. Recently, various Eulerian strategy-based consensus sequence “GTTTGGTTGT.”
computing. All of these methods are claimed to yield longer GC content. Finally, the Markov model of a protein coding
contigs and more representative taxonomic representations region is constructed based on the usage of all 61 codons.
on simulated and real data compared with assemblers de- Orphelia is the third recently developed tool for gene
signed for single genome assembly. prediction,29 and is unique in that it adopts a neural net-
After the reads are assembled into contigs, the relative work-based method. The neural network is trained on ran-
positions of the contigs along a genome are determined by domly excised DNA fragments from the genomes that were
scaffolding, a process that depends on mate-pair information. If used for discriminate training. The artificial neural network
two ends of the mate pair are in differing contigs, the two contigs combines sequence features, such as monocodon usage,
are inferred to be adjacent to each other on the genome. Most of dicodon usage, and translational initiation sites, with ORF
the assemblers contain module of scaffolding. This scaffolding length and GC content to compute a posterior probability of
algorithm starts with the most reliable information and gradu- an ORF to encode a protein.
ally adds more data as long as the new information agrees with One recent study has benchmarked these three gene pre-
the constructed scaffold. Tools, such as Bambus 2,25 have been diction methods and demonstrated variable performance at
developed for metagenome scaffolding. Bambus 2 can be applied different read lengths and fragment types.26 As might be
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
to virtually all existing sequencing technologies and the output logically predicted, the authors found that longer reads result
from popular assemblers. in better gene prediction. In addition, while MGA had the best
sensitivity, it was the worst in specificity for most read lengths.
MetaGeneMark had average sensitivity but much better spec-
Gene Prediction
ificity than MGA, and Orphelia had the lowest annotation error
A fundamental purpose of gene assembly is to enable gene for longer read lengths. Therefore, the combination of several
predictions using scaffolds, such that genes can be classified methods, screening intergenic regions for overlooked genes,
into correct functions. There are two classically described and using dedicated frameshift detectors may result in better
methods for gene prediction: (1) train model parameters on prediction accuracy.26 Decisions as to which will perform
known annotations to predict unknown annotation or (2) to optimally will also be dependent upon the number of reads
train models based on homology search, which aligns se- per sample and the ratio of bacterial to human reads. This is
quences to gene database to find homologous sequences. similarly related to the human body niche of sample origin.
However, it is not possible to apply these traditional methods
directly to metagenomics data. The incomplete open reading
Taxonomy Identification
frame (ORF) acquired from metagenome assembly often lacks
start or stop codons; therefore, ab-initio programs do not One important question to answer in metagenomics analysis
work in this scenario. In addition, there is not yet a sufficient is “What microbes are present?” which leads to the identifi-
metagenomics sequence database to build a statistical model cation of taxonomy distribution in metagenomic samples.
to distinguish coding from noncoding ORFs. The obvious 16S rDNA-based surveys produce on average 10,000 sequen-
drawback for these homology-based methods is that it only ces that range from 400 to700 bp in length per sample. Rapid
provides information for known genes. taxonomic classifiers, such as the Ribosomal Database Project
Recent tools have been developed to address this core issue (RDP) classifier,30 use these sequences to generate taxonomic
of metagenome gene prediction. MetaGene Annotator (MGA) distribution down to the genus level. Despite its popularity,
26
is upgraded from MetaGene.27 First, all ORFs in MGA are 16S rDNA-based methods suffer from the biased estimation of
extracted and scored on a model estimated from annotated microbial diversity due to the variability in copy number of
genomes. Then, an optimal combination of ORFs is calculated the 16S gene and the PCR.
using the scores of orientations and distances of neighboring There are two key pathways enabling taxonomic identifica-
ORFs. MGA also uses the logistic regression models of the GC tion using WGS reads. The first employs homology search
content and di-codon frequencies from MetaGene. In addi- against a reference gene database. For example, MEGAN31 first
tion, it has an adaptable ribosomal binding site model based performs a BLASTX search against the NCBI-NR database. Taxo-
on complementary sequences of 16S ribosomal RNA, which nomic analysis is then conducted by placing each read onto a
helps MGA to precisely predict translation start sites. node of the NCBI taxonomy according to the lowest common
MetaGeneMark28 uses a heuristic approach originally ancestor of the top hits, and the NCBI taxonomy is based on a
developed for finding genes within small fragments of anon- hierarchically structured classification of all species represented
ymous prokaryotic genomes and/or highly inhomogeneous in the NCBI. Instead, CARMA,32 and the refined version Web-
genomes. The training dataset consists of 357 bacteria and CARMA,33 searches all Pfam domain and protein families as
Archaea species. Linear regression is applied to the relevant phylogenetic markers to identify the source organisms of unas-
information in the training set, such as the relationship sembled reads using hidden Markov models. Then a phyloge-
between positional nucleotide frequencies and the global netic tree is reconstructed for each matching Pfam family and the
nucleotide frequencies and the relationship between the corresponding query reads. Finally, the reads are classified into a
amino acid frequencies and the global GC content. The initial higher-order taxonomy depending on their phylogenetic rela-
frequency values of the occurrence of 61 codons are calculat- tionships to family members with known taxonomic affiliations.
ed based on the above information and subsequently modi- It is worth noting that only a small portion of reads have matches
fied by the frequency of each amino acid determined by the by BLAST against the microbial database.
An alternative strategy to homology-based approaches is reference protein database, but then one must determine
to use machine learning and statistical methods to classify which database to use. The size and contents of the databases
reads based on the composition of the DNA base signatures. are different, which will in turn affect the efficiency and
The interpolated Markov models (IMMs) have been employed accuracy. If one is interested in annotating as many sequences
with success in bacterial gene classification using the GLIM- as possible, the NCBI RefSeq database would be a good choice
MER system.34 Compared with other methods, IMMs utilize because it has the most comprehensive collections of ge-
information from sequences of different lengths and integrate nomes. For this purpose, various versions of BLAST, including
the results. The program Phymm35 demonstrates the use of BLASTX and BLASTP, could be applied. However, this ap-
IMMs in classification of metagenomic reads. In Phymm, a proach suffers from the long computation time required to
classifier is trained on a large amount of curated genomes. search through all the homologs in reference to the database
This classifier constructs probability distributions that repre- for each sequence in the dataset. To speed up the process,
sent the observed patterns of nucleotides characterizing each BLAST can be done in parallel, like the MBLASTX (Multi-
chromosome or plasmid. PhymmBL35 demonstrates that the coreWare, St. Louis, MO) used by the HMP. Tools, such as
combination of machine learning and BLAST produces higher UBLAST,40 have been developed for high-throughput se-
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
accuracy than either method alone. quence classifications that are often an order of magnitude
Given the complexity of metagenomic assemblies, the faster than BLAST in real applications, but these applications
taxonomic classification can also be achieved by directly using lose sensitivity. The raw quantification obtained from align-
reads before assembly. Large-scale studies, for example, the ments needs to be normalized by the size of the reference
HMP,1–3 likely includes hundreds of samples. In these large- coding sequences. The results from a homology search are
scale studies, the computational efficiency of BLAST becomes often affected by sequence conservation due to the functional
the bottleneck for the analysis process if all the reads are used homology in different organisms. When sequences are
for classification without assembly. Therefore, reference mark- mapped to structurally or functionally conserved region,
er gene sets are constructed to reduce the size of the database. they can easily be assigned to different species if only a
MetaPhyler36 is one of these methods, which relies on 31 similarity score is used.
phylogenetic marker genes derived from existing genomes and A possible solution to misclassification is to adopt the
the NCBI-NR database. Furthermore, instead of using a univer- more sensitive profile-based search method. This method
sal classification threshold for all genes at all taxonomic levels, uses databases with profiles generated from alignments of
MetaPhyler uses different thresholds for classifiers to the protein families that share similar functions, such as COG,
reference gene and to the taxonomic level, which results in Pfam, or TIGRfam. Hidden Markov–based HMMER41 was
much faster analysis. MetaPhlAn37 first identified more than 2 designed to perform a fast search against profiles generated
million potential markers using 2,887 genomes from Integrat- from multiple sequence alignments. Although more sensitiv-
ed Microbial Genomes (IMG) system,38 which was further ity is achieved this way, fewer sequences get annotated. For
refined to a catalog of 1,221 species with 231 markers per partial proteins generated on short contigs or unassembled
species and > 115,000 markers at higher taxonomic levels. sequences, a repository with patterns or motifs (i.e., PROSITE)
The relative abundance of each taxonomic level is made by the might be used for a functionality search. If gene prediction is
alignment of reads to clade-specific marker sequences in this successful, genomic neighborhood, phylogenetic profiling,
catalog. Microbial clade abundance is then estimated by and gene coexpression analysis may provide useful informa-
normalizing read-based counts with the average genome tion for functional prediction as well.
size of each clade. MetaPhlAn has been applied to the analysis Pathway-based analysis has been developed to interpret
of vaginal microbiome (posterior fornix) of asymptotic women the results from microarray experiments before applying the
enrolled in the HMP. As Lactobacillus is the dominant genus in results to metagenomics data. Pathway here indicates a series
the vaginal microbiome, it is important to further classify this connected sets of genes with nodes representing genes and
genus down to the species level to reflect the detailed micro- lines representing their relationships (►Fig. 3). The signifi-
bial profile. Using this strategy, all five of the signature cance of these pathways is decided by functional enrichment
Lactobacillus species could be identified by MetaPhlAn. Despite statistics (Fisher exact test) or by scoring based on the pool of
the technical difference between 16S sequencing and WGS, the genes in the sample (gene set enrichment analysis [GSEA]).
estimated relative abundance is quite similar. One major drawback of these count-based methods is dis-
regarding the topology of the pathways. The order of the
genes in the pathway could help with the interpretations of
Functional and Pathway Annotation
the results. Fortunately, more complex methods have been
After discovering the microbial consistency, the next question developed to address this problem. THINK-Back,42 stands for
to be answered from WGS data are “What are these microbes Knowledge-based Interpretation of High Throughput data, is
able to do?”. There are two issues involved in this process. The a suite of tools trying to generate biologically meaningful
first issue is to assign functional annotation to the assembled hypothesis by using knowledge in pathway databases, such as
ORF or to the reads directly. The other issue is to place the KEGG, PANTHER, Reactome, and Biocarta. One method in
genes in the context of biological pathways, especially meta- THINK-Back adjusts the score generated by GSEA43 by incor-
bolic pathways.39 The most straightforward way for function- porating the appearance frequency of the genes in a KEGG
al prediction is by aligning query sequences to an existing database.44 Another method takes into account the topology
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
Figure 3 Mock KEGG pathway map shows the concept of pathway analysis. The figure on the top contains all the KEGG pathways involved in
metabolic process. Blue and red colors indicate the enrichment of genes in either case or control group. Two examples of KEGG pathways are
shown with nodes representing proteins or molecules and lines representing their biological relationship. The gradient of red indicates the
average relative ratio of gene abundance between case and control samples.
of the pathways to calculate a density score, which is subse- fined as a list of functions annotated for a set of genes in a
quently used for adjusting GSEA scores.42 The pathway minimal pathway that includes all the gene functions. This
reconstruction using WGS data is essentially using the num- approach avoids the problem of identification of spurious
ber of gene copies to indicate the activity of pathway, which is pathways and overestimation of microbial abundance. After
quite different from RNA-based microarray and RNA-seq data normalization and smoothing, pathway coverage (rela-
analysis. But the idea of integrating topology pathway infor- tive confidence of each pathway being present in the sample)
mation into pathway analysis could still be applied to meta- and pathway abundance (relative “copy number” of each
genomics pathway reconstruction. pathway in the sample) are generated and organized into a
As described above, most of the ORFs assembled from matrix-like format for postprocessing.
metagenomics reads are partial and likely contain errors
caused by frame shifts; therefore, another way to perform
Comparative Metagenomics
functional annotation is to skip the gene calling altogether
and use the protein coding sequences identified from the Despite all the challenges with WGS as covered in this review,
reads. In HUMAnN (the HMP Unified Metabolic Analysis important environmental and biological questions have been
Network),45 the reconstruction of a network is accomplished investigated through comparisons of taxonomic abundance
by mapping the protein coding genes onto reference pathway and metabolic pathway activity. Because of the dynamic
collections, such as eggnog and KEGG orthology groups, nature of the microbiome, there is large variation in micro-
based on their homology to the reference genes previously biome profiles even from samples obtained from a similar
characterized. MinPath46 adopts an integer programming environment. Therefore, a higher number of samples need to
algorithm to reconstruct “minimal pathway,” which is de- be collected to get an accurate measure of the microbiome.
Ergo, in addition to a large amount of sequences, there are task. Unsupervised learning has been applied to metagenom-
also a large number of samples with metadata. The taxonomy ics data as well to discover the hidden structures of micro-
profile is often organized into a matrix with rows represent- biome. One effort is the introduction of the “enterotype” by
ing taxonomy (either relative abundance for each taxonomic Arumugam et al50 using the human gut samples from the
level or OTU counts) and columns representing each sample. MetaHIT consortium.50 Enterotypes are generated by per-
Depending on the complexity of the microbiome, there could forming clustering analysis on the gut microbial communi-
be thousands of rows and columns in the matrix. ties. The difference among three enterotypes is driven by key
Matrix expansion yields the issue of data dimension. Di- bacterial genera and not related with age, gender, or body
mensionality reduction becomes important to decrease the weight. There is also a report about the existence of enter-
computational cost. If the taxa table is large, it will be helpful to otype-like clusters in the vaginal microbiome community
first filter the table to remove singletons or OTUs/species only based on the abundance of bacterial species, mainly species in
appearing in a small number of samples. As singletons or rare the Lactobacillus genus.51 However, recent research, includ-
species may be generated by sequencing error, they are not ing our own, indicates that one should take precautions when
helpful for the purpose of comparative metagenomics. Princi- performing enterotyping.52,53 Despite various ways to gener-
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
pal coordinates analysis (PCoA) is the most popular technique ate a taxa table, clustering is a statistical approach, whose
to perform dimensionality reduction. PCoA takes the results of performance is affected by many factors. One recent study
β diversity comparisons that are generated using phylogenetic tried to identify the influence of various factors on enter-
(UniFrac47) or nonphylogenetic (such as Canberra) distance on otyping, including clustering methodology, distance metrics,
taxon table and produces a new matrix with fewer dimensions OTU-picking methods, sequencing depth, and sequencing
by solving eigenvalues. The direction of each axis is chosen to methods.52 Using the HMP data, instead of discrete enter-
maximize the variation in the data. Normally, the first three otypes, a smooth gradient distribution of Bacteroides abun-
coordinates are chosen to visualize the samples in three dances was observed in gut microbiome. For the vaginal
dimensions. The points that cluster together indicate that microbiome, depending on the taxonomy level, distance
these samples have similar taxonomy profiles. An alternative, metrics and scoring methods, two to five clusters are found
nonparametric rank-based method is to use nonmetric multi- using the HMP vaginal data. These results suggest that
dimensional scaling, which could avoid the arch effect caused distance metrics and the clustering methods have the largest
by the sparsity of the matrix.48 effect on enterotyping. At least one absolute scoring method
If clusters are observed from a 3D PCoA plot, most biologists combined with two to three distance metrics should be used
will be interested to know which taxa cause the differences in to verify the existence of enterotypes.52
the microbial community depending on the metadata. Thus,
some machine learning techniques could be used to answer
WGS Data Analysis Pipelines
this question. Not every statistical test should be used for every
analysis, but the combination of several analyses can produce a As described above, the metagenomics data analysis includes
more accurate result. The Random Forest generates a large assembly, gene predication and annotation, taxonomic classifi-
ensemble of decision trees from a random subset of the data cation, and so on, but each of these tasks is performed by specific
and a random selection of the variable. The resulting ensemble software that requires installation, configuration, and integra-
of trees is then used with a majority-voting approach to decide tion. This is a daunting task even for bioinformatics experts. Most
which sample belongs to which group. One advantage of the of the research groups construct analysis pipeline by picking
Random Forest is that there is no need for cross validation to tools for each task based on their own experience. For a
get an unbiased estimate of the test set error. An out-of-bag laboratory without bioinformatics support, it may be difficult
error estimate is generated internally by a bootstrap sample to perform meaningful analysis with a large amount of data.
from the original data. This is very useful when the sample size With this in mind, we have recently worked to produce single-
is small. Boruta is an all-relevant feature selection wrapper site, publicly available tool sets.54 Specifically designed for 16S
algorithm around Random Forest. It finds important features analysis, our Genboree Microbiome tool set was deployed using
by iterative learning of the Random Forest classifier. In the end, the web-based Genboree workbench, which has an easy-to-use
a list of features confirmed to differentiate groups is generat- GUI interface. Users upload the sequencing file and metadata
ed.4 LEfSe (linear discriminate analysis effect size)49 is a and choose the desired task analysis by clicking. Similar web
recently developed tool to identify genomic features (genes, tools for WGS data analysis have been developed too. MG-
pathways, or taxa) specific to each group. LEfSe first use RAST55 is a comprehensive web tool for both phylogenetic
Kruskal–Wallis sum-rank test and Wilcoxon rank-sum test to and functional summaries. MG-RAST is based on a modified
identify the significant differential abundance with respect to version of the RAST (rapid annotation based on subsystem
the class of interest. Then linear discriminant analysis is technology) server56 upon the SEED framework, which provides
applied to estimate the effect size of each differentially abun- automated sequence assignment by comparison with both
dant feature. LEfSe also provides bar plots and cladogram plots protein and nucleotide databases. Users can upload the sequence
to represent the discovered biomarkers. file to the server and keep data private or public.
Random Forest and linear discriminate analysis are both Like QIIME for 16S-based analysis, a similar standardized
supervised machine learning methods, which means that the framework for WGS data analysis has been created. Smash-
samples have been assigned to a group before the learning Community57 is one of the early pipelines designed for 454
and Sanger data with limited capability for follow-up analysis. field. We can now explore unknown environments as com-
MetAMOS58 is a modular and customizable framework for munity genomic, ecologic niches in previously unparalleled
metagenomic assembly and analysis, which is also user and dynamic fashions. However, the downstream analysis
friendly. The construction is built upon the AMOS open- currently lags behind the sequencing technology. Compared
source genome assembly framework. A collection of publicly with 16S-based metagenomic sequencing, WGS generates
available tools is tied together by the lightweight workflow exponentially more sequences that necessitate large storage
system Ruffus, including Meta-IDBA,21 MetaVelvet,20 SOAP- requirements, and produce large numbers of unknown spe-
denovo,18 Bowtie,59 MetaGeneMark,28 MetaPhyler,36 and cies that demand more computational resources. In this
more. The modular design enables users to check the output review, we have introduced several recently developed tools
for each step and facilitates the integration of data generated dedicated to metagenomics assembly, gene predication, and
by other tools or in different formats. pathway reconstruction. There is still a high demand for more
efficient and more sensitive tools to perform standardized
analysis. In addition to WGS, RNA-based metatranscriptomics
WGS Application in Human Reproductive
is also under development to provide more details on the
Medicine
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
dynamic changes in the community, which may alleviate the
Despite these advances in WGS analysis, the prevailing technique limitation caused by DNA-based methods. Metabolomics
used for microbiome research in the area of human reproductive attempts to measure the complete set of molecules in the
medicine is still 16S rDNA sequencing. However, WGS techni- community, which could provide important information on
ques have been adopted in several recent studies in addition to the study of host–microbe interactions. In our era of “omics-
the 16S sequencing. A subset of samples from HMP was sub- based discovery science,” physician scientists are increasingly
jected to Illumina sequencing, which included samples from called upon to work side by side with computational scien-
posterior fornix.2 The result from this study demonstrated that tists. It is our hope that this review will provide our fellow
although each body site is characterized by signature clade, most microbiome-minded reproductive and perinatal biologists
of the metabolic pathways are evenly distributed and prevalent with a working knowledge of the current state of the science.
across both individuals and body habitats. However, this analysis
revealed that the pathways related with oligosaccharide and References
polyol transport system are more active in posterior fornix 1 Human Microbiome Project Consortium. A framework for human
samples. One recent study on dynamic changes of gut microbiota microbiome research. Nature 2012;486(7402):215–221
from first to third trimesters also used Illumina HiSEq. 2000 to 2 Human Microbiome Project Consortium. Structure, function and
examine the enrichment of specific metabolic pathways.60 The diversity of the healthy human microbiome. Nature 2012;
486(7402):207–214
analysis of data did not find difference in the mean relative
3 Aagaard K, Petrosino J, Keitel W, et al. The Human Microbiome
abundance of gene categories or metabolic pathways between
Project strategy for comprehensive sampling of the human micro-
trimesters. Therefore, the shifts of gut microbiome during biome and why it matters. FASEB J 2013;27(3):1012–1022
pregnancy may not be associated with metabolic changes. 4 Aagaard K, Riehle K, Ma J, et al. A metagenomic approach to
However, a network analysis of correlations between COG characterization of the vaginal microbiome signature in pregnan-
(cluster of orthologous groups) abundances across samples cy. PLoS ONE 2012;7(6):e36466
5 Qin J, Li Y, Cai Z, et al. A metagenome-wide association study of gut
indicated the loss of network modularity in the third trimester,
microbiota in type 2 diabetes. Nature 2012;490(7418):55–60
which indicates a reduction in phylogenetic diversity and a more 6 Joossens M, Huys G, Cnockaert M, et al. Dysbiosis of the faecal
uneven distribution of taxa.60 The results of this study are in microbiota in patients with Crohn’s disease and their unaffected
agreement with studies of phylogenic diversity from our labo- relatives. Gut 2011;60(5):631–637
ratory.4 As WGS techniques continue to improve and become 7 Turnbaugh PJ, Bäckhed F, Fulton L, Gordon JI. Diet-induced obesity
more user friendly, this will be a powerful tool in future studies is linked to marked but reversible alterations in the mouse distal
gut microbiome. Cell Host Microbe 2008;3(4):213–223
with a focus on human reproduction. For instance, in studying
8 Sobhani I, Tap J, Roudot-Thoraval F, et al. Microbial dysbiosis in
preterm birth, it can be challenging to detect bacteria in the colorectal cancer (CRC) patients. PLoS ONE 2011;6(1):e16393
amniotic fluid of patients.61 Yet, recent studies have detected 9 Goldenberg RL, Hauth JC, Andrews WW. Intrauterine infection and
bacteria deep within human fetal membranes.62 In addition, an preterm delivery. N Engl J Med 2000;342(20):1500–1507
independent study found that bacteria were harbored in the 10 Nold C, Anton L, Brown A, Elovitz M. Inflammation promotes a
basal plate of the placenta.63 Remarkably, there was not statisti- cytokine response and disrupts the cervical epithelial barrier: a
possible mechanism of premature cervical remodeling and pre-
cal significance in the presence of bacteria between preterm and
term birth. Am J Obstet Gynecol 2012;208:e201–e207
term patients.63 Thus, while previous sequencing techniques 11 Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLOS
have failed to detect bacteria in the placenta, the advent on NGS Comput Biol 2010;6(2):e1000667
techniques may help to advance our understanding of the role of 12 Prakash T, Taylor TD. Functional assignment of metagenomic data:
the microbiome in promoting healthy, term pregnancies. challenges and applications. Brief Bioinform 2012;13(6):711–727
13 Gonzalez A, Knight R. Advancing analytical algorithms and pipe-
lines for billions of microbial sequences. Curr Opin Biotechnol
Conclusion 2012;23(1):64–71
14 Teeling H, Glöckner FO. Current opportunities and challenges in
The rapid advancement of sequencing technology has microbial metagenome analysis—a bioinformatic perspective.
brought both promise and challenges to the metagenomics Brief Bioinform 2012;13(6):728–742
15 Pop M. Genome assembly reborn: recent computational chal- 40 Edgar RC. Search and clustering orders of magnitude faster than
lenges. Brief Bioinform 2009;10(4):354–366 BLAST. Bioinformatics 2010;26(19):2460–2461
16 Zerbino DR, Birney E. Velvet: algorithms for de novo short read 41 Finn RD, Clements J, Eddy SR. HMMER web server: interactive
assembly using de Bruijn graphs. Genome Res 2008;18(5):821–829 sequence similarity searching. Nucleic Acids Res 2011;39(Web
17 Myers EW, Sutton GG, Delcher AL, et al. A whole-genome assembly Server issue):W29–37
of Drosophila. Science 2000;287(5461):2196–2204 42 Farfan F, Ma J, Sartor MA, Michailidis G, Jagadish HV. THINK Back:
18 Li R, Zhu H, Ruan J, et al. De novo assembly of human genomes with Knowledge-based interpretation of high throughput data. BMC
massively parallel short read sequencing. Genome Res 2010;20(2): Bioinformatics 2012;13 Suppl 2:S4
265–272 43 Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment
19 Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to analysis: a knowledge-based approach for interpreting genome-
DNA fragment assembly. Proc Natl Acad Sci U S A 2001;98(17): wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 2005;
9748–9753 102(43):15545–15550
20 Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an 44 Ma J, Sartor MA, Jagadish HV. Appearance frequency modulated
extension of Velvet assembler to de novo metagenome assembly gene set enrichment testing. BMC Bioinformatics 2011;12:81
from short sequence reads. Nucleic Acids Res 2012;40(20):e155 45 Abubucker S, Segata N, Goll J, et al. Metabolic reconstruction for
21 Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler metagenomic data and its application to the human microbiome.
This document was downloaded for personal use only. Unauthorized distribution is strictly prohibited.
for single-cell and metagenomic sequencing data with highly PLoS Comput Biol 2012;8(6):e1002358
uneven depth. Bioinformatics 2012;28(11):1420–1428 46 Ye Y, Doak TG. A parsimony approach to biological pathway
22 Wu YW, Rho M, Doak TG, Ye Y. Stitching gene fragments with a reconstruction/inference for genomes and metagenomes. PLoS
network matching algorithm improves gene assembly for meta- Comput Biol 2009;5(8):e1000465
genomics. Bioinformatics 2012;28(18):i363–i369 47 Lozupone C, Knight R. UniFrac: a new phylogenetic method for
23 Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J. Ray comparing microbial communities. Appl Environ Microbiol 2005;
Meta: scalable de novo metagenome assembly and profiling. 71(12):8228–8235
Genome Biol 2012;13(12):R122 48 Dinsdale EA, Edwards RA, Bailey BA, et al. Multivariate analysis of
24 Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of functional metagenomes. Frontiers in Genetics 2013;4:41
reads from a mix of high-throughput sequencing technologies. J 49 Segata N, Izard J, Waldron L, et al. Metagenomic biomarker
Comput Biol 2010;17(11):1519–1533 discovery and explanation. Genome Biol. 2011;12(6):R60
25 Koren S, Treangen TJ, Pop M. Bambus 2: scaffolding metagenomes. 50 Arumugam M, Raes J, Pelletier E, et al. Enterotypes of the human
Bioinformatics 2011;27(21):2964–2971 gut microbiome. Nature 2011;473(7346):174–180
26 Yok NG, Rosen GL. Combining gene prediction methods to improve 51 Ravel J, Gajer P, Abdo Z, Sci USA. Vaginal microbiome of reproduc-
metagenomic gene annotation. BMC Bioinformatics 2011;12:20 tive-age women. Proc Natl Acad 108 Suppl 2011, 1 SRC -
27 Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene finding GoogleScholar:4680–4687
from environmental genome shotgun sequences. Nucleic Acids Res 52 Spor A, Koren O, Ley R. Unravelling the effects of the environment
2006;34(19):5623–5630 and host genotype on the gut microbiome. Nat Rev Microbiol
28 Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in 2011;9(4):279–290
metagenomic sequences. Nucleic Acids Res 2010;38(12):e132 53 Wu GD, Chen J, Hoffmann C, et al. Linking long-term dietary
29 Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: predicting genes patterns with gut microbial enterotypes. Science 2011;
in metagenomic sequencing reads. Nucleic Acids Res 2009;37(Web 334(6052):105–108
Server issue):W101–105 54 Riehle K, Coarfa C, Jackson A, et al. The Genboree Microbiome
30 Cole JR, Wang Q, Cardenas E, et al. The Ribosomal Database Project: Toolset and the analysis of 16S rRNA microbial sequences. BMC
improved alignments and new tools for rRNA analysis. Nucleic Bioinformatics 2012;13 Suppl 13:S11
Acids Res 2009;37(Database issue):D141–145 55 Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F. Using
31 Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of meta- the metagenomics RAST server (MG-RAST) for analyzing shotgun
genomic data. Genome Res 2007;17(3):377–386 metagenomes. Cold Spring Harb Protoc 2010;2010 (1):pdb
32 Krause L, Diaz NN, Goesmann A, et al. Phylogenetic classification of prot5368
short environmental DNA fragments. Nucleic Acids Res 2008; 56 Aziz RK, Bartels D, Best AA, et al. The RAST Server: rapid annota-
36(7):2230–2239 tions using subsystems technology. BMC Genomics 2008;9:75
33 Gerlach W, Junemann S, Tille F, Goesmann A, Stoye J. WebCARMA: a 57 Arumugam M, Harrington ED, Foerstner KU, Raes J, Bork P.
web application for the functional and taxonomic classification of SmashCommunity: a metagenomic annotation and analysis tool.
unassembled metagenomic reads. BMC Bioinformatics 2009;10:430 Bioinformatics 2010;26(23):2977–2978
34 Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved 58 Treangen TJ, Koren S, Sommer DD, et al. MetAMOS: a modular and
microbial gene identification with GLIMMER. Nucleic Acids Res open source metagenomic assembly and analysis pipeline. Ge-
1999;27(23):4636–4641 nome Biol 2013;14(1):R2
35 Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phy- 59 Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-
logenetic classification with interpolated Markov models. Nat efficient alignment of short DNA sequences to the human genome.
Methods 2009;6(9):673–676 Genome Biol 2009;10(3):R25
36 Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast 60 Koren O, Goodrich JK, Cullender TC, et al. Host remodeling of the
estimation of taxonomic profiles from metagenomic shotgun gut microbiome and metabolic changes during pregnancy. Cell
sequences. BMC Genomics 2011;12 Suppl 2:S4 2012;150(3):470–480
37 Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Hutten- 61 Han YW, Shen T, Chung P, Buhimschi IA, Buhimschi CS. Unculti-
hower C. Metagenomic microbial community profiling using unique vated bacteria as etiologic agents of intra-amniotic inflammation
clade-specific marker genes. Nat Methods 2012;9(8):811–814 leading to preterm birth. J. Clin Microbiol 2009;47(1):38–47
38 Markowitz VM, Chen IM, Palaniappan K, et al. IMG: the Integrated 62 Steel JH, Malatos S, Kennea N, et al. Bacteria and inflammatory cells
Microbial Genomes database and comparative analysis system. in fetal membranes do not always cause preterm labor. Pediatr. Res
Nucleic Acids Res 2012;40(Database issue):D115–122 2005;57(3):404–411
39 De Filippo C, Ramazzotti M, Fontana P, Cavalieri D. Bioinformatic 63 Stout MJ, Conlon B, Landeau M, et al. Identification of intracellular
approaches for functional annotation and pathway inference in bacteria in the basal plate of the human placenta in term and preterm
metagenomics data. Brief Bioinform 2012;13(6):696–710 gestations. Am J Obstet Gynecol 2013;208(3):226 e221–227