Recommendations For The Introduction of Metagenomic Next-Generation Sequencing in Clinical Virology, Part II: Bioinformatic Analysis and Reporting
Recommendations For The Introduction of Metagenomic Next-Generation Sequencing in Clinical Virology, Part II: Bioinformatic Analysis and Reporting
8
Bhupender Singh and Ayan Roy
Abstract
Keywords
8.1 Introduction
The recent advancement in microbiology has directed researchers from the respec-
tive discipline to explore microbes differently and insightfully (Arnold et al. 2016).
The finding that most microbes cannot be cultured acted as a catalyst to alter the
before used dynamics to study microbial populations (Stewart 2012). The knowl-
edge of the impact of microbes on the humans and the environment has led the
microbiologists to develop strategies for examining the uncultured microorganisms
(Handelsman 2004). The urge to study the evolutionary and functional
characteristics of the uncultured microbial diversity has introduced the multidisci-
plinary field called metagenomics (Imhoff 2016). It involves the isolation of the
genomic DNA from environmental samples and after cloning and expressing in a
culturable organism for further analysis (Handelsman 2004). The term
metagenomics was coined to denote the meta-examination of the relatively similar
microbial population residing in the various niche (Neelakanta and Sultana 2013).
The metagenomic field was brought in the spotlight by the studies of DeLongs and
his colleagues after they generated the metagenomic library of prokaryotes from
sea-water. The 16s rRNA sequencing confirmed that the library belonged to the
archaeon and was not cultured until that time (Stein et al. 1996).
The sequencing analysis has a unique role in the identification and functional
annotation of metagenomic samples (Österlund et al. 2017). The very first
metagenomic analysis coupled with shotgun sequencing was carried out to analyse
viral, microbial diversity residing on the surface of sea-water, which has identified
more than 65% of the sequences having no prior knowledge (Osunmakinde et al.
2018). Majority of the metagenome analysis has been performed concerning marine
samples because two-thirds of the earth is occupied by water. Secondly, it provides
the niche to diverse microbial communities which regulate the ecosystem (Aguiar-
Pulido et al. 2016). Apart from this, the marine microbial population serves as a
hidden treasure trove for novel pharmaceutical and industrially relevant metabolites
(Hug et al. 2018). Progressively, the combination of metagenomics with High-
throughput sequencing (HTS) technologies has opened the floodgates to find novel
metabolites from microbes having clinical implications. Together these approaches
have identified various important metabolites like biomass-degenerating enzymes
from cow rumen, recognizing novel CRISPR (Clustered Regularly Interspaced Short
Palindromic Repeats) systems and set up of gene index of the human microbiome
(Seshadri et al. 2018; Stewart et al. 2018).
8 Metagenomics and Drug-Discovery 135
HTS technology was introduced firstly in 2005, and since then, it is being
improvised efficiently to provide greater accuracy and precision in terms of genome
analysis (Reuter et al. 2015). The advancement in the HTS has led the foundation of
extensive exploration of microbial communities by producing high throughput
genomic data with rate and time effectiveness (Zhou et al. 2015). The array of
HTS technology involves amplicon sequencing, whole-genome sequencing and
shotgun metagenome sequencing (Petrosino et al. 2009). The pros of HTS over
old-style sequencing convention involve its high-throughput data generation,
absence of cloning and fewer tariffs (Ari and Arikan 2016). The critical step in
this technology is to draw statistically significant inferences out of the generated
data. In the following discussion, various HTS platforms which can be implemented
to analyse metagenomes are highlighted (and summarized in Fig. 8.1).
Several bioactive metabolites have been obtained from metagenomic samples with
immense therapeutic potential. Some examples include malacidin, fluoroquinolone,
minimide and erdacin. GS20 was the first HTS variant introduced in 2005. This
sequencer implements sequencing by synthesis approach in a picotitre plate, which
gives 20 megabases of output in a single run and mean read size of 100 base pair
(Pareek et al. 2011). The sequencer works on the mechanism of pyrosequencing,
which involves NTP (Nucleotide Triphosphates) and nucleotide addition comple-
mentary to the sequencing strand is detected by the liberation of pyrophosphate
(Harrington et al. 2013). Its high-end variant GS-FLX Titanium + generates around
850 megabases in a single run with a mean read size of 700–750 base pairs. The
system was most suitable for the 16S rRNA sequencing as it can pave the highly
capricious fragments of 16S rRNA. The GS-FLX variants were discontinued since
December 2016 due to their cost, error rate and quantity of sample DNA required
was higher in comparison with other HTS platforms, but it has produced an
enormous amount of data which is not yet available in the scientific knowledge
(Liu et al. 2012).
Variants GA I, II, HiSeq, MiSeq, NextSeq 500, HiSeq2500 and HiSeq X Ten.
The Illumina sequencer was introduced back in 2006 and got widely used by the
scientists because of its lower cost. However, its major downside was smaller read
size in its initial variants which were taken care of in the advanced variants. The
MiSeq variant gives 2 300 base pairs of read-length (Quail et al. 2008). The
above-discussed peculiarities made the scientists switch from the Roche 454 platform
to Illumina sequencer. It utilises sequencing by synthesis tactic by the termination.
The HiSeq 2500 variant produces optimum four billion pieces of 125 bases size for a
single read in a paired-end manner dye (Schirmer et al. 2016). Its recent advance-
ment HiSeq X Ten includes, as the name suggests coupling of ten HiSeq machines to
obtain high-throughput data (Levy and Myers 2016). Illumina has also introduced
the first small-sized sequencer regarded as NextSeq 500 (Buermans and den Dunnen
2014).
Ion torrent was the first organisation which has introduced small-scale sequencers in
the form of PGM to the researchers in 2010. As a result, it received positive feedback
and become a hotspot among researchers to perform sequencing analysis in compar-
atively lesser spending. The sequencing was carried out in a microtiter plate in which
DNA stretches are incorporated to the beads when the DNA is supplemented to the
sequencing strand it liberates the proton which results in a change of pH and sensed
by the detector. Ion Proton, the high-throughput variant of the Ion torrent, generates
8 Metagenomics and Drug-Discovery 137
ten gigabases of data with almost 50 million reads in a single run having read size of
200 base pairs (Lahens et al. 2017). The most significant advantage of PGM is that it
can generate a large read size of about 400 base pairs (Henson et al. 2012). Their
latest variant is Ion S5 which can generate 15 gigabytes of output with 60 to
80 million reads in a single run of size around 200 base pairs (Mehrotra et al. 2017).
The nanopore sequencer from Oxford utilises the state-of-the-art strand sequencing
which can sequence the entire DNA fragment by detecting the change in electric
current when passed through minute nanopores made of proteins. MinIon mk1B is a
compact sequencer which can get coupled with any sort of computer for immediate
data analysis. The PromethION sequencer offers 144,000 (3000 nanopore channels
of 48 flow cells) channels for the sequencing purposes. Their SmidgION variants can
get operated through a smartphone for instant analysis. The VolTRAX variant can be
controlled through the Universal Serial Bus (USB) after sample load. The compara-
tively larger read size removes the necessity of shot-gun sequencing and thus
bringing a revolt in the respective (Wanunu 2012).
138 B. Singh and A. Roy
The poor-quality reads are initially processed by the utilities supporting the variant
of sequencer used for the sequencing. One such utility is the FASTX-Toolkit
(command-line based utility for pre-processing of FASTA/FASTQ data), apart
from this, FastQC (quality check utility to process raw HTS data) is also
implemented for the same purpose which also gives the overall figures of the
FASTQ data. Tools such as Galaxy (multivariate genome analysis platform),
SolexaQA (to view a graphical representation of sequence quality) and Lucy 2 (com-
mand-line based sequence cleaner and visualiser) are implemented to process
FASTQ data. These tools utilise Q quality or Phred scores (measures the sequencing
quality), whose verge relies on the variant of sequencer implemented.
This is carried out with the help of utilities like DUST. After this, the reads/
sequences which are sharing more than 95% identity are eliminated. Some tools
like MG-RAST (MetaGenomic Rapid Annotation using Subsystems Technology)
allows the user to eliminate reads which are almost matched with the genome of
model organisms like human, fly, cow and mouse. The process is mediated by the
Bowtie 2 (fast and efficient tool for the alignment of sequencing reads against
reference sequence) utility.
In this step “gene calling” is brought into action that allows the user to recognise
genes which are present in reads/contigs. CDS (Coding DNA Sequence),
non-coding RNA genes and some tools also allow the user to recognise CRISPR
(Clustered Regularly Interspaced Short Palindromic Repeats). Metagene,
FragGeneScan, Prodigal, MetaGeneMark and Orphelia helps to recognise the CDS
genes by implementing ab-initio gene prediction. These utilities implement codon
information to recognise regions of reads/contigs as introns and exons. They can be
trained by using the user-oriented datasets. FragGeneScan is used for recognising
prokaryotic genes and implemented by IMG/M (Integrated Microbial Genomes with
Microbiome samples; a platform to perform metagenome comparative data analy-
sis), EBI (European Bioinformatics Institute) Metagenomics and MG-RAST
8 Metagenomics and Drug-Discovery 139
The next step of the metagenomic data elucidation includes allocation of the
functions to the genes. The objective is accomplished by the similarity—search
method in which investigational sequence is compared with the database sequence
having annotated genes information. The basic steps involved in metagenome
annotation are shown in Fig. 8.2. The bigger size of the metagenomic data has
made this process automated and computationally expensive. BLAST (Basic Local
Alignment Search Tool) utility is implemented in high-end computing servers. The
concept of multithread is implemented in which a process is separated into numerous
CPU (Central Processing Unit)/GPU (Graphical Processing Unit) in order to obtain
the results in short time-span. The metagenomic data is annotated with the help of
various databases like KEGG (Kyoto Encyclopedia of Genes and Genomes), egg-
NOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups),
COG (Clusters of Orthologous Groups)/KOG (EuKaryotic Orthologous Groups)
and protein databases like PFAM (Protein FAMily), Interpro and TIGRFAM (The
Institute of Genomic Research’s database of protein Families). The use of numerous
databases mentioned above is brought into action for metagenomic data annotation.
IMG/MER uses Hidden Markov Model (HMM) profile to link the query set genes
with the PFAM after which with the help of COG ortholog clustering is carried out.
The PSSM (Position-specific Scoring Matrix) dataset is retrieved from the NCBI
(National Center for Biotechnology Information) for functional assignment of
proteins. On the other hand, genes are identified with the help of KEGG, and EC
(Enzyme Commission) numbers and evolutionary analysis of the metagenome data
is carried out by homology exploration.
IMG/MER contains a huge amount of genomic data which it utilises to retrieve
extra annotation information. The first step in its workflow is the anticipation of the
genes out of the metagenome and subsequently utilises other options to annotate
those genes further. This leads to the recognition of PFAM, which is not determined
in case of MG-RAST and results in comprehensive annotation parallel to the COG,
which is the only protein family identification resource utilised by MG-RAST. One
tailback for IMG/MER is the rapid increase in the gene counts which is not the case
with the MG-RAST but because the query metagenome as in case of IMG/MER is
subjected to PFAM analysis results in extensive annotation and reporting of the
metagenome.
MG-RAST initially anticipate the genes present in the metagenome followed by
searching for the homologs of those anticipated genes in the separated genomes. The
process is carried out by a utility called BLAT (BLAST—Like Alignment Tool). It
considers only those homologs whose identity is more than 70% thus omits consid-
erable hits. The best homologs from the separated genome are further subjected to
annotation rather than the metagenome. This turns into a drawback as the annotation
is carried out on the substituted genes of the separated genome while ignoring the
metagenome. Nevertheless, the plus point of implementing this method is the shorter
time-pan for complete annotation. Apart from this, the database does not enlarge
while the IMG/MER enlarge its mass.
However, the standalone bioinformatics and HTS approach can anticipate a limited
number of gene clusters, but the improved facilitation has allowed the researchers to
identify novel pharmaceutically significant active metabolites.
Nowadays, instead of functionally annotating the metagenome gene clusters, the
research has shifted to the targeted screening, which considers the background of the
metagenome under examination. In the upcoming part, we highlight various
pharmaceutically active metabolites obtained through metagenome examination.
Irrespective of the pipeline followed for the functional and structural
characterisation, and the metagenome is a reservoir for several metabolite-
synthesizing genes.
In 1969, the examination of the marine squirt Ecteinascidia turbinata resulted in the
identification of its anti-cancer properties and the structure of Ecteinascidin (ET-743)
was elucidated in 1984, and presently it is a pharmaceutically validated anti-cancer
metabolite. The practices to grow sea squirt in order to fulfil the pharmaceutical
hunger was not that successful but alternatively, extensive artificial approaches were
adopted to meet the pharma needs. The identification of ET-743 homology with
bacteria-derived metabolites namely saframycin A (Streptomyces lavendule),
safracin B (Pseudomonas fluorescens), saframycin Mx1 (Myxococcus xanthus) has
resulted in the understanding that symbiotic bacterial communities synthesized
ET-743. The metagenome sequence analysis of tunicate depicts that it regulates
non-ribosomal peptide synthetase pathways by the expression of 25 genes. The
extensive sequence annotation workflow has identified that the bioactive molecule
is generated by Candidatus Endoecteinascidia frumentensis. The whole-genome
size of the particular organism was identified approximately 631 kb. The determina-
tion of the associated pathway for the metabolite synthesis open the gates for the
pharma industries to synthesize the respective metabolite along with its analogues at
massive scale.
8.4.2 Bryostatins
In 1968, Bryostatin was found in Bugula neritina, which further caught the limelight
because of its toxic action for the cancerous cells, explicitly targeting Protein Kinase
C. The activity of the Bryostatin was estimated by more than 80 clinical, trials and
the medication is used for the Alzheimer. Initially, it was found that Bryostatin was
expressed by symbiotic relationships as numerous forms of the compound exists.
Later, the cosmid library respective of B. neritina was constructed, and numerous
corresponding clones were sequenced, which leads to the identification of 65 kb
brygene group. Further, the hybridisation studies were carried out on two E. sertula
strains from a different host. In one strain, the genes were found adjoining while the
other strain was having the respective gene cluster fragmented from the auxiliary
142 B. Singh and A. Roy
genes. As the E. sertula is not-culturable, to meet the needs of the pharma industry,
the brygene cluster can be expressed in various host organisms.
8.4.3 Psymberin
Psymberin is a kind of polyketide with cell-toxicity activity against the tumour cells.
It was obtained from the numerous sea sponges. The compound is of little impor-
tance because of its intricate structure, bioactivity and structure of the respective
compound were elucidated in 11 years utilising more than 600 samples. The
biosynthesis process of the compound was determined in the metagenome of
Psammociniaaff. Bulbosa. Sample of Psammociniaaff. Bulbosa were collected
from scuba diving at Milne Bay, New Guinea. Further, the protein sequence-based
alignments of psymberin and other related groups were generated. Amplicons were
generated using the primer-based amplification approach. Total sponge DNA was
isolated and two libraries (3,20,000 and 9,00,000 clones) respective of
Psammociniaaff. Bulbosa were generated following PCR based screening using
psymEAD-Yyspez2-forward and psymEAD-Yyspez1-reverse primers to obtain the
PKS gene clusters. (Haas 2009). The genomic composition analysis of the respective
compound suggests its derivation from the bacteria.
8.4.4 Onnamides
The tumour targeting particularity of the mycalamide and pederin inhibits the
replication and translation mechanism even at the slight concentration of 1 ng/ml.
The clinical trial study suggests that the implication of the respective compounds
lead to increase the life-span of cancer-induced mice. The assistance of metagenome
analysis has found that these compounds are derived from non-culturable Pseudo-
monas linked with Paederusfuscipes. The homology study of the pederin results in
the identification of structurally and functionally similar compounds in Lithistida.
The pederin-led analysis of polyketide synthase used as amplification template was
obtained from Theonella swinhoei metagenome and subsequently determined the
biosynthesis pathway for onnamide. Advanced analysis of the T. swinhoei
metagenome depicts that these polyketide synthases can only be derived from
those sponges which had comprises pederin homologs formerly. Finally, it was
known that onnamides are derived from non-culturable Candidatus Entotheonella
spp.
8.4.5 Patellazoles
Patellazoles were extracted from the tunicate during 1980, and they have gained
importance due to their pharmacological significance. The respective metabolite was
able to show potential antifungal action and cell-toxicity action to human cell-lines.
8 Metagenomics and Drug-Discovery 143
The chief symbiont for this metabolite was L. patella and C. albicans. The structural
units of the patellazole contain acetate and thiazole ring has led to the assumption
that it may be synthesized through polyketide synthase and non-ribosomal peptide
synthetase pathway respectively. The assumption was tested by carrying out
sequence-based metagenome analysis of tunic-cloaca niche and gave-off negative
findings. The PCR (Polymerase Chain Reaction) analysis revealed the patellazole
synthesis mediated through trans-acyltransferase family from the miniature zooids.
Shotgun sequence analysis of isolated zooid DNA reveals the 86 kb genome
corresponding to trans-acyltransferase polyketide synthase pathways. The genome
was contemplated to possess by Candidatus Endolissoclinum faulkneri.
8.4.6 Calyculin A
The compound was extracted from sea sponge Discodermia calyx in 1986 which
possess high cellular-toxicity abilities. The biosynthesis of the respective compound
is mediated by a combination of non-ribosomal peptide and polyketide pathways,
respectively. The homologs respective of Calyculin were also identified, which
suggests its derivation regulated by symbiotic interaction. The biosynthetic gene
cluster for the Calyculin was determined with the aid of metagenome analysis.
Metagenome examination of D. calyx reveals 150 kb of the gene cluster. Taking
this gene cluster as a template with the help of molecular biology analysis, the
Calyculin synthesis pathway was found to be possessed by filamentous bacteria.
Further, 16s rRNA sequence analysis of the respective bacteria showed 97% homol-
ogy with Candidatus Entotheonella factor obtained from T. swinhoei sponge.
8.4.7 Polytheonamides
8.5 Conclusion
the metagenome data but also perform the annotation effectively. We have discussed
various HTS platforms which can be utilised for the metagenome examination along
with their pros and cons. The strategies which can be employed to perform the
metagenome analysis has also been discussed along with some useful resources. The
main objective of the book chapter was to draw the attention of researchers towards
the HTS usefulness in metagenome examination so that this exponentially growing
field not only receive the appreciation but also direct the intellect of wet-lab
researchers towards designing their work-flow in a manner by which the distin-
guished properties of both wet lab and dry lab analysis can be utilised to serve the
human society at their best.
References
Aguiar-Pulido V, Huang W, Suarez-Ulloa V et al (2016) Metagenomics, metatranscriptomics, and
metabolomics approaches for microbiome analysis. Evol Bioinforma 12:5–16
Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecule real-time (SMRT)
sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids
Res 46:2159–2168
Ari Ş, Arikan M (2016) Next-generation sequencing: advantages, disadvantages, and future. In:
Hakeem KR, Tombuloğlu H, Tombuloğlu G (eds) Plant omics: trends and applications.
Springer, Berlin, pp 109–135
Arnold JW, Roach J, Azcarate-Peril MA (2016) Emerging technologies for gut microbiome
research. Trends Microbiol 24:887–901
Buermans HPJ, den Dunnen JT (2014) Next generation sequencing technology: advances and
applications. Biochim Biophys Acta Mol basis Dis 1842:1932–1941
Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation
sequencing technologies. Nat Rev Genet 17:333–351. https://doi.org/10.1038/nrg.2016.49
Haas MJ (2009) Polyketide pas de deux. Sci Exch 2:898–898. https://doi.org/10.1038/scibx.2009.
898
Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms.
Microbiol Mol Biol Rev 68:669–685. https://doi.org/10.1128/MMBR.68.4.669-685.2004
Harrington CT, Lin EI, Olson MT, Eshleman JR (2013) Fundamentals of pyrosequencing. Arch
Pathol Lab Med 137:1296–1303. https://doi.org/10.5858/arpa.2012-0463-RA
Henson J, Tischler G, Ning Z (2012) Next-generation sequencing and large genome assemblies.
Pharmacogenomics 13:901–915
Hug JJ, Bader CD, Remškar M et al (2018) Concepts and methods to access novel antibiotics from
actinomycetes. Antibiotics 7:44
Imhoff J (2016) New dimensions in microbial ecology—functional genes in studies to unravel the
biodiversity and role of functional microbial groups in the environment. Microorganisms 4:19.
https://doi.org/10.3390/microorganisms4020019
Lahens NF, Ricciotti E, Smirnova O et al (2017) A comparison of Illumina and Ion Torrent
sequencing platforms in the context of differential gene expression. BMC Genomics 18:602.
https://doi.org/10.1186/s12864-017-4011-0
Levy SE, Myers RM (2016) Advancements in next-generation sequencing. Annu Rev Genomics
Hum Genet 17:95–115. https://doi.org/10.1146/annurev-genom-083115-022413
Liu L, Li Y, Li S et al (2012) Comparison of next-generation sequencing systems. J Biomed
Biotechnol 2012:1–11. https://doi.org/10.1155/2012/251364
Mehrotra M, Duose DY, Singh RR et al (2017) Versatile ion S5XL sequencer for targeted next
generation sequencing of solid tumors in a clinical laboratory. PLoS One 12:e0181968. https://
doi.org/10.1371/journal.pone.0181968
8 Metagenomics and Drug-Discovery 145