Mastriani2018 Protocol Microarray-BasedMicroRNAExpres
Mastriani2018 Protocol Microarray-BasedMicroRNAExpres
Abstract
MicroRNAs (miRNAs) are small, noncoding RNAs that are able to regulate the expression of targeted
mRNAs. Thousands of miRNAs have been identified; however, only a few of them have been functionally
annotated. Microarray-based expression analysis represents a cost-effective way to identify candidate
miRNAs that correlate with specific biological pathways, and to detect disease-associated molecular signa-
tures. Generally, microarray-based miRNA data analysis contains four major steps: (1) quality control and
normalization, (2) differential expression analysis, (3) target gene prediction, and (4) functional annota-
tion. For each step, a large couple of software tools or packages have been developed. In this chapter, we
present a standard analysis pipeline for miRNA microarray data, assembled by packages mainly developed
with R and hosted in Bioconductor project.
Key words MicroRNA (miRNA), Bioconcductor, R Package, Gene expression analysis, Microarray
data analysis
1 Introduction
Yejun Wang and Ming-an Sun (eds.), Transcriptome Data Analysis: Methods and Protocols, Methods in Molecular Biology,
vol. 1751, https://doi.org/10.1007/978-1-4939-7710-9_9, © Springer Science+Business Media, LLC 2018
127
128 Emilio Mastriani et al.
2 Materials
2.1 Software Tools The most recent version of R was downloaded and installed. For
this chapter, Linux platform is used. For R installation and admin-
2.1.1 R/Bioconductor
istration, the FAQs and documents can be referred: https://www.r-
project.org/. Bioconductor can be installed by entering the follow-
ing commands after starting R:
> source("https://bioconductor.org/biocLite.R")
> biocLite()
2.1.2 Installation Install the R/Bioconductor packages for miRNA microarray data
of R/Bioconductor analysis with biocLite(). The packages are summarized in
Packages Table 1 [5–16].
> biocLite(c("Biobase", "GEOquery", "limma", "mclust",
"devtools",
+ "GOstats","gplots","networkD3","miRNAtap","miRNAtap.db",
+ "visNetwork","SpidermiR"))
> library("GEOquery")
> gset <- getGEO("GSE54578",GSEMatrix=TRUE,AnnotGPL=FALSE)
MicroRNA Analysis Pipeline 129
Table 1
R packages used in the chapter for miRNA data analysis
3 Methods
3.1 Preprocessing The original miRNA expression data could contain some “NA”
and Normalization values and the columns are named with GSM accessions in default.
The data structure and content can be shown with “head(exprs
3.1.1 Preprocessing
(gset))” command (Fig. 1a). In the preprocessing step, we may
wish to remove all the “NA” records and rename the columns with
user-readable format (Fig. 1b).
> head(exprs(gset))
> rmv <- which(apply(exprs(gset),1,function(x) any (is.na
(x))))
130 Emilio Mastriani et al.
Fig. 1 Preprocessing of miRNA microarray data. (a) Raw expression data containing “NA” values. (b) “NA”
filtered expression data. (c) Variance among samples before normalization. (d) Variance among samples after
normalization
3.1.2 Normalization After preprocessing, the microarray data must be normalized to get
rid of variations with nonbiological sources. A large number of
methods have been proposed to normalize microarray-based tran-
scriptome data. The methods are suited for different platforms and
integrated in packages for corresponding data analysis, e.g., “Nor-
miR” function in the “ExiMiR” package for two-color microarray
experiments using a common reference or similar methods in the
“affy” package for single-channel Affymetrix arrays, “normal-
izeBetweenArrays” function in the “limma” package, etc. In
the example, “normalizeBetweenArrays” is applied, with a
quantile normalization procedure.
> library("limma")
> ex_norm <- normalizeBetweenArrays(ex)
> qu <- as.numeric(quantile(ex,c(0.,0.25,0.5,0.75,0.99,1.0),
na.rm=T))
> filt <- ( qu[5]>100 || (qu[6]-qu[1]>50 && qu[2]>0) || (qu[2]>
0 && qu[2]<1 && qu[4]>1
&& qu[4]<2))
> if(filt){ex_norm[which(ex<=0)] <- NaN; exprs(gset) <- log2
(ex_norm)}
3.2 Expression The normalized expression data can be compared directly between
Difference groups. T Test is the most straightforward statistic comparison
and Clustering method between two groups, which will measure the significance
Analysis of difference with probability of no difference ( p values: the lower,
the more significant). For microarray data, tens of thousands of
genes are compared between groups simultaneously and it is a
massive multiple testing problem. It is more complicated that the
measured expression levels do not always follow normal distribu-
tions and have nonidentical and dependent distributions between
genes. To solve this problem and identify the differentially
expressed genes more precisely, Smyth proposed an empirical
Bayes moderated t test, which has been incorporated into the
“limma” package [10]. An example is shown as following, and
more details about the usage of “eBayes” can refer to the docu-
ment: http://web.mit.edu/~r/current/arch/i386_linux26/lib/
R/library/limma/html/ebayes.html.
132 Emilio Mastriani et al.
The comparison results are stored in objects fit2 and tT, which
will be used for further analysis.
Besides the significance measured by the statistic p values, the
fold change amplitude of miRNA gene expression levels also
appears important to biologists. A volcano plot can show the
statistic significance and change amplitude in a two-dimensional
plane simultaneously, which plots the fold change and p values
(log-transformed results) on x- and y-axis respectively (Fig. 2a).
The “volcanoplot” function in the “limma” package can be
applied conveniently. Note that the ‘highlight’ argument indicates
the top probe sets are highlighted. Other packages such as
“ggplot2” also have functions to draw volcano plots.
> volcanoplot(fit2,coef=1,highlight=10)
a b
2.0
168789
148624
11058
17332
46829
168809
145984
145833
42609
42514
148234
46869
42801
1.5
148049
46752
11134
145705
169188
168637
147767
168878
27720
146008
168844
29575
-log10p
169305
17953
42540
miRNA probes
148622
147940
1.0
42513
10975
168871
13147
10952
168955
148247
42490
148491
147632
147806
27740
169167
169035
46479
46866
147588
0.5
169185
10964
145633
145647
168648
42782
42522
27537
46810
168709
148032
168769
46380
42808
17898
17822
168722
0.0
146163
17904
169171
27672
SCHIZO1
CTRL4
CTRL2
CTRL3
CTRL6
CTRL7
SCHIZO3
SCHIZO7
CTRL1
CTRL5
CTRL12
CTRL11
SCHIZO6
CTRL15
CTRL8
SCHIZO9
CTRL13
CTRL10
CTRL14
SCHIZO2
SCHIZO15
SCHIZO10
CTRL9
SCHIZO14
SCHIZO11
SCHIZO12
SCHIZO13
SCHIZO5
SCHIZO4
SCHIZO8
Fig. 2 Volcano plot and heat map of miRNA expression data. (a) Volcano plot showing the differentially
expressed miRNAs between disease and control samples. (b) Clustering the samples and genes with
expression patterns of significantly differential miRNAs
MicroRNA Analysis Pipeline 133
3.3 miRNA Target The difference between miRNA and general transcriptome data
Analysis analysis is mainly represented by the specific target gene analysis
of the former. The major activity of miRNAs is to regulate the
3.3.1 Target
expression of target genes posttranscriptionally or translationally,
Identification
and therefore annotation of the target genes of interesting miRNAs
appears important.
There are multiple options to identify target genes of miRNAs.
For example, Brock et al proposed a pipeline for miRNA target
analysis with R packages “targetscan.Mm.eg.db”, “micro-
RNA” and “org.Mm.eg.db”. In the example shown below, an
integrated package “SpidermiR” is adopted, which provides
both validated and predicted target genes from multiple databases
or software tools including mirWalk [18], miR2Disease [19], miR-
Tar [20], miRTarBase [21], miRandola [22], Pharmaco-miR [23],
DIANA [24], Miranda [25], PicTar [26], and TargetScan [27]. It
can also retrieve and visualize the gene networks. The following
commands give an example of target gene determination for some
interesting miRNAs, e.g., the top significant five miRNAs with
expression difference between groups (see Note 1). The potential
targets of these miRNAs will be predicted with SpidermiRdown-
load_miRNAprediction and exported to mirnaTar.
> tT[selected,]$Name[1:5]
> mirna <-
c(’hsa-miR-4429’,’hsa-miR-1827’,’hsa-miR-5002-5p’,’hsa-miR-
5187-3p’,’hsa-miR-4455’)
> mirnaTar <- SpidermiRdownload_miRNAprediction(mirna_list=-
mirna)
134 Emilio Mastriani et al.
3.3.2 Network and Gene Network analysis and visualization can show not only the shared
Set Enrichment Analysis targets of multiple miRNAs, but also the interactions and pathways
among the target genes. There are many tools developed for net-
work building and visualization, e.g., user-friendly interfaced tool
Cytoscape [28], R package SpidermiR [15]. Here, we use Cytos-
cape to construct the regulatory network between the miRNAs
(top significant 5) and their predicted targets (50 for each
miRNA), since Cytoscape is quite straightforward and particularly
useful for network construction with user-customized interactions
(Fig. 3a) (see Note 2). GeneMANIA curates validated and pre-
dicted networks between genes from a variety of species [29]. The
network types include coexpression, colocalization, genetic inter-
actions, pathway, physical interactions, shared protein domains, and
predicted interactions. GeneMANIA also provides a webserver to
implement the network construction. SpidermiR can download the
interaction data from GeneMANIA and visualize the networks
among the user-customized genes, and the functions are still
being debugged and updated. Here, we directly use the GeneMA-
NIA prediction server (http://genemania.org/) to construct the
pathway network of miRNA target genes (Fig. 3b) (see Note 3).
Besides the network analysis, statistics-based gene set enrich-
ment analysis (GSEA) should be done for the miRNAs and miRNA
targets, so as to find biological meanings and help increase the
statistical power through aggregating the signal across groups of
related genes. GOstats and a number of other R/Bioconductor
packages (e.g., GeneAnswers [30]) can make the enrichment
analysis with hypergeomtric tests (hyperGTest function for
GOstats). As an example, we use GOstats to make GO enrich-
ment analysis (Biological Process) to the predicted target genes of
the top 5 miRNAs (see Note 4).
> library("org.Hs.eg.db")
> library("GSEABase")
> library("GOstats")
> mirTarget <- mirnaTar$V2
> goAnn <- get("org.Hs.egGO")
> universe <- Lkeys(goAnn)
> entrezIDs <- mget(mirTarget, org.Hs.egSYMBOL2EG, ifnotfound=NA)
> entrezIDs <- as.character(entrezIDs)
MicroRNA Analysis Pipeline 135
Fig. 3 Interaction networks among miRNAs and their targets. (a) Regulatory network between miRNAs and
target genes. (b) Pathway sub-network among the miRNA target genes
+ universeGeneIds=universe,
+ annotation="org.Hs.eg.db",
+ categoryName="KEGG",
+ pvalueCutoff=0.01,
+ testDirection="over")
> keggET <- hyperGTest(params)
> kegg <- summary(keggET)
> library(Category)
> genelist <- geneIdsByCategory(keggET)
> genelist <- sapply(genelist, function(.ids) {
+ .sym <- mget(.ids, envir=org.Hs.egSYMBOL, ifnotfound=NA)
+ .sym[is.na(.sym)] <- .ids[is.na(.sym)]
+ paste(.sym, collapse=";")
+ })
> kegg$Symbols <- genelist[as.character(kegg$KEGGID)]
> head(kegg)
4 Notes
References
1. Kozomara A, Griffiths-Jones S (2014) miR- 11. Scrucca L, Fop M, Murphy TB, Raftery AE
Base: annotating high confidence microRNAs (2016) mclust 5: clustering, classification and
using deep sequencing data. Nucleic Acids Res density estimation using Gaussian finite mix-
42(Database issue):D68–D73. https://doi. ture models. R J 8(1):289–317
org/10.1093/nar/gkt1181 12. Pajak M, Simpson TI (2016) miRNAtap: miR-
2. McCall MN, Kim MS, Adil M, Patil AH, Lu Y, NAtap: microRNA targets – aggregated predic-
Mitchell CJ, Leal-Rojas P, Xu J, Kumar M, tions. R package version 1.8.0.
Dawson VL, Dawson TM, Baras AS, Rosen- 13. Pajak M, Simpson TI (2016) miRNAtap.db:
berg AZ, Arking DE, Burns KH, Pandey A, data for miRNAtap. R package version
Halushka MK (2017) Toward the human cel- 0.99.10.
lular microRNAome. Genome Res. https:// 14. Allaire JJ, Gandrud C, Russell K, Yetman CJ
doi.org/10.1101/gr.222067.117 (2017) networkD3: D3 JavaScript network
3. Otto T, Candido SV, Pilarz MS, Sicinska E, graphs from R. R package version 0.4.
Bronson RT, Bowden M, Lachowicz IA, https://CRAN.R-project.org/
Mulry K, Fassl A, Han RC, Jecrois ES, Sicinski package¼networkD3.
P (2017) Cell cycle-targeting microRNAs pro- 15. Cava C, Colaprico A, Bertoli G, Graudenzi A,
mote differentiation by enforcing cell-cycle Silva TC, Olsen C, Noushmehr H,
exit. Proc Natl Acad Sci U S A 114 Bontempi G, Mauri G, Castiglioni I (2017)
(40):10660–10665. pii 201702914. https:// SpidermiR: an R/bioconductor package for
doi.org/10.1073/pnas.1702914114 integrative analysis with miRNA data. Int J
4. Gao L, Jiang F (2016) MicroRNA (miRNA) Mol Sci 18(2.): pii: E274). https://doi.org/
profiling. Methods Mol Biol 1381:151–161 10.3390/ijms18020274
5. Huber W, Carey VJ, Gentleman R, Anders S, 16. Almende BV, Thieurmel B, Robert T (2017)
Carlson M, Carvalho BS, Bravo HC, Davis S, visNetwork: network visualization using ‘vis.js’
Gatto L, Girke T, Gottardo R, Hahne F, Han- library. R package version 2.0.1. https://
sen KD, Irizarry RA, Lawrence M, Love MI, CRAN.R-project.org/package¼visNetwork.
MacDonald J, Obenchain V, Oleś AK, 17. Zhang F, Xu Y, Shugart YY, Yue W et al (2015)
Pagès H, Reyes A, Shannon P, Smyth GK, Converging evidence implicates the abnormal
Tenenbaum D, Waldron L, Morgan M microRNA system in schizophrenia. Schizophr
(2015) Orchestrating high-throughput geno- Bull 41(3):728–735
mic analysis with Bioconductor. Nat Methods
12(2):115–121. https://doi.org/10.1038/ 18. Dweep H, Sticht C, Pandey P, Gretz N (2011)
nmeth.3252 miRWalk--database: prediction of possible
miRNA binding sites by “walking” the genes
6. Wickham H, Chang W (2017) devtools: tools of three genomes. J Biomed Inform 44
to make developing R packages easier. R pack- (5):839–847
age version 1.13.3. https://CRAN.R-project.
org/package¼devtools. 19. Jiang Q, Wang Y, Hao Y, Juan L, Teng M,
Zhang X, Li M, Wang G, Liu Y (2009) miR2-
7. Falcon S, Gentleman R (2007) Using GOstats Disease: a manually curated database for micro-
to test gene lists for GO term association. Bio- RNA deregulation in human disease. Nucleic
informatics 23(2):257–258 Acids Res 37(Database issue):D98–104
8. Davis S, Meltzer PS (2017) GEOquery: a 20. Hsu JB, Chiu CM, Hsu SD, Huang WY, Chien
bridge between the Gene Expression Omnibus CH, Lee TY, Huang HD (2011) miRTar: an
(GEO) and BioConductor. Bioinformatics 23 integrated system for identifying miRNA-
(14):1846–1847 target interactions in human. BMC Bioinfor-
9. Warnes GR, Bolker B, Bonebakker L, et al. matics 12:300. https://doi.org/10.1186/
(2016) gplots: various R programming tools 1471-2105-12-300
for plotting data. R package version 3.0.1. 21. Hsu SD, Lin FM, Wu WY, Liang C, Huang
https://CRAN.R-project.org/ WC, Chan WL, Tsai WT, Chen GZ, Lee CJ,
package¼gplots. Chiu CM, Chien CH, Wu MC, Huang CY,
10. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Tsou AP, Huang HD (2011) miRTarBase: a
Shi W, Smyth GK (2015) limma powers differ- database curates experimentally validated
ential expression analyses for RNA-sequencing microRNA-target interactions. Nucleic Acids
and microarray studies. Nucleic Acids Res 43 Res 39(Database issue):D163–D169. https://
(7):e47 doi.org/10.1093/nar/gkq1107
138 Emilio Mastriani et al.
22. Russo F, Di Bella S, Nigita G, Macca V, 26. Krek A, Grün D, Poy MN, Wolf R,
Laganà A, Giugno R, Pulvirenti A, Ferro A Rosenberg L, Epstein EJ, MacMenamin P, da
(2012) miRandola: extracellular circulating Piedade I, Gunsalus KC, Stoffel M, Rajewsky N
microRNAs database. PLoS One 7(10): (2005) Combinatorial microRNA target pre-
e47786. https://doi.org/10.1371/journal. dictions. Nat Genet 37(5):495–500
pone.0047786 27. Agarwal V, Bell GW, Nam J, Bartel DP (2015)
23. Rukov JL, Wilentzik R, Jaffe I, Vinther J, Predicting effective microRNA target sites in
Shomron N (2014) Pharmaco-miR: linking mammalian mRNAs. eLife 4:e05005
microRNAs and drug effects. Brief Bioinform 28. Saito R, Smoot ME, Ono K, Ruscheinski J,
15(4):648–659. https://doi.org/10.1093/ Wang PL, Lotia S, Pico AR, Bader GD, Ideker
bib/bbs082 T (2012) A travel guide to cytoscape plugins.
24. Maragkakis M, Reczko M, Simossis VA, Nat Methods 9(11):1069–1076. https://doi.
Alexiou P, Papadopoulos GL, Dalamagas T, org/10.1038/nmeth.2212
Giannopoulos G, Goumas G, Koukis E, 29. Montojo J, Zuberi K, Rodriguez H, Bader GD,
Kourtis K, Vergoulis T, Koziris N, Sellis T, Morris Q (2014) GeneMANIA: fast gene net-
Tsanakas P, Hatzigeorgiou AG (2009) work construction and function prediction for
DIANA-microT web server: elucidating micro- cytoscape. F1000Res 3(153). https://doi.org/
RNA functions through target prediction. 10.12688/f1000research.4572.1. eCollection
Nucleic Acids Res 37(Web Server issue): 2014
W273–W276 30. Feng G, Shaw P, Rosen ST, Lin SM, Kibbe WA
25. John B, Enright AJ, Aravin A, Tuschl T, (2012) Using the bioconductor GeneAnswers
Sander C, Marks DS (2004) Human Micro- package to interpret gene lists. Methods Mol
RNA targets. PLoS Biol 2(11):e363 Biol 802:101–112. https://doi.org/10.1007/
978-1-61779-400-1_7