0% found this document useful (0 votes)
43 views29 pages

R for NGS Analysis and Bioconductor

The document discusses the application of R and Bioconductor in next-generation sequencing (NGS) analysis, highlighting various tools and packages for RNA-seq, ChIP-seq, and SNP-seq. It emphasizes the importance of R as a programming language in bioinformatics and outlines the capabilities of Bioconductor for genomic data analysis. Key topics include data import/export, differential expression analysis, and the integration of biological metadata.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views29 pages

R for NGS Analysis and Bioconductor

The document discusses the application of R and Bioconductor in next-generation sequencing (NGS) analysis, highlighting various tools and packages for RNA-seq, ChIP-seq, and SNP-seq. It emphasizes the importance of R as a programming language in bioinformatics and outlines the capabilities of Bioconductor for genomic data analysis. Key topics include data import/export, differential expression analysis, and the integration of biological metadata.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

R & NGS

Dr. G. Ramesh Kumar, PhD.,


AU-KBC Research Centre,
MIT, Anna University,
Chromepet,Chennai-44.
UNIT III
• Application of R in NGS analysis:
• 5 TOPICS
• Introduction to Bioconductor GR
• Reading of RNA-seq data (ShortRead,
Rsamtools, GenomicRanges),
• annotation (biomaRt, genomeIntervals),
• reads coverage and assign counts (IRanges,
GenomicFeatures),
• differential expression (DESeq).
REF

• [Link]
home/ht-seq#R_BACK
Application of R in NGS analysis
• They are central to many applications in the:
• Genome annotation and
• NGS analysis areas, such as
• RNA-Seq,
• ChIP-Seq and
• SNP-Seq.
Application of R in NGS analysis

• Seq2pathway: an R/Bioconductor package for


pathway analysis of next-generation
sequencing data
R
• In recent years the R language has become the
Lingua Franca of data intensive research, and is
now by far the most widely used data analysis
programming language in bioinfomatics.
• One of the outstanding strengths of the R
language is the ease of programming extensions
to automate the analysis and mining of almost
any data type.
R

• The following topics will be introduced:


• (1) conditional executions,
• (2) loops,
• (3) writing custom functions,
• (4) calling external software,
• (5) running and debugging R programs, and
• (6) building custom R packages.
R
• R ([Link] is a versatile data
analysis environment that has a broad
application spectrum in all experimental and
quantitative scientific areas.
• The associated Bioconductor project provides
access to over 700 R extension packages for
the analysis of modern biological and
biomedical data sets, such as next generation
sequences, comparative genomics, network
modeling and statistical analysis.
R
• The R software is free and runs on all common operating
systems.

• The following topics will be covered:


• (1) command syntax,
• (2) basic functions,
• (3) data import/export,
• (4) data/object types,
• (5) graphical display,
• (6) usage of R packages/libraries (e.g. Bioconductor) and
• (7) using R for basic data analysis operations.
Bioconductor
• [Link]
• Bioconductor is a free, open source and open
development software project for the analysis and
comprehension of genomic data generated by wet
lab experiments in molecular biology.
• [Link]
• Bioconductor provides tools for the analysis and
comprehension of high-throughput genomic data.
• Bioconductor uses the R statistical programming
language, and is open source and open
development.
Why Open Source
• so that you can find out what algorithm is being
used, and how it is being used
• so that you can modify these algorithms to try
out new ideas or to accommodate local
conditions or needs
• so you can read the code, find bugs, suggest
improvements etc.
• so that they can be used as components
(potentially modified) in other peoples software
Overview
• biology is a computational science
• problems of data analysis, data generation,
reproducibility require computational support and
computational solutions
• we value code reuse
– many of the tasks have already been solved
– if we use those solutions we can put effort into new
research
• well designed, self-describing data structures help us
deal with complex data
Goals
• Provide access to powerful statistical and graphical methods
for the analysis of genomic data.
• Facilitate the integration of biological metadata (GenBank,
GO, Entrez Gene, PubMed) in the analysis of experimental
data.
• Allow the rapid development of extensible, interoperable, and
scalable software.
• Promote high-quality documentation and reproducible
research.
• Provide training in computational and statistical methods.
Bioconductor packages
Release 2.10, 554 Software Packages!
• General infrastructure
Biobase, Biostrings, biocViews
• Annotation:
annotate, annaffy, biomaRt, AnnotationDbi  data packages.
• Graphics/GUIs:
geneplotter, hexbin, limmaGUI, exploRase
• Pre-processing:
affy, affycomp, oligo, makecdfenv, vsn, gcrm, limma
• Differential gene expression:
genefilter, limma, ROC, siggenes, EBArrays, factDesign
• GSEA/Hypergeometric Testing
GSEABase, Category, GOstats, topGO
• Graphs and networks:
graph, RBGL, Rgraphviz
• Flow Cytometry:
flowCore, flowViz, flowUtils
• Protein Interactions:
ppiData, ppiStats, ScISI, Rintact
• Sequence Data:
Biostrings,ShortRead,rtracklayer,IRanges,GenomicFeatures,
VariantAnnotation
• Other data:
xcms, DNAcopy, PROcess, aCGH, rsbml, SBMLR, Rdisop
Component software

• interesting problems will require the


coordinated application of many
different techniques
• thus we need integrated interoperable
software
• of primary importance is well designed
and shared data structures
Data complexity
• Dimensionality.
• Dynamic/evolving data: e.g., gene annotation, sequence,
literature.
• Multiple data sources and locations: in-house, WWW.
• Multiple data types: numeric, textual, graphical.
No longer Xnxp!
We distinguish between biological metadata and
experimental metadata.
Experimental metadata

• when were the samples processed


and how
• what arrays were used/what kits
• if size selection of some sort (eg.
fractionation for proteomics
experiments) was used
• date the samples were run
• lane or chip information
• treatments
Biological metadata
• Biological attributes that can be applied to the
experimental data.
• E.g. for genes
– chromosomal location;
– gene annotation (Entrez Gene, GO);
– gene models
– relevant literature (PubMed)
• Biological metadata sets are large, evolving rapidly, and
typically distributed via the WWW.
• Tools: annotate, biomaRt, and
AnnotationDbi, GenomicFeatures packages,
and annotation data packages.
Annotation packages
annotate, annafy, biomaRt, and AnnotationDbi
Metadata package hgu95av2 mappings • Assemble and process genomic
between different gene IDs for this chip. annotation data from public
repositories.
GENENAME
ENTREZID • Build annotation data packages.
zinc finger protein 261
9203 • Associate experimental data in
real time to biological metadata
ACCNUM from web databases such as
X95808 MAP GenBank, GO, KEGG, Entrez
Xq13.1 Gene, and PubMed.
AffyID
41046_s_at
• Process and store query results:
e.g., search PubMed abstracts.
• Generate HTML reports of
analyses.
SYMBOL
ZNF261
PMID
10486218 GO
9205841 GO:0003677
8817323 GO:0007275
GO:0016021 + many other mappings
Sequence Annotation
• for a given gene:
– gene models
– sequence
– exon/intron boundaries
– location
– conservation
• often in the form of tracks
• it is important to keep track of the reference
genome being used
Vignettes
• Bioconductor developed a new documentation
paradigm, the vignette.
• A vignette is an executable document consisting of a
collection of documentation text and code chunks.
• Vignettes form dynamic, integrated, and reproducible
statistical documents that can be automatically
updated if either data or analyses are changed.
• Vignettes can be generated using the Sweave
function from the R tools package.
Bioconductor Software

• concentrate development resources on a few


important aspects
• Biobase: core classes and definitions that allow for
succinct description and handling of the data
• annotate: generic functions for annotation that can be
specialized
• genefilter/limma/DESeq/DEXSeq: differential
expression
• ShortRead/IRanges/GenomicFeatures/
VariantAnnotation: string manipulations, sequence
analysis
Quality Assessment
• ensuring that the data are of sufficient quality
is an essential first step
• arrayQuality Metrics: comprehensive QA
assessment of microarrays (one color or two
color)
– modifications are coming to make it more suitable
for sequence data
• ShortRead: tools for QA of short reads,
primarily Illumina
Biobase:ExpressionSet
• software should help organize and manipulate your
data
• the data need to be assembled correctly once, and
then they can be processed, subset etc without
worrying about them
• we developed the ExpressionSet class
• SummarizedExperiment class is the next iteration in
this process (in the GenomicRanges package)
Microarray data analysis
CEL, CDF .gpr, .Spot

Pre-processing affy marray


vsn limma
vsn
ExpressionSet
Annotation
annotate
Differential Graphs & Cluster Prediction annaffy
expression networks analysis biomaRt
edd graph CRAN + metadata
CRAN packages
genefilter RBGL class
class
limma Rgraphviz e1071
cluster Graphics
multtest ipred
MASS geneplotter
ROC LogitBoost
mva hexbin
+ CRAN MASS
nnet + CRAN
randomForest
rpart
Differential Expression
• limma: provides a linear models interface for
DE
– uses a moderated variance
– a variety of p-value correction methods are
provided
• DESeq and edgeR: for sequence data
– similar approach to limma
– make use of count data (Neg Binomial)
• DEXSeq for exon level differential expression
Machine Learning
• Software for machine learning has been written by many
different people
– the calling sequences and return values are unique to each
method
• MLInterfaces
• provides uniform calling sequences and return values for
all machine learning algorithms
• MLearn is the main wrapper function
– methods, eg knni, are passed to the wrapper
• return values are of class MLOutput
• see the MLInterfaces vignette for more details
Publications
• Bioconductor: Open software development for
computational biology and bioinformatics, Genome
Biology 2004, 5:R80,
[Link]
• Bioinformatics and Computational Biology Solutions
using R and Bioconductor, Springer, 2005, R.
Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit
eds.
• Bioconductor Case Studies, Springer
• R Programming for Bioinformatics, Chapman Hall
Comprehensive R Archive Network
• CRAN is a network of ftp and web servers
around the world that store identical, up-to-
date, versions of code and documentation for
R.
• [Link]

You might also like