Bioinformatics Toolbox™ User's Guide PDF
Bioinformatics Toolbox™ User's Guide PDF
User’s Guide
R2014a
How to Contact MathWorks
www.mathworks.com Web
comp.soft-sys.matlab Newsgroup
www.mathworks.com/contact_TS.html Technical Support
508-647-7000 (Phone)
508-647-7001 (Fax)
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 2003 Online only New for Version 1.0 (Release 13SP1+)
June 2004 Online only Revised for Version 1.1 (Release 14)
November 2004 Online only Revised for Version 2.0 (Release 14SP1+)
March 2005 Online only Revised for Version 2.0.1 (Release 14SP2)
May 2005 Online only Revised for Version 2.1 (Release 14SP2+)
September 2005 Online only Revised for Version 2.1.1 (Release 14SP3)
November 2005 Online only Revised for Version 2.2 (Release 14SP3+)
March 2006 Online only Revised for Version 2.2.1 (Release 2006a)
May 2006 Online only Revised for Version 2.3 (Release 2006a+)
September 2006 Online only Revised for Version 2.4 (Release 2006b)
March 2007 Online only Revised for Version 2.5 (Release 2007a)
April 2007 Online only Revised for Version 2.6 (Release 2007a+)
September 2007 Online only Revised for Version 3.0 (Release 2007b)
March 2008 Online only Revised for Version 3.1 (Release 2008a)
October 2008 Online only Revised for Version 3.2 (Release 2008b)
March 2009 Online only Revised for Version 3.3 (Release 2009a)
September 2009 Online only Revised for Version 3.4 (Release 2009b)
March 2010 Online only Revised for Version 3.5 (Release 2010a)
September 2010 Online only Revised for Version 3.6 (Release 2010b)
April 2011 Online only Revised for Version 3.7 (Release 2011a)
September 2011 Online only Revised for Version 4.0 (Release 2011b)
March 2012 Online only Revised for Version 4.1 (Release 2012a)
September 2012 Online only Revised for Version 4.2 (Release 2012b)
March 2013 Online only Revised for Version 4.3 (Release 2013a)
September 2013 Online only Revised for Version 4.3.1 (Release 2013b)
March 2014 Online only Revised for Version 4.4 (Release 2014a)
Contents
Getting Started
1
Bioinformatics Toolbox Product Description . . . . . . . . . 1-2
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Installing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Required Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Optional Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
v
Editing Formulas to Run the Example on a Subset of the
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-27
Using the Spreadsheet Link EX Interface to Interact With
the Data in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28
vi Contents
Construct an Annotation Object . . . . . . . . . . . . . . . . . . . . . . 2-23
Retrieve General Information from an Annotation
Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Access Data in an Annotation Object . . . . . . . . . . . . . . . . . . 2-25
Use Feature Annotations with Short-Read Sequence
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
Sequence Analysis
3
Exploring a Nucleotide Sequence Using Command
Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Overview of Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
vii
Searching the Web for Sequence Information . . . . . . . . . . . 3-2
Reading Sequence Information from the Web . . . . . . . . . . . 3-5
Determining Nucleotide Composition . . . . . . . . . . . . . . . . . 3-6
Determining Codon Composition . . . . . . . . . . . . . . . . . . . . . 3-11
Open Reading Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
Amino Acid Conversion and Composition . . . . . . . . . . . . . . 3-18
viii Contents
Microarray Analysis
4
Managing Gene Expression Data in Objects . . . . . . . . . . 4-2
ix
Visualizing Microarray Images . . . . . . . . . . . . . . . . . . . . . 4-33
Overview of the Mouse Example . . . . . . . . . . . . . . . . . . . . . 4-33
Exploring the Microarray Data Set . . . . . . . . . . . . . . . . . . . 4-34
Spatial Images of Microarray Data . . . . . . . . . . . . . . . . . . . 4-36
Statistics of the Microarrays . . . . . . . . . . . . . . . . . . . . . . . . 4-46
Scatter Plots of Microarray Data . . . . . . . . . . . . . . . . . . . . . 4-48
Phylogenetic Analysis
5
Overview of Phylogenetic Analysis . . . . . . . . . . . . . . . . . . 5-2
x Contents
1
Getting Started
Key Features
• Next Generation Sequencing analysis and browser
• Sequence analysis and visualization, including pairwise and multiple
sequence alignment and peak detection
• Microarray data analysis, including reading, filtering, normalizing, and
visualization
• Mass spectrometry analysis, including preprocessing, classification, and
marker identification
• Phylogenetic tree analysis
• Graph theory functions, including interaction maps, hierarchy plots, and
pathways
• Data import from genomic, proteomic, and gene expression files, including
SAM, FASTA, CEL, and CDF, and from databases such as NCBI and
GenBank
1-2
Product Overview
Product Overview
In this section...
“Features” on page 1-3
“Expected Users” on page 1-5
Features
The Bioinformatics Toolbox product extends the MATLAB® environment
to provide an integrated software environment for genome and proteome
analysis. Scientists and engineers can answer questions, solve problems,
prototype new algorithms, and build applications for drug discovery and
design, genetic engineering, and biological research. An introduction to these
features will help you to develop a conceptual model for working with the
toolbox and your biological data.
You can use the basic bioinformatic functions provided with this toolbox
to create more complex algorithms and applications. These robust and
well-tested functions are the functions that you would otherwise have to
create yourself.
1-3
1 Getting Started
1-4
Product Overview
Expected Users
The Bioinformatics Toolbox product is intended for computational biologists
and research scientists who need to develop new algorithms or implement
published ones, visualize results, and create standalone applications.
1-5
1 Getting Started
Installation
In this section...
“Installing” on page 1-6
“Required Software” on page 1-6
“Optional Software” on page 1-6
Installing
Install the Bioinformatics Toolbox software from a DVD or Web release
using the MathWorks® Installer. For more information, see the installation
documentation.
Required Software
The Bioinformatics Toolbox software requires the following MathWorks
products to be installed on your computer.
Optional Software
MATLAB and the Bioinformatics Toolbox software environment is open and
extensible. In this environment you can interactively explore ideas, prototype
new algorithms, and develop complete solutions to problems in bioinformatics.
MATLAB facilitates computation, visualization, prototyping, and deployment.
1-6
Installation
Using the Bioinformatics Toolbox software with other MATLAB toolboxes and
products will allow you to do advanced algorithm development and solve
multidisciplinary problems.
1-7
1 Getting Started
1-8
Features and Functions
1-9
1 Getting Started
from the NCBI Gene Expression Omnibus (GEO) Web site by using a single
function (getgeodata).
Gene Ontology database — Load the database from the Web into
a gene ontology object (geneont.geneont). Select sections of the
ontology with methods for the geneont object (geneont.getancestors,
geneont.getdescendants, geneont.getmatrix, geneont.getrelatives),
and manipulate data with utility functions (goannotread, num2goid).
Writing data formats — The functions for getting data from the Web
include the option to save the data to a file. However, there is a function to
write data to a file using the FASTA format (fastawrite).
1-10
Features and Functions
Sequence Alignments
You can select from a list of analysis methods to compare nucleotide or amino
acid sequences using pairwise or multiple sequence alignment functions.
1-11
1 Getting Started
1-12
Features and Functions
Phylogenetic Analysis
You can use functions for phylogenetic tree building and analysis. There is
also a GUI to draw phylograms (trees).
1-13
1 Getting Started
GPR files (gprread) and GAL files (galread). Get Gene Expression Omnibus
(GEO) data from the Web (getgeodata) and read GEO data from files
(geosoftread).
1-14
Features and Functions
Note For more information on creating and using DataMatrix objects, see
“Representing Expression Data Values in DataMatrix Objects” on page 4-5.
Reading raw data — Load raw mass/charge and ion intensity data from
comma-separated-value (CSV) files, or read a JCAMP-DX-formatted file with
mass spectrometry data (jcampread) into the MATLAB environment.
You can also have data in TXT files and use the importdata function.
1-15
1 Getting Started
Spectrum analysis — Load spectra into a GUI (msviewer) for selecting mass
peaks and further analysis.
The following graphic illustrates the roles of the various mass spectrometry
functions in the toolbox.
1-16
Features and Functions
mzXML File
mzxmlread
mzXML Structure
mzxml2peaks
mspeaks msppresample
msheatmap Plot
Raw Reconstructed
Data Data
msviewer Mass
Semicontinuous Signal Spectra
Viewer
msresample
1-17
1 Getting Started
Graph algorithms do not pretest for graph properties because such tests
can introduce a time penalty. For example, there is an efficient shortest
path algorithm for DAG, however testing if a graph is acyclic is expensive
compared to the algorithm. Therefore, it is important to select a graph theory
function and properties appropriate for the type of the graph represented by
your input matrix. If the algorithm receives a graph type that differs from
what it expects, it will either:
1-18
Features and Functions
Graph Visualization
The toolbox includes functions, objects, and methods for creating, viewing,
and manipulating graphs such as interactive maps, hierarchy plots, and
pathways. This allows you to view relationships between data.
The object constructor function (biograph) lets you create a biograph object to
hold graph data. Methods of the biograph object let you calculate the position
of nodes (dolayout), draw the graph (view), get handles to the nodes and
edges (getnodesbyid and getedgesbynodeid) to further query information,
and find relations between the nodes (getancestors, getdescendants,
and getrelatives). There are also methods that apply basic graph theory
algorithms to the biograph object.
1-19
1 Getting Started
The toolbox provides functions that build on the classification and statistical
learning tools in the Statistics Toolbox software (classify, kmeans, and
treefit).
Data Visualization
You can visually compare pairwise sequence alignments, multiply aligned
sequences, gene expression data from microarrays, and plot nucleic acid and
1-20
Features and Functions
protein characteristics. The 2-D and volume visualization features let you
create custom graphical representations of multidimensional data sets. You
can also create montages and overlays, and export finished graphics to an
Adobe® PostScript® image file or copy directly into Microsoft PowerPoint®.
• Share algorithms with other users — You can share data analysis
algorithms created in the MATLAB language across all supported
platforms by giving files to other users. You can also create GUIs within the
MATLAB environment using the Graphical User Interface Development
Environment (GUIDE).
• Deploy MATLAB GUIs — Create a GUI within the MATLAB
environment using GUIDE, and then use MATLAB Compiler software
to create a standalone GUI application that runs separately from the
MATLAB environment.
• Create dynamic link libraries (DLLs) — Use MATLAB Compiler
software to create DLLs for your functions, and then link these libraries to
other programming environments such as C and C++.
• Create COM objects — Use MATLAB Builder NE software to create
COM objects, and then use a COM-compatible programming environment
(Visual Basic®) to create a standalone application.
• Create Excel add-ins — Use MATLAB Builder EX software to
create Excel add-in functions, and then use these functions with Excel
spreadsheets.
• Create Java classes — Use MATLAB Builder JA software to
automatically generate Java classes from algorithms written in the
MATLAB programming language. You can run these classes outside the
MATLAB environment.
1-21
1 Getting Started
The Excel file used in the following example contains data from DeRisi, J.L.,
Iyer, V.R., and Brown, P.O. (Oct. 24, 1997). Exploring the metabolic and
genetic control of gene expression on a genomic scale. Science 278(5338),
680–686. PMID: 9381177. The data was filtered using the steps described
in Gene Expression Profile Analysis.
1-22
Exchange Bioinformatic Data Between Excel® and MATLAB®
5 From Excel, open the following file provided with the Bioinformatics
Toolbox software:
matlabroot\toolbox\bioinfo\biodemos\Filtered_Yeastdata.xlsm
6 In the Excel software, enable macros. Click the Developer tab, and then
select Macro Security from the Code group. (If the Developer tab is not
displayed on the Excel ribbon, consult Excel Help to display it.)
Tip To view a cell’s formula, select the cell, and then view the formula in
the formula bar at the top of the Excel window.
2 Execute the formulas in cells J5, J6, J7, and J12, by selecting the cell,
pressing F2, and then pressing Enter.
1-23
1 Getting Started
Each of the first three cells contains a formula using the Spreadsheet Link
EX function MLPutMatrix, which creates a MATLAB variable from the
data in the spreadsheet. Cell J12 contains a formula using the Spreadsheet
Link EX function MLEvalString, which runs the Bioinformatics Toolbox
clustergram function using the three variables as input. For more
information on adding formulas using Spreadsheet Link EX functions,
see “Enter Functions into Worksheet Cells” in the Spreadsheet Link EX
documentation.
1-24
Exchange Bioinformatic Data Between Excel® and MATLAB®
3 Note that cell J17 contains a formula using a macro function Clustergram,
which was created in the Visual Basic Editor. Running this macro does the
same as the formulas in cells J5, J6, J7, and J12. Optionally, view the
Clustergram macro function by clicking the Developer tab, and then
clicking the Visual Basic button . (If the Developer tab is not on the
Excel ribbon, consult Excel Help to display it.)
1-25
1 Getting Started
For more information on creating macros using Visual Basic Editor, see
“Use Spreadsheet Link EX Functions in Macros” in the Spreadsheet Link
EX documentation.
4 Execute the formula in cell J17 to analyze and visualize the data:
b Press F2.
c Press Enter.
1-26
Exchange Bioinformatic Data Between Excel® and MATLAB®
b Select cell J6, then press F2 to display the formula for editing. Change
A617 to A33, and then press Enter.
2 Run the formulas in cells J5, J6, J7, and J12 to analyze and visualize
a subset of the data:
a Select cell J5, press F2, and then press Enter.
1-27
1 Getting Started
1-28
Exchange Bioinformatic Data Between Excel® and MATLAB®
3 Type YAGenes for the variable name, and then click OK.
1-29
1 Getting Started
Note Make sure you use the ' (transpose) symbol when plotting the data
in this step. You need to transpose the data in YAGenes so that it plots as
three genes over seven time intervals.
6 Select cell J20, and then click from the MATLAB group, select Get
MATLAB figure.
1-30
Get Information from Web Database
1-31
1 Getting Started
Specifically, this function will take one or more search terms, submit them
to the PubMed database for a search, then return a MATLAB structure or
structure array, with each structure containing information for an article
found by the search. The returned information will include a PubMed
identifier, publication date, title, abstract, authors, and citation.
The function will also include property name/property value pairs that let
the user of the function limit the search by publication date and limit the
number of records returned.
1 From MATLAB, open the MATLAB Editor by selecting File > New >
Function.
2 Define the getpubmed function, its input arguments, and return values by
typing:
3 Add code to do some basic error checking for the required input SEARCHTERM.
4 Create variables for the two property name/property value pairs, and set
their default values.
1-32
Get Information from Web Database
5 Add code to parse the two property name/property value pairs if provided
as input.
end
end
6 You access the PubMed database through a search URL, which submits
a search term and options, and then returns the search results in a
specified format. This search URL is comprised of a base URL and defined
parameters. Create a variable containing the base URL of the PubMed
database on the NCBI Web site.
1-33
1 Getting Started
8 Create a variable containing the search URL from the variables created
in the previous steps.
9 Use the urlread function to submit the search URL, retrieve the search
results, and return the results (as text in the MEDLINE report type) in
medlineText, a character array.
medlineText = urlread(searchURL);
10 Use the MATLAB regexp function and regular expressions to parse and
extract the information in medlineText into hits, a cell array, where each
cell contains the MEDLINE-formatted text for one article. The first input
is the character array to search, the second input is a search expression,
which tells the regexp function to find all records that start with PMID-,
while the third input, 'match', tells the regexp function to return the
actual records, rather than the positions of the records.
hits = regexp(medlineText,'PMID-.*?(?=PMID|</pre>$)','match');
pmstruct = struct('PubMedID','','PublicationDate','','Title','',...
'Abstract','','Authors','','Citation','');
12 Use the MATLAB regexp function and regular expressions to loop through
each article in hits and extract the PubMed ID, publication date, title,
1-34
Get Information from Web Database
for n = 1:numel(hits)
pmstruct(n).PubMedID = regexp(hits{n},'(?<=PMID- ).*?(?=\n)','match', 'once');
pmstruct(n).PublicationDate = regexp(hits{n},'(?<=DP - ).*?(?=\n)','match', 'once');
pmstruct(n).Title = regexp(hits{n},'(?<=TI - ).*?(?=PG -|AB -)','match', 'once');
pmstruct(n).Abstract = regexp(hits{n},'(?<=AB - ).*?(?=AD -)','match', 'once');
pmstruct(n).Authors = regexp(hits{n},'(?<=AU - ).*?(?=\n)','match');
pmstruct(n).Citation = regexp(hits{n},'(?<=SO - ).*?(?=\n)','match', 'once');
end
When you are done, your file should look similar to the getpubmed.m
file included with the Bioinformatics Toolbox software. The sample
getpubmed.m file, including help, is located at:
matlabroot\toolbox\bioinfo\biodemos\getpubmed.m
1-35
1 Getting Started
1-36
2
High-Throughput Sequence
Analysis
Overview
Many biological experiments produce huge data files that are difficult to
access due to their size, which can cause memory issues when reading the
file into the MATLAB Workspace. You can construct a BioIndexedFile
object to access the contents of a large text file containing nonuniform size
entries, such as sequences, annotations, and cross-references to data sets.
The BioIndexedFile object lets you quickly and efficiently access this data
without loading the source file into memory.
Use the BioIndexedFile object in conjunction with your large source file to:
2-2
Work with Large Multi-Entry Text Files
• FASTA
• FASTQ
• SAM
When you construct a BioIndexedFile object from your source file for the
first time, you also create an auxiliary index file, which by default is saved
to the same location as your source file. However, if your source file is in a
read-only location, you can specify a different location to save the index file.
2-3
2 High-Throughput Sequence Analysis
Tip If insufficient memory is not an issue when accessing your source file,
you may want to try an appropriate read function, such as genbankread, for
importing data from GenBank files. .
1 Create a variable containing the full absolute path of your source file. For
your source file, use the yeastgenes.sgd file, which is included with the
Bioinformatics Toolbox software.
sourcefile = which('yeastgenes.sgd');
Caution Do not modify the index file. If you modify it, you can get invalid
results. Also, the constructor function cannot use a modified index file to
construct future objects from the associated source file.
2-4
Work with Large Multi-Entry Text Files
gene2goObj.NumEntries
ans =
6476
2-5
2 High-Throughput Sequence Analysis
There are two ways to set the Interpreter property of the BioIndexedFile
object:
2-6
Work with Large Multi-Entry Text Files
Example
To quickly find all the gene ontology (GO) terms associated with a particular
gene because the entry keys are gene names:
2 Read only the entries that have a key of YAT2, and return their GO terms.
GO_YAT2_entries =
2-7
2 High-Throughput Sequence Analysis
Overview
High-throughput sequencing instruments produce large amounts of short-read
sequence data that can be challenging to store and manage. Using objects to
contain this data lets you easily access, manipulate, and filter the data.
2-8
Manage Short-Read Sequence Data in Objects
quality information
(created using the
fastqread function)
BioMap • Sequence headers • SAM file
• Read sequences • BAM file
• Sequence qualities (base • SAM structure (created
calling) using the samread function)
• Sequence alignment and • BAM structure (created
mapping information using the bamread function)
(relative to a single
• Cell arrays containing
reference sequence),
header, sequence, quality,
including mapping quality
and mapping/alignment
information (created using
the samread or bamread
function)
Prerequisites
A BioRead object represents a collection of short-read sequences. Each
element in the object is associated with a sequence, sequence header, and
sequence quality information.
• Indexed — The data remains in the source file. Constructing the object
and accessing its contents is memory efficient. However, you cannot modify
object properties, other than the Name property. This is the default method
if you construct a BioRead object from a FASTQ- or SAM-formatted file.
• In Memory — The data is read into memory. Constructing the object
and accessing its contents is limited by the amount of available memory.
2-9
2 High-Throughput Sequence Analysis
However, you can modify object properties. When you construct a BioRead
object from a FASTQ structure or cell arrays, the data is read into memory.
When you construct a BioRead object from a FASTQ- or SAM-formatted file,
use the InMemory name-value pair argument to read the data into memory.
BRObj1 =
The constructor function construct a BioRead object and, if an index file does
not already exist, it also creates an index file with the same file name, but
with an .IDX extension. This index file, by default, is stored in the same
location as the source file.
2-10
Manage Short-Read Sequence Data in Objects
Caution Your source file and index file must always be in sync.
• After constructing a BioRead object, do not modify the index file, or you
can get invalid results when using the existing object or constructing new
objects.
• If you modify the source file, delete the index file, so the object constructor
creates a new index file when constructing new objects.
Note Because you constructed this BioRead object from a source file, you
cannot modify the properties (except for Name) of the BioRead object.
Prerequisites
A BioMap object represents a collection of short-read sequences that map
against a single reference sequence. Each element in the object is associated
with a read sequence, sequence header, sequence quality information, and
alignment/mapping information.
When constructing a BioMap object from a BAM file, the maximum size of the
file is limited by your operating system and available memory.
• Indexed — The data remains in the source file. Constructing the object
and accessing its contents is memory efficient. However, you cannot modify
object properties, other than the Name property. This is the default method
if you construct a BioMap object from a SAM- or BAM-formatted file.
• In Memory — The data is read into memory. Constructing the object
and accessing its contents is limited by the amount of available memory.
However, you can modify object properties. When you construct a BioMap
object from a structure, the data stays in memory. When you construct
2-11
2 High-Throughput Sequence Analysis
1 If you do not know the number and names of the reference sequences in
your source file, determine them using the saminfo or baminfo function
and the ScanDictionary name-value pair argument.
ans =
'seq1'
'seq2'
Tip The previous syntax scans the entire SAM file, which is time
consuming. If you are confident that the Header information of the SAM
file is correct, omit the ScanDictionary name-value pair argument, and
inspect the SequenceDictionary field instead.
BMObj2 =
2-12
Manage Short-Read Sequence Data in Objects
SequenceDictionary: 'seq1'
Reference: [1501x1 File indexed property]
Signature: [1501x1 File indexed property]
Start: [1501x1 File indexed property]
MappingQuality: [1501x1 File indexed property]
Flag: [1501x1 File indexed property]
MatePosition: [1501x1 File indexed property]
Quality: [1501x1 File indexed property]
Sequence: [1501x1 File indexed property]
Header: [1501x1 File indexed property]
NSeqs: 1501
Name: 'MyObject'
The constructor function constructs a BioMap object and, if index files do not
already exist, it also creates one or two index files:
• If constructing from a SAM-formatted file, it creates one index file that has
the same file name as the source file, but with an .IDX extension. This
index file, by default, is stored in the same location as the source file.
• If constructing from a BAM-formatted file, it creates two index files that
have the same file name as the source file, but one with a .BAI extension
and one with a .LINEARINDEX extension. These index files, by default,
are stored in the same location as the source file.
Caution Your source file and index files must always be in sync.
• After constructing a BioMap object, do not modify the index files, or you
can get invalid results when using the existing object or constructing new
objects.
• If you modify the source file, delete the index files, so the object constructor
creates new index files when constructing new objects.
2-13
2 High-Throughput Sequence Analysis
Note Because you constructed this BioMap object from a source file, you
cannot modify the properties (except for Name and Reference) of the BioMap
object.
Note This example constructs a BioMap object from a SAM structure using
samread. Use similar steps to construct a BioMap object from a BAM structure
using bamread.
SAMStruct = samread('ex2.sam');
2 To construct a valid BioMap object from a SAM-formatted file, the file must
contain only one reference sequence. Determine the number and names
of the reference sequences in your SAM-formatted file using the unique
function to find unique names in the ReferenceName field of the structure:
unique({SAMStruct.ReferenceName})
ans =
'seq1' 'seq2'
BMObj1 =
2-14
Manage Short-Read Sequence Data in Objects
SequenceDictionary: {'seq1'}
Reference: {1501x1 cell}
Signature: {1501x1 cell}
Start: [1501x1 uint32]
MappingQuality: [1501x1 uint8]
Flag: [1501x1 uint16]
MatePosition: [1501x1 uint32]
Quality: {1501x1 cell}
Sequence: {1501x1 cell}
Header: {1501x1 cell}
NSeqs: 1501
Name: ''
For example, to retrieve all headers from a BioRead object, use the Header
property as follows:
allHeaders = BRObj1.Header;
This syntax returns a cell array containing the headers for all elements in the
BioRead object.
allStarts = BMObj1.Start;
This syntax returns a vector containing the start positions of aligned read
sequences with respect to the position numbers in the reference sequence in
a BioMap object.
2-15
2 High-Throughput Sequence Analysis
This syntax returns a cell array containing all start positions and headers
information of a BioMap object.
For a list and description of all properties of a BioRead object, see BioRead
class. For a list and description of all properties of a BioMap object, see BioMap
class.
This syntax returns a new BioRead object containing the first 10 elements in
the original BioRead object.
subSeqs =
2-16
Manage Short-Read Sequence Data in Objects
'TGGCTTTAAAGC'
'CCCGAAAGCTAG'
'AATTTTGCGGCT'
• Sequence header
• SAM flags for the sequence
• Start position of the aligned read sequence with respect to the reference
sequence
• Mapping quality score for the sequence
• Signature (CIGAR-formatted string) for the sequence
• Sequence
• Quality scores for sequence positions
Prerequisites
To modify properties (other than Name and Reference) of a BioRead or BioMap
object, the data must be in memory, and not indexed. To ensure the data is
in memory, do one of the following:
2-17
2 High-Throughput Sequence Analysis
BRObj1 = BioRead('SRR005164_1_50.fastq','InMemory',true);
To provide custom headers for sequences of interest (in this case sequences 1
to 5), do the following:
Several other specialized set methods let you set the properties of a subset of
elements in a BioRead or BioMap object.
2-18
Manage Short-Read Sequence Data in Objects
For example, you can compute the number, indices, and start positions of
the read sequences that align within the first 25 positions of the reference
sequence. To do so, use the getCounts, getIndex, and getStart methods:
Cov =
12
Indices =
1
2
3
4
5
6
7
8
9
10
11
12
startPos =
1
3
5
6
9
13
13
15
18
22
2-19
2 High-Throughput Sequence Analysis
22
24
The first two syntaxes return the number and indices of the read sequences
that align within the specified region of the reference sequence. The last
syntax returns a vector containing the start position of each aligned read
sequence, corresponding to the position numbers of the reference sequence.
For example, you can also compute the number of the read sequences that
align to each of the first 10 positions of the reference sequence. For this
computation, use the getBaseCoverage method:
Cov =
1 1 2 2 3 4 4 4 5 5
Alignment_1_12 =
CACTAGTGGCTC
CTAGTGGCTC
AGTGGCTC
GTGGCTC
GCTC
Indices =
2-20
Manage Short-Read Sequence Data in Objects
1
2
3
4
5
Return the headers of the read sequences that align to a specific region of
the reference sequence:
alignedHeaders =
'B7_591:4:96:693:509'
'EAS54_65:7:152:368:113'
'EAS51_64:8:5:734:57'
'B7_591:1:289:587:906'
'EAS56_59:8:38:671:758'
BMObj2 = BioMap('ex1.sam');
2 Use the filterByFlag method to create a logical vector indicating the read
sequences in a BioMap object that are mapped.
3 Use this logical vector and the getSubset method to create a new BioMap
object containing only the mapped read sequences.
2-21
2 High-Throughput Sequence Analysis
BMObj2 = BioMap('ex1.sam');
2 Use the filterByFlag method to create a logical vector indicating the read
sequences in a BioMap object that are mapped in a proper pair, that is, both
the read sequence and its mate are mapped to the reference sequence.
3 Use this logical vector and the getSubset method to create a new BioMap
object containing only the read sequences that are mapped in a proper pair.
2-22
Store and Manage Feature Annotations in Objects
GFFAnnotObj = GFFAnnotation('tair8_1.gff')
GFFAnnotObj =
2-23
2 High-Throughput Sequence Analysis
GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf')
GTFAnnotObj =
GFFAnnotObj.FieldNames
ans =
Columns 1 through 6
Columns 7 through 9
GTFAnnotObj.FieldNames
ans =
Columns 1 through 6
Columns 7 through 11
2-24
Store and Manage Feature Annotations in Objects
Determine the range of the reference sequences that are covered by feature
annotations by using the getRange method with the annotation object
constructed in the previous section:
range = getRange(GFFAnnotObj)
range =
3631 498516
AnnotStruct =
2-25
2 High-Throughput Sequence Analysis
Starts = AnnotStruct.Start;
Extract the start positions for annotations 12 through 17. Notice that you
must use square brackets when indexing a range of positions:
Starts_12_17 = [AnnotStruct(12:17).Start]
Starts_12_17 =
Extract the start position and the feature for the 12th annotation:
Start_12 = AnnotStruct(12).Start
Start_12 =
4706
Feature_12 = AnnotStruct(12).Feature
Feature_12 =
CDS
2-26
Store and Manage Feature Annotations in Objects
GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf');
2 Use the getReferenceNames method to return the names for the reference
sequences for the annotation object:
refNames = getReferenceNames(GTFAnnotObj)
refNames =
'chr2'
3 Use the getFeatureNames method to retrieve the feature names from the
annotation object:
featureNames = getFeatureNames(GTFAnnotObj)
featureNames =
'CDS'
'exon'
'start_codon'
'stop_codon'
4 Use the getGeneNames method to retrieve a list of the unique gene names
from the annotation object:
geneNames = getGeneNames(GTFAnnotObj)
geneNames =
'uc002qvu.2'
'uc002qvv.2'
'uc002qvw.2'
'uc002qvx.2'
'uc002qvy.2'
'uc002qvz.2'
'uc002qwa.2'
'uc002qwb.2'
'uc002qwc.1'
'uc002qwd.2'
'uc002qwe.3'
2-27
2 High-Throughput Sequence Analysis
'uc002qwf.2'
'uc002qwg.2'
'uc002qwh.2'
'uc002qwi.3'
'uc002qwk.2'
'uc002qwl.2'
'uc002qwm.1'
'uc002qwn.1'
'uc002qwo.1'
'uc002qwp.2'
'uc002qwq.2'
'uc010ewe.2'
'uc010ewf.1'
'uc010ewg.2'
'uc010ewh.1'
'uc010ewi.2'
'uc010yim.1'
Filter Annotations
Use the getData method to filter the annotations and create a structure
containing only the annotations of interest, which are annotations that are
exons associated with the uc002qvv.2 gene on chromosome 2.
AnnotStruct = getData(GTFAnnotObj,'Reference','chr2',...
'Feature','exon','Gene','uc002qvv.2')
AnnotStruct =
2-28
Store and Manage Feature Annotations in Objects
Gene
Transcript
Source
Score
Strand
Frame
Attributes
StartPos = [AnnotStruct.Start];
EndPos = [AnnotStruct.Stop];
BMObj3 = BioMap('ex3.bam');
Then use the range for the annotations of interest as input to the getCounts
method of a BioMap object. This returns the counts of short reads aligned to
the annotations of interest.
counts =
1399
1
54
221
97
2-29
2 High-Throughput Sequence Analysis
125
0
1
0
65
9
12
2-30
Visualize and Investigate Short-Read Alignments
2-31
2 High-Throughput Sequence Analysis
You can visualize and investigate the aligned data before, during, or after any
preprocessing (filtering, quality recalibration) or analysis steps you perform
on the aligned data.
ngsbrowser
2-32
Visualize and Investigate Short-Read Alignments
2-33
Browser Displaying Reference Track, One Alignment Track, and One Annotation Track
2 High-Throughput Sequence Analysis
2 In the Open dialog box, select a FASTA file, and then click Open.
Tip You can use the getgenbank function with the ToFile and SequenceOnly
name-value pair arguments to retrieve a reference sequence from the
GenBank database and save it to a FASTA-formatted file.
• BioMap object
2-34
Visualize and Investigate Short-Read Alignments
Tip If you do not have index files (IDX or BAI and LINEARINDEX) stored
in the same location as your source file, and your source file is stored in a
location to which you do not have write access, you cannot import data from
the source file directly into the browser. Instead, construct a BioMap object
from the source file using the IndexDir name-value pair argument, and
then import the BioMap object into the browser.
1 Select File > Add Data from File or File > Import Alignment Data
from MATLAB Workspace.
2 In the Open dialog box, select a GFF- or GTF-formatted file, and then
click Open.
2-35
2 High-Throughput Sequence Analysis
Tip Use the left and right arrow keys to pan in one base pair (bp) increments.
2-36
Visualize and Investigate Short-Read Alignments
Note The browser computes coverage at the base pair resolution, instead
of binning, even when zoomed out.
Tip Set Max to a value greater than 100, if needed, when comparing the
coverage of multiple tracks of reads.
2-37
2 High-Throughput Sequence Analysis
Limit the depth of the reads displayed in the pileup view by setting the
Maximum display read depth in the Alignment Pileup settings.
Tip Limiting the depth of short reads in the pileup view does not change the
counts displayed in the coverage view.
2-38
Visualize and Investigate Short-Read Alignments
2-39
2 High-Throughput Sequence Analysis
2-40
Visualize and Investigate Short-Read Alignments
Flag Reads
Click anywhere in an alignment track to display the Alignment Pileup
settings.
2-41
2 High-Throughput Sequence Analysis
In addition to the base Phred quality information that displays in the tooltip,
you can visualize quality differences by using the Shade mismatch bases
by Phred quality settings.
• Light shade — Mismatch bases with Phred scores below the minimum
• Graduation of medium shades — Mismatch bases with Phred scores within
the minimum to maximum range
• Dark shade — Mismatch bases with Phred scores above the maximum
2-42
Visualize and Investigate Short-Read Alignments
2-43
2 High-Throughput Sequence Analysis
Introduction
In the prostate cancer study, the prostate cancer cell line LNCap was treated
with androgen/DHT. Mock-treated and androgen-stimulated LNCap cells
were sequenced using the Illumina® 1G Genome Analyzer [1]. For the
mock-treated cells, there were four lanes totaling ~10 million reads. For the
DHT-treated cells, there were three lanes totaling ~7 million reads. All
replicates were technical replicates. Samples labeled s1 through s4 are from
mock-treated cells. Samples labeled s5, s6, and s8 are from DHT-treated
cells. The read sequences are stored in FASTA files. The sequence IDs break
down as follows: seq_(unique sequence id)_(number of times this sequence
was seen in this lane).
(1) Downloaded and uncompressed the seven FASTA files (s1.fa, s2.fa,
s3.fa, s4.fa, s5.fa, s6.fa and s8.fa) containing the raw, 35bp, unmapped
short reads from the author’s Web Site.
2-44
Identifying Differentially Expressed Genes from RNA-Seq Data
(2) Produced a SAM-formatted file for each of the seven FASTA files by
mapping the short reads to the NCBI version 37 of the human genome using a
mapper such as Bowtie [2],
(3) Ordered the SAM-formatted files by reference name first, then by genomic
position.
For the published version of this example, 4,388,997 short reads were mapped
using the Bowtie aligner [2]. The aligner was instructed to report one best
valid alignment. No more than two mismatches were allowed for alignment.
Reads with more than one reportable alignment were suppressed, i.e. any
read that mapped to multiple locations was discarded. The alignment was
output to seven SAM files (s1.sam, s2.sam, s3.sam, s4.sam, s5.sam, s6.sam
and s8.sam). Because the input files were FASTA files, all quality values
were assumed to be 40 on the Phred quality scale [2]. We then used SAMtools
[3] to sort the mapped reads in the seven SAM files, one for each replicate.
GFFfilename = ensemblmart2gff('ensemblmart_genes_hum37.txt');
genes = GFFAnnotation(GFFfilename)
genes =
2-45
2 High-Throughput Sequence Analysis
Create a subset with the genes present in chromosomes only (without contigs).
The GFFAnnotation object contais 20012 annotated protein-coding genes in
the Ensembl database.
chrs = {'1','2','3','4','5','6','7','8','9','10','11','12','13','14',...
'15','16','17','18','19','20','21','22','X','Y','MT'};
genes = getSubset(genes,'reference',chrs)
genes =
Copy the gene information into a structure and display the first entry.
getData(genes,1)
ans =
Reference: '1'
Start: 205111632
Stop: 205180727
Feature: 'DSTYK'
Source: 'protein_coding'
Score: '0.0'
Strand: '-'
Frame: '.'
Attributes: ''
The size of the sorted SAM files in this data set are in the order of 250-360MB.
You can access the mapped reads in s1.sam by creating a BioMap. BioMap has
2-46
Identifying Differentially Expressed Genes from RNA-Seq Data
an interface that provides direct access to the mapped short reads stored in
the SAM-formatted file, thus minimizing the amount of data that is actually
loaded into memory.
bm = BioMap('s1.sam')
bm =
Use the getSummary method to obtain a list of the existing references and the
actual number of short read mapped to each one. Observe that the order of
the references is equivalent to the previously created cell string chrs.
getSummary(bm)
BioMap summary:
Name: ''
Container_Type: 'Data is file indexed.'
Total_Number_of_Sequences: 458367
Number_of_References_in_Dictionary: 25
Number_of_Sequences Genomic_Range
gi|224589800|ref|NC_000001.10| 39037 564571 2492
2-47
2 High-Throughput Sequence Analysis
You can access the alignments, and perform operations like getting counts
and coverage from bm. For more examples of getting read coverage at the
chromosome level, see Exploring Protein-DNA Binding Sites from Paired-End
ChIP-Seq Data.
Next, you will determine the mapped reads associated with each Ensembl
gene. Because the strings used in the SAM files to denote the reference names
are different to those provided in the annotations, we find a vector with the
reference index for each gene:
geneReference = seqmatch(genes.Reference,chrs,'exact',true);
2-48
Identifying Differentially Expressed Genes from RNA-Seq Data
For each gene, count the mapped reads that overlap any part of the gene.
The read counts for each gene are the digital gene expression of that gene.
Use the getCounts method of a BioMap to compute the read count within a
specified range.
counts = getCounts(bm,genes.Start,genes.Stop,1:genes.NumEntries,geneReferen
Gene expression levels can be best respresented by a table, with each row
representing a gene. Create a table with two columns, set the first column to
the gene symbols and second column to the counts of the first sample.
filenames = {'s1.sam','s2.sam','s3.sam','s4.sam','s5.sam','s6.sam','s8.sam'
samples = {'Mock_1','Mock_2','Mock_3','Mock_4','DHT_1','DHT_2','DHT_3'};
lncap = table(genes.Feature,counts,'VariableNames',{'Gene',samples{1}});
lncap(1:10,:)
ans =
Gene Mock_1
____________ ______
'DSTYK' 21
'KCNJ2' 1
'DPF3' 2
'KRT78' 0
'GPR19' 1
'SOX9' 8
'C17orf63' 13
'AL929472.1' 0
'INPP5B' 19
'NME4' 10
Determine the number of genes that have counts greater than or equal to
50 in chromosome 1.
2-49
2 High-Throughput Sequence Analysis
ans =
188
Repeat this step for the other six samples (SAM files) in the data set to get
their gene counts and copy the information to the previously created table.
for i = 2:7
bm = BioMap(filenames{i});
counts = getCounts(bm,genes.Start,genes.Stop,1:genes.NumEntries,geneRef
lncap.(samples{i}) = counts;
end
Inspect the first 10 rows in the table with the counts for all seven samples.
lncap(1:10, :)
ans =
'DSTYK' 21 15 15 24 24 24
'KCNJ2' 1 0 2 0 0 2
'DPF3' 2 2 2 2 2 1
'KRT78' 0 0 0 0 0 0
'GPR19' 1 2 1 1 0 0
'SOX9' 8 13 19 15 27 22
'C17orf63' 13 12 16 24 19 12
'AL929472.1' 0 0 0 1 0 0
'INPP5B' 19 23 27 24 35 32
'NME4' 10 11 14 22 11 20
2-50
Identifying Differentially Expressed Genes from RNA-Seq Data
DHT_3
_____
15
2
1
0
0
11
9
0
9
8
The table lncap contains counts for samples from two biological conditions:
mock-treated (Aidx) and DHT-treated (Bidx).
You can plot the counts for a chromosome along the chromosome genome
coordinate. For example, plot the counts for chromosome 1 for mock-treated
sample Mock_1 and DHT-treated sample DHT_1. Add the ideogram for
chromosome 1 to the plot using the chromosomeplot function.
figure
plot(genes.Start(ichr1), lncap{ichr1,'Mock_1'}, '.-r',...
genes.Start(ichr1), lncap{ichr1,'DHT_1'}, '.-b');
ylabel('Gene Counts')
title('Gene Counts on Chromosome 1')
fixGenomicPositionLabels(gca) % formats tick labels and adds datacursors
chromosomeplot('hs_cytoBand.txt', 1, 'AddToPlot', gca)
2-51
2 High-Throughput Sequence Analysis
For RNA-seq experiments, the read counts have been found to be linearly
related to the abundance of the target transcripts [4]. The interest lies
in comparing the read counts between different biological conditions.
Current observations suggest that typical RNA-seq experiments have low
background noise, and the gene counts are discrete and could follow the
Poisson distribution. While it has been noted that the assumption of the
Poisson distribution often predicts smaller variation in count data by ignoring
the extra variation due to the actual differences between replicate samples
[5]. Anders et.al.,(2010) proposed an error model for statistical inference
of differential signal in RNA-seq expression data that could address the
overdispersion problem [6]. Their approach uses the negative binomial
distribution to model the null distribution of the read counts. The mean and
variance of the negative binomial distribution are linked by local regression,
and these two parameters can be reliably estimated even when the number of
replicates is small [6].
In this example, you will apply this statistical model to process the count
data and test for differential expression. The details of the algorithm can be
found in reference [6]. The model of Anders et.al., (2010) has three sets of
parameters that need to be estimated from the data:
3. The smooth functions that model the dependence of the raw variance on
the expected mean.
The expectation values of all gene counts from a sample are proportional to
the sample’s library size. The effective library size can be estimated from the
count data.
Compute the geometric mean of the gene counts (rows in lncap) across all
samples in the experiment as a pseudo-reference sample.
pseudo_ref_sample = geomean(lncap{:,samples},2);
2-52
Identifying Differentially Expressed Genes from RNA-Seq Data
Each library size parameter is computed as the median of the ratio of the
sample’s counts to those of the pseudo-reference sample.
The counts can be transformed to a common scale using size factor adjustment.
base_lncap = lncap;
base_lncap{:,samples} = bsxfun(@rdivide,lncap{:,samples},sizeFactors);
Use the boxplot function to inspect the count distribution of the mock-treated
and DHT-treated samples and the size factor adjustment.
figure
subplot(2,1,1)
maboxplot(log2(lncap{:,samples}), 'title','Raw Read Counts',...
'orientation', 'horizontal')
subplot(2,1,2)
maboxplot(log2(base_lncap{:,samples}), 'title','Size Factor Adjusted Read C
'orientation', 'horizontal')
Plot the log2 fold changes against the base means using the mairplot
function. A quick exploration reflects ~15 differentially expressed genes (20
fold change or more), though not all of these are significant due to the low
number of counts compared to the sample variance.
mairplot(mean_A(nzi),mean_B(nzi),'Labels',lncap.Gene,'Factor',20)
2-53
2 High-Throughput Sequence Analysis
In the model, the variances of the counts of a gene are considered as the sum
of a shot noise term and a raw variance term. The shot noise term is the
mean counts of the gene, while the raw variance can be predicted from the
mean, i.e., genes with a similar expression level have similar variance across
the replicates (samples of the same biological condition). A smooth function
that models the dependence of the raw variance on the mean is obtained by
fitting the sample mean and variance within replicates for each gene using
local regression function.
z = mean_A * mean(1./sizeFactors(Aidx));
raw_var_func_A = estimateNBVarFunc(mean_A,var_A,sizeFactors(Aidx))
raw_var_func_A =
@(meanEstimate)calculateUnbiasedRawVariance(meanEstimate)
var_fit_A = raw_var_func_A(mean_A) + z;
2-54
Identifying Differentially Expressed Genes from RNA-Seq Data
Plot the sample variance to its regressed value to check the fit of the variance
function.
figure
loglog(mean_A, var_A, '*')
hold on
loglog(mean_A, var_fit_A, '.r')
ylabel('Base Variances')
xlabel('Base Means')
title('Dependence of the Variance on the Mean for Mock-Treated Samples')
The fit (red line) follows the single-gene estimates well, even though the
spread of the latter is considerable, as one would expect, given that each
raw variance value is estimated from only four values (four mock-treaded
replicates).
degrees_of_freedom = sum(Aidx) - 1;
var_ratio = var_A ./ var_fit_A;
pchisq = chi2cdf(degrees_of_freedom * var_ratio, degrees_of_freedom);
figure;
2-55
2 High-Throughput Sequence Analysis
hold on
cm = jet(7);
for i = 1:7
[Y1,X1] = ecdf(pchisq(grps==i));
plot(X1,Y1,'LineWidth',2,'color',cm(i,:))
end
plot([0,1],[0,1] ,'k', 'linewidth', 2)
set(gca, 'Box', 'on')
legend(labels,'Location','NorthWest')
xlabel('Chi-squared probability of residual')
ylabel('ECDF')
title('Residuals ECDF plot for mock-treated samples')
The ECDF curves of count levels greater than 3 and below 130 follows the
diagonal well (black line). If the ECDF curves are below the black line,
variance is underestimated. If the ECDF curves are above the black line,
variance is overestimated [6]. For very low counts (below 3), the deviations
become stronger, but at these levels, shot noise dominates. For the high
count cases, the variance is overestimated. The reason might be there are
not enough genes with high counts. Get the number of genes in each of the
count levels.
array2table(accumarray(grps,1),'VariableNames',{'Counts'},'RowNames',labels
ans =
Counts
______
0-3 8984
4-12 3405
13-30 3481
31-65 2418
66-130 1173
131-310 428
> 311 123
2-56
Identifying Differentially Expressed Genes from RNA-Seq Data
Increasing the sequence depth, which in turn increases the number of genes
with higher counts, improves the variance estimation.
Having estimated and verified the mean-variance dependence, you can test
for differentially expressed genes between the samples from the mock- and
DHT- treated conditions. Define, as test statistic, the total counts in each
condition, k_A and k_B:
Parameters of the new negative binomial distributions for count sums k_A can
be calculated by Eqs. 12-14 in [6]:
Compute the p-values for the statistical significance of the change from
DHT-treated condition to mock-treated condition. The helper function
computePVal implements the numerical computation of the p-values
presented in the reference [6].
res = table(genes.Feature,'VariableNames',{'Gene'});
res.pvals = computePVal(k_B, mean_k_B, var_k_B, k_A, mean_k_A, var_k_A);
You can empirically adjust the p-values from the multiple tests for false
discovery rate (FDR) with the Benjamini-Hochberg procedure [7] using the
mafdr function.
2-57
2 High-Throughput Sequence Analysis
res.log2_fold_change = log2(fold_change);
Plot the log2 fold changes against the base means, and color those genes
with p-values.
figure
scatter(log2(pooled_mean), res.log2_fold_change,3,(res.p_fdr).^(.02),'o')
xlabel('log2 Mean')
ylabel('log2 Fold Change')
colormap(flipud(cool(256)))
hc = colorbar;
set(hc,'YTickLabel',num2str((get(hc,'Ytick').^50)','%6.1g'))
title('Fold Change colored by False Discovery Rate (FDR)')
You can identify up- or down- regulated genes for mean base count levels
over 3.
up_idx = find(res.p_fdr < 0.01 & res.log2_fold_change >= 2 & pooled_mean >
numel(up_idx)
ans =
185
ans =
2-58
Identifying Differentially Expressed Genes from RNA-Seq Data
190
This analysis identified 375 statistically significant (out of 20,012 genes) that
were differentially up- or down- regulated by hormone treatment. You can
sort table res by statistical significant and display the top list.
[~,h] = sort(res.p_fdr);
res(h(1:20),:)
ans =
'FKBP5' 0 0 5.0449
'NCAPD3' 0 0 5.4914
'CENPN' 6.6707e-300 4.4498e-296 4.8519
'LIFR' 2.4939e-284 1.2477e-280 4.0734
'DHCR24' 2.0847e-249 8.3437e-246 3.1845
'ERRFI1' 9.2602e-246 3.0886e-242 4.0914
'GLYATL2' 8.5613e-244 2.4475e-240 3.4522
'ACSL3' 2.6073e-225 6.5221e-222 3.6953
'ATF3' 1.2368e-193 2.75e-190 3.368
'MLPH' 2.0119e-185 4.0263e-182 2.5466
'STEAP4' 1.7537e-182 3.1905e-179 9.9479
'DBI' 3.787e-173 6.3155e-170 2.7759
'ABCC4' 8.5321e-166 1.3134e-162 2.8211
'KLK2' 2.7911e-163 3.9897e-160 2.9506
'SAT1' 1.2922e-161 1.724e-158 2.6687
'CAMK2N1' 8.8046e-161 1.1012e-157 -4.2901
'JAM3' 4.7333e-151 5.5719e-148 5.7235
'MBOAT2' 1.556e-140 1.7299e-137 3.285
'RHOU' 1.4157e-138 1.4911e-135 4.0932
'NNMT' 5.6484e-138 5.6517e-135 4.3572
References
2-59
2 High-Throughput Sequence Analysis
[1] Li, H., Lovci, M.T., Kwon, Y-S., Rosenfeld, M.G., Fu, X-D., and Yeo, G.W.
"Determination of Tag Density Required for Digital Transcriptome Analysis:
Application to an Androgen-Sensitive Prostate Cancer Model", PNAS, 105(51),
pp 20179-20184, 2008.
[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. "Ultrafast and
Memory-efficient Alignment of Short DNA Sequences to the Human Genome",
Genome Biology, 10:R25, pp 1-10, 2009.
[3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing
Subgroup, "The Sequence Alignment/map (SAM) Format and SAMtools",
Bioinformatics, 25, pp 2078-2079, 2009.
[4] Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B.
"Mapping and quantifying mammalian transcriptomes by RNA-Seq", Nature
Methods, 5, pp 621-628, 2008.
[5] Robinson, M.D., and Oshlack, A. "A Scaling Normalization method for
differential Expression Analysis of RNA-seq Data", Genome Biology 11:R25,
1-9, 2010.
[7] Benjamini, Y., and Hochberg, Y. "Controlling the false discovery rate: a
practical and powerful approach to multiple testing", J. Royal Stat. Soc.,
B 57, 289-300, 1995.
2-60
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Introduction
Data Set
2-61
2 High-Throughput Sequence Analysis
(1) downloaded the file SRR054715.sra containing the unmapped short read
and converted it to FASTQ formatted files using the NCBI SRA Toolkit.
(2) produced a SAM formatted file by mapping the short reads to the Thale
Cress reference genome, using a mapper such as BWA [2], Bowtie, or SSAHA2
(which is the mapper used by authors of [1]), and,
(3) ordered the SAM formatted file by reference name first, then by genomic
position.
For the published version of this example, 8,655,859 paired-end short reads
are mapped using the BWA mapper [2]. BWA produced a SAM formatted
file (aratha.sam) with 17,311,718 records (8,655,859 x 2). Repetitive hits
were randomly chosen, and only one hit is reported, but with lower mapping
quality. The SAM file was ordered and converted to a BAM formatted file
using SAMtools [3] before being loaded into MATLAB.
The last part of the example also assumes that you downloaded the
reference genome for the Thale Cress model organism (which includes five
chromosomes). Uncomment the following lines of code to download the
reference from the NCBI repository:
% getgenbank('NC_003070','FileFormat','fasta','tofile','ach1.fasta');
% getgenbank('NC_003071','FileFormat','fasta','tofile','ach2.fasta');
% getgenbank('NC_003074','FileFormat','fasta','tofile','ach3.fasta');
% getgenbank('NC_003075','FileFormat','fasta','tofile','ach4.fasta');
% getgenbank('NC_003076','FileFormat','fasta','tofile','ach5.fasta');
bm = BioMap('aratha.bam')
bm =
2-62
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
BioMap
Properties:
SequenceDictionary: {5x1 cell}
Reference: [14637324x1 File indexed property]
Signature: [14637324x1 File indexed property]
Start: [14637324x1 File indexed property]
MappingQuality: [14637324x1 File indexed property]
Flag: [14637324x1 File indexed property]
MatePosition: [14637324x1 File indexed property]
Quality: [14637324x1 File indexed property]
Sequence: [14637324x1 File indexed property]
Header: [14637324x1 File indexed property]
NSeqs: 14637324
Name: ''
Use the getSummary method to obtain a list of the existing references and the
actual number of short read mapped to each one.
getSummary(bm)
BioMap summary:
Name: ''
Container_Type: 'Data is file indexed.'
Total_Number_of_Sequences: 14637324
Number_of_References_in_Dictionary: 5
Number_of_Sequences Genomic_Range
Chr1 3151847 1 30427671
Chr2 3080417 1000 19698292
Chr3 3062917 94 23459782
Chr4 2218868 1029 18585050
Chr5 3123275 11 26975502
2-63
2 High-Throughput Sequence Analysis
The remainder of this example focuses on the analysis of one of the five
chromosomes, Chr1. Create a new BioMap to access the short reads mapped to
the first chromosome by subsetting the first one.
bm1 = getSubset(bm,'SelectReference','Chr1')
bm1 =
BioMap
Properties:
SequenceDictionary: {'Chr1'}
Reference: [3151847x1 File indexed property]
Signature: [3151847x1 File indexed property]
Start: [3151847x1 File indexed property]
MappingQuality: [3151847x1 File indexed property]
Flag: [3151847x1 File indexed property]
MatePosition: [3151847x1 File indexed property]
Quality: [3151847x1 File indexed property]
Sequence: [3151847x1 File indexed property]
Header: [3151847x1 File indexed property]
NSeqs: 3151847
Name: ''
By accessing the Start and Stop positions of the mapped short read you can
obtain the genomic range.
x1 = min(getStart(bm1))
x2 = max(getStop(bm1))
x1 =
x2 =
2-64
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
30427671
To explore the coverage for the whole range of the chromosome, a binning
algorithm is required. The getBaseCoverage method produces a coverage
signal based on effective alignments. It also allows you to specify a bin width
to control the size (or resolution) of the output signal. However internal
computations are still performed at the base pair (bp) resolution. This means
that despite setting a large bin size, narrow peaks in the coverage signal can
still be observed. Once the coverage signal is plotted you can program the
figure’s data cursor to display the genomic position when using the tooltip.
You can zoom and pan the figure to determine the position and height of
the ChIP-Seq peaks.
[cov,bin] = getBaseCoverage(bm1,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov)
axis([x1,x2,0,100]) % sets the axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage in Chromosome 1')
p1 = 4598837-1000;
p2 = 4598837+1000;
figure
plot(p1:p2,getBaseCoverage(bm1,p1,p2))
xlim([p1,p2]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
2-65
2 High-Throughput Sequence Analysis
ylabel('Depth')
title('Coverage in Chromosome 1')
Observe the large peak with coverage depth of 800+ between positions
4599029 and 4599145. Investigate how these reads are aligning to the
reference chromosome. You can retrieve a subset of these reads enough to
satisfy a coverage depth of 25, since this is sufficient to understand what is
happening in this region. Use getIndex to obtain indices to this subset. Then
use getCompactAlignment to display the corresponding multiple alignment of
the short-reads.
i = getIndex(bm1,4599029,4599145,'depth',25);
bmx = getSubset(bm1,i,'inmemory',false)
getCompactAlignment(bmx,4599029,4599145)
bmx =
BioMap
Properties:
SequenceDictionary: {'Chr1'}
Reference: [62x1 File indexed property]
Signature: [62x1 File indexed property]
Start: [62x1 File indexed property]
MappingQuality: [62x1 File indexed property]
Flag: [62x1 File indexed property]
MatePosition: [62x1 File indexed property]
Quality: [62x1 File indexed property]
Sequence: [62x1 File indexed property]
Header: [62x1 File indexed property]
NSeqs: 62
Name: ''
2-66
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
ans =
2-67
2 High-Throughput Sequence Analysis
In addition to visually confirming the alignment, you can also explore the
mapping quality for all the short reads in this region, as this may hint to a
potential problem. In this case, less than one percent of the short reads have
a Phred quality of 60, indicating that the mapper most likely found multiple
hits within the reference genome, hence assigning a lower mapping quality.
figure
i = getIndex(bm1,4599029,4599145);
hist(double(getMappingQuality(bm1,i)))
title('Mapping Quality of the reads between 4599029 and 4599145')
xlabel('Phred Quality Score')
ylabel('Number of Reads')
Most of the large peaks in this data set occur due to satellite repeat regions or
due to its closeness to the centromere [4], and show characteristics similar to
the example just explored. You may explore other regions with large peaks
using the same procedure.
To prevent these problematic regions, two techniques are used. First, given
that the provided data set uses paired-end sequencing, by removing the reads
that are not aligned in a proper pair reduces the number of potential aligner
errors or ambiguities. You can achieve this by exploring the flag field of the
SAM formatted file, in which the second less significant bit is used to indicate
if the short read is mapped in a proper pair.
i = find(bitget(getFlag(bm1),2));
bm1_filtered = getSubset(bm1,i)
bm1_filtered =
BioMap
Properties:
SequenceDictionary: {'Chr1'}
Reference: [3040724x1 File indexed property]
Signature: [3040724x1 File indexed property]
Start: [3040724x1 File indexed property]
2-68
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
Second, consider only uniquely mapped reads. You can detect reads that are
equally mapped to different regions of the reference sequence by looking at
the mapping quality, because BWA assigns a lower mapping quality (less
than 60) to this type of short read.
i = find(getMappingQuality(bm1_filtered)==60);
bm1_filtered = getSubset(bm1_filtered,i)
bm1_filtered =
BioMap
Properties:
SequenceDictionary: {'Chr1'}
Reference: [2313252x1 File indexed property]
Signature: [2313252x1 File indexed property]
Start: [2313252x1 File indexed property]
MappingQuality: [2313252x1 File indexed property]
Flag: [2313252x1 File indexed property]
MatePosition: [2313252x1 File indexed property]
Quality: [2313252x1 File indexed property]
Sequence: [2313252x1 File indexed property]
Header: [2313252x1 File indexed property]
NSeqs: 2313252
Name: ''
2-69
2 High-Throughput Sequence Analysis
Visualize again the filtered data set using both, a coarse resolution with 1000
bp bins for the whole chromosome, and a fine resolution for a small region of
20,000 bp. Most of the large peaks due to artifacts have been removed.
[cov,bin] = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','m
figure
plot(bin,cov)
axis([x1,x2,0,100]) % sets the axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')
p1 = 24275801-10000;
p2 = 24275801+10000;
figure
plot(p1:p2,getBaseCoverage(bm1_filtered,p1,p2))
xlim([p1,p2]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')
2-70
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
information is captured in the fifth bit of the flag field, according to the SAM
file format.
fow_idx = find(~bitget(getFlag(bm1_filtered),5));
rev_idx = find(bitget(getFlag(bm1_filtered),5));
SAM-formatted files use the same header strings to identify pair mates. By
pairing the header strings you can determine how the short reads in BioMap
are paired. To pair the header strings, simply order them in ascending order
and use the sorting indices (hf and hr) to link the unsorted header strings.
[~,hf] = sort(getHeader(bm1_filtered,fow_idx));
[~,hr] = sort(getHeader(bm1_filtered,rev_idx));
mate_idx = zeros(numel(fow_idx),1);
mate_idx(hf) = rev_idx(hr);
Use the resulting fow_idx and mate_idx variables to retrieve pair mates. For
example, retrieve the paired-end reads for the first 10 fragments.
for j = 1:10
disp(getInfo(bm1_filtered, fow_idx(j)))
disp(getInfo(bm1_filtered, mate_idx(j)))
end
2-71
2 High-Throughput Sequence Analysis
Use the paired-end indices to construct a new BioMap with the minimal
information needed to represent the sequencing fragments. First, calculate
the insert sizes.
J = getStop(bm1_filtered, fow_idx);
K = getStart(bm1_filtered, mate_idx);
L = K - J - 1;
Obtain the new signature (or CIGAR string) for each fragment by using the
short read original signatures separated by the appropriate number of skip
CIGAR symbols (N).
n = numel(L);
cigars = cell(n,1);
for i = 1:n
cigars{i} = sprintf('%dN' ,L(i));
end
cigars = strcat( getSignature(bm1_filtered, fow_idx),...
cigars,...
getSignature(bm1_filtered, mate_idx));
J = getStart(bm1_filtered,fow_idx);
K = getStop(bm1_filtered,mate_idx);
L = K - J + 1;
figure
hist(double(L),100)
title(sprintf('Fragment Size Distribution\n %d Paired-end Fragments Mapped
xlabel('Fragment Size')
2-72
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
ylabel('Count')
bm1_fragments = BioMap('Sequence',seqs,'Signature',cigars,'Start',J)
bm1_fragments =
BioMap
Properties:
SequenceDictionary: {0x1 cell}
Reference: {0x1 cell}
Signature: {1156626x1 cell}
Start: [1156626x1 uint32]
MappingQuality: [0x1 uint8]
Flag: [0x1 uint16]
MatePosition: [0x1 uint32]
Quality: {0x1 cell}
Sequence: {1156626x1 cell}
Header: {0x1 cell}
NSeqs: 1156626
Name: ''
cov_reads = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','m
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,x1,x2,'binWidth',1000,'
2-73
2 High-Throughput Sequence Analysis
figure
plot(bin,cov_reads,bin,cov_fragments)
xlim([x1,x2]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage Comparison')
legend('Short Reads','Fragments')
p1 = 1;
p2 = 200000;
cov_reads = getBaseCoverage(bm1_filtered,p1,p2);
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,p1,p2);
chr1 = fastaread('ach1.fasta');
mp1 = regexp(chr1.Sequence(p1:p2),'CA..TG')+3+p1;
mp2 = regexp(chr1.Sequence(p1:p2),'GT..AC')+3+p1;
motifs = [mp1 mp2];
figure
plot(bin,cov_reads,bin,cov_fragments)
hold on
plot([1;1;1]*motifs,[0;max(ylim);NaN],'r')
xlim([111000 114000]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
2-74
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
title('Coverage Comparison')
legend('Short Reads','Fragments','E-box motif')
Observe that it is not possible to associate each peak in the coverage signals
with an E-box motif. This is because the length of the sequencing fragments is
comparable to the average motif distance, blurring peaks that are close. Plot
the distribution of the distances between the E-box motif sites.
motif_sep = diff(sort(motifs));
figure
hist(motif_sep(motif_sep<500),50)
title('Distance (bp) between adjacent E-box motifs')
xlabel('Distance (bp)')
ylabel('Counts')
Use the function mspeaks to perform peak detection with Wavelets denoising
on the coverage signal of the fragment alignments. Filter putative ChIP peaks
using a height filter to remove peaks that are not enriched by the binding
process under consideration.
putative_peaks = mspeaks(bin,cov_fragments,'noiseestimator',20,...
'heightfilter',10,'showplot',true);
hold on
plot([1;1;1]*motifs(motifs>p1 & motifs<p2),[0;max(ylim);NaN],'r')
xlim([111000 114000]) % sets the x-axis limits
fixGenomicPositionLabels % formats tick labels and adds datacursors
legend('Coverage from Fragments','Wavelet Denoised Coverage','Putative ChIP
xlabel('Base position')
ylabel('Depth')
title('ChIP-Seq Peak Detection')
2-75
2 High-Throughput Sequence Analysis
Use the knnsearch function to find the closest motif to each one of the
putative peaks. As expected, most of the enriched ChIP peaks are close to an
E-box motif [1]. This reinforces the importance of performing peak detection
at the finest resolution possible (bp resolution) when the expected density of
binding sites is high, as it is in the case of the E-box motif. This example also
illustrates that for this type of analysis, paired-end sequencing should be
considered over single-end sequencing [1].
h = knnsearch(motifs',putative_peaks(:,1));
distance = putative_peaks(:,1)-motifs(h(:))';
figure
hist(distance(abs(distance)<200),50)
title('Distance to Closest E-box Motif for Each Detected Peak')
xlabel('Distance (bp)')
ylabel('Counts')
References
[1] Wang C., Xu J., Zhang D., Wilson Z.A., and Zhang D. "An effective
approach for identification of in vivo protein-DNA binding sites from
paired-end ChIP-Seq data", BMC Bioinformatics, 11:81, Feb 9, 2010.
[2] Li H. and Durbin R. "Fast and accurate short read alignment with
Burrows-Wheeler transform", Bioinformatics, 25, pp 1754-60, 2009.
[3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing
Subgroup "The Sequence Alignment/map (SAM) Format and SAMtools",
Bioinformatics, 25, pp 2078-2079, 2009.
2-76
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
[6] Ramsey SA, Knijnenburg TA, Kennedy KA, Zak DE, Gilchrist M, Gold
ES, Johnson CD, Lampano AE, Litvak V, Navarro G, Stolyar T, Aderem A,
Shmulevich I. "Genome-wide histone acetylation data improve prediction of
mammalian transcription factor binding sites", Bioinformatics, 26(17), pp
2071-5, Sep 1, 2010.
2-77
2 High-Throughput Sequence Analysis
Note: For enhanced performance, MathWorks recommends that you run this
example on a 64-bit platform, because the memory footprint is close to 2 GB.
On a 32-bit platform, if you receive "Out of memory" errors when running
this example, try increasing the virtual memory (or swap space) of your
operating system or try setting the 3GB switch (32-bit Windows® XP only).
These techniques are described in this document.
Introduction
In this example you will explore the DNA methylation profiles of two
human cancer cells: parental HCT116 colon cancer cells and DICERex5
cells. DICERex5 cells are derived from HCT116 cells after the truncation of
the DICER1 alleles. Serre et al. in [1] proposed to study DNA methylation
profiles by using the MBD2 protein as a methyl CpG binding domain and
subsequently used high-throughput sequencing (HTseq). This technique
is commonly know as MBD-Seq. Short reads for two replicates of the two
samples have been submitted to NCBI’s SRA archive by the authors of [1].
There are other technologies available to interrogate DNA methylation status
of CpG sites in combination with HTseq, for example MeDIP-seq or the use of
restriction enzymes. You can also analyze this type of data sets following the
approach presented in this example.
Data Sets
You can obtain the unmapped single-end reads for four sequencing
experiments from the NCBI FTP site. Short reads were produced using
Illumina®’s Genome Analyzer II. Average insert size is 120 bp, and the length
of short reads is 36 bp.
2-78
Exploring Genome-wide Differences in DNA Methylation Profiles
(2) produced SAM-formatted files by mapping the short reads to the reference
human genome (NCBI Build 37.5) using the Bowtie [2] algorithm. Only
uniquely mapped reads are reported.
(3) compressed the SAM formatted files to BAM and ordered them by
reference name first, then by genomic position by using SAMtools [3].
This example also assumes that you downloaded the reference human genome
(GRCh37.p5). You can use the bowtie-inspect command to reconstruct the
human reference directly from the bowtie indices. Or you may download the
reference from the NCBI repository by uncommenting the following line:
% getgenbank('NC_000009','FileFormat','fasta','tofile','hsch9.fasta');
To explore the signal coverage of the HCT116 samples you need to construct a
BioMap. BioMap has an interface that provides direct access to the mapped
short reads stored in the BAM-formatted file, thus minimizing the amount
of data that is actually loaded into memory. Use the function baminfo to
obtain a list of the existing references and the actual number of short reads
mapped to each one.
info = baminfo('SRR030224.bam','ScanDictionary',true);
fprintf('%-35s%s\n','Reference','Number of Reads');
for i = 1:numel(info.ScannedDictionary)
fprintf('%-35s%d\n',info.ScannedDictionary{i},...
info.ScannedDictionaryCount(i));
end
2-79
2 High-Throughput Sequence Analysis
gi|224589811|ref|NC_000002.11| 187019
gi|224589815|ref|NC_000003.11| 73986
gi|224589816|ref|NC_000004.11| 84033
gi|224589817|ref|NC_000005.9| 96898
gi|224589818|ref|NC_000006.11| 87990
gi|224589819|ref|NC_000007.13| 120816
gi|224589820|ref|NC_000008.10| 111229
gi|224589821|ref|NC_000009.11| 106189
gi|224589801|ref|NC_000010.10| 112279
gi|224589802|ref|NC_000011.9| 104466
gi|224589803|ref|NC_000012.11| 87091
gi|224589804|ref|NC_000013.10| 53638
gi|224589805|ref|NC_000014.8| 64049
gi|224589806|ref|NC_000015.9| 60183
gi|224589807|ref|NC_000016.9| 146868
gi|224589808|ref|NC_000017.10| 195893
gi|224589809|ref|NC_000018.9| 60344
gi|224589810|ref|NC_000019.9| 166420
gi|224589812|ref|NC_000020.10| 148950
gi|224589813|ref|NC_000021.8| 310048
gi|224589814|ref|NC_000022.10| 76037
gi|224589822|ref|NC_000023.10| 32421
gi|224589823|ref|NC_000024.9| 18870
gi|17981852|ref|NC_001807.4| 1015
Unmapped 6805842
bm_hct116_1 = BioMap('SRR030224.bam','SelectRef','gi|224589821|ref|NC_00000
bm_hct116_2 = BioMap('SRR030225.bam','SelectRef','gi|224589821|ref|NC_00000
bm_hct116_1 =
SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
Reference: [106189x1 File indexed property]
Signature: [106189x1 File indexed property]
2-80
Exploring Genome-wide Differences in DNA Methylation Profiles
bm_hct116_2 =
SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
Reference: [107586x1 File indexed property]
Signature: [107586x1 File indexed property]
Start: [107586x1 File indexed property]
MappingQuality: [107586x1 File indexed property]
Flag: [107586x1 File indexed property]
MatePosition: [107586x1 File indexed property]
Quality: [107586x1 File indexed property]
Sequence: [107586x1 File indexed property]
Header: [107586x1 File indexed property]
NSeqs: 107586
Name:
figure
ha = gca;
hold on
n = 141213431; % length of chromosome 9
[cov,bin] = getBaseCoverage(bm_hct116_1,1,n,'binWidth',100);
h1 = plot(bin,cov,'b'); % plots the binned coverage of bm_hct116_1
[cov,bin] = getBaseCoverage(bm_hct116_2,1,n,'binWidth',100);
h2 = plot(bin,cov,'g'); % plots the binned coverage of bm_hct116_2
2-81
2 High-Throughput Sequence Analysis
Because short reads represent the methylated regions of the DNA, there
is a correlation between aligned coverage and DNA methylation. Observe
the increased DNA methylation close to the chromosome telomeres; it is
known that there is an association between DNA methylation and the role of
telomeres for maintaining the integrity of the chromosomes. In the coverage
plot you can also see a long gap over the chromosome centromere. This is due
to the repetitive sequences present in the centromere, which prevent us from
aligning short reads to a unique position in this region. For the data sets used
in this example, only about 30% of the short reads were uniquely mapped to
the reference genome.
Load the human chromosome 9 from the reference file hs37.fasta. For this
example, it is assumed that you recovered the reference from the Bowtie
indices using the bowtie-inspect command; therefore hs37.fasta contains
all the human chromosomes. To load only the chromosome 9 you can use the
option nave-value pair BLOCKREAD with the fastaread function.
chr9 = fastaread('hs37.fasta','blockread',9)
chr9 =
2-82
Exploring Genome-wide Differences in DNA Methylation Profiles
Use the cpgisland function to find the CpG clusters. Using the standard
definition for CpG islands [4], 200 or more bp islands with 60% or greater
CpGobserved/CpGexpected ratio, leads to 1682 GpG islands found in
chromosome 9.
cpgi = cpgisland(chr9.Sequence)
cpgi =
Use the getCounts method to calculate the ratio of aligned bases that are
inside CpG islands. For the first replicate of the sample HCT116, the ratio
is close to 45%.
aligned_bases_in_CpG_islands = getCounts(bm_hct116_1,cpgi.Starts,cpgi.Stops
aligned_bases_total = getCounts(bm_hct116_1,1,n,'method','sum')
ratio = aligned_bases_in_CpG_islands ./ aligned_bases_total
aligned_bases_in_CpG_islands =
1724363
aligned_bases_total =
3822804
ratio =
2-83
2 High-Throughput Sequence Analysis
0.4511
You can explore high resolution coverage plots of the two sample replicates
and observe how the signal correlates with the CpG islands. For example,
explore the region between 23,820,000 and 23,830,000 bp. This is the 5’ region
of the human gene ELAVL2.
To find regions that contain more mapped reads than would be expected by
chance, you can follow a similar approach to the one described by Serre et al.
2-84
Exploring Genome-wide Differences in DNA Methylation Profiles
First, use the getCounts method to count the number of mapped reads
that start at each window. In this example you use a binning approach
that considers only the start position of every mapped read, following the
approach of Serre et al. However, you may also use the OVERLAP and METHOD
name-value pairs in getCounts to compute more accurate statistics. For
instance, to obtain the maximum coverage for each window considering base
pair resolution, set OVERLAP to 1 and METHOD to MAX.
counts_1 = getCounts(bm_hct116_1,w,w+99,'independent',true,'overlap','start
counts_2 = getCounts(bm_hct116_2,w,w+99,'independent',true,'overlap','start
First, try to model the counts assuming that all the windows with counts are
biologically significant and therefore from the same distribution. Use the
negative bionomial distribution to fit a model the count data.
nbp = nbinfit(counts_1);
figure
hold on
emphist = histc(counts_1,0:100); % calculate the empirical distribution
bar(0:100,emphist./sum(emphist),'c','grouped') % plot histogram
plot(0:100,nbinpdf(0:100,nbp(1),nbp(2)),'b','linewidth',2); % plot fitted m
axis([0 50 0 .001])
legend('Empirical Distribution','Negative Binomial Fit')
ylabel('Frequency')
xlabel('Counts')
title('Frequency of counts for 100 bp windows (HCT116-1)')
2-85
2 High-Throughput Sequence Analysis
The poor fitting indicates that the observed distribution may be due to the
mixture of two models, one that represents the background and one that
represents the count data in methylated DNA windows.
Before fitting the real data, let us assess the fiting procedure with some
sampled data from a known distribution.
figure
hold on
emphist = histc(x,0:100); % Calculate the empirical distribution
bar(0:100,emphist./sum(emphist),'c','grouped') % plot histogram
h1 = plot(0:100,nbinpdf(0:100,nbphat1(1),nbphat1(2)),'b-o','linewidth',2);
h2 = plot(0:100,nbinpdf(0:100,nbphat2(1),nbphat2(2)),'r','linewidth',2);
h3 = plot(0:100,nbinpdf(0:100,nbphat3(1),nbphat3(2)),'g','linewidth',2);
axis([0 25 0 .2])
legend([h1 h2 h3],'Neg-binomial fitted to all data',...
'Neg-binomial fitted to truncated data',...
'Truncated neg-binomial fitted to truncated data')
2-86
Exploring Genome-wide Differences in DNA Methylation Profiles
ylabel('Frequency')
xlabel('Counts')
For the two replicates of the HCT116 sample, fit a right-truncated negative
binomial distribution to the observed null model using the rtnbinfit
anonymous function previously defined.
pval1 = 1 - nbincdf(counts_1,pn1(1),pn1(2));
pval2 = 1 - nbincdf(counts_2,pn2(1),pn2(2));
Calculate the false discovery rate using the mafdr function. Use the
name-value pair BHFDR to use the linear-step up (LSU) procedure ([6]) to
calculate the FDR adjusted p-values. Setting the FDR < 0.01 permits you to
identify the 100-bp windows that are significantly methylated.
fdr1 = mafdr(pval1,'bhfdr',true);
fdr2 = mafdr(pval2,'bhfdr',true);
Number_of_sig_windows_HCT116_1 = sum(w1)
Number_of_sig_windows_HCT116_2 = sum(w2)
Number_of_sig_windows_HCT116 = sum(w12)
Number_of_sig_windows_HCT116_1 =
1662
2-87
2 High-Throughput Sequence Analysis
Number_of_sig_windows_HCT116_2 =
1674
Number_of_sig_windows_HCT116 =
1346
Overall, you identified 1662 and 1674 non-overlapping 100-bp windows in the
two replicates of the HCT116 samples, which indicates there is significant
evidence of DNA methylation. There are 1346 windows that are significant in
both replicates.
For example, looking again in the promoter region of the ELAVL2 human
gene you can observe that in both sample replicates, multiple 100-bp windows
have been marked significant.
2-88
Exploring Genome-wide Differences in DNA Methylation Profiles
GFFfilename = ensemblmart2gff('ensemblmart_genes_hum37.txt');
a = GFFAnnotation(GFFfilename)
a9 = getSubset(a,'reference','9')
numGenes = a9.NumEntries
a =
a9 =
numGenes =
800
Find the promoter regions for each gene. In this example we consider the
proximal promoter as the -500/100 upstream region.
downstream = 500;
upstream = 100;
2-89
2 High-Throughput Sequence Analysis
promoters = dataset({a9.Feature,'Gene'});
promoters.Strand = char(a9.Strand);
promoters.Start = promoterStart';
promoters.Stop = promoterStop';
promoters.Counts_1 = getCounts(bm_hct116_1,promoters.Start,promoters.Stop,.
'overlap',1,'independent',true);
promoters.Counts_2 = getCounts(bm_hct116_2,promoters.Start,promoters.Stop,.
'overlap',1,'independent',true);
Fit a null distribution for each sample replicate and compute the p-values:
Ratio_of_sig_methylated_promoters = Number_of_sig_promoters./numGenes
Number_of_sig_promoters =
2-90
Exploring Genome-wide Differences in DNA Methylation Profiles
74
Ratio_of_sig_methylated_promoters =
0.0925
[~,order] = sort(promoters.pval_1.*promoters.pval_2);
promoters(order(1:30),[1 2 3 4 5 7 6 8])
ans =
2-91
2 High-Throughput Sequence Analysis
2-92
Exploring Genome-wide Differences in DNA Methylation Profiles
2.0943e-06 47 3.0746e-05
1.7771e-06 42 6.8037e-05
4.7762e-06 46 3.6016e-05
Serre et al. [1] reported that, in these data sets, approximately 90% of the
uniquely mapped reads fall outside the 5’ gene promoter regions. Using
a similar approach as before, you can find genes that have intergenic
methylated regions. To compensate for the varying lengths of the genes, you
can use the maximum coverage, computed base-by-base, instead of the raw
number of mapped short reads. Another alternative approach to normalize
the counts by the gene length is to set the METHOD name-value pair to rpkm
in the getCounts function.
intergenic = dataset({a9.Feature,'Gene'});
intergenic.Strand = char(a9.Strand);
intergenic.Start = a9.Start;
intergenic.Stop = a9.Stop;
intergenic.Counts_1 = getCounts(bm_hct116_1,intergenic.Start,intergenic.Sto
'overlap','full','method','max','independent',true);
intergenic.Counts_2 = getCounts(bm_hct116_2,intergenic.Start,intergenic.Sto
'overlap','full','method','max','independent',true);
trun = 10; % Set a truncation threshold
pn1 = rtnbinfit(intergenic.Counts_1(intergenic.Counts_1<trun),trun); % Fit
pn2 = rtnbinfit(intergenic.Counts_2(intergenic.Counts_2<trun),trun); % Fit
intergenic.pval_1 = 1 - nbincdf(intergenic.Counts_1,pn1(1),pn1(2)); % p-val
intergenic.pval_2 = 1 - nbincdf(intergenic.Counts_2,pn2(1),pn2(2)); % p-val
Ratio_of_sig_methylated_genes = Number_of_sig_genes./numGenes
[~,order] = sort(intergenic.pval_1.*intergenic.pval_2);
intergenic(order(1:30),[1 2 3 4 5 7 6 8])
2-93
2 High-Throughput Sequence Analysis
Number_of_sig_genes =
62
Ratio_of_sig_methylated_genes =
0.0775
ans =
2-94
Exploring Genome-wide Differences in DNA Methylation Profiles
For instance, explore the methylation profile of the BARX1 gene, the sixth
significant gene with intergenic methylation in the previous list. The GTF
2-95
2 High-Throughput Sequence Analysis
barx1 = GTFAnnotation('ensemblmart_barx1.gtf')
transcripts = getTranscriptNames(barx1)
barx1 =
transcripts =
'ENST00000253968'
'ENST00000401724'
Plot the DNA methylation profile for both HCT116 sample replicates with
base-pair resolution. Overlay the CpG islands and plot the exons for each of
the two transcripts along the bottom of the plot.
range = barx1.getRange;
r1 = range(1)-1000; % set the region limits
r2 = range(2)+1000;
figure
hold on
% plot high-resolution coverage of bm_hct116_1
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
% plot high-resolution coverage of bm_hct116_2
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
2-96
Exploring Genome-wide Differences in DNA Methylation Profiles
In the study by Serre et al. another cell line is also analyzed. New cells
(DICERex5) are derived from the same HCT116 colon cancer cells after
truncating the DICER1 alleles. It has been reported in literature [5] that
there is a localized change of DNA methylation at small number of gene
promoters. In this example, you be find significant 100-bp windows in two
sample replicates of the DICERex5 cells following the same approach as the
2-97
2 High-Throughput Sequence Analysis
parental HCT116 cells, and then you will search statistically significant
differences between the two cell lines.
bm_dicer_1 = BioMap('SRR030222.bam','SelectRef','gi|224589821|ref|NC_000009
bm_dicer_2 = BioMap('SRR030223.bam','SelectRef','gi|224589821|ref|NC_000009
[counts_3,pval3,fdr3] = getWindowCounts(bm_dicer_1,4,w,100);
[counts_4,pval4,fdr4] = getWindowCounts(bm_dicer_2,4,w,100);
w3 = fdr3<.01; % logical vector indicating significant windows in DICERex5_
w4 = fdr4<.01; % logical vector indicating significant windows in DICERex5-
w34 = w3 & w4; % logical vector indicating significant windows in both repl
Number_of_sig_windows_DICERex5_1 = sum(w3)
Number_of_sig_windows_DICERex5_2 = sum(w4)
Number_of_sig_windows_DICERex5 = sum(w34)
Number_of_sig_windows_DICERex5_1 =
908
Number_of_sig_windows_DICERex5_2 =
1041
Number_of_sig_windows_DICERex5 =
759
To perform a differential analysis you use the 100-bp windows that are
significant in at least one of the samples (either HCT116 or DICERex5).
2-98
Exploring Genome-wide Differences in DNA Methylation Profiles
Use the function manorm to normalize the data. The PERCENTILE name-value
pair lets you filter out windows with very large number of counts while
normalizing, since these windows are mainly due to artifacts, such as
repetitive regions in the reference chromosome.
counts_norm = round(manorm(counts,'percentile',90).*100);
[~,ord] = sort(pval);
fprintf('Window Pos Type p-value HCT116 DICERe
for i = 1:25
j = ord(i);
[~,msg] = findClosestGene(a9,[ws(j) ws(j)+99]);
fprintf('%10d %-25s %7.6f%5d%5d %5d%5d\n', ...
ws(j),msg,pval(j),counts_norm(j,:));
end
2-99
2 High-Throughput Sequence Analysis
Plot the DNA methylation profile for the promoter region of gene FAM189A2,
the most signicant differentially covered promoter region from the previous
list. Overlay the CpG islands and the FAM189A2 gene.
range = getRange(getSubset(a9,'Feature','FAM189A2'));
r1 = range(1)-1000;
r2 = range(2)+1000;
figure
hold on
% plot high-resolution coverage of all replicates
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
h3 = plot(r1:r2,getBaseCoverage(bm_dicer_1,r1,r2,'binWidth',1),'r');
h4 = plot(r1:r2,getBaseCoverage(bm_dicer_2,r1,r2,'binWidth',1),'m');
2-100
Exploring Genome-wide Differences in DNA Methylation Profiles
Observe that the CpG islands are clearly unmethylated for both of the
DICERex5 replicates.
References
[1] Serre, D., Lee, B.H., and Ting A.H. "MBD-isolated Genome Sequencing
provides a high-throughput and comprehensive survey of DNA methylation in
the human genome", Nucleic Acids Research, 38(2), pp 391-399, 2010.
[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. "Ultrafast and
Memory-efficient Alignment of Short DNA Sequences to the Human Genome",
Genome Biology, 10:R25, pp 1-10, 2009.
[3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing
Subgroup "The Sequence Alignment/map (SAM) Format and SAMtools",
Bioinformatics, 25, pp 2078-2079, 2009.
2-101
2 High-Throughput Sequence Analysis
[5] Ting, A.H., Suzuki, H., Cope, L., Schuebel, K.E., Lee, B.H., Toyota, M.,
Imai, K., Shinomura, Y., Tokino, T. and Baylin, S.B. "A Requirement for
DICER to Maintain Full Promoter CpG Island % Hypermethylation in Human
Cancer Cells", Cancer Research, 68, 2570, April 15, 2008.
[6] Benjamini, Y., Hochberg, Y., "Controlling the false discovery rate: a
practical and powerful approach to multiple testing", Journal of the Royal
Statistical Society, 57, pp 289-300, 1995.
2-102
3
Sequence Analysis
Sequence analysis is the process you use to find information about a nucleotide
or amino acid sequence using computational methods. Common tasks in
sequence analysis are identifying genes, determining the similarity of two
genes, determining the protein coded by a gene, and determining the function
of a gene by finding a similar gene in another organism with a known function.
Overview of Example
After sequencing a piece of DNA, one of the first tasks is to investigate the
nucleotide content in the sequence. Starting with a DNA sequence, this
example uses sequence statistics functions to determine mono-, di-, and
trinucleotide content, and to locate open reading frames.
First research information about the human mitochondria and find the
nucleotide sequence for the genome. Next, look at the nucleotide content for
the entire sequence. And finally, determine open reading frames and extract
specific gene sequences.
1 Use the MATLAB Help browser to explore the Web. In the MATLAB
Command Window, type
web('http://www.ncbi.nlm.nih.gov/')
3-2
Exploring a Nucleotide Sequence Using Command Line
A separate browser window opens with the home page for the NCBI Web
site.
2 Search the NCBI Web site for information. For example, to search for the
human mitochondrion genome, from the Search list, select Genome , and in
the Search list, enter mitochondrion homo sapiens.
3 Select a result page. For example, click the link labeled NC_012920.
The MATLAB Help browser displays the NCBI page for the human
mitochondrial genome.
3-3
3 Sequence Analysis
3-4
Exploring a Nucleotide Sequence Using Command Line
The consensus sequence for the human mitochondrial genome has the
GenBank accession number NC_012920. Since the whole GenBank entry is
quite large and you might only be interested in the sequence, you can get
just the sequence information.
mitochondria = getgenbank('NC_012920','SequenceOnly',true)
mitochondria =
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGT . . .
2 If you don’t have a Web connection, you can load the data from a MAT file
included with the Bioinformatics Toolbox software, using the command
load mitochondria
The load function loads the sequence mitochondria into the MATLAB
Workspace.
whos mitochondria
3-5
3 Sequence Analysis
After you read a sequence into the MATLAB environment, you can use
the sequence statistics functions to determine if your sequence has the
characteristics of a protein-coding region. This procedure uses the human
mitochondrial genome as an example. See “Reading Sequence Information
from the Web” on page 3-5.
ntdensity(mitochondria)
3-6
Exploring a Nucleotide Sequence Using Command Line
basecount(mitochondria)
ans =
A: 5124
C: 5181
G: 2169
T: 4094
3-7
3 Sequence Analysis
basecount(seqrcomplement(mitochondria))
ans =
A: 4094
C: 2169
G: 5181
T: 5124
4 Use the function basecount with the chart option to visualize the
nucleotide distribution.
figure
basecount(mitochondria,'chart','pie');
3-8
Exploring a Nucleotide Sequence Using Command Line
5 Count the dimers in a sequence and display the information in a bar chart.
figure
dimercount(mitochondria,'chart','bar')
ans =
AA: 1604
AC: 1495
AG: 795
AT: 1230
3-9
3 Sequence Analysis
CA: 1534
CC: 1771
CG: 435
CT: 1440
GA: 613
GC: 711
GG: 425
GT: 419
TA: 1373
TC: 1204
TG: 513
TT: 1004
3-10
Exploring a Nucleotide Sequence Using Command Line
After you read a sequence into the MATLAB environment, you can analyze
the sequence for codon composition. This procedure uses the human
3-11
3 Sequence Analysis
codoncount(mitochondria)
2 Count the codons in all six reading frames and plot the results in heat maps.
3-12
Exploring a Nucleotide Sequence Using Command Line
3-13
3 Sequence Analysis
3-14
Exploring a Nucleotide Sequence Using Command Line
3-15
3 Sequence Analysis
After you read a sequence into the MATLAB environment, you can analyze
the sequence for open reading frames. This procedure uses the human
mitochondria genome as an example. See “Reading Sequence Information
from the Web” on page 3-5.
seqshoworfs(mitochondria);
If you compare this output to the genes shown on the NCBI page for
NC_012920, there are fewer genes than expected. This is because vertebrate
mitochondria use a genetic code slightly different from the standard genetic
code. For a list of genetic codes, see the Genetic Code table in the aa2nt
reference page.
orfs= seqshoworfs(mitochondria,...
'GeneticCode','Vertebrate Mitochondrial',...
'alternativestart',true);
Notice that there are now two large ORFs on the third reading frame. One
starts at position 4470 and the other starts at 5904. These correspond to
the genes ND2 (NADH dehydrogenase subunit 2 [Homo sapiens] ) and
COX1 (cytochrome c oxidase subunit I) genes.
3 Find the corresponding stop codon. The start and stop positions for ORFs
have the same indices as the start positions in the fields Start and Stop.
ND2Start = 4470;
StartIndex = find(orfs(3).Start == ND2Start)
ND2Stop = orfs(3).Stop(StartIndex)
ND2Stop =
5511
3-16
Exploring a Nucleotide Sequence Using Command Line
4 Using the sequence indices for the start and stop of the gene, extract the
subsequence from the sequence.
ND2Seq = mitochondria(ND2Start:ND2Stop)
attaatcccctggcccaacccgtcatctactctaccatctttgcaggcac
actcatcacagcgctaagctcgcactgattttttacctgagtaggcctag
aaataaacatgctagcttttattccagttctaaccaaaaaaataaaccct
cgttccacagaagctgccatcaagtatttcctcacgcaagcaaccgcatc
cataatccttc . . .
codoncount (ND2Seq)
The codon count shows a high amount of ACC, ATA, CTA, and ATC.
6 Look up the amino acids for codons ATA, CTA, ACC, and ATC.
aminolookup('code',nt2aa('ATA'))
aminolookup('code',nt2aa('CTA'))
3-17
3 Sequence Analysis
aminolookup('code',nt2aa('ACC'))
aminolookup('code',nt2aa('ATC'))
Ile isoleucine
Leu leucine
Thr threonine
Ile isoleucine
After you locate an open reading frame (ORF) in a gene, you can convert it to
an amino sequence and determine its amino acid composition. This procedure
uses the human mitochondria genome as an example. See “Open Reading
Frames” on page 3-15.
ND2AASeq = nt2aa(ND2Seq,'geneticcode',...
'Vertebrate Mitochondrial')
MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVLTKKMNP
RSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMM
AMAMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLN
VSLLLTLSILSIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNM
TILNLTIYIILTTTAFLLLNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLS
3-18
Exploring a Nucleotide Sequence Using Command Line
LGGLPPLTGFLPKWAIIEEFTKNNSLIIPTIMATITLLNLYFYLRLIYST
SITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTTLLLPISPFMLMIL
ND2protein = getgenpept('YP_003024027','sequenceonly',true)
The getgenpept function retrieves the published conversion from the NCBI
database and reads it into the MATLAB Workspace.
aacount(ND2AASeq, 'chart','bar')
A bar graph displays. Notice the high content for leucine, threonine and
isoleucine, and also notice the lack of cysteine and aspartic acid.
3-19
3 Sequence Analysis
atomiccomp(ND2AASeq)
molweight (ND2AASeq)
ans =
C: 1818
H: 2882
N: 420
O: 471
3-20
Exploring a Nucleotide Sequence Using Command Line
S: 25
ans =
3.8960e+004
If this sequence was unknown, you could use this information to identify
the protein by comparing it with the atomic composition of other proteins
in a database.
3-21
3 Sequence Analysis
seqviewer
The Sequence Viewer opens without a sequence loaded. Notice that the
panes to the right and bottom are blank.
3-22
Exploring a Nucleotide Sequence Using the Sequence Viewer App
2 To retrieve a sequence from the NCBI database, select File > Download
Sequence from > NCBI.
The MATLAB software accesses the NCBI database on the Web, loads
nucleotide sequence information for the accession number you entered,
and calculates some basic statistics.
3-23
3 Sequence Analysis
1 In the left pane tree, click Comments. The right pane displays general
information about the sequence.
2 Now click Features. The right pane displays NCBI feature information,
including index numbers for a gene and any CDS sequences.
3 Click ORF to show the search results for ORFs in the six reading frames.
3-24
Exploring a Nucleotide Sequence Using the Sequence Viewer App
3-25
3 Sequence Analysis
2 In the Find Word dialog box, type a sequence word or pattern, for example,
atg, and then click Find.
3-26
Exploring a Nucleotide Sequence Using the Sequence Viewer App
The Sequence Viewer searches and displays the location of the selected
word.
3-27
3 Sequence Analysis
3-28
Exploring a Nucleotide Sequence Using the Sequence Viewer App
The Sequence Viewer displays the ORFs for the six reading frames in
the lower-right pane. Hover the cursor over a frame to display information
about it.
The ORF is highlighted to indicate the part of the sequence that is selected.
3-29
3 Sequence Analysis
4 Select File > Import from Workspace. Type the name of a variable
with an exported ORF, for example, NM_000520_ORF_2, and then click
Import.
The Sequence Viewer adds a tab at the bottom for the new sequence
while leaving the original sequence open.
3-30
Exploring a Nucleotide Sequence Using the Sequence Viewer App
5 In the left pane, click Full Translation. Select Display > Amino Acid
Residue Display > One Letter Code.
The Sequence Viewer displays the amino acid sequence below the
nucleotide sequence.
3-31
3 Sequence Analysis
seqviewer('close')
3-32
Explore a Protein Sequence Using the Sequence Viewer App
3-33
3 Sequence Analysis
The Sequence Viewer accesses the NCBI database on the Web and loads
amino acid sequence information for the accession number you entered.
3-34
Explore a Protein Sequence Using the Sequence Viewer App
3 Select Display > Amino Acid Color Scheme, and then select Charge,
Function, Hydrophobicity, Structure, or Taylor. For example, select
Function.
The display colors change to highlight charge information about the amino
acid residues. The following table shows color legends for the amino acid
color schemes.
3-35
3 Sequence Analysis
3-36
Explore a Protein Sequence Using the Sequence Viewer App
seqviewer('close')
References
[1] Taylor, W.R. (1997). Residual colours: a proposal for aminochromography.
Protein Engineering 10, 7, 743–746.
3-37
3 Sequence Analysis
Sequence Alignment
In this section...
“Overview of Example” on page 3-38
“Find a Model Organism to Study” on page 3-38
“Retrieve Sequence Information from a Public Database” on page 3-41
“Search a Public Database for Related Genes” on page 3-43
“Locate Protein Coding Sequences” on page 3-45
“Compare Amino Acid Sequences” on page 3-49
Overview of Example
Determining the similarity between two sequences is a common task in
computational biology. Starting with a nucleotide sequence for a human gene,
this example uses alignment algorithms to locate and verify a corresponding
gene in a model organism.
First, research information about Tay-Sachs and the enzyme that is associated
with this disease, then find the nucleotide sequence for the human gene
that codes for the enzyme, and finally find a corresponding gene in another
organism to use as a model for study.
1 Use the MATLAB Help browser to explore the Web. In the MATLAB
Command window, type
web('http://www.ncbi.nlm.nih.gov/books/NBK22250/')
The MATLAB Help browser opens with the Tay-Sachs disease page in the
Genes and Diseases section of the NCBI web site. This section provides a
comprehensive introduction to medical genetics. In particular, this page
3-38
Sequence Alignment
3-39
3 Sequence Analysis
3-40
Sequence Alignment
The gene HEXA codes for the alpha subunit of the dimer enzyme
hexosaminidase A (Hex A), while the gene HEXB codes for the beta subunit
of the enzyme. A third gene, GM2A, codes for the activator protein GM2.
However, it is a mutation in the gene HEXA that causes Tay-Sachs.
After you locate a sequence, you need to move the sequence data into the
MATLAB Workspace.
1 Open the MATLAB Help browser to the NCBI Web site. In the MATLAB
Command Widow, type
web('http://www.ncbi.nlm.nih.gov/')
The MATLAB Help browser window opens with the NCBI home page.
2 Search for the gene you are interested in studying. For example, from the
Search list, select Nucleotide, and in the for box enter Tay-Sachs.
The search returns entries for the genes that code the alpha and beta
subunits of the enzyme hexosaminidase A (Hex A), and the gene that codes
the activator enzyme. The NCBI reference for the human gene HEXA has
accession number NM_000520.
3-41
3 Sequence Analysis
3 Get sequence data into the MATLAB environment. For example, to get
sequence information for the human gene HEXA, type
3-42
Sequence Alignment
humanHEXA = getgenbank('NM_000520')
humanHEXA =
LocusName: 'NM_000520'
LocusSequenceLength: '2255'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'PRI'
LocusModificationDate: '13-AUG-2006'
Definition: 'Homo sapiens hexosaminidase A (alpha polypeptide) (HEXA), mRNA.'
Accession: 'NM_000520'
Version: 'NM_000520.2'
GI: '13128865'
Project: []
Keywords: []
Segment: []
Source: 'Homo sapiens (human)'
SourceOrganism: [4x65 char]
Reference: {1x58 cell}
Comment: [15x67 char]
Features: [74x74 char]
CDS: [1x1 struct]
Sequence: [1x2255 char]
SearchURL: [1x108 char]
RetrieveURL: [1x97 char]
3-43
3 Sequence Analysis
Homologous genes are genes that have a common ancestor and similar
sequences. One goal of searching a public database is to find similar genes.
If you are able to locate a sequence in a database that is similar to your
unknown gene or protein, it is likely that the function and characteristics of
the known and unknown genes are the same.
After finding the nucleotide sequence for a human gene, you can do a BLAST
search or search in the genome of another organism for the corresponding
gene. This procedure uses the mouse genome as an example.
1 Open the MATLAB Help browser to the NCBI Web site. In the MATLAB
Command window, type
web('http://www.ncbi.nlm.nih.gov')
2 Search the nucleotide database for the gene or protein you are interested in
studying. For example, from the Search list, select Nucleotide, and in the
for box enter hexosaminidase A.
The search returns entries for the mouse and human genomes. The NCBI
reference for the mouse gene HEXA has accession number AK080777.
3 Get sequence information for the mouse gene into the MATLAB
environment. Type
mouseHEXA = getgenbank('AK080777')
3-44
Sequence Alignment
mouseHEXA =
LocusName: 'AK080777'
LocusSequenceLength: '1839'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'HTC'
LocusModificationDate: '02-SEP-2005'
Definition: [1x150 char]
Accession: 'AK080777'
Version: 'AK080777.1'
GI: '26348756'
Project: []
Keywords: 'HTC; CAP trapper.'
Segment: []
Source: 'Mus musculus (house mouse)'
SourceOrganism: [4x65 char]
Reference: {1x8 cell}
Comment: [8x66 char]
Features: [33x74 char]
CDS: [1x1 struct]
Sequence: [1x1839 char]
SearchURL: [1x107 char]
RetrieveURL: [1x97 char]
After you have a list of genes you are interested in studying, you can
determine the protein coding sequences. This procedure uses the human gene
HEXA and mouse gene HEXA as an example.
3-45
3 Sequence Analysis
1 If you did not retrieve gene data from the Web, you can load example data
from a MAT-file included with the Bioinformatics Toolbox software. In the
MATLAB Command window, type
load hexosaminidase
2 Locate open reading frames (ORFs) in the human gene. For example, for
the human gene HEXA, type
humanORFs = seqshoworfs(humanHEXA.Sequence)
humanORFs =
The Help browser opens displaying the three reading frames with the
ORFs colored blue, red, and green. Notice that the longest ORF is in the
first reading frame.
3-46
Sequence Alignment
3-47
3 Sequence Analysis
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
mouseORFs =
The mouse gene shows the longest ORF on the first reading frame.
3-48
Sequence Alignment
After you have located the open reading frames on your nucleotide sequences,
you can convert the protein coding sections of the nucleotide sequences to
their corresponding amino acid sequences, and then you can compare them
for similarities.
1 Using the open reading frames identified previously, convert the human
and mouse DNA sequences to the amino acid sequences. Because both the
human and mouse HEXA genes were in the first reading frames (default),
you do not need to indicate which frame. Type
humanProtein = nt2aa(humanHEXA.Sequence);
mouseProtein = nt2aa(mouseHEXA.Sequence);
2 Draw a dot plot comparing the human and mouse amino acid sequences.
Type
seqdotplot(mouseProtein,humanProtein,4,3)
ylabel('Mouse hexosaminidase A (alpha subunit)')
xlabel('Human hexosaminidase A (alpha subunit)')
Dot plots are one of the easiest ways to look for similarity between
sequences. The diagonal line shown below indicates that there may be a
good alignment between the two sequences.
3-49
3 Sequence Analysis
3 Globally align the two amino acid sequences, using the Needleman-Wunsch
algorithm. Type
3-50
Sequence Alignment
3-51
3 Sequence Analysis
The alignment is very good between amino acid position 69 and 599, after
which the two sequences appear to be unrelated. Notice that there is a
stop (*) in the sequence at this point. If you shorten the sequences to
include only the amino acids that are in the protein you might get a better
alignment. Include the amino acid positions from the first methionine (M) to
the first stop (*) that occurs after the first methionine.
4 Trim the sequence from the first start amino acid (usually M) to the first
stop (*) and then try alignment again. Find the indices for the stops in
the sequences.
humanStops =
mouseStops =
5 Truncate the sequences to include only amino acids in the protein and
the stop.
humanProteinORF = humanProtein(70:humanStops(2))
humanProteinORF =
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDV
SSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVV
TPGCNQLPTLESVENYTLTINDDQCLLLSETVWGALRGLETFSQLVWKSA
EGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLSSILDTLDVMAYNKLNV
3-52
Sequence Alignment
FHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRG
IRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEF
MSTFFLEVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQ
LESFYIQTLLDIVSSYGKGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNY
MKELELVTKAGFRALLSAPWYLNRISYGPDWKDFYIVEPLAFEGTPEQKA
LVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKLTSDLTFAYERL
SHFRCELLRRGVQAQPLNVGFCEQEFEQT*
mouseProteinORF = mouseProtein(11:mouseStops(1))
mouseProteinORF =
MAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHV
SSAAQAGCVVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILVVSVV
TAECNEFPNLESVENYTLTINDDQCLLASETVWGALRGLETFSQLVWKSA
EGTFFINKTKIKDFPRFPHRGVLLDTSRHYLPLSSILDTLDVMAYNKFNV
FHWHLVDDSSFPYESFTFPELTRKGSFNPVTHIYTAQDVKEVIEYARLRG
IRVLAEFDTPGHTLSWGPGAPGLLTPCYSGSHLSGTFGPVNPSLNSTYDF
MSTLFLEISSVFPDFYLHLGGDEVDFTCWKSNPNIQAFMKKKGFTDFKQL
ESFYIQTLLDIVSDYDKGYVVWQEVFDNKVKVRPDTIIQVWREEMPVEYM
LEMQDITRAGFRALLSAPWYLNRVKYGPDWKDMYKVEPLAFHGTPEQKAL
VIGGEACMWGEYVDSTNLVPRLWPRAGAVAERLWSSNLTTNIDFAFKRLS
HFRCELVRRGIQAQPISVGCCEQEFEQT*
showalignment displays the results for the second global alignment. Notice
that the percent identity for the untrimmed sequences is 60% and 84% for
trimmed sequences.
3-53
3 Sequence Analysis
7 Another way to truncate an amino acid sequence to only those amino acids
in the protein is to first truncate the nucleotide sequence with indices from
3-54
Sequence Alignment
the seqshoworfs function. Remember that the ORF for the human HEXA
gene and the ORF for the mouse HEXA were both on the first reading
frame.
humanORFs = seqshoworfs(humanHEXA.Sequence)
humanORFs =
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
mouseORFs =
humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):...
humanORFs(1).Stop(1)));
mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):...
mouseORFs(1).Stop(1)));
showalignment(GlobalAlignment2)
3-55
3 Sequence Analysis
LocalScore =
1057
LocalAlignment =
RGDQR-AMTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYV . . .
|| | ||:: ||| |||||||:| ||||||||| :|| :||: . . .
RGAGRWAMAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYT . . .
showalignment(LocalAlignment)
3-56
Sequence Alignment
3-57
3 Sequence Analysis
The Phylogenetic Tree app allows you to view, edit, and explore
phylogenetic tree data. It also allows branch pruning, reordering, renaming,
and distance exploring. It can also open or save Newick or ClustalW tree
formatted files.
load primates.mat
3-58
View and Align Multiple Sequences
phytreeviewer(tree)
3-59
3 Sequence Analysis
2 Click the branches to prune (remove) from the tree. For this example, click
the branch nodes for gorillas, orangutans, and Neanderthals.
3 Export the selected branches to a second tree. Select File > Export to
Workspace, and then select Only Displayed.
4 In the Export to dialog box, enter the name of a variable. For example,
enter tree2, and then click OK.
3-60
View and Align Multiple Sequences
ma = multialign(primates2);
seqalignviewer(ma);
3-61
3 Sequence Analysis
3-62
View and Align Multiple Sequences
2 Click a letter to select it, and then move the cursor over the red direction
bar. The cursor changes to a hand.
3 Click and drag the sequence to the right to insert a gap. If there is a gap to
the left, you can also move the sequence to the left and eliminate the gap.
3-63
3 Sequence Analysis
Alternately, to insert a gap, select a character, and then click the Insert
Gap icon on the toolbar or press the spacebar.
Note You cannot delete or add letters to a sequence, but you can add or
delete gaps. If all of the sequences at one alignment position have gaps,
you can delete that column of gaps.
3-64
View and Align Multiple Sequences
seqalignviewer('close')
3-65
3 Sequence Analysis
3-66
4
Microarray Analysis
In MATLAB, you can represent all the previous data and information in an
ExpressionSet object, which typically contains the following objects:
4-2
Managing Gene Expression Data in Objects
4-3
4 Microarray Analysis
An ExpressionSet object lets you store, manage, and subset the data from a
microarray gene expression experiment. An ExpressionSet object includes
properties and methods that let you access, retrieve, and change data,
metadata, and other information about the microarray experiment. These
properties and methods are useful to view and analyze the data. For a list of
the properties and methods, see ExpressionSet class.
To learn more about constructing and using objects for microarray gene
expression data and information, see:
4-4
Representing Expression Data Values in DataMatrix Objects
4-5
4 Microarray Analysis
load filteredyeastdata
2 Create variables to contain a subset of the data, specifically the first five
rows and first four columns of the yeastvalues matrix, the genes cell
array, and the times vector.
yeastvalues = yeastvalues(1:5,1:4);
genes = genes(1:5,:);
times = times(1:4);
import bioma.data.*
dmo = DataMatrix(yeastvalues,genes,times)
dmo =
4-6
Representing Expression Data Values in DataMatrix Objects
1 Use the get method to display the properties of the DataMatrix object, dmo.
get(dmo)
Name: ''
RowNames: {5x1 cell}
ColNames: {' 0' ' 9.5' '11.5' '13.5'}
NRows: 5
NCols: 4
NDims: 2
ElementClass: 'double'
2 Use the set method to specify a name for the DataMatrix object, dmo.
dmo = set(dmo,'Name','MyDMObject');
3 Use the get method again to display the properties of the DataMatrix
object, dmo.
get(dmo)
Name: 'MyDMObject'
RowNames: {5x1 cell}
ColNames: {' 0' ' 9.5' '11.5' '13.5'}
NRows: 5
NCols: 4
NDims: 2
ElementClass: 'double'
4-7
4 Microarray Analysis
• Parenthesis ( ) indexing
• Dot . indexing
Parentheses () Indexing
Use parenthesis indexing to extract a subset of the data in dmo and assign
it to a new DataMatrix object dmo2:
dmo2 = dmo(1:5,2:3)
dmo2 =
9.5 11.5
SS DNA 1.699 -0.026
YAL003W 0.146 -0.129
YAL012W 0.175 0.467
YAL026C 0.796 0.384
YAL034C 0.487 -0.184
Use parenthesis indexing to extract a subset of the data using row names and
column names, and assign it to a new DataMatrix object dmo3:
dmo3 =
11.5
SS DNA -0.026
YAL012W 0.467
YAL034C -0.184
Note If you use a cell array of row names or column names to index into a
DataMatrix object, the names must be unique, even though the row names or
column names within the DataMatrix object are not unique.
4-8
Representing Expression Data Values in DataMatrix Objects
9.5 11.5
SS DNA 1.7 -0.03
YAL003W 0.15 -0.13
YAL012W 0.175 0.467
YAL026C 0.796 0.384
YAL034C 0.487 -0.184
9.5 11.5
YAL012W 0.175 0.467
YAL026C 0.796 0.384
YAL034C 0.487 -0.184
Dot . Indexing
Note In the following examples, notice that when using dot indexing with
DataMatrix objects, you specify all rows or all columns using a colon within
single quotation marks, (':').
Use dot indexing to extract the data from the 11.5 column only of dmo:
timeValues = dmo.(':')('11.5')
timeValues =
-0.0260
-0.1290
0.4670
0.3840
-0.1840
4-9
4 Microarray Analysis
Use dot indexing to assign new data to a subset of the elements in dmo:
dmo.(1:2)(':') = 7
dmo =
dmo.YAL034C = []
dmo =
dmo.(':')(2:3)=[]
dmo =
0 13.5
SS DNA 7 7
YAL003W 7 7
YAL012W 0.157 -0.379
YAL026C 0.246 0.981
4-10
Representing Expression Data Values in ExptData Objects
A B C
100001_at 2.26 20.14 31.66
100002_at 158.86 236.25 206.27
100003_at 68.11 105.45 82.92
100004_at 74.32 96.68 84.87
100005_at 75.05 53.17 57.94
100006_at 80.36 42.89 77.21
100007_at 216.64 191.32 219.48
An ExptData object lets you store, manage, and subset the data values from a
microarray experiment. An ExptData object includes properties and methods
that let you access, retrieve, and change data values from a microarray
experiment. These properties and methods are useful to view and analyze the
data. For a list of the properties and methods, see ExptData class.
4-11
4 Microarray Analysis
import bioma.data.*
EDObj = ExptData(dmObj);
EDObj
Experiment Data:
500 features, 26 samples
1 elements
Element names: Elmt1
objectname.propertyname
4-12
Representing Expression Data Values in ExptData Objects
EDObj.NElements
ans =
objectname.propertyname = propertyvalue
EDObj.Name = 'MyExptDataObject'
Note Property names are case sensitive. For a list and description of all
properties of an ExptData object, see ExptData class.
objectname.methodname
or
methodname(objectname)
EDObj.sampleNames
Columns 1 through 9
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' ...
size(EDObj)
4-13
4 Microarray Analysis
ans =
500 26
References
[1] Hovatta, I., Tennant, R S., Helton, R., et al. (2005). Glyoxalase 1 and
glutathione reductase 1 regulate anxiety in mice. Nature 438, 662–666.
4-14
Representing Sample and Feature Metadata in MetaData Objects
4-15
4 Microarray Analysis
VariableDescription
id 'Sample identifier'
Gender 'Gender of the mouse in study'
Age 'The number of weeks since mouse birth'
Type 'Genetic characters'
Strain 'The mouse strain'
Source 'The tissue source for RNA collection'
A MetaData object lets you store, manage, and subset the metadata from a
microarray experiment. A MetaData object includes properties and methods
that let you access, retrieve, and change metadata from a microarray
experiment. These properties and methods are useful to view and analyze the
metadata. For a list of the properties and methods, see MetaData class
import bioma.data.*
2 Load some sample data, which includes Fisher’s iris data of 5 measurements
on a sample of 150 irises.
4-16
Representing Sample and Feature Metadata in MetaData Objects
load fisheriris
3 Create a dataset array from some of Fisher’s iris data. The dataset
array will contain 750 measured values, one for each of 150 samples (iris
replicates) at five variables (species, SL, SW, PL, PW). In this dataset array,
the rows correspond to samples, and the columns correspond to variables.
irisVarDesc =
VariableDescription
species 'Iris species'
SL 'Sepal Length'
SW 'Sepal Width'
PL 'Petal Length'
PW 'Petal Width'
4-17
4 Microarray Analysis
import bioma.data.*
Note that this text file contains two tables. One table contains 130
measured values, one for each of 26 samples (A through Z) at five variables
(Gender, Age, Type, Strain, and Source). In this table, the rows correspond
to samples, and the columns correspond to variables. The second table has
lines prefaced by the # symbol. It contains five rows, each corresponding to
the five variables: Gender, Age, Type, Strain, and Source. The first column
contains the variable name. The second column has a column header of
VariableDescription and contains a description of the variable.
4-18
Representing Sample and Feature Metadata in MetaData Objects
Sample Names:
A, B, ...,Z (26 total)
Variable Names and Meta Information:
VariableDescription
Gender ' Gender of the mouse in study'
Age ' The number of weeks since mouse birth'
Type ' Genetic characters'
Strain ' The mouse strain'
Source ' The tissue source for RNA collection'
objectname.propertyname
MDObj2.NVariables
4-19
4 Microarray Analysis
ans =
objectname.propertyname = propertyvalue
Note Property names are case sensitive. For a list and description of all
properties of a MetaData object, see MetaData class.
objectname.methodname
or
methodname(objectname)
For example, to access the dataset array in a MetaData object that contains
the variable values:
MDObj2.variableValues;
To access the dataset array of a MetaData object that contains the variable
descriptions:
variableDesc(MDObj2)
ans =
VariableDescription
Gender ' Gender of the mouse in study'
Age ' The number of weeks since mouse birth'
4-20
Representing Sample and Feature Metadata in MetaData Objects
4-21
4 Microarray Analysis
• Experiment design
• Microarrays used
• Samples used
• Sample preparation and labeling
• Hybridization procedures and parameters
• Normalization controls
• Preprocessing information
• Data processing specifications
A MIAME object includes properties and methods that let you access, retrieve,
and change experiment information related to a microarray experiment.
These properties and methods are useful to view and analyze the information.
For a list of the properties and methods, see MIAME class.
4-22
Representing Experiment Information in a MIAME Object
import bioma.data.*
geoStruct = getgeodata('GSE4616')
geoStruct =
3 Use the MIAME constructor function to create a MIAME object from the
structure.
MIAMEObj1 = MIAME(geoStruct);
MIAMEObj1
MIAMEObj1 =
Experiment Description:
Author name: Mika,,Silvennoinen
Riikka,,Kivelˆ⁄
Maarit,,Lehti
Anna-Maria,,Touvras
Jyrki,,Komulainen
Veikko,,Vihko
Heikki,,Kainulainen
Laboratory: LIKES - Research Center
Contact information: Mika,,Silvennoinen
URL:
PubMedIDs: 17003243
4-23
4 Microarray Analysis
import bioma.data.*
MIAMEObj2
MIAMEObj2 =
Experiment Description:
Author name: Jane Researcher
Laboratory: One Bioinformatics Laboratory
Contact information: [email protected]
URL: www.lab.not.exist
PubMedIDs:
Abstract: A 4 word abstract is available. Use the Abstract property.
No experiment design summary available.
Other notes:
'Notes:Created from a text file.'
4-24
Representing Experiment Information in a MIAME Object
objectname.propertyname
MIAMEObj1.PubMedID
ans =
17003243
objectname.propertyname = propertyvalue
Note Property names are case sensitive. For a list and description of all
properties of a MIAME object, see MIAME class.
objectname.methodname
or
methodname(objectname)
MIAMEObj1.isempty
ans =
4-25
4 Microarray Analysis
Note For a complete list of methods of a MIAME object, see MIAME class.
4-26
Representing All Data in an ExpressionSet Object
4-27
4 Microarray Analysis
4-28
Representing All Data in an ExpressionSet Object
An ExpressionSet object lets you store, manage, and subset the data from a
microarray gene expression experiment. An ExpressionSet object includes
properties and methods that let you access, retrieve, and change data,
metadata, and other information about the microarray experiment. These
properties and methods are useful to view and analyze the data. For a list of
the properties and methods, see ExpressionSet class.
Note The following procedure assumes you have executed the example code
in the previous sections:
import bioma.*
ESObj
4-29
4 Microarray Analysis
ExpressionSet
Experiment Data: 500 features, 26 samples
Element names: Expressions
Sample Data:
Sample names: A, B, ...,Z (26 total)
Sample variable names and meta information:
Gender: Gender of the mouse in study
Age: The number of weeks since mouse birth
Type: Genetic characters
Strain: The mouse strain
Source: The tissue source for RNA collection
Feature Data: none
Experiment Information: use 'exptInfo(obj)'
objectname.propertyname
ESObj.NSamples
ans =
26
Note Property names are case sensitive. For a list and description of all
properties of an ExpressionSet object, see ExpressionSet class.
4-30
Representing All Data in an ExpressionSet Object
objectname.methodname
or
methodname(objectname)
ESObj.sampleVarNames
ans =
exptInfo(ESObj)
ans =
Experiment description
Author name: Mika,,Silvennoinen
Riikka,,Kivelˆ⁄
Maarit,,Lehti
Anna-Maria,,Touvras
Jyrki,,Komulainen
Veikko,,Vihko
Heikki,,Kainulainen
Laboratory: XYZ Lab
Contact information: Mika,,Silvennoinen
URL:
PubMedIDs: 17003243
Abstract: A 90 word abstract is available Use the Abstract property.
Experiment Design: A 234 word summary is available Use the ExptDesign property.
Other notes:
[1x80 char]
4-31
4 Microarray Analysis
4-32
Visualizing Microarray Images
http://labs.pharmacology.ucla.edu/smithlab/genome_multiplex/
The microarray data is also available on the Gene Expression Omnibus Web
site at
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30
The GenePix GPR-formatted file mouse_a1pd.gpr contains the data for one
of the microarrays used in the study. This is data from voxel A1 of the brain
of a mouse in which a pharmacological model of Parkinson’s disease (PD)
was induced using methamphetamine. The voxel sample was labeled with
Cy3 (green) and the control, RNA from a total (not voxelated) normal mouse
brain, was labeled with Cy5 (red). GPR formatted files provide a large amount
of information about the array, including the mean, median, and standard
4-33
4 Microarray Analysis
1 Read data from a file into a MATLAB structure. For example, in the
MATLAB Command Window, type
pd = gprread('mouse_a1pd.gpr')
pd =
Header: [1x1 struct]
Data: [9504x38 double]
Blocks: [9504x1 double]
Columns: [9504x1 double]
Rows: [9504x1 double]
Names: {9504x1 cell}
IDs: {9504x1 cell}
ColumnNames: {38x1 cell}
Indices: [132x72 double]
Shape: [1x1 struct]
pd.ColumnNames
ans =
'X'
'Y'
'Dia.'
4-34
Visualizing Microarray Images
'F635 Median'
'F635 Mean'
'F635 SD'
'B635 Median'
'B635 Mean'
'B635 SD'
'% > B635+1SD'
'% > B635+2SD'
'F635 % Sat.'
'F532 Median'
'F532 Mean'
'F532 SD'
'B532 Median'
'B532 Mean'
'B532 SD'
'% > B532+1SD'
'% > B532+2SD'
'F532 % Sat.'
'Ratio of Medians'
'Ratio of Means'
'Median of Ratios'
'Mean of Ratios'
'Ratios SD'
'Rgn Ratio'
'Rgn R†'
'F Pixels'
'B Pixels'
'Sum of Medians'
'Sum of Means'
'Log Ratio'
'F635 Median - B635'
'F532 Median - B532'
'F635 Mean - B635'
'F532 Mean - B532'
'Flags'
3 Access the names of the genes. For example, to list the first 20 gene names,
type
pd.Names(1:20)
4-35
4 Microarray Analysis
ans =
'AA467053'
'AA388323'
'AA387625'
'AA474342'
'Myo1b'
'AA473123'
'AA387579'
'AA387314'
'AA467571'
''
'Spop'
'AA547022'
'AI508784'
'AA413555'
'AA414733'
''
'Snta1'
'AI414419'
'W14393'
'W10596'
This procedure uses data from a study of gene expression in mouse brains.
For a list of field names in the MATLAB structure pd, see “Exploring the
Microarray Data Set” on page 4-34.
1 Plot the median values for the red channel. For example, to plot data from
the field F635 Median, type
figure
maimage(pd,'F635 Median')
4-36
Visualizing Microarray Images
The MATLAB software plots an image showing the median pixel values for
the foreground of the red (Cy5) channel.
2 Plot the median values for the green channel. For example, to plot data
from the field F532 Median, type
figure
maimage(pd,'F532 Median')
4-37
4 Microarray Analysis
The MATLAB software plots an image showing the median pixel values of
the foreground of the green (Cy3) channel.
3 Plot the median values for the red background. The field B635 Median
shows the median values for the background of the red channel.
figure
maimage(pd,'B635 Median')
4-38
Visualizing Microarray Images
The MATLAB software plots an image for the background of the red
channel. Notice the very high background levels down the right side of
the array.
4 Plot the medial values for the green background. The field B532 Median
shows the median values for the background of the green channel.
figure
maimage(pd,'B532 Median')
4-39
4 Microarray Analysis
The MATLAB software plots an image for the background of the green
channel.
5 The first array was for the Parkinson’s disease model mouse. Now read in
the data for the same brain voxel but for the untreated control mouse. In
this case, the voxel sample was labeled with Cy3 and the control, total
brain (not voxelated), was labeled with Cy5.
wt = gprread('mouse_a1wt.gpr')
4-40
Visualizing Microarray Images
wt =
Header: [1x1 struct]
Data: [9504x38 double]
Blocks: [9504x1 double]
Columns: [9504x1 double]
Rows: [9504x1 double]
Names: {9504x1 cell}
IDs: {9504x1 cell}
ColumnNames: {38x1 cell}
Indices: [132x72 double]
Shape: [1x1 struct]
figure
subplot(2,2,1);
maimage(wt,'F635 Median')
subplot(2,2,2);
maimage(wt,'F532 Median')
subplot(2,2,3);
maimage(wt,'B635 Median')
subplot(2,2,4);
maimage(wt,'B532 Median')
4-41
4 Microarray Analysis
7 If you look at the scale for the background images, you will notice that the
background levels are much higher than those for the PD mouse and there
appears to be something nonrandom affecting the background of the Cy3
channel of this slide. Changing the colormap can sometimes provide more
insight into what is going on in pseudocolor plots. For more control over the
color, try the colormapeditor function.
colormap hot
4-42
Visualizing Microarray Images
b532MedCol =
16
b532Data = wt.Data(:,b532MedCol);
4-43
4 Microarray Analysis
figure
subplot(1,2,1);
imagesc(b532Data(wt.Indices))
axis image
colorbar
title('B532 Median')
4-44
Visualizing Microarray Images
11 Bound the intensities of the background plot to give more contrast in the
image.
maskedData = b532Data;
maskedData(b532Data<500) = 500;
maskedData(b532Data>2000) = 2000;
subplot(1,2,2);
imagesc(maskedData(wt.Indices))
axis image
colorbar
title('Enhanced B532 Median')
4-45
4 Microarray Analysis
figure
subplot(2,1,1)
maboxplot(pd,'F532 Median','title','Parkinson''s Disease Model Mouse')
subplot(2,1,2)
maboxplot(pd,'B532 Median','title','Parkinson''s Disease Model Mouse')
figure
subplot(2,1,1)
maboxplot(wt,'F532 Median','title','Untreated Mouse')
subplot(2,1,2)
maboxplot(wt,'B532 Median','title','Untreated Mouse')
4-46
Visualizing Microarray Images
4-47
4 Microarray Analysis
From the box plots you can clearly see the spatial effects in the background
intensities. Blocks numbers 1, 3, 5, and 7 are on the left side of the
arrays, and numbers 2, 4, 6, and 8 are on the right side. The data must be
normalized to remove this spatial bias.
4-48
Visualizing Microarray Images
cy5DataCol =
34
cy3DataCol =
35
2 A simple way to compare the two channels is with a loglog plot. The
function maloglog is used to do this. Points that are above the diagonal in
this plot correspond to genes that have higher expression levels in the A1
voxel than in the brain as a whole.
figure
maloglog(cy5Data,cy3Data)
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
The MATLAB software displays the following messages and plots the
images.
4-49
4 Microarray Analysis
Notice that this function gives some warnings about negative and zero
elements. This is because some of the values in the 'F635 Median - B635'
and 'F532 Median - B532' columns are zero or even less than zero. Spots
where this happened might be bad spots or spots that failed to hybridize.
Points with positive, but very small, differences between foreground and
background should also be considered to be bad spots.
4-50
Visualizing Microarray Images
figure
maloglog(cy5Data,cy3Data) % Create the loglog plot
warning(warnState); % Reset the warning state.
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
threshold = 10;
badPoints = (cy5Data <= threshold) | (cy3Data <= threshold);
4-51
4 Microarray Analysis
5 You can then remove these points and redraw the loglog plot.
4-52
Visualizing Microarray Images
This plot shows the distribution of points but does not give any indication
about which genes correspond to which points.
6 Add gene labels to the plot. Because some of the data points have
been removed, the corresponding gene IDs must also be removed from
the data set before you can use them. The simplest way to do that is
wt.IDs(~badPoints).
maloglog(cy5Data,cy3Data,'labels',wt.IDs(~badPoints),...
'factorlines',2)
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
4-53
4 Microarray Analysis
You will see the gene ID associated with the point. Most of the outliers are
below the y = x line. In fact, most of the points are below this line. Ideally
the points should be evenly distributed on either side of this line.
8 Normalize the points to evenly distribute them on either side of the line.
Use the function mameannorm to perform global mean normalization.
normcy5 = mameannorm(cy5Data);
normcy3 = mameannorm(cy3Data);
If you plot the normalized data you will see that the points are more evenly
distributed about the y = x line.
figure
4-54
Visualizing Microarray Images
maloglog(normcy5,normcy3,'labels',wt.IDs(~badPoints),...
'factorlines',2)
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
9 The function mairplot is used to create an Intensity vs. Ratio plot for the
normalized data. This function works in the same way as the function
maloglog.
figure
mairplot(normcy5,normcy3,'labels',wt.IDs(~badPoints),...
'factorlines',2)
4-55
4 Microarray Analysis
10 You can click the points in this plot to see the name of the gene associated
with the plot.
4-56
Analyzing Gene Expression Profiles
The microarray data for this example is from DeRisi, J.L., Iyer, V.R., and
Brown, P.O. (Oct 24, 1997). Exploring the metabolic and genetic control of
gene expression on a genomic scale. Science, 278 (5338), 680–686. PMID:
9381177.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28
4-57
4 Microarray Analysis
load yeastdata.mat
numel(genes)
The number of genes in the data set displays in the MATLAB Command
Window. The MATLAB variable genes is a cell array of the gene names.
ans =
6400
genes{15}
This displays the 15th row of the variable yeastvalues, which contains
expression levels for the open reading frame (ORF) YAL054C.
ans =
YAL054C
4 Use the function web to access information about this ORF in the
Saccharomyces Genome Database (SGD).
url = sprintf(...
'http://genome-www4.stanford.edu/cgi-bin/SGD/...
locus.pl?locus=%s',...
genes{15});
web(url);
5 A simple plot can be used to show the expression profile for this ORF.
plot(times, yeastvalues(15,:))
xlabel('Time (Hours)');
ylabel('Log2 Relative Expression Level');
4-58
Analyzing Gene Expression Profiles
The MATLAB software plots the figure. The values are log2 ratios.
plot(times, 2.^yeastvalues(15,:))
xlabel('Time (Hours)');
ylabel('Relative Expression Level');
4-59
4 Microarray Analysis
The MATLAB software plots the figure. The gene associated with this
ORF, ACS1, appears to be strongly up-regulated during the diauxic shift.
hold on
plot(times, 2.^yeastvalues(16:26,:)')
xlabel('Time (Hours)');
ylabel('Relative Expression Level');
title('Profile Expression Levels');
4-60
Analyzing Gene Expression Profiles
Filtering Genes
This procedure illustrates how to filter the data by removing genes that are
not expressed or do not change. The data set is quite large and a lot of the
information corresponds to genes that do not show any interesting changes
during the experiment. To make it easier to find the interesting genes, reduce
the size of the data set by removing genes with expression profiles that do not
show anything of interest. There are 6400 expression profiles. You can use
a number of techniques to reduce the number of expression profiles to some
subset that contains the most significant genes.
1 If you look through the gene list you will see several spots marked as
'EMPTY'. These are empty spots on the array, and while they might have
data associated with them, for the purposes of this example, you can
4-61
4 Microarray Analysis
consider these points to be noise. These points can be found using the
strcmp function and removed from the data set with indexing commands.
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
numel(genes)
ans =
6314
In the yeastvalues data you will also see several places where the
expression level is marked as NaN. This indicates that no data was collected
for this spot at the particular time step. One approach to dealing with
these missing values would be to impute them using the mean or median of
data for the particular gene over time. This example uses a less rigorous
approach of simply throwing away the data for any genes where one or
more expression levels were not measured.
2 Use the isnan function to identify the genes with missing data and then
use indexing commands to remove the genes.
nanIndices = any(isnan(yeastvalues),2);
yeastvalues(nanIndices,:) = [];
genes(nanIndices) = [];
numel(genes)
ans =
6276
If you were to plot the expression profiles of all the remaining profiles,
you would see that most profiles are flat and not significantly different
from the others. This flat data is obviously of use as it indicates that the
genes associated with these profiles are not significantly affected by the
diauxic shift. However, in this example, you are interested in the genes
with large changes in expression accompanying the diauxic shift. You can
use filtering functions in the toolbox to remove genes with various types
4-62
Analyzing Gene Expression Profiles
3 Use the function genevarfilter to filter out genes with small variance
over time. The function returns a logical array of the same size as the
variable genes with ones corresponding to rows of yeastvalues with
variance greater than the 10th percentile and zeros corresponding to those
below the threshold.
mask = genevarfilter(yeastvalues);
% Use the mask as an index into the values to remove the
% filtered genes.
yeastvalues = yeastvalues(mask,:);
genes = genes(mask);
numel(genes)
ans =
5648
ans =
423
4-63
4 Microarray Analysis
ans = 310
Clustering Genes
Now that you have a manageable list of genes, you can look for relationships
between the profiles using some different clustering techniques from the
Statistics Toolbox software.
3 The profiles of the genes in these clusters can be plotted together using a
simple loop and the function subplot.
figure
for c = 1:16
subplot(4,4,c);
plot(times,yeastvalues((clusters == c),:)');
axis tight
end
suptitle('Hierarchical Clustering of Profiles');
4-64
Analyzing Gene Expression Profiles
4-65
4 Microarray Analysis
5 Instead of plotting all of the profiles, you can plot just the centroids.
figure
for c = 1:16
subplot(4,4,c);
plot(times,ctrs(c,:)');
axis tight
axis off % turn off the axis
end
suptitle('K-Means Clustering of Profiles');
4-66
Analyzing Gene Expression Profiles
6 You can use the function clustergram to create a heat map and
dendrogram from the output of the hierarchical clustering.
figure
clustergram(yeastvalues(:,2:end),'RowLabels',genes,...
'ColumnLabels',times(2:end))
4-67
4 Microarray Analysis
1 Use the pca function in the Statistics Toolbox software to calculate the
principal components of a data set.
pc =
Columns 1 through 4
4-68
Analyzing Gene Expression Profiles
Columns 5 through 7
2 You can use the function cumsum to see the cumulative sum of the variances.
cumsum(pcvars./sum(pcvars) * 100)
ans =
78.3719
89.2140
93.4357
96.0831
98.3283
99.3203
100.0000
This shows that almost 90% of the variance is accounted for by the first
two principal components.
3 A scatter plot of the scores of the first two principal components shows that
there are two distinct regions. This is not unexpected, because the filtering
4-69
4 Microarray Analysis
process removed many of the genes with low variance or low information.
These genes would have appeared in the middle of the scatter plot.
figure
scatter(zscores(:,1),zscores(:,2));
xlabel('First Principal Component');
ylabel('Second Principal Component');
title('Principal Component Scatter Plot');
4 The gname function from the Statistics Toolbox software can be used to
identify genes on a scatter plot. You can select as many points as you like
on the scatter plot.
gname(genes);
4-70
Analyzing Gene Expression Profiles
plot where points from each group have a different color or marker. You
can use clusterdata, or any other clustering function, to group the points.
figure
pcclusters = clusterdata(zscores(:,1:2),6);
gscatter(zscores(:,1),zscores(:,2),pcclusters)
xlabel('First Principal Component');
ylabel('Second Principal Component');
title('Principal Component Scatter Plot with Colored Clusters');
gname(genes) % Press enter when you finish selecting genes.
4-71
4 Microarray Analysis
Introduction
The data in this example is the Coriell cell line BAC array CGH data analyzed
by Snijders et al.(2001). The Coriell cell line data is widely regarded as a "gold
standard" data set. You can download this data of normalized log2-based
intensity ratios and the supplemental table of known karyotypes from
http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html. You will
4-72
Detecting DNA Copy Number Alteration in Array-Based CGH Data
For this example, the Coriell cell line data are provided in a MAT file. The
data file coriell_baccgh.mat contains coriell_data, a structure containing
of the normalized average of the log2-based test to reference intensity ratios
of 15 fibroblast cell lines and their genomic positions. The BAC targets are
ordered by genome position beginning at 1p and ending at Xq.
load coriell_baccgh
coriell_data
coriell_data =
You can plot the genome wide log2-based test/reference intensity ratios of
DNA clones. In this example, you will display the log2 intensity ratios for cell
line GM03576 for chromosomes 1 through 23.
sample =
To label chromosomes and draw the chromosome borders, you need to find
the number of data points of in each chromosome.
4-73
4 Microarray Analysis
% Label the autosomes with their chromosome numbers, and the sex chromosome
% with X.
x_label = chr_nums - ceil(chr_data_len/2);
y_label = zeros(1, length(x_label)) - 1.6;
chr_labels=num2str((1:1:23)');
chr_labels = cellstr(chr_labels);
chr_labels{end} = 'X';
figure;hold on
h_ratio = plot(coriell_data.Log2Ratio(:,sample), '.');
h_vbar = line(x_vbar, y_vbar, 'color', [0.8 0.8 0.8]);
h_text = text(x_label, y_label, chr_labels,...
'fontsize', 8, 'HorizontalAlignment', 'center');
title(coriell_data.Sample{sample})
xlabel({'', 'Chromosome'})
ylabel('Log2(T/R)')
hold off
In the plot, borders between chromosomes are indicated by grey vertical bars.
The plot indicates that the GM03576 cell line is trisomic for chromosomes
2 and 21 [3].
4-74
Detecting DNA Copy Number Alteration in Array-Based CGH Data
You can also plot the profile of each chromosome in a genome. In this
example, you will display the log2 intensity ratios for each chromosome in cell
line GM05296 individually.
hp = plot(chr_y, '.');
line([0, chr_data_len(c)], [0,0], 'color', 'r');
The plot indicates the GM05296 cell line has a partial trisomy at chromosome
10 and a partial monosomy at chromosome 11.
Observe that the gains and losses of copy number are discrete. These
alterations occur in contiguous regions of a chromosome that cover several
clones to entitle chromosome.
4-75
4 Microarray Analysis
% Smoother
GM05296_Data(iloop).SmoothedRatio = ...
mslowess(GM05296_Data(iloop).GenomicPosition,...
GM05296_Data(iloop).Log2Ratio,...
'SPAN',15);
To better visualize and later validate the locations of copy number changes,
we need cytoband information. Read the human cytoband information from
the hs_cytoBand.txt data file using the cytobandread function. It returns a
structure of human cytoband information [4].
hs_cytobands = cytobandread('hs_cytoBand.txt')
4-76
Detecting DNA Copy Number Alteration in Array-Based CGH Data
hs_cytobands =
You can inspect the data by plotting the log2-based ratios, the smoothed ratios
and the derivative of the smoothed ratios together. You can also display the
centromere position of a chromosome in the data plots. The magenta vertical
bar marks the centromere of the chromosome.
4-77
4 Microarray Analysis
hold off
end
Detecting Change-Points
thrd = 0.1;
4-78
Detecting DNA Copy Number Alteration in Array-Based CGH Data
You can set the length for the set of adjacent positions distributed around the
change-point indices. For this example, you will select a length of 5. You can
also inspect each change-point by plotting its GM clusters. In this example,
you will plot the GM clusters for the Chromosome 10 data.
len = 5;
for iloop = 1:length(GM05296_Data)
seg_num = numel(GM05296_Data(iloop).SegIndex) - 1;
if seg_num > 1
% Plot the data points in chromosome 10 data
if GM05296_Data(iloop).Chromosome == 10
figure; hold on;
plot(GM05296_Data(iloop).GenomicPosition,...
GM05296_Data(iloop).Log2Ratio, '.')
ylim([-0.5, 1])
xlabel('Genomic Position')
ylabel('Log2(T/R)')
title(sprintf('Chromosome %d - GM05296', ...
GM05296_Data(iloop).Chromosome))
end
segidx = GM05296_Data(iloop).SegIndex;
segidx_emadj = GM05296_Data(iloop).SegIndex;
% Select initial guess for the of cluster index for each point.
gmpart = (gmy > (min(gmy) + range(gmy)/2)) + 1;
4-79
4 Microarray Analysis
Once you determine the optimal change-point indices, you also need to
determine if each segment represents a statistically significant changes
in DNA copy number. You will perform permutation t-tests to assess the
significance of the segments identified. A segment includes all the data points
from one change-point to the next change-point or the chromosome end. In
this example, you will perform 10,000 permutations of the data points on two
consecutive segments along the chromosome at the significance level of 0.01.
alpha = 0.01;
for iloop = 1:length(GM05296_Data)
seg_num = numel(GM05296_Data(iloop).SegIndex) - 1;
4-80
Detecting DNA Copy Number Alteration in Array-Based CGH Data
seg_index = GM05296_Data(iloop).SegIndex;
if seg_num > 1
ppvals = zeros(seg_num+1, 1);
if sloop== seg_num-1
seg2idx = seg_index(sloop+1):(seg_index(sloop+2));
else
seg2idx = seg_index(sloop+1):(seg_index(sloop+2)-1);
end
seg1 = GM05296_Data(iloop).SmoothedRatio(seg1idx);
seg2 = GM05296_Data(iloop).SmoothedRatio(seg2idx);
n1 = numel(seg1);
n2 = numel(seg2);
N = n1+n2;
segs = [seg1;seg2];
% Permutation test
iter = 10000;
t_perm = zeros(iter,1);
for i = 1:iter
randseg = segs(randperm(N));
t_perm(i) = abs(mean(randseg(1:n1))-mean(randseg(n1+1:N)));
end
ppvals(sloop+1) = sum(t_perm >= abs(t_obs))/iter;
end
4-81
4 Microarray Analysis
numel(GM05296_Data(iloop).SegIndex) - 1, GM05296_Data(iloop).Chromos
end
ylabel('Log2(T/R)')
set(gca, 'Box', 'on', 'ylim', [-1, 1])
title(sprintf('Chromosome %d - GM05296', chr_num));
chromosomeplot(hs_cytobands, chr_num, 'addtoplot', gca, 'unit', 2)
4-82
Detecting DNA Copy Number Alteration in Array-Based CGH Data
end
You can also display the CNAs of the GM05296 cell line align to the
chromosome ideogram summary view using the chromosomeplot function.
Determine the genomic positions for the CNAs on chromosomes 10 and 11.
chr10_idx = GM05296_Data(2).SegIndex(2):GM05296_Data(2).SegIndex(3)-1;
chr10_cna_start = GM05296_Data(2).GenomicPosition(chr10_idx(1))*1000;
chr10_cna_end = GM05296_Data(2).GenomicPosition(chr10_idx(end))*1000;
chr11_idx = GM05296_Data(3).SegIndex(2):GM05296_Data(3).SegIndex(3)-1;
chr11_cna_start = GM05296_Data(3).GenomicPosition(chr11_idx(1))*1000;
chr11_cna_end = GM05296_Data(3).GenomicPosition(chr11_idx(end))*1000;
cna_struct =
4-83
4 Microarray Analysis
This example shows how MATLAB and its toolboxes provide tools for the
analysis and visualization of copy-number alterations in array-based CGH
data.
References
[1] Redon, R., Ishikawa, S., Fitch, K.R., et al. (2006). Global variation in copy
number in the human genome. Nature 444, 444-454.
[2] Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins,
C. Kuo, W.L., Chen, C., Zhai, Y., et al. (1998). High resolution analysis of
DNA copy number variations using comparative genomic hybridization to
microarrays. Nat. Genet. 20, 207-211.
[3] Snijders, A.M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy,
J., Hamilton, G., Hindle, A.K., Huey, B., Kimura, K., et al. (2001). Assembly
of microarrays for genome-wide measurement of DNA copy number", Nat.
Genet. 29, 263-264.
[5] Myers, C.L., Dunham, M.J., Kung, S.Y., and Troyanskaya, O.G. (2004).
Accurate detection of aneuploidies in array CGH and gene expression
microarray data. Bioinformatics 20, 18, 3533-3543.
4-84
Exploring Gene Expression Data
Introduction
The CNS dataset (CEL files) is available at the CNS experiment web site.
The 42 tumor tissue samples include 10 medulloblastomas, 10 rhabdoid, 10
malignant glioma, 8 supratentorial PNETS, and 4 normal human cerebella.
The CNS raw dataset was preprocessed with the Robust Multi-array Average
(RMA) and GC Robust Multi-array Average (GCRMA) procedures. For
further information on Affymetrix oligonucleotide microarray preprocessing,
see Preprocessing Affymetrix Microarray Data at the Probe Level.
You will use the t-test and false discovery rate to detect differentially
expressed genes between two of the tumor types. Additionally, you will look
at Gene Ontology terms related to the significantly up-regulated genes.
load cnsexpressiondata
4-85
4 Microarray Analysis
get(expr_cns_gcrma_eb)
Name: ''
RowNames: {7129x1 cell}
ColNames: {1x42 cell}
NRows: 7129
NCols: 42
NDims: 2
ElementClass: 'single'
nGenes =
7129
nSamples =
42
You can use gene symbols instead of the probe set IDs to label the expression
values. The gene symbols for the HuGeneFl array are provided in a MAT
file containing a Map object.
4-86
Exploring Gene Expression Data
load HuGeneFL_GeneSymbol_Map;
Create a cell array of gene symbols for the expression values from the
hu6800GeneSymbolMap object.
Set the row names of the exprs_cns_gcrma_eb to gene symbols using the
rownames method of the DataMatrix object.
Remove gene expression data with empty gene symbols. In the example, the
empty symbols are labeled as '---'.
expr_cns_gcrma_eb('---', :) = [];
Many of the genes in this study are not expressed, or have only small
variability across the samples. Remove these genes using non-specific
filtering.
Use genelowvalfilter to filter out genes with very low absolute expression
values.
Use genevarfilter to filter out genes with a small variance across samples.
nGenes = expr_cns_gcrma_eb.NRows
nGenes =
4-87
4 Microarray Analysis
5669
You can now compare the gene expression values between two groups of data:
CNS medulloblastomas (MD) and non-neuronal origin malignant gliomas
(Mglio) tumor.
From the expression data of all 42 samples, extract the data of the 10 MD
samples and the 10 Mglio samples.
Name: ''
RowNames: {5669x1 cell}
ColNames: {1x10 cell}
NRows: 5669
NCols: 10
NDims: 2
ElementClass: 'single'
Name: ''
RowNames: {5669x1 cell}
ColNames: {1x10 cell}
NRows: 5669
NCols: 10
NDims: 2
ElementClass: 'single'
4-88
Exploring Gene Expression Data
In any test situation, two types of errors can occur, a false positive by
declaring that a gene is differentially expressed when it is not, and a false
negative when the test fails to identify a truly differentially expressed gene.
In multiple hypothesis testing, which simultaneously tests the null hypothesis
of thousands of genes using microarray expression data, each test has a
specific false positive rate, or a false discovery rate (FDR). False discovery
rate is defined as the expected ratio of the number of false positives to the
total number of positive calls in a differential expression analysis between
two groups of samples (Storey et al., 2003).
In this example, you will compute the FDR using the Storey-Tibshirani
procedure (Storey et al., 2003). The procedure also computes the q-value of
a test, which measures the minimum FDR that occurs when calling the test
significant. The estimation of FDR depends on the truly null distribution of
the multiple tests, which is unknown. Permutation methods can be used to
estimate the truly null distribution of the test statistics by permuting the
columns of the gene expression data matrix (Storey et al., 2003, Dudoit et
al., 2003). Depending on the sample size, it may not be feasible to consider
all possible permutations. Usually a random subset of permutations are
considered in the case of large sample size. Use the nchoosek function in
Statistics Toolbox™ to find out the number of all possible permutations of
the samples in this example.
4-89
4 Microarray Analysis
ans =
184756
cutoff = 0.05;
sum(pvaluesCorr < cutoff)
ans =
2121
Estimate the FDR and q-values for each test using mafdr. The quantity pi0 is
the overall proportion of true null hypotheses in the study. It is estimated
from the simulated null distribution via bootstrap or the cubic polynomial fit.
Note: You can also manually set the value of lambda for estimating pi0.
figure;
[pFDR, qvalues] = mafdr(pvaluesCorr, 'showplot', true);
Determine the number of genes that have q-values less than the cutoff value.
Note: You may get a different number of genes due to the permutation test
and the bootstrap outcomes.
4-90
Exploring Gene Expression Data
ans =
2173
Many genes with low FDR implies that the two groups, MD and Mglio, are
biologically distinct.
You can also empirically estimate the FDR adjusted p-values using the
Benjamini-Hochberg (BH) procedure (Benjamini et al, 1995) by setting the
mafdr input parameter BHFDR to true.
ans =
1374
You can store the t-scores, p-values, pFDRs, q-values and BH FDR corrected
p-values together as a DataMatrix object.
Update the column name for BH FDR corrected p-values using the colnames
method of DataMatrix object.
testResults(1:23, :)
4-91
4 Microarray Analysis
ans =
Plot the -log10 of p-values against the biological effect in a volcano plot. Note:
From the volcano plot UI, you can interactively change the p-value cutoff and
fold change limit, and export differentially expressed genes.
4-92
Exploring Gene Expression Data
diffStruct =
Ctrl-click genes in the gene lists to label the genes in the plot. As seen in the
volcano plot, genes specific for neuronal based cerebella granule cells, such
as ZIC and NEUROD, are found in the up-regulated gene list, while genes
typical of the astrocytic and oligodendrocytic lineage and cell differentiation,
such as SOX2, PEA15, and ID2B, are found in the down-regulated list.
nDiffGenes = diffStruct.PValues.NRows
nDiffGenes =
327
nUpGenes =
4-93
4 Microarray Analysis
225
nDownGenes =
102
Use Gene Ontology (GO) to annotate the differentially expressed genes. You
can look at the up-regulated genes from the analysis above. Download the
Homo sapiens annotations (gene_association.goa_human.gz file) from
Gene Ontology Current Annotations, unzip, and store it in your the current
directory.
Find the indices of the up-regulated genes for Gene Ontology analysis.
huGenes = rownames(expr_cns_gcrma_eb);
for i = 1:nUpGenes
up_geneidx(i) = find(strncmpi(huGenes, up_genes{i}, length(up_genes{i})
end
Load the Gene Ontology database into a MATLAB object using the geneont
function.
GO = geneont('live',true);
Read the Homo sapiens gene annotation file. For this example, you will look
only at genes that are related to molecular function, so you only need to read
the information where the Aspect field is set to ’F’. The fields that are of
interest are the gene symbol and associated ID. In GO Annotation files these
have field names DB_Object_Symbol and GOid respectively.
HGann = goannotread('gene_association.goa_human',...
'Aspect','F','Fields',{'DB_Object_Symbol','GOid'});
4-94
Exploring Gene Expression Data
HGmap = containers.Map();
for i=1:numel(HGann)
key = HGann(i).DB_Object_Symbol;
if isKey(HGmap,key)
HGmap(key) = [HGmap(key) HGann(i).GOid];
else
HGmap(key) = HGann(i).GOid;
end
end
HGmap.Count
ans =
16006
Not all of the 5758 genes on the HuGeneFL chip are annotated. For every
gene on the chip, see if it is annotated by comparing its gene symbol to the
list of gene symbols from GO. Track the number of annotated genes and the
number of up-regulated genes associated with each GO term. Note that data
in public repositories is frequently curated and updated; therefore the results
of this example might be slightly different when you use up-to-date datasets.
It is also possible that you get warnings about invalid or obsolete IDs due to
an updated Homo sapiens gene annotation file.
4-95
4 Microarray Analysis
chipgenesCount(goid) = chipgenesCount(goid) + 1;
if (any(i == up_geneidx))
upgenesCount(goid) = upgenesCount(goid) +1;
end
end
end
gopvalues = hygepdf(upgenesCount,max(chipgenesCount),...
max(upgenesCount),chipgenesCount);
[dummy, idx] = sort(gopvalues);
4-96
Exploring Gene Expression Data
Inspect the significant GO terms and select the terms related to specific
molecule functions to build a sub-ontology that includes the ancestors of the
terms. Visualize this ontology using the biograph function. You can also color
the graphs nodes. In this example, the red nodes are the most significant,
while the blue nodes are the least significant gene ontology terms. Note: The
GO terms returned may differ from those shown due to the frequent update to
the Homo sapiens gene annotation file.
fcnAncestors = GO(getancestors(GO,idx(1:5)))
[cm acc rels] = getmatrix(fcnAncestors);
BG = biograph(cm,get(fcnAncestors.Terms,'name'))
for i=1:numel(acc)
pval = gopvalues(acc(i));
color = [(1-pval).^(1) pval.^(1/8) pval.^(1/8)];
set(BG.Nodes(i),'Color',color);
end
view(BG)
You can query the pathway information of the differentially expressed genes
from the KEGG pathway database through KEGG’s Web Service.
Following are a few pathway maps with the genes in the up-regulated gene
list highlighted:
Cell Cycle
References
4-97
4 Microarray Analysis
[1] Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M.,
McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C., Allen,
J.C., Zagzag, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T.,
Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, DN, Mesirov,
J.P., Lander, E.S., and Golub, T.R. (2002). Prediction of central nervous
system embryonal tumour outcome based on gene expression. Nature,
415(6870), 436-442.
[3] Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2003). Multiple hypothesis
testing in microarray experiment. Statistical Science, 18, 71-103.
[4] Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery
rate: a practical and powerful approach to multiple testing. J. Royal Stat.
Soc., B 57, 289-300.
4-98
5
Phylogenetic Analysis
5-2
Building a Phylogenetic Tree
The origin of modern humans is a heavily debated issue that scientists have
recently tackled by using mitochondrial DNA (mtDNA) sequences. One
hypothesis explains the limited genetic variation of human mtDNA in terms
of a recent common genetic ancestry, implying that all modern population
mtDNA originated from a single woman who lived in Africa less than 200,000
years ago.
5-3
5 Phylogenetic Analysis
allows sequences to be traced through one genetic line and all polymorphisms
assumed to be caused by mutations.
Neanderthal DNA
The ability to isolate mitochondrial DNA (mtDNA) from palaeontological
samples has allowed genetic comparisons between extinct species and closely
related nonextinct species. The reasons for isolating mtDNA instead of
nuclear DNA in fossil samples have to do with the fact that:
References
Ovchinnikov I., et al. (2000). Molecular analysis of Neanderthal DNA from
the northern Caucasus. Nature 404(6777), 490–493.
Krings M., et al. (1997). Neanderthal DNA sequences and the origin of
modern humans. Cell 90 (1), 19–30.
5-4
Building a Phylogenetic Tree
1 Use the MATLAB Help browser to search for data on the Web. In the
MATLAB Command Window, type
web('http://www.ncbi.nlm.nih.gov')
A separate browser window opens with the home page for the NCBI Web
site.
2 Search the NCBI Web site for information. For example, to search for the
human taxonomy, from the Search list, select Taxonomy, and in the for
box, enter hominidae.
5-5
5 Phylogenetic Analysis
3 Select the taxonomy link for the family Hominidae. A page with the
taxonomy for the family is shown.
5-6
Building a Phylogenetic Tree
you can select a method for calculating the hierarchical clustering distances
used to build a tree.
After locating the GenBank accession codes for the sequences you are
interested in studying, you can create a phylogenetic tree with the data. For
information on locating accession codes, see “Searching NCBI for Phylogenetic
Data” on page 5-5.
In the following example, you will use the Jukes-Cantor method to calculate
distances between sequences, and the Unweighted Pair Group Method
Average (UPGMA) method for linking the tree nodes.
2 Retrieve sequence data from the GenBank database and copy into the
MATLAB environment.
distances = seqpdist(seqs,'Method','Jukes-Cantor','Alphabet','DNA');
tree = seqlinkage(distances,'UPGMA',seqs)
5-7
5 Phylogenetic Analysis
h = plot(tree,'orient','top');
ylabel('Evolutionary distance')
set(h.terminalNodeLabels,'Rotation',65)
5-8
Building a Phylogenetic Tree
5-9
5 Phylogenetic Analysis
'Chimp_Schweinfurthii' 'AF176722';
'Chimp_Vellerosus' 'AF315498';
'Chimp_Verus' 'AF176731';
};
2 Get additional sequence data from the GenBank database, and copy the
data into the next indices of a MATLAB structure.
distances = seqpdist(seqs,'Method','Jukes-Cantor','Alpha','DNA');
tree = seqlinkage(distances,'UPGMA',seqs);
h = plot(tree,'orient','top');
ylabel('Evolutionary distance')
set(h.terminalNodeLabels,'Rotation',65)
5-10
Building a Phylogenetic Tree
names = get(tree,'LeafNames')
names =
'German_Neanderthal'
'Russian_Neanderthal'
5-11
5 Phylogenetic Analysis
'European_Human'
'Chimp_Troglodytes'
'Chimp_Schweinfurthii'
'Chimp_Verus'
'Chimp_Vellerosus'
'Puti_Orangutan'
'Jari_Orangutan'
'Mountain_Gorilla_Rwanda'
'Eastern_Lowland_Gorilla'
'Western_Lowland_Gorilla'
From the list, you can determine the indices for its members. For example,
the European Human leaf is the third entry.
2 Find the closest species to a selected species in a tree. For example, find
the species closest to the European human.
[h_all,h_leaves] = select(tree,'reference',3,...
'criteria','distance',...
'threshold',0.6);
h_all is a list of indices for the nodes within a patristic distance of 0.6 to
the European human leaf, while h_leaves is a list of indices for only the
leaf nodes within the same patristic distance.
subtree_names = names(h_leaves)
subtree_names =
'German_Neanderthal'
'Russian_Neanderthal'
5-12
Building a Phylogenetic Tree
'European_Human'
'Chimp_Schweinfurthii'
'Chimp_Verus'
'Chimp_Troglodytes'
4 Extract a subtree from the whole tree by removing unwanted leaves. For
example, prune the tree to species within 0.6 of the European human
species.
leaves_to_prune = ~h_leaves;
pruned_tree = prune(tree,leaves_to_prune)
h = plot(pruned_tree,'orient','top');
ylabel('Evolutionary distance')
set(h.terminalNodeLabels,'Rotation',65)
The MATLAB software returns information about the new subtree and
plots the pruned phylogenetic tree in a Figure window.
5-13
5 Phylogenetic Analysis
phytreeviewer(pruned_tree)
5-14
Building a Phylogenetic Tree
You can interactively change the appearance of the tree using the app.
For information on using this app, see “Phylogenetic Tree App Reference”
on page 5-16.
5-15
5 Phylogenetic Analysis
The Phylogenetic Tree app can read data from Newick and ClustalW tree
formatted files.
This procedure uses the phylogenetic tree data stored in the file pf00002.tree
as an example. The data was retrieved from the protein family (PFAM) Web
database and saved to a file using the accession number PF00002 and the
function gethmmtree.
1 Create a phytree object. For example, to create a phytree object from tree
data in the file pf00002.tree, type
tr= phytreeread('pf00002.tree')
5-16
Phylogenetic Tree App Reference
phytreeviewer(tr)
5-17
5 Phylogenetic Analysis
File Menu
The File menu includes the standard commands for opening and closing a
file, and it includes commands to use phytree object data from the MATLAB
Workspace. The File menu commands are shown below.
5-18
Phylogenetic Tree App Reference
5-19
5 Phylogenetic Analysis
A second Phylogenetic Tree viewer opens with tree data from the selected
file.
Open Command
Use the Open command to read tree data from a Newick-formatted file and
display that data in the app.
2 Select a directory, select a Newick-formatted file, and then click Open. The
app uses the file extension .tree for Newick-formatted files, but you can
use any Newick-formatted file with any extension.
The app replaces the current tree data with data from the selected file.
5-20
Phylogenetic Tree App Reference
The app replaces the current tree data with data from the selected object.
Save As Command
After you create a phytree object or prune a tree from existing data, you can
save the resulting tree in a Newick-formatted file. The sequence data used to
create the phytree object is not saved with the tree.
5-21
5 Phylogenetic Analysis
2 In the Filename box, enter the name of a file. The toolbox uses the file
extension .tree for Newick-formatted files, but you can use any file
extension.
3 Click Save.
The app saves tree data without the deleted branches, and it saves changes
to branch and leaf names. Formatting changes such as branch rotations,
collapsed branches, and zoom settings are not saved in the file.
1 Select File > Export to New Viewer, and then select either With
Hidden Nodes or Only Displayed.
1 Select File > Export to Workspace, and then select either With Hidden
Nodes or Only Displayed.
2 In the Workspace variable name box, enter the name for your
phylogenetic tree data. For example, enter MyTree.
5-22
Phylogenetic Tree App Reference
3 Click OK.
1 From the File menu, select Print to Figure, and then select either With
Hidden Nodes or Only Displayed.
5-23
5 Phylogenetic Analysis
5-24
Phylogenetic Tree App Reference
'angular'
5-25
5 Phylogenetic Analysis
'equalangle'
5-26
Phylogenetic Tree App Reference
'equaldaylight'
3 Select the Display Labels you want on your figure. You can select from all
to none of the options.
5-27
5 Phylogenetic Analysis
The Print Preview window opens, which you can use to select page
formatting options.
2 Select the page formatting options and values you want, and then click
Print.
5-28
Phylogenetic Tree App Reference
Print Command
Use the Print command to make a copy of your phylogenetic tree after you
use the Print Preview command to select formatting options.
2 From the Name list, select a printer, and then click OK.
Tools Menu
Use the Tools menu to:
The Tools menu and toolbar contain most of the commands specific to trees
and phylogenetic analysis. Use these commands and modes to edit and format
your tree interactively. The Tools menu commands are:
5-29
5 Phylogenetic Analysis
Inspect Mode
Viewing a phylogenetic tree in the Phylogenetic Tree app provides a rough
idea of how closely related two sequences are. However, to see exactly how
closely related two sequences are, measure the distance of the path between
them. Use the Inspect command to display and measure the path between
two sequences.
1 Select Tools > Inspect, or from the toolbar, click the Inspect Tool Mode
icon .
5-30
Phylogenetic Tree App Reference
2 Click a branch or leaf node (selected node), and then hover your cursor over
another branch or leaf node (current node).
The app highlights the path between the two nodes and displays the path
length in the pop-up window. The path length is the patristic distance
calculated by the seqpdist function.
2 Point to a branch.
The paths, branch nodes, and leaf nodes below the selected branch appear
in gray, indicating you selected them to collapse (hide from view).
The app hides the display of paths, branch nodes, and leaf nodes below the
selected branch. However, it does not remove the data.
5-31
5 Phylogenetic Analysis
Tip After collapsing nodes, you can redraw the tree by selecting Tools >
Fit to Window.
1 Select Tools > Rotate Branch, or from the toolbar, click the Rotate
Branch Mode icon .
5-32
Phylogenetic Tree App Reference
The branch and leaf nodes below the selected branch node rotate 180
degrees around the branch node.
1 Select Tools > Rename, or from the toolbar, click the Rename
4 To accept your changes and close the text box, click outside of the text box.
To save your changes, select File > Save As.
5-33
5 Phylogenetic Analysis
1 Select Tools > Prune, or from the toolbar, click the Prune (delete)
For a leaf node, the branch line connected to the leaf appears in gray. For a
branch node, the branch lines below the node appear in gray.
Note If you delete nodes (branches or leaves), you cannot undo the
changes. The Phylogenetic Tree app does not have an Undo command.
The tool removes the branch from the figure and rearranges the other
nodes to balance the tree structure. It does not recalculate the phylogeny.
Tip After pruning nodes, you can redraw the tree by selecting Tools > Fit
to Window.
1 Select Tools > Zoom In, or from the toolbar, click the Zoom In icon .
5-34
Phylogenetic Tree App Reference
The app activates zoom in mode and changes the cursor to a magnifying
glass.
2 Place the cursor over the section of the tree diagram you want to enlarge
and then click.
4 Move the cursor over the tree diagram, left-click, and drag the diagram to
the location you want to view.
Tip After zooming and panning, you can reset the tree to its original view,
by selecting Tools > Reset View.
Select Submenu
Select a single branch or leaf node by clicking it. Select multiple branch or
leaf nodes by Shift-clicking the nodes, or click-dragging to draw a box around
nodes.
5-35
5 Phylogenetic Analysis
Use the Select submenu to select specific branch and leaf nodes based on
different criteria.
After selecting nodes using one of the previous commands, hide and show the
nodes using the following commands:
• Collapse Selected
• Expand Selected
• Expand All
Clear all selected nodes by clicking anywhere else in the Phylogenetic Tree
app.
5-36
Phylogenetic Tree App Reference
3 Click OK.
The branch or leaf nodes that match the expression appear in red.
After selecting nodes using the Find Leaf/Branch command, you can hide
and show the nodes using the following commands:
• Collapse Selected
• Expand Selected
• Expand All
The data for branches and leaves that you hide using the Collapse/Expand
or Collapse Selected command are not removed from the tree. You can
display selected or all hidden data using the Expand Selected or Expand
All command.
5-37
5 Phylogenetic Analysis
to Window command to redraw the tree diagram to fill the entire Figure
window.
Options Submenu
Use the Options command to select the behavior for the zoom and pan modes.
Window Menu
This section illustrates how to switch to any open window.
Help Menu
This section illustrates how to select quick links to the Bioinformatics
Toolbox documentation for phylogenetic analysis functions, tutorials, and the
Phylogenetic Tree app reference.
5-38
Phylogenetic Tree App Reference
Use the Help menu to select quick links to the Bioinformatics Toolbox
documentation for phylogenetic analysis functions, tutorials, and the
phytreeviewer reference.
5-39