0% found this document useful (0 votes)
27 views16 pages

5.7. Data Retrieval

Uploaded by

samimohmand80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views16 pages

5.7. Data Retrieval

Uploaded by

samimohmand80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

5.7.

DATA RETRIEVAL 101

FIGURE 5.1 Partial view of the NCBI home page (http://www.ncbi.nlm.nih.gov/; as of June, 2013). A specific database can be selected
from the drop-down menu and then the search term can be entered in the space shown. Hitting the “search” button returns the entries.

variety of high-quality resources, such as databases and Taxonomy database (contains the names of all organisms
tools, are made accessible to the public by the NCBI that are represented by nucleotide or protein sequences);
through a common retrieval system.53,54 The databases the UniGene database (contains non-redundant
are visible in the drop-down menu from the NCBI home- information on computationally identified transcripts
page. Some of the common databases are named below. from the same locus across species; described above);
Additionally, the link “Resource List (A-Z)” located at and the Epigenomics database (a relatively new database
the left-hand top corner of the NCBI home page can be that provides epigenomic data in the context of biological
clicked to obtain links to all resources, including all the sample information).
databases, browsers etc., organized alphabetically. Below
the “Resource List (A-Z)”, there is the link “All
Resources.” This link lists a specific class of resources 5.7 DATA RETRIEVAL
under one tab; hence the “databases” tab lists all data-
bases, “tools” tab lists all analysis tools, etc. (Figure 5.1). Data retrieval from different databases requires a
Some of the widely used databases are PubMed (bib- search capability using a data retrieval system (tool).
liographic database); OMIM (Online Mendelian Some common data retrieval systems are Entrez/GQuery,
Inheritance in Man; described above); the Entrez DBGET/LinkDB, Sequence Retrieval System (SRS), and
Nucleotide database (described above); the Gene retrieval system from EMBL-EBI. Retrieval systems are
Expression Omnibus (GEO) database (described above); capable of simultaneously searching multiple linked data-
the Protein database (curated sequences are in RefSeq); bases in response to a single search query and retrieve
the Genome database (contains information on sequence, related data from multiple databases. It is worth emphasiz-
annotation, maps, chromosomes, and assemblies of all ing at the outset that the appearance and functionality of
organisms whose genomes have been sequenced so far, various web-based resources are subject to frequent change.
and provides graphic display through the genomic Therefore, various screenshots displayed here may change by the
browser Map Viewer); the Structure database (contains time this book is published. Nevertheless, knowing how to use
three-dimensional images of proteins); the Gene databa- the tools by following the screenshots presented in the book
sei (contains information about individual genes from should still help the readers to understand and cope with
among the genomes represented in the RefSeq); the the changes.

i
Gene is described as a searchable database of genes in the NCBI “Resource” section. However, Gene is also described as a portal
that integrates gene-specific connections in the nexus of map, sequence, expression, structure, function, citation, and homology data,
using information from a wide range of resources, such as RefSeq maps, pathways, and genome- and locus-specific resources. From a
user’s perspective, Gene acts as a single-source specialized database containing information on specific genes across different species.

BIOINFORMATICS FOR BEGINNERS


102 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

5.7.1 Search and Retrieval Other databases not shown in the figure also have differ-
Using Entrez/GQuery ent numbers of entries. Clicking on the number or on the
database name will return all the entries from that data-
Entrez (GQuery, or global query; http://www.ncbi base. Without the data retrieval system, such simulta-
.nlm.nih.gov/sites/gquery) is a user-friendly, versatile, neous searching across multiple databases by entering
text-based search and retrieval system developed by the search term only once is not possible and individual
the NCBI. It searches linked databases using a single databases have to be searched separately.
word or combination of words entered as search The simultaneous search capability and all-in-one
term. Thus, Entrez provides a global query system and display of results from multiple databases make the
forms a web of connections with the databases (nodes NCBI Entrez (GQuery) a user-friendly search and
in the web of connections). The search at the NCBI can retrieval system for general users.
be performed either using a specific database, or using
Entrez across databases simultaneously.
Figure 5.1 shows the databases (partial list) that can 5.7.2 Search and Retrieval Using
be selected from the drop-down menu on the NCBI DBGET/LinkDB
home page, and then the search term can be entered
in the space shown. Hitting the “search” button will DBGET/LinkDB (http://www.genome.jp/dbget/
usually return a number of entries. Depending on the dbget_manual.html) is an integrated text-based search
database selected for search and retrieval, the primary and retrieval system for major biological databases
source of some of the retrieved entries may be other at GenomeNet. GenomeNet is the Japanese network of
related but specialized databases. For example, the database and computational services for genome
Nucleotide, RefSeq, EST, GSS, and Gene databases all research and related biomedical research; it is operated
have entries on the same nucleotide sequence or part by the Kyoto University Bioinformatics Center (http://
thereof, under database-specific accession numbers www.bic.kyoto-u.ac.jp/). DBGET searches and extracts
and descriptors. Because all these databases are linked, entries from a wide range of molecular biology data-
selecting the Nucleotide database for searching a bases, and LinkDB searches and computes links
sequence will retrieve all entries related to the sequence between entries in divergent databases. Databases
from other related and specialized databases as well. being searched can exist in different servers, but from
However, selecting a specialized database will retrieve a the user’s point of view, they all exist in a single
smaller number of entries. DBGET server.55
Alternatively, the user can access the Entrez DBGET/LinkDB uses three basic commands for
home page and perform a search across all databases performing search and retrieval of database entries:
simultaneously by entering the search term in the space bfind, bget, and blink. bget retrieves database entries
shown. Hitting “Search” will return the number based on a search combination (name:identifier), bfind
of entries available in each database, which is displayed retrieves database entries by keywords, whereas blink
next to the database name. The Entrez home page has retrieves related entries in a given database as well as
recently undergone a change in appearance. all databases.
Figures 5.2A and 5.2B show a partial view of the Entrez
home page. A screenshot of the Entrez home page cap-
5.7.3 Search and Retrieval Using
tured in March 2013 is shown in Figure 5.2A, whereas a
screenshot captured in June 2013 is shown in
Sequence Retrieval System
Figure 5.2B. These two screenshots are shown to under- Examples of some publicly available Sequence
score the fact that the appearance or versions Retrieval System (SRS) servers are http://www.emb-
of bioinformatic tools and database home pages are sub- net.sk:8080/srs81/; http://www.dkfz.de/srs/; http://
ject to change, although the utility pretty much remains iubio.bio.indiana.edu/srs/. There are many other such
the same and is mostly improved. The Entrez home web-based servers, too. Figure 5.3 shows various ser-
page states GQuery (global query) now, and the order of vices available from EMBL-EBI (http://www.ebi
database display has been reorganized in the new ver- .ac.uk/services) that includes sequence retrieval func-
sion. Both Figures 5.2A and 5.2B show only the top por- tions as well. These can be accessed by clicking the
tion of the retrieved information that was obtained by “DNA & RNA” as well as “Proteins” links. A search
performing a search using the search term “Mus muscu- in dbfetch (http://www.ebi.ac.uk/Tools/dbfetch/
lus Slco1a6.” Figures show the number of hits in various dbfetch/) requires the accession number, as shown in
databases; PubMed has 2 and PubMed Central has 10 Figure 5.4. A search for multiple sequences can also be
entries (as of June 2013), Nucleotide database has 10 made by using multiple search terms and separating
entries (visible in Figure 5.2A but not in Figure 5.2B). them using a comma.

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 103

FIGURE 5.2 Partial view of the Entrez home page at two different dates. (A) A screenshot of the Entrez home page captured in
March 2013. (B) A screenshot of the Entrez home page captured in June 2013. These two screenshots are shown to underscore the fact that the
home page is subject to change, although the utility pretty much remains the same and is mostly improved. The Entrez home page states
GQuery now. A user can perform a search across all the databases simultaneously by entering the search term in the space shown. Hitting
“Search” will return the number of entries available in each database, displayed next to the database name. This may change with time as
new information is added to various databases.

5.8 AN EXAMPLE OF RETRIEVAL collection of all nucleotide sequences from the primary as
OF MRNA/GENE INFORMATION well as the specialized databases. A search using the
mRNA or gene name in the Nucleotide databases
Information about an mRNA or genej can be retrieved retrieves many records, and depending on the search
by selecting the “Nucleotide” (database) from the drop- term the number of records may sometimes be too many
down menu on the NCBI home page (Figure 5.1). The to go through individually. The Nucleotide database can
Nucleotide (database) provides a link to the grand be searched in different ways to focus the search more

j
The display of information output associated with any database is subject to change from time to time. This is because there is
continuing effort to improve the information output and display features. Therefore, the graphic displays shown in the figures are
not expected to remain the same all the time. Nevertheless, knowing how to harness and use the information should prepare readers
to deal with any such changes.

BIOINFORMATICS FOR BEGINNERS


104 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

FIGURE 5.3 Data Retrieval at EMBL-EBI. Nucleotide sequence data can be retrieved by clicking the “DNA & RNA” link and accessing the
ENA resource. Protein sequence data can be retrieved by clicking the “Protein” link and accessing the protein resource, such as UniProt.
(Source: EMBL-EBI, http://www.ebi.ac.uk/services).

FIGURE 5.4 Search and retrieval using dbfetch, ENA, and EB-eye. Specific sequence information from the EMBL-Bank can be retrieved
using dbfetch (upper panel), ENA (middle panel), and EB-eye (lower panel). These are partial screenshots.

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 105

FIGURE 5.5 GenBank information on mouse Oatp-5. The upper panel shows the top portion of the GenBank record of the original
submission of mouse Oatp-5 mRNA along with its accession number and the version. Below the accession number is the link to the graphics
(circled). Clicking the graphics link will return the graphics of the mRNA and the protein shown in the lower panel. The lower panel also
shows various links and tools in the Graphics page that can help visualize different aspects of the sequence as described in the text. (Source:
http://www.ncbi.nlm.nih.gov/-Nucleotide, information as of June 2013)

narrowly, such as by utilizing the accession or and other relevant information shown in Figure 5.5
GI number or even using the names of the authors of a lower panel, Figure 5.6, and Figure 5.7, along with
submission. Of course, the user has to know this type various links and tools that can help visualize different
of information. If the accession number or GI number of aspects of the sequence. The same graphical representa-
a sequence is known, the exact record can be directly tion (and more) can also be retrieved by using the
retrieved. Currently, the GenBank nucleotide record Gene database (discussed later). The red-colored track
provides a link to graphics of the sequence. represents the mouse Oatp-5 protein. If the cursor is
For example, Figure 5.5 (upper panel) shows the top brought onto the track, a drop-down box appears that
portion of the GenBank record of the original submis- contains information about the red track; for example,
sion of mouse Oatp-5 mRNA.56 Mouse Oatp-5 was later the Oatp-5 coding sequence spans from base 179 to
given other names, such as Slc21a13 and Slco1a6, of 2191, and the Oatp-5 protein contains 670 amino acids
which Slco1a6 is the name used in all databases. (Figure 5.5, lower panel). The figure shows a sliding
Slco1a6 stands for “solute carrier organic anion trans- zoom-in/out button; moving the button to the right first
porter (Slco) member 1a6.” In the text that follows, both zooms in the figure and ultimately reveals the nucleotide
the terms Oatp-5 and Slco1a6 will be used. The flatfile sequence on the black track at the top, along with the
of this original submission (accession: AF213260) has corresponding amino-acid sequence on the red track.
been shown before. Figure 5.5 upper panel shows the Alternatively the “zoom-to-sequence” link can be clicked
link to the graphics (circled). Clicking the graphics link to reveal the sequence. This automatically moves the
will return the graphics of the mRNA and the protein sliding zoom-in/out button all the way to the right.

BIOINFORMATICS FOR BEGINNERS


106 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

FIGURE 5.6 The zoom-in state of the record shown in Figure 5.5 (lower panel), showing the sequence. The figure shows the nucleotide
sequence of Oatp-5 cDNA at the top, associated with the black track; and the amino-acid sequence of the Oatp-5 protein along with the codons
for each amino acid, associated with the red track. The coding sequence begins from base 179, which is the “A” of “ATG.” (Source: http://www.
ncbi.nlm.nih.gov/-Nucleotide, information as of June 2013)

FIGURE 5.7 A modified composite screenshot of the record shown in Figure 5.5 (lower panel). The information on all the tracks in
Figure 5.5 (lower panel) were separately captured and pasted to artificially create this figure. The figure shows the individual drop-down
information boxes associated with each track. Note that it is not possible to obtain all the information drop-down boxes at the same time. This
is because the cursor can be held only on one track at a time to obtain the drop-down information box.

The zoom-in state showing the sequence is shown in Figure 5.6 that the coding sequence begins from base
Figure 5.6 (partial sequence shown). It shows the nucleo- 179, which is the “A” of “ATG.” Figure 5.7 is a modified
tide sequence of Oatp-5 cDNA at the top associated with composite figure (see the legend for Figure 5.7).
the black track, and the amino-acid sequence of the Compared to the the original submission (AF213260.1),
Oatp-5 protein along with the codons for each amino the RefSeq record of Oatp-5 (called Slco1a6, with an
acid associated with the red track. It is clear from accession number NM_023718 version 3) has more

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 107

FIGURE 5.8 The graphics of the RefSeq record for Oatp-5. In the RefSeq record, Oatp-5 is identified as Slco1a6. The graphics of the
RefSeq record show additional information that was not present in the original submission, such as information on the length and span of
exons in mRNA, and the transmembrane regions in the protein. (Source: http://www.ncbi.nlm.nih.gov/-Nucleotide, information as of June 2013)

graphics available. Figure 5.8 shows the graphics of TMDs). Figure 5.9 shows that the first TMD of Slco1a6 is
the RefSeq record, which identifies Oatp-5 as Slco1a6. 20 amino acids long and spans from amino acid 21 to 40
The graphics of the RefSeq record show additional infor- (21. . .40). The UniProtKB/Swiss-Prot accession number
mation that was not present in the original submission of mouse Slco1a6 is Q99J94, and this is a curated entry;
(Figures 5.5 and 5.6), such as information on the length hence, the information has been validated.
and span of exons in mRNA and on transmembrane Note that the original submission (AF213260.1) shows the
regions in the protein. coding sequence spanning from base 179 to 2191, but the
Figure 5.9 was created by first zooming in Figure 5.8 RefSeq record (NM_023718.3) shows the coding sequence
to reveal the sequence and then separately capturing spanning from base 175 to 2187. This difference reflects an
and pasting the information about all the tracks to the adjustment of four bases in the 50 -UTR of the RefSeq
screenshot; hence Figure 5.9 is an artificially created record compared to the original record. This was done
screenshot. As mentioned above, all the drop-down during the creation and validation of the RefSeq record,
information boxes cannot be obtained at the same time; which involved comparison with the Slco1a6 gene
the cursor can be held on one track at a time so that the sequence record from the mouse reference genome.57
information about that track appears in the drop-down Therefore, the information in the RefSeq record should
box. In these graphics, the green track represents the be regarded as more accurate and up to date.
entire length (1. . .2804) of the Slco1a6 (Oatp5) mRNA, At the left-hand top corner of Figure 5.9, there is
and is associated with an information box. The red track a link to “Display Settings”; next to it is “Graphics”
represents the Slco1a6 protein along with the amino- (circled). The “Display Settings” is a drop-down menu
acid codons; hence the red track also shows the coding that provides many options for viewing the sequence
sequence (base 175. . .2187). The graphics of the RefSeq information. When the “Graphics” option is chosen, the
record also displays information about all the exons. information is displayed as graphics as in Figure 5.9 and
Figure 5.9 shows that exon 3, for example, is 142 bp other similar figures. Figure 5.10 shows information
long (235. . .376). Thus, base 235 through 376 of the about the sequence in a different (“Revision History”)
Slco1a6 mRNA is derived from exon 3 of the Slco1a6 format. Choosing the “Revision History” option from
gene. Slco1a6 is a membrane transporter with more than the “Display Settings” drop-down menu displays the
10 transmembrane regions (transmembrane domains or entire history of revision of the sequence. Figure 5.10

BIOINFORMATICS FOR BEGINNERS


108 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

FIGURE 5.9 A modified composite screenshot of the record shown in Figure 5.8 showing the individual drop-down information boxes
associated with each track. See text for details.

FIGURE 5.10 The “Revision History” of Slco1a6. The upper panel shows the upper part of the list and the lower panel shows the lower
part of the list. By selecting two specific entries a comparison can be made to find out the revisions made in the sequence. The
figure shows that the first and the last entry of the Slco1a6 mRNA sequence have been selected for comparison. (Source: http://www.ncbi.nlm.
nih.gov/-Nucleotide, information as of June 2013)

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 109

FIGURE 5.11 Results of the comparison of the two versions of Slco1a6 mRNAs selected in Figure 5.10. The upper panel shows that the
comparison format of the revision history from Figure 5.10 is BLAST pairwise alignment. The lower panel shows only the first 60 bases from
the pairwise alignment. Base 1 of the Sbjct sequence starts aligning with base 5 of the Query sequence; this suggests that the original sequence
entry (Query) with the GI number 12963796 had four extra bases at the beginning of the sequence that are not present in the latest entry
(Sbjct) with the GI number 194440679. (Source: http://www.ncbi.nlm.nih.gov/-Nucleotide, information as of June 2013)

upper panel shows the upper part of the list and the shows the result of that comparison. Figure 5.11 upper
lower panel shows the lower part of the list (the whole panel shows that the comparison format chosen from
list is too long to display in one page). By selecting two the drop-down menu is BLAST pairwise alignment.
specific entries, a comparison can be made to find out The lower panel shows only the first 60 bases from
the revisions made in the sequence. Figure 5.10 shows the pairwise alignment. It shows that the alignment
that the first and the last entry of the Slco1a6 mRNA starts from base 5 of the original sequence entry
sequence have been selected for comparison. Figure 5.11 (Query; GI number 12963796), indicating that the

BIOINFORMATICS FOR BEGINNERS


110 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

FIGURE 5.12 The expanded “Tools” drop-down menu, showing its options. See text for explanation. (Source: http://www.ncbi.nlm.nih.gov/
-Nucleotide, information as of June 2013)

original sequence entry had four extra bases (atcc) at aliases, other designations, chromosomal location, map
the beginning of the sequence that are not present in position, and the RefSeq annotation information. For
the latest entry (Sbjct; GI number 194440679). Hence, example, the second entry is mouse Oatp-5. Its official
base 1 of the Sbjct sequence starts aligning with base 5 symbol is Slco1a6, other alias is Slc21a13, it is located on
of the Query sequence; the rest of the Query and Sbjct chromosome 6, it spans from nucleotide (nt) 142085768
sequences are identical. These extra four bases (atcc) to nt 142186149 on the reverse strand. Therefore, the
could have been a cloning/sequencing artifact in the mouse Oatp5 gene is 100,382 bp long, and the Gene
original submission. This is why the original submission database ID is 28254, which can be used to retrieve the
(AF213260.1) shows the coding sequence spanning from record directly from the Gene database.
base 179 to 2191, but the RefSeq record (NM_023718.3) If the mouse Slco1a6 result is clicked to open the
shows the coding sequence spanning from base 175 to detailed record, this record contains 10 information
2187, reflecting an adjustment of four bases. fields. These fields, shown in Figure 5.14, have been
In the screenshots shown in Figures 5.55.9, there is collapsed to fit the screen. Three fields will be discussed
a link to a “Tools” drop-down menu, which is shown here: the “Summary” field, the “Genomic context” field,
expanded in Figure 5.12 to show the available options. and the “Genomic regions, transcripts, and products”
Three such options are circled. The “Go To” option allows field. Other fields can be likewise expanded and explored
the user to go to a specific position in the sequence; the for their information content.
“Flip Strands” option allows the user to flip the polarity The “Summary” field with its detailed information
of the sequence; the “Sequence Text View” option allows content is shown in Figure 5.15; the figure also shows
the user to view the entire nucleotide sequence as well as the detailed information content of the “Genomic con-
the amino-acid sequence. text” field. The “Summary” field shows that the official
A search for Oatp-5/Slco1a6 can also be performed symbol Slco1a6 is provided by the Mouse Genome
using the Gene database. Figure 5.13 shows the results Informatics (MGI) groupk.58 The Slco1a6 gene has an
of a query in the Gene database using the search term ID MGI:1351906, which can be used to search for it in
“Oatp-5” (circled in the figure) performed in June MGI databases. The link to MGI:1351906 can be clicked
2013. The search retrieved just two records, one for to obtain the Slco1a6 page of MGI (Figure 5.16). The
mouse, and one for rat. As indicated before, Oatp-5 is inset in Figure 5.16 is actually located to the far right
also known by two other names, Slco1a6 and Slc21a13. on the Slco1a6 page; it has been moved to fit the
Each entry shows the official symbol, name, other screenshot. The MGI Slco1a6 page shows its map

k
MGI (http://www.informatics.jax.org/) is the international database resource that provides integrated genetic, genomic, and
biological data for the laboratory mouse.

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 111

FIGURE 5.13 The result of a query in the Gene database using the search term “Oatp-5” (circled). See text for explanation. (Source:
http://www.ncbi.nlm.nih.gov/-Gene, information as of June 2013)

FIGURE 5.14 The detailed record for the mouse Slco1a6 entry in Figure 5.13. The detailed record shows 10 information fields. Each field
can be clicked to expand. (Source: http://www.ncbi.nlm.nih.gov/-Gene, information as of June 2013)

BIOINFORMATICS FOR BEGINNERS


112 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

FIGURE 5.15 The detailed information content of the “Summary” and “Genomic context” fields from the mouse Slco1a6 detailed
record in Figure 5.14 after the fields are expanded. The “Summary” field (upper panel) shows that the official symbol Slco1a6 is provided by
the Mouse Genome Informatics (MGI) group. The Slco1a6 gene has an ID MGI:1351906, which can be used to search for it in the MGI database.
The “Genomic context” field (lower panel) shows the chromosomal and genomic location of the Slco1a6 gene. (Source: http://www.ncbi.nlm.nih.
gov/-Gene, information as of June 2013)

FIGURE 5.16 Truncated screenshot of the MGI Slco1a6 page. The figure in the inset is located to the far right on the actual Slco1a6 page.
Because of the truncation of the Slco1a6 page to fit the figure, the inset has been copied and pasted close to the rest of the information. The
page shows the genetic map position of the Slco1a6 gene. The Slco1a6 page provides a lot of information and links to other information
resources (see text). (Source: http://www.informatics.jax.org/-MGI Slco1a6 page, information as of March 2013)

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 113

FIGURE 5.17 Figure created by pasting three partial screenshots from the MGI pages on Slco1a6. The upper panel was obtained by
clicking the “Detailed Genetic Map 6 1 cM” link from Figure 5.16. It shows the chromosomal location of the Slco1a6 locus in greater resolution
with respect to the surrounding loci. The middle and lower panels were obtained by clicking the “Mouse Genome Browser” link shown in the
inset in Figure 5.16. (Source: http://www.informatics.jax.org/-MGI Slco1a6 page, information as of March 2013)

position as 73.42 cMl, which is with respect to position format from VEGA annotation of mouse genome build
0 at one end of the chromosome. Mouse chromosome 6 38 (GRCm38m). Note that the total number of nucleo-
is an acrocentric chromosome—that is, the centromere is tides is 122,761 bp (higher than 100,382 bp mentioned
located almost at one end, creating an extremely short p earlier; Figure 5.16, “Sequence Map” field, link circled).
arm and a very long q arm. The 0 position in the genetic The Slco1a6 page has much more information (not shown
map starts at one end of the chromosome near the here), that can be clicked and explored. Figure 5.17 is
centromere; so the Slco1a6 gene with its genetic map a composite figure that has been created by pasting
position at 73.42 cM lies very close to the other end three partial screenshots. The upper panel was obtained
of chromosome 6 (Figure 5.17, upper panel). The MGI by clicking the “Detailed Genetic Map 6 1 cM” link
Slco1a6 page provides links to sequence map display from Figure 5.16. It shows the chromosomal location of
on four genome browsers: VEGA, Ensembl, UCSC, the Slco1a6 locus in greater resolution with respect to the
and NCBI Map Viewer (Figure 5.16). However, the surrounding loci. The middle and the lower panels
“Summary” field of the Gene database search record were obtained by clicking the “Mouse Genome Browser”
itself also provides links to the Ensembl and VEGA link (shown in the inset in Figure 5.16). Viewing sequence
genome browsers (Figure 5.15). The “Sequence Map” maps on genome browsers will be discussed later. Other
field of the MGI Slco1a6 page also provides a “Get links on the Slco1a6 page can be clicked to explore more
FASTA” link to the entire gene sequence in FASTA information.

l
1 centiMorgan (1 cM) 5 1 map unit distance between two genes or genetic markers.
m
GRC is an acronym for Genome Reference Consortium and m38 means the 38th version (build 38) of mouse genome sequence
assembly. The GRC is responsible for assembling the human and mouse reference genomes, and in that process correct
misrepresented loci and close remaining assembly gaps. The members of GRC include The Genome Center at Washington
University, the Wellcome Trust Sanger Institute, the EBI, and the NCBI. The GRC website (http://www.genomereference.org) is
available to view the progress of various projects, and communicate with the scientific community in general.

BIOINFORMATICS FOR BEGINNERS


114 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

The “Genomic context” field with its detailed TABLE 5.3 RefSeq IDs (Accession Numbers) of Various
information content is shown in Figure 5.15, lower Chromosomes in Human, Rat, and Mouse
panel. The “Location” line on the left of the Genomic RefSeq ID of Chromosomes
context field (Figure 5.15, lower panel) shows 6G2.
Chr # Homo sapiens Rattus norvegicus Mus musculus
This means that the Oatp5/Slco1a6 gene maps to
region G, band 2 of chromosome 6. Because mouse 1 NC_000001 NC_005100 NC_000067
chromosomes are acrocentric (centromere almost at 2 NC_000002 NC_005101 NC_000068
the end of the chromosome), creating an extremely
short p arm and a very long q arm, sometimes the 3 NC_000003 NC_005102 NC_000069
q arm is not mentioned. Therefore, the location can be 4 NC_000004 NC_005103 NC_000070
expressed as both 6G2 and 6qG2. Below the location 5 NC_000005 NC_005104 NC_000071
line is the “Sequence” line that shows “Chromosome: 6;
NC_000072.6 (142085768. . .142186149, complement).” 6 NC_000006 NC_005105 NC_000072
The NC_000072.6 is the RefSeq ID (accession number) for 7 NC_000007 NC_005106 NC_000073
Mus musculus chromosome 6 (see Table 5.3), version 6; 8 NC_000008 NC_005107 NC_000074
the “142085768. . .142186149” means that the Oatp5/
Slco1a6 gene spans from nt 142085768 to 142186149; 9 NC_000009 NC_005108 NC_000075
hence, the gene is 100382 bp long. The “complement” 10 NC_000010 NC_005109 NC_000076
means that the gene is located on the reverse strand of 11 NC_000011 NC_005110 NC_000077
the chromosomen. Note that this nucleotide location
span of the gene is based on the build 38 (GRCm38), 12 NC_000012 NC_005111 NC_000078
which is the latest version of mouse genome sequence 13 NC_000013 NC_005112 NC_000079
assembly as this section is being written. Below the 14 NC_000014 NC_005113 NC_000080
location field, there is a diagram showing the chromo-
somal location of Oatp5/Slco1a6 in relation to other 15 NC_000015 NC_005114 NC_000081
closely linked genes, such as Slco1a1, and Slco1a5. The 16 NC_000016 NC_005115 NC_000082
direction of the arrow is from right to left, indicating 17 NC_000017 NC_005116 NC_000083
that the Oatp5/Slco1a6 gene is on the reverse (minus)
strand of the chromosome. In other words, the direction 18 NC_000018 NC_005117 NC_000084
of transcription is from right to left. 19 NC_000019 NC_005118 NC_000085
Another direct way of obtaining the gene, mRNA, 20 NC_000020 NC_005119
and protein sequences through the Gene database is the
“NCBI Reference Sequence (RefSeq)” field. Figure 5.14 21 NC_000021
shows this field circled towards the bottom. Expanding 22 NC_000022
this field provides links to the Slco1a6 gene sequence X NC_000023 NC_005120 NC_000086
in chromosome 6, Slco1a6 mRNA, and Slco1a6 protein
(with their respective RefSeq accession numbers). By Y NC_000024 NC_000087
clicking these links one can directly obtain the gene, The version numbers are not shown here because they may change when a
mRNA, and protein sequences. new assembly is reported
The “Genomic regions, transcripts, and products”
field with its detailed information content is shown
in Figure 5.18. The upper panel shows the gene (as a appears (Figure 5.19). The mRNA and protein sequences
horizontal green line) with all the exons and introns, of Slco1a6 can be directly obtained by clicking the
whereas the lower panel shows the sequence. The “Go to reference sequence details” link in the right-hand
gene information is based on build 38 of the mouse top corner (circled) (Figure 5.18).
genome assembly (GRCm38; circled); the field also The details of the exon and intron sequence infor-
shows the chromosome information (chromosome 6). mation can be obtained by clicking “Display Settings”
If the “Graphics” link in the right-hand top corner in the left-hand top corner and selecting “Gene Table”
(circled) is clicked, the chromosome 6 graphics page from the drop-down menu (Figure 5.20; circled; this

n
Each chromosome (in an unduplicated state) is composed of one DNA molecule; hence two DNA strands. The DNA strand whose
50 -end is closer to the centromere is called the forward strand of the chromosome; the other strand is the reverse strand (or
complement). Therefore, the direction from p-q arm of the chromosome is the same as the 50 - 30 direction of the forward strand.
The sense strand (coding strand) of some genes resides in the forward strand whereas that of others resides in the reverse strand
(complement) of the chromosome.

BIOINFORMATICS FOR BEGINNERS


5.8. AN EXAMPLE OF RETRIEVAL OF MRNA/GENE INFORMATION 115

FIGURE 5.18 The “Genomic regions, transcripts, and products” field from the mouse Slco1a6 detailed record in Figure 5.14 after the
field is expanded. Upper panel showing the gene with its exons and introns; lower panel showing the sequence. The gene information is
based on build 38 of the mouse genome assembly (GRCm38). The RefSeq links to the mRNA and protein sequences of Slco1a6 can be directly
obtained by clicking the “Go to reference sequence details” link in the right-hand top corner (circled). (Source: http://www.ncbi.nlm.nih.gov/
-Gene, information as of June 2013)

FIGURE 5.19 The chromosome 6 graphics page, from the “Graphics” link in Figure 5.18. The span of chromosome 6 shown is approxi-
mately 0.9 3 106 bp long, and it contains many genes, including many transporter genes. The vertical bars represent the exons. (Source: http://
www.ncbi.nlm.nih.gov/-Gene, information as of June 2013)

BIOINFORMATICS FOR BEGINNERS


116 5. DATA, DATABASES, DATA FORMAT, DATABASE SEARCH, DATA RETRIEVAL SYSTEMS, AND GENOME BROWSERS

FIGURE 5.20 Exon and intron sequence information for mouse Slco1a6. Partial screenshot (upper part) of the details of the exon and
intron sequence information that can be obtained by clicking the “Display Setting” in the left-hand top corner and selecting the “Gene Table”
from the drop-down menu (circled). (Source: http://www.ncbi.nlm.nih.gov/-Gene, information as of June 2013)

FIGURE 5.21 Partial screenshot (lower part) of the details of the exon and intron sequence information (continuation of Figure 5.20).
Each exon or intron link can be clicked to obtain the exon or intron sequence, respectively.

figure is a partial screenshot showing the upper part (Figure 5.14). If this field is expanded by clicking,
of the display). The lower part of the display shows it shows a field called “GeneRIFs: Gene References
the details of the exon and intron sequence informa- Into Functions.” The GeneRIF contains a link called
tion (Figure 5.21). Each exon or intron link can be “Correction,” which provides an opportunity to the
clicked to obtain the exon or intron sequence, scientific community to update and add more rele-
respectively. vant references in relation to the gene in question.
Below the “Genomic regions, transcripts, and This information can be submitted to the NCBI
products” field there is the “Bibliography” field directly.

BIOINFORMATICS FOR BEGINNERS

You might also like