4Bioinformaticsdatabases
4Bioinformaticsdatabases
Databases
1
What are Databases?
• A database is a structured collection of
information.
• A database consists of basic units called
records or entries.
• Each record consists of fields, which hold pre-
defined data related to the record.
• For example, a protein database would have
protein sequences as records and protein
properties as fields (e.g., name of protein,
length, amino-acid sequence, …)
2
• A database can be thought of as a large
table, where the rows represent records and
the columns represent fields.
6
Databases on the Internet
• Biological databases often have web
interfaces, which allow users to send queries to
the databases.
• Some databases can be accessed by different
web servers, each offering a different interface.
request query
8
There are approximately
286,730,369,256
sequence records in the
traditional GenBank
divisions as of 2011.
(Benson et al. (2011) Nucleic Acids Res D32:7)
(Benson et al. (2011) Nucleic Acids Res D32:7)
The “perfect” database
1. Comprehensive, but easy to search.
4. Cross-referenced.
5. Minimum redundancy.
20
Tips for the Practical Session
• We will go over several databases in a very
short time. Don’t expect to remember all the
small details. They are not important. All
respectable databases have a “HELP”
component.
• Try to:
– Learn the common features of biological databases.
– Understand the main features of every database.
– Learn how to use the online HELP.
– Judge and compare databases.
21
EBI/NCBI/DDBJ
• These 3 databases contain mainly the same information
within 2-3 days (few differences in format and syntax)
• Serve as archives containing all sequences (single genes,
ESTs, complete genomes, etc.) derived from:
– Genome projects
– Sequencing centers
– Individual scientists
– Literature
– Patent offices
• Non-confidential data exchanged daily
• The database triples approximately every 12 months.
EBI/NCBI/DDBJ
• Heterogeneous: sequence length, genomes, variants,
fragments, …
• Minimum sequence size: 10 bp
• Archive: nothing goes out -> highly redundant!
• full of errors: in sequences, in annotations, in CDS
attribution….
• no consistency of annotations; most annotations are
done by the submitters; heterogeneity of the quality and
the completion and updating of the information
EBI/NCBI/DDBJ
• Unexpected information you can find:
• ACCESSION Z71230
FT source 1..124
FT /db_xref="taxon:4097"
FT /organelle="plastid:chloroplast"
FT /organism="Nicotiana tabacum"
FT /isolate="Cuban Cahibo cigar, gift from President Fidel
FT Castro”
• ACCESSION NC_001610
FT source 1..17084
FT /chromosome="complete mitochondrial genome"
FT /db_xref="taxon:9267"
FT /organelle="mitochondrion"
FT /organism="Didelphis virginiana"
FT /dev_stage="adult"
FT /isolate="fresh road killed individual"
FT /tissue_type="liver"
• There are 126,551,501,141 bases in
135,440,924 sequence records in the traditional
GenBank divisions and 191,401,393,188 bases
in 62,715,288 sequence records in the WGS
division as of April 2011.
26
Sequences
27
1 The LOCUS field
consists of five
different
subfields:
29
3 ACCESSION (Z92910) - Unique identifier assigned to a complete
sequence record. This number never changes, even if the record is
modified. An accession number is a combination of letters and
numbers that are usually in the format of one letter followed by five
digits (e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).
30
4 VERSION (Z92910.1) - Identification number assigned to a single,
specific sequence in the database. This number is in the format
“accession.version.” If any changes are made to the sequence data,
the version part of the number will increase by one. For example
U12345.1 becomes U12345.2. A version number of Z92910.1 for this
HFE sequence indicates that the sequence data has not been
altered since its original submission.
31
5 GI (1890179) - Also a sequence identification number. Whenever a
sequence is changed, the version number is increased and a new GI
is assigned. If a nucleotide sequence record contains a protein
translation of the sequence, the translation will have its own GI
number
32
6 KEYWORDS (haemochromatosis; HFE gene) - A keyword can be
any word or phrase used to describe the sequence. Keywords are
not taken from a controlled vocabulary. Notice that in this record the
keyword, "haemochromatosis," employs British spelling, rather than
the American "hemochromatosis." Many records have no keywords.
A period is placed in this field for records without keywords.
33
7 SOURCE (human) - Usually contains an abbreviated or common
name of the source organism.
37
The FEATURES table
38
A feature is simply an annotation that describes a portion of
the sequence.
39
source - An obligatory feature. The source gives the length of
the entire sequence, the scientific name of the source
organism, and the Taxon ID number.
40
gene - Sequence portion that delineates the beginning and
end of a gene.
41
exon - Sequence segment that contains an exon. Exons may
contain portions of 5' and 3’ UTRs (untranslated regions). The
name of the gene to which the exon belongs and exon number
are provided.
42
CDS - Sequence of nucleotides that code for amino acids of the
protein product (coding sequence).
The CDS begins with the first nucleotide of the start codon and
ends with the third nucleotide of the stop codon.
This feature includes the translation into amino acids and may
also contain gene name, gene product function, link to protein
sequence record, and cross-references to other database
entries.
43
intron - Transcribed but spliced-out parts. Intron number is
shown.
44
polyA_signal - Identifies the sequence portion required for
endonuclease cleavage of an mRNA transcript. Consensus
sequence for the polyA signal is AATAAA.
45
BASE COUNT & ORIGIN
BASE COUNT - Base Count gives the total number of adenine
(A), cytosine (C), guanine (G), and thymine (T) bases in the
sequence.
46
Molecule-specific and topic-specific databases
AsDb - Aberrant Splicing db
ACUTS - Ancient conserved untranslated DNA sequences db
Codon Usage Db
EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db [Mirror at EBI]
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project
gRNAs db - Guide RNA db
PLACE - Plant cis-acting regulatory DNA elements db
PlantCARE - Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db
ssu rRNA - Small ribosomal subunit db
lsu rRNA - Large ribosomal subunit db
5S rRNA - 5S ribosomal RNA db
tmRNA Website
tmRDB - tmRNA dB
tRNA - tRNA compilation from the University of Bayreuth
uRNADB - uRNA db
RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis Tools
Subviral RNA db - Small circular RNAs db (viroid and viroid-like)
MPDB - Molecular probe db
OPD - Oligonucleotide probe db
VectorDB - Vector sequence db (seems dead!)
Organism specific
databases:
FlyBase (Drosophila)
SGD (yeast)
MaizeDB (maize)
SubtiList (B. subtilis).
48
The search and retrieval system
that integrates information from
the National Center for
Biotechnology (NCBI) databases.
52
Databases: protein sequences
• SWISS-PROT: created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/
Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally
proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor
uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry
NCBI - RefSeq
• Main features of the RefSeq collection include:
1. Non-redundancy.
60
EMBL: The Genome divisions
http://www.ebi.ac.uk/genomes/
>sequence name
[sequence]…
Protein structure database
http://genome.ucsc.edu/
Summary
After…