0% found this document useful (0 votes)

5 views

4Bioinformaticsdatabases

The document provides an introduction to sequence databases, focusing on the structure and purpose of biological databases that store DNA, RNA, and protein information. It outlines the essential components of database entries, the importance of consistent formatting, and the challenges faced by large databases, including redundancy and inadequate sequences. Additionally, it discusses the use of various biological databases and their accessibility through web interfaces, emphasizing the need for careful evaluation of the data's reliability.

Uploaded by

Mohamed Hasan

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

4Bioinformaticsdatabases

Uploaded by

Mohamed Hasan

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 71

Introduction to Sequence

Databases

1. DNA & RNA

2. Proteins

1
What are Databases?
• A database is a structured collection of
information.
• A database consists of basic units called
records or entries.
• Each record consists of fields, which hold pre-
defined data related to the record.
• For example, a protein database would have
protein sequences as records and protein
properties as fields (e.g., name of protein,
length, amino-acid sequence, …)

2
• A database can be thought of as a large
table, where the rows represent records and
the columns represent fields.

Field Name Length Sequence Enzyme

Record
QA001 MTGA 243 MYQWI… yes
QA002 Ribosomal 267 MAAPV… no
protein L9
QA003 Flagellin 374 GSSIL… no
QA004 GDPMH 157 MFLRQ… yes

Accession Numbers: Unique identifiers of the

database records. 3
Ideal minimal content of an entry in a
sequence database
• Sequence
• Accession number (AC)
• Taxonomic data
• References
• Annotation/Curation - researchSources of data:
groups (direct
• Keywords submission)
- literature supplementary
• Cross-references information
- genome sequencing institutes
• Documentation - patents
Within a database, the format needs to be
kept consistent.
A SwissProt entry, in Fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).

MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE
NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA
VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD
AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR
Why Databases?
• The purpose of databases is not merely to collect and
organize data, but to allow intelligent data retrieval.
• A query is a method to retrieve information from the
database.
• The organization of each record into predetermined
fields, allows us to use queries on fields.

6
Databases on the Internet
• Biological databases often have web
interfaces, which allow users to send queries to
the databases.
• Some databases can be accessed by different
web servers, each offering a different interface.

request query

web page result

User Web server Database server

7
Database download

• Nearly all biological databases are available for

download as simple text (flat) files.
• A local version of the database allows one
greater freedom in processing the data.
• Processing data in files requires some
computer-programming skills. PERL is an easy
programming language that can be used for
extraction and analysis of data from files.

8
There are approximately
286,730,369,256
sequence records in the
traditional GenBank
divisions as of 2011.
(Benson et al. (2011) Nucleic Acids Res D32:7)
(Benson et al. (2011) Nucleic Acids Res D32:7)
The “perfect” database
1. Comprehensive, but easy to search.

2. Annotated, but not “too annotated”.

3. A simple, easy to understand structure.

4. Cross-referenced.

5. Minimum redundancy.

6. Easy retrieval of data. 14

Problems with General
Sequence Databases
• Databases that strive for encyclopedic
completeness are now so huge as to be close
to unmanageable.

1. Redundancy (nothing ever goes out).

2. Inadequate sequences.
– old sequences
– partially annotated sequences
– inconsistent & outdated annotations (submitter annotation)
– error sequences, low-quality sequences
– contaminations
– anonymous sequence
15
( )

Release 57.5 of 07-Jul-09

of UniProtKB/Swiss-Prot
contains 471,472 sequence
entries,comprising
167,326,533 amino acids
abstracted from 181,042
references.
The RefSeq Accession number format
and molecule types

Accession Molecule type

NC_xxxxxx Complete genomic molecule
NG_xxxxxx Genomic region
NM_xxxxxx mRNA
NP_xxxxxx Protein
NR_xxxxxx RNA
NT_xxxxxx computed Genomic contig
XM_xxxxxx computed mRNA
XP_xxxxxx computed Protein
Using Biological Databases

• What databases should I use?

• What kind of information I expect
to find in this database?
• Is the data in database of interest
to me?
• How reliable is it?
19
Practical Session: Outline
• Integrated systems: e.g., NCBI (Protein,
Nucleotide, Gene, OMIM, etc.)
• Protein Databases: e.g., ExPASy
(SwissProt + TrEMBL)
• Protein structures: e.g., PDB and PDBsum
• Pathway databases: e.g., KEGG (Kyoto
Encyclopedia of Genes and Genomes)

20
Tips for the Practical Session
• We will go over several databases in a very
short time. Don’t expect to remember all the
small details. They are not important. All
respectable databases have a “HELP”
component.
• Try to:
– Learn the common features of biological databases.
– Understand the main features of every database.
– Learn how to use the online HELP.
– Judge and compare databases.
21
EBI/NCBI/DDBJ
• These 3 databases contain mainly the same information
within 2-3 days (few differences in format and syntax)
• Serve as archives containing all sequences (single genes,
ESTs, complete genomes, etc.) derived from:
– Genome projects
– Sequencing centers
– Individual scientists
– Literature
– Patent offices
• Non-confidential data exchanged daily
• The database triples approximately every 12 months.
EBI/NCBI/DDBJ
• Heterogeneous: sequence length, genomes, variants,
fragments, …
• Minimum sequence size: 10 bp
• Archive: nothing goes out -> highly redundant!
• full of errors: in sequences, in annotations, in CDS
attribution….
• no consistency of annotations; most annotations are
done by the submitters; heterogeneity of the quality and
the completion and updating of the information
EBI/NCBI/DDBJ
• Unexpected information you can find:
• ACCESSION Z71230
FT source 1..124
FT /db_xref="taxon:4097"
FT /organelle="plastid:chloroplast"
FT /organism="Nicotiana tabacum"
FT /isolate="Cuban Cahibo cigar, gift from President Fidel
FT Castro”
• ACCESSION NC_001610
FT source 1..17084
FT /chromosome="complete mitochondrial genome"
FT /db_xref="taxon:9267"
FT /organelle="mitochondrion"
FT /organism="Didelphis virginiana"
FT /dev_stage="adult"
FT /isolate="fresh road killed individual"
FT /tissue_type="liver"
• There are 126,551,501,141 bases in
135,440,924 sequence records in the traditional
GenBank divisions and 191,401,393,188 bases
in 62,715,288 sequence records in the WGS
division as of April 2011.

• Most biocomputing sites update their copy of

GenBank every day over the internet.

• Scientists access GenBank directly over the

Web. 25
Annotation
•These billions of Gs, As, Ts, and Cs would be
useless without the "annotation" in each sequence
record.

26
Sequences

27
1 The LOCUS field
consists of five
different
subfields:

1a Locus Name (HSHFE) - The locus name is a tag for grouping

similar sequences. The first two or three letters usually designate
the organism. In this case HS stands for Homo sapiens The last
several characters are associated with another group designation,
such as gene product. In this example, the last three digits
represent the gene symbol, HFE. Currently, the only requirement for
assigning a locus name to a record is that it is unique.

1b Sequence Length (12146 bp) - The total number of nucleotide

base pairs (or amino acid residues) in the sequence record.
28
2 DEFINITION - Brief description of the sequence. The description
may include source organism name, gene or protein name, or
designation as untranscribed or untranslated sequences (e.g., a
promoter region). For sequences containing a coding region (CDS),
the definition field may also contain a “completeness” qualifier such
as "complete CDS" or "exon 1."

29
3 ACCESSION (Z92910) - Unique identifier assigned to a complete
sequence record. This number never changes, even if the record is
modified. An accession number is a combination of letters and
numbers that are usually in the format of one letter followed by five
digits (e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).

30
4 VERSION (Z92910.1) - Identification number assigned to a single,
specific sequence in the database. This number is in the format
“accession.version.” If any changes are made to the sequence data,
the version part of the number will increase by one. For example
U12345.1 becomes U12345.2. A version number of Z92910.1 for this
HFE sequence indicates that the sequence data has not been
altered since its original submission.
31
5 GI (1890179) - Also a sequence identification number. Whenever a
sequence is changed, the version number is increased and a new GI
is assigned. If a nucleotide sequence record contains a protein
translation of the sequence, the translation will have its own GI
number

32
6 KEYWORDS (haemochromatosis; HFE gene) - A keyword can be
any word or phrase used to describe the sequence. Keywords are
not taken from a controlled vocabulary. Notice that in this record the
keyword, "haemochromatosis," employs British spelling, rather than
the American "hemochromatosis." Many records have no keywords.
A period is placed in this field for records without keywords.

33
7 SOURCE (human) - Usually contains an abbreviated or common
name of the source organism.

8 ORGANISM (Homo sapiens) - The scientific name (usually genus

and species) and phylogenetic lineage. See the NCBI Taxonomy
Homepage for more information about the classification scheme
used to construct taxonomic lineages.
34
9 REFERENCE - Citations of publications by sequence authors that
support information presented in the sequence record. Several
references may be included in one record. References are
automatically sorted from the oldest to the newest. Cited publications
are searchable by author, article or publication title, journal title, or
MEDLINE unique identifier (UID). The UID links the sequence record
to the MEDLINE record.
35
1c Molecule Type
(DNA) - Type of
molecule that was
sequenced. All
sequence data in an
entry must be of the
same type.

1d GenBank Division (PRI) - There are different GenBank divisions.

In this example, PRI stands for primate sequences. Some other
divisions include ROD (rodent sequences), MAM (other mammal
sequences), PLN (plant, fungal, and algal sequences), and BCT
(bacterial sequences).

1e Modification Date (23-July-1999) - Date of most recent

modification made to the record. The date of first public release is not
available in the sequence record. This information can be obtained
only by contacting NCBI at [email protected]. 36
9 REFERENCE - If the REFERENCE TITLE contains the words
"Direct Submission," contact information for the submitter(s) is
provided.

37
The FEATURES table

38
A feature is simply an annotation that describes a portion of
the sequence.

 Each feature includes a location (sequence location or

interval) and one or several qualifiers.

 Clicking on the feature name will open a record for the

sequence interval identified in the feature location.

A list of features can be found in

http://www.ncbi.nlm.nih.gov/collab/FT/

39
source - An obligatory feature. The source gives the length of
the entire sequence, the scientific name of the source
organism, and the Taxon ID number.

Other types of information that the submitter may include in

this field are chromosome number, map location, clone, and
strain identification.

40
gene - Sequence portion that delineates the beginning and
end of a gene.

41
exon - Sequence segment that contains an exon. Exons may
contain portions of 5' and 3’ UTRs (untranslated regions). The
name of the gene to which the exon belongs and exon number
are provided.

42
CDS - Sequence of nucleotides that code for amino acids of the
protein product (coding sequence).
The CDS begins with the first nucleotide of the start codon and
ends with the third nucleotide of the stop codon.
This feature includes the translation into amino acids and may
also contain gene name, gene product function, link to protein
sequence record, and cross-references to other database
entries.

43
intron - Transcribed but spliced-out parts. Intron number is
shown.

44
polyA_signal - Identifies the sequence portion required for
endonuclease cleavage of an mRNA transcript. Consensus
sequence for the polyA signal is AATAAA.

45
BASE COUNT & ORIGIN
BASE COUNT - Base Count gives the total number of adenine
(A), cytosine (C), guanine (G), and thymine (T) bases in the
sequence.

ORIGIN - Origin contains the sequence data, which begins on

the line immediately below the field title.

46
Molecule-specific and topic-specific databases
AsDb - Aberrant Splicing db
ACUTS - Ancient conserved untranslated DNA sequences db
Codon Usage Db
EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db [Mirror at EBI]
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project
gRNAs db - Guide RNA db
PLACE - Plant cis-acting regulatory DNA elements db
PlantCARE - Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db
ssu rRNA - Small ribosomal subunit db
lsu rRNA - Large ribosomal subunit db
5S rRNA - 5S ribosomal RNA db
tmRNA Website
tmRDB - tmRNA dB
tRNA - tRNA compilation from the University of Bayreuth
uRNADB - uRNA db
RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis Tools
Subviral RNA db - Small circular RNAs db (viroid and viroid-like)
MPDB - Molecular probe db
OPD - Oligonucleotide probe db
VectorDB - Vector sequence db (seems dead!)
Organism specific
databases:

FlyBase (Drosophila)
SGD (yeast)
MaizeDB (maize)
SubtiList (B. subtilis).

48
The search and retrieval system
that integrates information from
the National Center for
Biotechnology (NCBI) databases.

These databases include

nucleotide sequences, protein
sequences, macromolecular
structures, whole genomes, and
MEDLINE, through PubMed. 51
Input your search keywords or the Boolean expression

52
Databases: protein sequences
• SWISS-PROT: created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/

• TrEMBL: created in 1996; complement to SWISS-PROT; derived from

EMBL CDS translations (« proteomic » version of EMBL)

• PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

• Genpept: « proteomic » version of GenBank

• Many specialized protein databases for specific families or groups of

proteins.

– Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM

receptors), IMGT (immune system), YPD (Yeast), etc.
SWISS-PROT
• Collaboration between the SIB (CH) and
EMBL/EBI (UK)
• Manually annotated: non-redundant,
cross-referenced, fully documented.
• Weekly releases; available from about 50
servers across the world, the main source
being ExPASy in Geneva
SWISS-PROT - 07/28/09
• 495,880 sequences
• 174,780,353 amino acid residues
• 11,891 species
• 2,000 journals
• 276,903 authors
SWISS-PROT - 07/27/11
• 531,473 sequences
• 188,463,640 amino acid residues
• 12,564 species
• 2,154 journals
• 306,144 authors
TrEMBL (Translation of EMBL)
• It is impossible to cope with the quantity of newly
generated data AND to maintain the high quality of
SWISS-PROT -> TrEMBL, created in 1996.

• TrEMBL is automatically generated (from annotated EMBL

coding sequences (CDS)) and annotated using software
tools.

• Contains all that is not in SWISS-PROT.

SWISS-PROT + TrEMBL = all known protein sequences.
The simplified story of a SWISS-PROT entry

Some data are not submitted to the public databases !!

(delayed or cancelled…)
cDNAs, genomes, …
« Automated »
EMBLnew EMBL • Redundancy check (merge)
CDS • Family attribution (InterPro)
• Annotation (computer)

TrEMBLnew TrEMBL « Manual »

• Redundancy (merge, conflicts)
• Annotation (manual)
• SWISS-PROT tools (macros…)
• SWISS-PROT documentation
• Medline
SWISS-PROT • Databases (MIM, MGD….)
• Brain storming

Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally
proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor
uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry
NCBI - RefSeq
• Main features of the RefSeq collection include:

1. Non-redundancy.

2. Explicitly linked nucleotide and protein sequences

3. Data validation and format consistency

4. Distinct accession series.

5. Ongoing curation by NCBI staff and collaborators, with

review status indicated on each record
59
Text based searching
• Terminology: query, hit, fields, logical/Boolean operator.
• General principles:
1. All main databases provide a convenient tool for text base
searching.

2. We can search for query words in specific fields.

3. We can search more than one database at a time.

4. We can Pose additional limits, such as modification date.

60
EMBL: The Genome divisions
http://www.ebi.ac.uk/genomes/

Schizosaccharomyces pombe strain 972h- complete genome

Sequence formats: GenBank format
Sequence formats: FASTA format

>sequence name 
[sequence]… 
Protein structure database
http://genome.ucsc.edu/
Summary

• What is the best db for sequence analysis ?

• Which does contain the highest quality data ?
• Which is the more comprehensive ?
• Which is the more up-to-date ?
• Which is the less redundant ?
• Which is the more indexed (allows complex queries) ?
• Which Web server does respond most quickly ?
Presents new databases and updates of existing databases
Before…

End of the first part…

After…

BI W2 Ex Ans
No ratings yet
BI W2 Ex Ans
9 pages
A Review of N. J. Habraken's 'Supports: An Alternative To Mass Housing'
No ratings yet
A Review of N. J. Habraken's 'Supports: An Alternative To Mass Housing'
1 page
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
selected topic in cs 1 (3)
No ratings yet
selected topic in cs 1 (3)
53 pages
Lecture 5- DataBase
No ratings yet
Lecture 5- DataBase
18 pages
lecture1_BIOF242_shuvadeep
No ratings yet
lecture1_BIOF242_shuvadeep
38 pages
Biol BDs Singapore
No ratings yet
Biol BDs Singapore
24 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Lecture 3 Database
No ratings yet
Lecture 3 Database
81 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
Nucleic_Acid_Databases
No ratings yet
Nucleic_Acid_Databases
37 pages
2024.HF_BioInformatics_Lec3p
No ratings yet
2024.HF_BioInformatics_Lec3p
11 pages
LO4 Access to Sequenced Data and Related Information
No ratings yet
LO4 Access to Sequenced Data and Related Information
11 pages
UNIT II
No ratings yet
UNIT II
23 pages
Databases in Bioinformatics - An Introduction
No ratings yet
Databases in Bioinformatics - An Introduction
11 pages
Biological Databases: - Bio-Informatics
No ratings yet
Biological Databases: - Bio-Informatics
16 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
Biological Databases
No ratings yet
Biological Databases
39 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Essential Info Notes-1
No ratings yet
Essential Info Notes-1
57 pages
Lab 1
No ratings yet
Lab 1
39 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
"MBG1002 Biological Databases Week II
No ratings yet
"MBG1002 Biological Databases Week II
37 pages
Generating Structural Data Analysis
No ratings yet
Generating Structural Data Analysis
8 pages
Fat Noews Docx (2)
No ratings yet
Fat Noews Docx (2)
37 pages
Molecular Genetics - Lab Manual - 22 May 2021
No ratings yet
Molecular Genetics - Lab Manual - 22 May 2021
36 pages
Database
No ratings yet
Database
40 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
Bio in For Matics
No ratings yet
Bio in For Matics
26 pages
Coursera 14b Unit 1-Ncbi PDF
No ratings yet
Coursera 14b Unit 1-Ncbi PDF
5 pages
02. Biological Sequence Databases
No ratings yet
02. Biological Sequence Databases
35 pages
Lab Report 1 Bioinformatics
No ratings yet
Lab Report 1 Bioinformatics
13 pages
BCH 505 Bioinformatics 3(2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3(2 2) Databases
17 pages
Computational Biology B.Tech - Biotech (Vith Semester)
No ratings yet
Computational Biology B.Tech - Biotech (Vith Semester)
34 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Database
No ratings yet
Database
16 pages
2nd Lec Student Copy - 2
No ratings yet
2nd Lec Student Copy - 2
19 pages
Bioinfo U2 KD 2
No ratings yet
Bioinfo U2 KD 2
3 pages
Database Dalam Bioinformatika
No ratings yet
Database Dalam Bioinformatika
34 pages
lecture2-BGGN213 F17
No ratings yet
lecture2-BGGN213 F17
10 pages
Zoya Bioinformatics Assignment
No ratings yet
Zoya Bioinformatics Assignment
36 pages
System Biology Assignment
No ratings yet
System Biology Assignment
17 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
Lecture Bioinfo Databases
No ratings yet
Lecture Bioinfo Databases
27 pages
Lecture 5 Information Retrieval From Databases
No ratings yet
Lecture 5 Information Retrieval From Databases
22 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
ok
No ratings yet
ok
29 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Lecture_3
No ratings yet
Lecture_3
55 pages
Manual
No ratings yet
Manual
68 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
The Handbook of Plant Genome Mapping: Genetic and Physical Mapping
From Everand
The Handbook of Plant Genome Mapping: Genetic and Physical Mapping
Khalid Meksem
No ratings yet
Quadratic Equations
No ratings yet
Quadratic Equations
4 pages
IELTS Speaking Mock Test 1
No ratings yet
IELTS Speaking Mock Test 1
8 pages
Kinematics of Rectilinear Motion
No ratings yet
Kinematics of Rectilinear Motion
3 pages
Joseph M. Murphy, Mei-Mei Sanford - Osun Across The Waters - A Yoruba Goddess in (2001) PDF
91% (23)
Joseph M. Murphy, Mei-Mei Sanford - Osun Across The Waters - A Yoruba Goddess in (2001) PDF
289 pages
Review of Signals and Systems:: Gaurav S. Kasbekar Dept. of Electrical Engineering IIT Bombay
No ratings yet
Review of Signals and Systems:: Gaurav S. Kasbekar Dept. of Electrical Engineering IIT Bombay
11 pages
ICSE Exam Time Table For Year 2018
No ratings yet
ICSE Exam Time Table For Year 2018
4 pages
Day 14 - PW HW
No ratings yet
Day 14 - PW HW
2 pages
Unit 3: Wild Life: English Module Year 5
No ratings yet
Unit 3: Wild Life: English Module Year 5
8 pages
Current Events Discussion: Hundreds More Free Handouts at
No ratings yet
Current Events Discussion: Hundreds More Free Handouts at
1 page
B 140 - B 140M - 97 - Qje0mc05n0uy
No ratings yet
B 140 - B 140M - 97 - Qje0mc05n0uy
3 pages
08ece Ee331 Group 6 Le Van Hai Lab3
No ratings yet
08ece Ee331 Group 6 Le Van Hai Lab3
19 pages
Part-7
No ratings yet
Part-7
16 pages
Rheologyof Molten Polymers
No ratings yet
Rheologyof Molten Polymers
16 pages
University of Toronto - Acceptable Research
No ratings yet
University of Toronto - Acceptable Research
27 pages
Exam Answer Sheet 2
No ratings yet
Exam Answer Sheet 2
6 pages
Us06Web Zoom Us/J/8438182581?Pwd Rhbuv1N2Etfxdnhpvdjbv1O0Efpkdz09
No ratings yet
Us06Web Zoom Us/J/8438182581?Pwd Rhbuv1N2Etfxdnhpvdjbv1O0Efpkdz09
21 pages
Specifications of S5 - Comen
No ratings yet
Specifications of S5 - Comen
6 pages
Unit 5 - Report Writing
No ratings yet
Unit 5 - Report Writing
19 pages
Laguna State Polytechnic University: Republic of The Philippines Province of Laguna
No ratings yet
Laguna State Polytechnic University: Republic of The Philippines Province of Laguna
16 pages
Iso 3425 1975
No ratings yet
Iso 3425 1975
4 pages
Activity 1
No ratings yet
Activity 1
3 pages
Anatomy Lesson Realistic Skeleton For Education by Slidesgo
No ratings yet
Anatomy Lesson Realistic Skeleton For Education by Slidesgo
15 pages
Mahouka Koukou No Rettousei 5 - Summer Holiday
100% (2)
Mahouka Koukou No Rettousei 5 - Summer Holiday
190 pages
Naskah Publikasi
No ratings yet
Naskah Publikasi
16 pages
Keyes KY-040 Arduino Rotary Encoder User Manual
No ratings yet
Keyes KY-040 Arduino Rotary Encoder User Manual
8 pages
J Matpr 2019 07 710
No ratings yet
J Matpr 2019 07 710
6 pages
Plumbing
No ratings yet
Plumbing
6 pages
The Cost of Constructing Each Pond Is As Follows
No ratings yet
The Cost of Constructing Each Pond Is As Follows
5 pages
IELTS - Thesis Statements Paraphrasing
No ratings yet
IELTS - Thesis Statements Paraphrasing
38 pages

4Bioinformaticsdatabases

Uploaded by

4Bioinformaticsdatabases

Uploaded by

Introduction to Sequence

1. DNA & RNA

Field Name Length Sequence Enzyme

Accession Numbers: Unique identifiers of the

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).

web page result

User Web server Database server

• Nearly all biological databases are available for

2. Annotated, but not “too annotated”.

3. A simple, easy to understand structure.

6. Easy retrieval of data. 14

1. Redundancy (nothing ever goes out).

Release 57.5 of 07-Jul-09

Accession Molecule type

• What databases should I use?

• Most biocomputing sites update their copy of

• Scientists access GenBank directly over the

1a Locus Name (HSHFE) - The locus name is a tag for grouping

1b Sequence Length (12146 bp) - The total number of nucleotide

8 ORGANISM (Homo sapiens) - The scientific name (usually genus

1d GenBank Division (PRI) - There are different GenBank divisions.

1e Modification Date (23-July-1999) - Date of most recent

 Each feature includes a location (sequence location or

 Clicking on the feature name will open a record for the

A list of features can be found in

Other types of information that the submitter may include in

ORIGIN - Origin contains the sequence data, which begins on

These databases include

• TrEMBL: created in 1996; complement to SWISS-PROT; derived from

• PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

• Genpept: « proteomic » version of GenBank

• Many specialized protein databases for specific families or groups of

– Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM

• TrEMBL is automatically generated (from annotated EMBL

• Contains all that is not in SWISS-PROT.

Some data are not submitted to the public databases !!

TrEMBLnew TrEMBL « Manual »

2. Explicitly linked nucleotide and protein sequences

3. Data validation and format consistency

4. Distinct accession series.

5. Ongoing curation by NCBI staff and collaborators, with

2. We can search for query words in specific fields.

3. We can search more than one database at a time.

4. We can Pose additional limits, such as modification date.

Schizosaccharomyces pombe strain 972h- complete genome

• What is the best db for sequence analysis ?

End of the first part…

You might also like