0% found this document useful (0 votes)
2 views

Class05- Molecular sequence database - 2022

The document provides an overview of molecular sequence databases, including GenBank, EMBL, and DDBJ, which serve as primary archives for nucleotide sequences from various sources. It highlights the importance of these databases in storing and sharing genetic information, as well as the role of the scientific community in verifying the accuracy of submitted data. Additionally, it discusses protein databases and the integration of resources like UniProt, which combines multiple protein sequence databases into a single platform.

Uploaded by

m-9274491
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Class05- Molecular sequence database - 2022

The document provides an overview of molecular sequence databases, including GenBank, EMBL, and DDBJ, which serve as primary archives for nucleotide sequences from various sources. It highlights the importance of these databases in storing and sharing genetic information, as well as the role of the scientific community in verifying the accuracy of submitted data. Additionally, it discusses protein databases and the integration of resources like UniProt, which combines multiple protein sequence databases into a single platform.

Uploaded by

m-9274491
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2/11/2022

SIV 2001

Class 5
Molecular sequence database

What is molecular sequence?


• DNA
• RNA
• Protein

1
2/11/2022

Nucleotide sequence databases


• Primary database: Genebank, EMBL and DDBJ
• These 3 db contain mainly the same information (few differences in
the format and syntax)
• Serve as archives containing all sequences (single genes, ESTs,
complete genomes, etc.) derived from:
• Genome projects and sequencing centers
• Individual scientists
• Patent offices (i.e. USPTO, EPO)
• Non-confidential data are exchanged daily
• Size: more than 2.5 x107 sequences, over 3.2 x1010 bp;
• Sequences from > 500,000 different species;

GenBank

• Public nucleotide
sequence database
• Development of
software tools for
sequence analysis

2
2/11/2022

Organisms in GenBank

Twenty most sequenced organisms in GenBank

3
2/11/2022

LOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999


DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete

ACCESSION
VERSION
cds.
AF115338
AF115338.1 GI:4959391
GenBank Flat File
KEYWORDS .
SOURCE Pseudomonas fluorescens.
ORGANISM Pseudomonas fluorescens
•Title
REFERENCE
Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae;
Pseudomonas.
1 (bases 1 to 591)
Header •Taxonomy
AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R.
TITLE Influence of a putative ECF sigma factor on expression of the major
•Citation
outer membrane protein, OprF, in Pseudomonas aeruginosa and
Pseudomonas fluorescens
JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999)
MEDLINE 99369842
PUBMED 10438740
REFERENCE 2 (bases 1 to 591)
AUTHORS De Mot,R.
TITLE Direct Submission
JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,
Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium
FEATURES Location/Qualifiers
source 1..591
/organism="Pseudomonas fluorescens"
/strain="M114"
/db_xref="taxon:294"
gene 1..591
/gene="sigX"
CDS 1..591
/gene="sigX"
/codon_start=1
Features (seq)
/transl_table=11
/product="ECF sigma factor SigX"
/protein_id="AAD34329.1"
/db_xref="GI:4959392"
/translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ
RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR
KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE
FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET"
BASE COUNT 157 a 133 c 170 g 131 t
ORIGIN
1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag
61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg
121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac
181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
7 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
241
301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
DNA Sequence
361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg
421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa

Who can put data into GenBank?

Sequence data are submitted to GenBank from scientists from


around the world.

Warning: GenBank does not check the validity or accuracy of


sequences submitted. This is left up to the scientific community
to verify, like all published scientific data.

4
2/11/2022

European Molecular Biology Laboratory

5
2/11/2022

ID AF115338 standard; DNA; PRO; 591 BP.


AC
SV
DT
AF115338;
AF115338.1
03-JUN-1999 (Rel. 59, Created)
EMBL Flat File
DT 23-AUG-1999 (Rel. 60, Last updated, Version 2)
DE Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds.
KW .
OS Pseudomonas fluorescens
OC Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. •Title
RN
RP
RX
[1]
1-591
MEDLINE; 99369842.
Header •Taxonomy
RA
RT
Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.;
"Influence of a putative ECF sigma factor on expression of the major outer •Citation
RT membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas
RT fluorescens";
RL J. Bacteriol. 181(16):4746-4754(1999).
RN [2]
RP 1-591
RA De Mot R.;
RT ;
RL Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases.
RL F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K.
RL Mercierlaan 92, Heverlee B-3001, Belgium
DR SPTREMBL; Q9X4L7; Q9X4L7.
FH Key Location/Qualifiers
FH
FT source 1..591
FT /db_xref="taxon:294"
FT /organism="Pseudomonas fluorescens"
FT /strain="M114"
FT CDS 1..591
FT
FT
FT
/codon_start=1
/db_xref="SPTREMBL:Q9X4L7"
/transl_table=11
Features (seq)
FT /gene="sigX"
FT /product="ECF sigma factor SigX"
FT /protein_id="AAD34329.1"
FT /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQR
FT TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKE
FT RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQE
FT IADIMHMGLSATKMRYKRALDKLREKFAGETET"
SQ Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other;
atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 60
ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 120
cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 180
gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
11
gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
240
300
360
DNA Sequence
gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 420

DDBJ - DNA Data Bank of Japan

6
2/11/2022

Genome database
• What Genomes Are Available?
• List of completed genomes increases almost every week
• GOLD Website: Listing of finished and “in progress” genomes
http://www.genomesonline.org/

7
2/11/2022

Genome database in NCBI

https://www.ncbi.nlm.nih.gov/genome/viruses/

8
2/11/2022

Genome database at EMBL-EBI

SRA database

9
2/11/2022

Protein Databases
• Protein sequence databases
• Protein properties
• Protein localization and targeting
• Protein sequence motifs and active sites
• Protein domain databases; protein classification
• Databases of individual protein families
• Protein structure database

Protein Information Resource

• https://pir.georgetown.edu/
• The oldest universal curated
protein sequence database.
• Published as the ‘Atlas of Protein
Sequence and Structure’ from
1965 to 1978 by the late Margaret
O Dayhoff.
• Established in 1984 as a successor
to the original National Biomedical
Research Foundation Protein
Sequence Database.

10
2/11/2022

SWISS-PROT
• https://web.expasy.org/docs/swiss-prot_guideline.html
• Manually curated, non-redundant protein sequence database.
• Highly integrated with other databases.

TrEMBL
• TrEMBL (Translation from EMBL) database (http://www.ebi.ac.uk/trembl/)
• Automatically curated and derived from the translation of all coding
sequences in the DDBJ/EMBL/GenBank nucleotide sequence database
that are not yet included in Swiss-Prot.
• http://www.bioinfo.pte.hu/more/TrEMBL.htm

11
2/11/2022

UniProt: the next generation of protein


sequence databases
• Combine the Swiss-Prot, TrEMBL and PIR-PSD databases into a single
resource, UniProt (http://www.uniprot.org)
• UniProt Knowledgebase: continue the work of Swiss-Prot, TrEMBL and PIR
by providing an expertly curated database.
• UniProt Archive (UniParc): new and updated sequences are loaded on a
daily basis.
• UniProt non-redundant reference databases (UniProt NREF): which
provide non-redundant views on top of the UniProt Knowledgebase and
UniParc.

UniProt

12
2/11/2022

NCBI Protein database


• http://www.ncbi.nlm.nih.gov/protein

FASTA

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4


MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R

13
2/11/2022

FASTA

> Your favourite gene 1 - yfg1


MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
> Your favourite gene 2 - yfg2
MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIV
DTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTI
TSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEW
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV

Accession number
• To label and identify sequence
 accessible information.
• A string of 4 to 12 characters
that are associated with a
molecular sequence record.

14
2/11/2022

Reference Sequence (RefSeq) Project


• GenBank: archival database (highly redundant).
• RefSeq entry corresponding to a given gene or gene product. (case of
splice variant or distant loci  several RefSeq entries).
• Non-redundant; one record for each gene, or each splice variant, from
each organism represented
• Each record is intended to present an encapsulation of the current
understanding of a gene or protein, similar to a review article

Formats of Accession Numbers for RefSeq Entries

15
2/11/2022

The J. Craig Venter Institute (JCVI)


• Non-profit genomics research institute founded in 2006.
• Consolidation of four organizations:
• Center for the Advancement of Genomics,
• The Institute for Genomic Research (TIGR),
• Institute for Biological Energy Alternatives, and
• J. Craig Venter Science Foundation Joint Technology Center.
• https://www.jcvi.org/
• Genomics and the societal implications of genomics
• genomic medicine; environmental genomic analysis; clean energy;
synthetic biology; and ethics, law, and economics

16

You might also like