Database Lectur1
Database Lectur1
Lecture 1
Sachidanand Singh
Where do the data come from?
Example Databases
literature d1
ctgccgatagc s o1
e
MKLVDDYTR i1
Information
New knowledge
What is a Database/Resource?
4
Databases
Information system
Query system
Storage
System
Data
5
Databases
6
Databases
Boxes
Information system
Oracle
Query system
MySQL
Storage PC binary files
System
Data Unix text files
Bookshelves
7
Databases
8
Databases
Information system
Query system
The UBC library
Google
Storage Entrez
System
Data SRS
9
Primary biological databases
NIG •Submissions
•Updates SRS
getentry
13
EMBL
What is GenBank?
GenBank is the NIH genetic sequence database
of all publicly available DNA and derived
protein sequences, with annotations describing
the biological information these records
contain.
14
GenBank Flat File
LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997
DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15
cell TA20 mRNA, complete cds.
•Title
ACCESSION D25291
NID g1850791
KEYWORDS neurite extension activity; growth arrest; TA20.
SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma
cell_line:NG108-15 cDNA to mRNA.
ORGANISM Murinae gen. sp.
Header •Taxonomy
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae.
REFERENCE 1 (sites)
AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y.
TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA
•Citation
cloning and expression
JOURNAL Neurosci. Res. 23 (1), 21-27 (1995)
MEDLINE 96064354
REFERENCE 3 (bases 1 to 1803)
AUTHORS Tohda,C.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro
Tohda, Toyama Medical and Pharmaceutical University, Research
Institute for Wakan-yaku, Analytical Research Center for
Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan
(E-mail:[email protected], Tel:+81-764-34-2281(ex.2841),
Fax:+81-764-34-5057)
COMMENT On Feb 26, 1997 this sequence version replaced gi:793764.
FEATURES Location/Qualifiers
source 1..1803
/organism="Murinae gen. sp."
/note="source origin of sequence, either mouse or rat, has
not been identified"
/db_xref="taxon:39108"
/cell_line="NG108-15"
/cell_type="mouse neuroblastma-rat glioma hybridoma"
misc_signal 156..163
/note="AP-2 binding site"
DNA Sequence
661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga
721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg
781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat
841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg
901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg
961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact
1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt
1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct
1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc
1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct
1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt
1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata
1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat
1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt
1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt
15
1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc
1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc
1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa
1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc
1801 cat
//
GenBank file format
GenBank file format
EMBL entry: example
ID HSERPG standard; DNA; HUM; 3398 BP.
XX
AC X02158;
XX
SV X02158.1
XX
DT 13-JUN-1985 (Rel. 06, Created)
DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)
XX
DE Human gene for erythropoietin
XX
KW erythropoietin; glycoprotein hormone; hormone; signal peptide. keyword
XX
OS Homo sapiens (human)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; taxonomy
OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1]
RP 1-3398
RX MEDLINE; 85137899.
RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., references
RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,
RA Shimizu T., Miyake T.;
RT Isolation and characterization of genomic and cDNA clones of human
RT erythropoietin;
RL Nature 313:806-810(1985). Cross-references
XX
DR GDB; 119110; EPO.
DR GDB; 119615; TIMP1.
DR SWISS-PROT; P01588; EPO_HUMAN.
XX
…
EMBL entry (cont.)
CC Data kindly reviewed (24-FEB-1986) by K. Jacobs
FH Key Location/Qualifiers
FH
FT source 1..3398
FT /db_xref=taxon:9606
FT /organism=Homo sapiens
FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)
FT /db_xref=SWISS-PROT:P01588
FT /product=erythropoietin
FT /protein_id=CAA26095.1
FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE
FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG
FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD
FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR
FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)
FT /product=erythropoietin
FT sig_peptide join(615..627,1194..1261)
FT exon 397..627
FT /number=1
annotation
FT intron 628..1193
FT /number=1
FT exon 1194..1339
FT /number=2
FT intron 1340..1595
FT /number=2
FT exon 1596..1682
FT /number=3
FT intron 1683..2293
FT /number=3
FT exon 2294..2473
FT /number=4
FT intron 2474..2607
FT /number=4
FT exon 2608..3327
FT /note=3' untranslated region
FT /number=5
XX
sequence
SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other;
agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60
tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120
Ensembl
• Contains all the human genome DNA sequences
currently available in the public domain.
• Automated annotation: by using different software
tools, features are identified in the DNA
sequences:
– Genes (known or predicted)
– Single nucleotide polymorphisms (SNPs)
– Repeats
– Homologies
• Created and maintained by the EBI and the
Sanger Center (UK)
• www.ensembl.org
Protein Databases
•SWISS-PROT: Annotated Sequence Database
•TrEMBL: Database of EMBL nucleotide translated sequences
•InterPro:Integrated resource for protein families, domains
and functional sites.
•CluSTr:Offers an automatic classification of SWISS-PROT
and TrEMBL.
•Proteome Analysis: Statistical and comparative analysis of
the predicted proteomes of fully sequenced organisms
•Protein Profiles: Tables of SWISS-PROT and TrEMBL entries
and alignments for the protein families of the Protein Profile.
•IntEnz: The Integrated relational Enzyme database (IntEnz) will
contain enzyme data approved by the Nomenclature Committee.
UniProt
• New protein sequence database that is the result of a merge
from SWISS-PROT and PIR. It will be the annotated curated
protein sequence database.
• Data in UniProt is primarily derived from coding sequence
annotations in EMBL (GenBank/DDBJ) nucleic acid
sequence data.
• UniProt is a Flat-File database just like EMBL and GenBank
• Flat-File format is SwissProt-like, or EMBL-like
24
Swiss-Prot
• Annotated protein sequence database established in 1986
and maintained collaboratively since 1987, by the
Department of Medical Biochemistry of the University of
Geneva and EBI
• Complete, Curated, Non-redundant and cross-referenced
with 34 other databases
• Highly cross-referenced
• Available from a variety of servers and through sequence
analysis software tools
• More than 8,000 different species
• First 20 species represent about 42% of all sequences in
the database
• More than 1,29,000 entries with 4.7 X 1010 amino acids
• More than 6,22,000 entries in TrEMBL
Swiss-Prot
• SWISS-PROT incorporates:
SWISS-PROT incorporates:
• Function
Function ofofthethe protein
protein
• Post-translational modification
Post-translational modification
• Domains
Domains and and sites.
sites.
• Secondary structure.
Secondary structure.
• Quaternary structure.
Quaternary structure.
• Similarities
Similarities toto other
other proteins;
proteins;
• Diseases associated
Diseases associated with with deficiencies
deficiencies in the protein
in the protein
• Sequence conflicts,
Sequence conflicts, variants,
variants, etc. etc.
26
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
TREMBL
• TrEMBL is a computer-annotated protein sequence database
supplementing the SWISS-PROT Protein Sequence Data Bank.
• TrEMBL contains the translations of all coding sequences
(CDS) present in the EMBL Nucleotide Sequence Database not
yet integrated in SWISS-PROT.
• TrEMBL can be considered as a preliminary section of SWISS-
PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT
accession numbers have been assigned.
31
TrEMBL (Translation of EMBL)
• Computer-annotated supplement to SWISS-
PROT, as it is impossible to cope with the flow of
data…
• Well-structure SWISS-PROT-like resource
• Derived from automated EMBL CDS translation
maintained at the EBI, UK.
• TrEMBL is automatically generated and annotated
using software tools (incompatible with the
SWISS-PROT in terms of quality)
• TrEMBL contains all what is not yet in SWISS-
PROT
Protein DataBank (PDB)
• Important in solving real problems in molecular
biology
• Protein Databank
– PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
– Sole international repository of macromolecular
structure data
– Moved to Research Collaboratory
for Structural Bioinformatics
http://www.rcsb.org/
Effective use of PDB
• Queries are of three types
– PDBid - As quoted in paper
– Search Lite - one or more keywords
– Search Fields - A detailed query form
• Query results
– Structure Explorer - details of the structure
– Query Result Browser - for multiple structures
• PDB Viewer
PDB
• Protein DataBase
– Protein and NA
3D structures
– Sequence
present
– YAFFF
35
HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2
COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3
COMPND 2 ATF/CREB SITE DNA 1DGC 4
SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5
AUTHOR T.J.RICHMOND 1DGC 6
REVDAT 1 22-JUN-94 1DGC 0 1DGC 7
JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8
JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9
PDB
JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10
JRNL TITL 3 FLEXIBILITY 1DGC 11
JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13
REMARK 1 1DGC 14
REMARK 2 1DGC 15
REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16
REMARK 3 1DGC 17
REMARK 3 REFINEMENT. 1DGC 18
REMARK 3 PROGRAM X-PLOR 1DGC 19
REMARK 3 AUTHORS BRUNGER 1DGC 20
REMARK 3 R VALUE 0.216 1DGC 21
REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22
REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23
•
REMARK 3 1DGC 24
HEADER
REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25
REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26
REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27
REMARK 3 PERCENT COMPLETION 98.2 1DGC 28
REMARK 3 1DGC 29
•
REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30
COMPND REMARK
REMARK
REMARK
REMARK
3
4
NUMBER OF NUCLEIC ACID ATOMS
• SOURCE
REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36
REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37
REMARK 6 1DGC 38
REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39
REMARK 7 1DGC 40
REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41
• AUTHOR
REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42
REMARK 8 1DGC 43
REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44
REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45
REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46
REMARK 9 1DGC 47
• DATE
REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48
REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49
REMARK 10 1DGC 50
REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51
REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52
REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53
• JRNL
REMARK
REMARK
REMARK
REMARK
10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND
10 TRANSLATION VECTOR TO THE COORDINATES X Y Z:
10
10 0 -1 0 X 117.32 X SYMM
1DGC
1DGC
1DGC
1DGC
54
55
56
57
REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58
•
REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59
REMARK
SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60
SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61
SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62
SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63
SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64
• SECRES
SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65
SEQRES 2 B 19 A T C T C C 1DGC 66
HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67
CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68
ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69
ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70
• ATOM COORDINATES
ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71
SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72
SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73
SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74
ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75
ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76
48
Abstract Syntax Notation (ASN.1)
49
FASTA
>
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4
MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
50
NBRF format
GCG format
GCG multiple sequence format
(MSF)
Identifiers
• You need identifiers which are stable through
time
• Need identifiers which will always refer to
specific sequences
• Need these identifiers to track history of
sequence updates
• Also need feature and annotation identifiers
54
LOCUS, Accession, NID and protein_id
55
LOCUS, Accession, gi and PID
LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998
DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
ACCESSION U40282
VERSION U40282.1 GI:3150001
LOCUS: HSU40282
LOCUS
ACCESSION: U40282
ACCESSION
VERSION: U40282.1 Accession.version
GI:
PID:
3150001
g3150002
gi
PID
Protein gi:
protein_id:
3150002
AAC16892.1
protein gi
Protein_id
CDS 157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
56
EST: Expressed Sequence Tag
Expressed Sequence Tags are short
(300-500 bp) single reads from mRNA (cDNA)
which are produced in large numbers.
They represent a snapshot of what is expressed
in a given tissue, and developmental stage.
57
STS
Sequenced Tagged Sites, are operationally
unique sequence that identifies the
combination of primer pairs used in a PCR
assay that generate a mapping reagent which
maps to a single position within the genome.
58