0% found this document useful (0 votes)
3 views

Database Lectur1

The document provides an overview of biological databases, detailing their types, sources, and examples such as GenBank, EMBL, and DDBJ. It explains the structure and function of these databases, including the types of biological information they store, such as nucleotide sequences and protein data. Additionally, it discusses the importance of databases in organizing biological data and facilitating research and discovery.

Uploaded by

svi14.sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Database Lectur1

The document provides an overview of biological databases, detailing their types, sources, and examples such as GenBank, EMBL, and DDBJ. It explains the structure and function of these databases, including the types of biological information they store, such as nucleotide sequences and protein data. Additionally, it discusses the importance of databases in organizing biological data and facilitating research and discovery.

Uploaded by

svi14.sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 58

Biological Databases

Lecture 1
Sachidanand Singh
Where do the data come from?
Example Databases
literature d1

ctgccgatagc s o1

e
MKLVDDYTR i1

Information

New knowledge
What is a Database/Resource?

• Type and Content of Data


– Sequence or Structure
– Nucleic acid or protein
– Important Biological information such as about enzyme
and their metabolic pathways, mutations, diseases,
drugs, images etc.
• Based on source of data
– Primary database
– Secondary database
– Knowledge bases
– Integrated Database
The reagent: databases
• Organized array of information
• Place where you put things in, and (if all is
well) you should be able to get them out again.
• Resource for other databases and tools.
• Simplify the information space by
specialization.
• Bonus: Allows you to make discoveries.

4
Databases

Information system

Query system

Storage
System
Data

5
Databases

Information system GenBank flat file


PDB file
Query system Interaction Record
Title of a book
Storage Book
System
Data

6
Databases

Boxes
Information system
Oracle
Query system
MySQL
Storage PC binary files
System
Data Unix text files
Bookshelves

7
Databases

A List you look at


Information system A catalogue
indexed files
Query system SQL
grep
Storage
System
Data

8
Databases

Information system

Query system
The UBC library
Google
Storage Entrez
System
Data SRS

9
Primary biological databases

• Nucleic acid • Protein


EMBL
PIR
GenBank
MIPS
DDBJ (DNA Data Bank of
SWISS-PROT
Japan)
TrEMBL
NRL-3D
Nucleotide Databases

•EMBL:Nucleotide sequence database


•Ensembl: Automatics annotation of
eukaryotic genomes
•Genome Server: Overview of completed
genomes at EBI
•EMBL-Align: Multiple sequence alignment
database
•Parasites: Parasite Genome databases
•Mutations: Sequence variation database
project
EMBL/GenBank/DDJB
• These 3 db contain mainly the same information
(few differences in the format and syntax)
• Serve as archives containing all sequences
(single genes, ESTs, complete genomes, etc.)
derived from:
– Genome projects and sequencing centers
– Individual scientists
– Patent offices (i.e. USPTO, EPO)
• Non-confidential data are exchanged daily
• Currently: 2.5 x107 sequences, over 3.2 x1010 bp;
• Sequences from > 50,000 different species;
Entrez
NIH
NCBI

•Submissions GenBank •Submissions


•Updates •Updates
EMBL
DDBJ
CIB EBI

NIG •Submissions
•Updates SRS
getentry
13
EMBL
What is GenBank?
GenBank is the NIH genetic sequence database
of all publicly available DNA and derived
protein sequences, with annotations describing
the biological information these records
contain.

14
GenBank Flat File
LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997
DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15
cell TA20 mRNA, complete cds.

•Title
ACCESSION D25291
NID g1850791
KEYWORDS neurite extension activity; growth arrest; TA20.
SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma
cell_line:NG108-15 cDNA to mRNA.
ORGANISM Murinae gen. sp.

Header •Taxonomy
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae.
REFERENCE 1 (sites)
AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y.
TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA

•Citation
cloning and expression
JOURNAL Neurosci. Res. 23 (1), 21-27 (1995)
MEDLINE 96064354
REFERENCE 3 (bases 1 to 1803)
AUTHORS Tohda,C.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro
Tohda, Toyama Medical and Pharmaceutical University, Research
Institute for Wakan-yaku, Analytical Research Center for
Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan
(E-mail:[email protected], Tel:+81-764-34-2281(ex.2841),
Fax:+81-764-34-5057)
COMMENT On Feb 26, 1997 this sequence version replaced gi:793764.
FEATURES Location/Qualifiers
source 1..1803
/organism="Murinae gen. sp."
/note="source origin of sequence, either mouse or rat, has
not been identified"
/db_xref="taxon:39108"
/cell_line="NG108-15"
/cell_type="mouse neuroblastma-rat glioma hybridoma"
misc_signal 156..163
/note="AP-2 binding site"

Features (AA seq)


GC_signal 647..655
/note="Sp1 binding site"
TATA_signal 694..701
gene 748..1311
/gene="TA20"
CDS 748..1311
/gene="TA20"
/function="neurite extensiion activity and growth arrest
effect"
/codon_start=1
/db_xref="PID:d1005516"
/db_xref="PID:g793765"
/translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR
KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL
RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY
RGPSNRSPPLPPRNRIKQPNRIKLRCR"
polyA_site 1803
BASE COUNT 507 a 458 c 311 g 527 t
ORIGIN
1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg
61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat
121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg
181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca
241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca
301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc
361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc
421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa
481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag
541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag
601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat

DNA Sequence
661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga
721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg
781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat
841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg
901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg
961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact
1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt
1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct
1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc
1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct
1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt
1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata
1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat
1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt
1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt

15
1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc
1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc
1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa
1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc
1801 cat
//
GenBank file format
GenBank file format
EMBL entry: example
ID HSERPG standard; DNA; HUM; 3398 BP.
XX
AC X02158;
XX
SV X02158.1
XX
DT 13-JUN-1985 (Rel. 06, Created)
DT 22-JUN-1993 (Rel. 36, Last updated, Version 2)
XX
DE Human gene for erythropoietin
XX
KW erythropoietin; glycoprotein hormone; hormone; signal peptide. keyword
XX
OS Homo sapiens (human)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; taxonomy
OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1]
RP 1-3398
RX MEDLINE; 85137899.
RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., references
RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M.,
RA Shimizu T., Miyake T.;
RT Isolation and characterization of genomic and cDNA clones of human
RT erythropoietin;
RL Nature 313:806-810(1985). Cross-references
XX
DR GDB; 119110; EPO.
DR GDB; 119615; TIMP1.
DR SWISS-PROT; P01588; EPO_HUMAN.
XX


EMBL entry (cont.)
CC Data kindly reviewed (24-FEB-1986) by K. Jacobs
FH Key Location/Qualifiers
FH
FT source 1..3398
FT /db_xref=taxon:9606
FT /organism=Homo sapiens
FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)
FT /db_xref=SWISS-PROT:P01588
FT /product=erythropoietin
FT /protein_id=CAA26095.1
FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE
FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG
FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD
FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR
FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763)
FT /product=erythropoietin
FT sig_peptide join(615..627,1194..1261)
FT exon 397..627
FT /number=1
annotation
FT intron 628..1193
FT /number=1
FT exon 1194..1339
FT /number=2
FT intron 1340..1595
FT /number=2
FT exon 1596..1682
FT /number=3
FT intron 1683..2293
FT /number=3
FT exon 2294..2473
FT /number=4
FT intron 2474..2607
FT /number=4
FT exon 2608..3327
FT /note=3' untranslated region
FT /number=5
XX
sequence
SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other;
agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60
tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120
Ensembl
• Contains all the human genome DNA sequences
currently available in the public domain.
• Automated annotation: by using different software
tools, features are identified in the DNA
sequences:
– Genes (known or predicted)
– Single nucleotide polymorphisms (SNPs)
– Repeats
– Homologies
• Created and maintained by the EBI and the
Sanger Center (UK)
• www.ensembl.org
Protein Databases
•SWISS-PROT: Annotated Sequence Database
•TrEMBL: Database of EMBL nucleotide translated sequences
•InterPro:Integrated resource for protein families, domains
and functional sites.
•CluSTr:Offers an automatic classification of SWISS-PROT
and TrEMBL.
•Proteome Analysis: Statistical and comparative analysis of
the predicted proteomes of fully sequenced organisms
•Protein Profiles: Tables of SWISS-PROT and TrEMBL entries
and alignments for the protein families of the Protein Profile.
•IntEnz: The Integrated relational Enzyme database (IntEnz) will
contain enzyme data approved by the Nomenclature Committee.
UniProt
• New protein sequence database that is the result of a merge
from SWISS-PROT and PIR. It will be the annotated curated
protein sequence database.
• Data in UniProt is primarily derived from coding sequence
annotations in EMBL (GenBank/DDBJ) nucleic acid
sequence data.
• UniProt is a Flat-File database just like EMBL and GenBank
• Flat-File format is SwissProt-like, or EMBL-like

24
Swiss-Prot
• Annotated protein sequence database established in 1986
and maintained collaboratively since 1987, by the
Department of Medical Biochemistry of the University of
Geneva and EBI
• Complete, Curated, Non-redundant and cross-referenced
with 34 other databases
• Highly cross-referenced
• Available from a variety of servers and through sequence
analysis software tools
• More than 8,000 different species
• First 20 species represent about 42% of all sequences in
the database
• More than 1,29,000 entries with 4.7 X 1010 amino acids
• More than 6,22,000 entries in TrEMBL
Swiss-Prot
• SWISS-PROT incorporates:
SWISS-PROT incorporates:
• Function
Function ofofthethe protein
protein
• Post-translational modification
Post-translational modification
• Domains
Domains and and sites.
sites.
• Secondary structure.
Secondary structure.
• Quaternary structure.
Quaternary structure.
• Similarities
Similarities toto other
other proteins;
proteins;
• Diseases associated
Diseases associated with with deficiencies
deficiencies in the protein
in the protein
• Sequence conflicts,
Sequence conflicts, variants,
variants, etc. etc.

26
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
TREMBL
• TrEMBL is a computer-annotated protein sequence database
supplementing the SWISS-PROT Protein Sequence Data Bank.
• TrEMBL contains the translations of all coding sequences
(CDS) present in the EMBL Nucleotide Sequence Database not
yet integrated in SWISS-PROT.
• TrEMBL can be considered as a preliminary section of SWISS-
PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT
accession numbers have been assigned.

31
TrEMBL (Translation of EMBL)
• Computer-annotated supplement to SWISS-
PROT, as it is impossible to cope with the flow of
data…
• Well-structure SWISS-PROT-like resource
• Derived from automated EMBL CDS translation
maintained at the EBI, UK.
• TrEMBL is automatically generated and annotated
using software tools (incompatible with the
SWISS-PROT in terms of quality)
• TrEMBL contains all what is not yet in SWISS-
PROT
Protein DataBank (PDB)
• Important in solving real problems in molecular
biology
• Protein Databank
– PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
– Sole international repository of macromolecular
structure data
– Moved to Research Collaboratory
for Structural Bioinformatics

http://www.rcsb.org/
Effective use of PDB
• Queries are of three types
– PDBid - As quoted in paper
– Search Lite - one or more keywords
– Search Fields - A detailed query form
• Query results
– Structure Explorer - details of the structure
– Query Result Browser - for multiple structures
• PDB Viewer
PDB
• Protein DataBase
– Protein and NA
3D structures
– Sequence
present
– YAFFF

35
HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2
COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3
COMPND 2 ATF/CREB SITE DNA 1DGC 4
SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5
AUTHOR T.J.RICHMOND 1DGC 6
REVDAT 1 22-JUN-94 1DGC 0 1DGC 7
JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8
JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9

PDB
JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10
JRNL TITL 3 FLEXIBILITY 1DGC 11
JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13
REMARK 1 1DGC 14
REMARK 2 1DGC 15
REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16
REMARK 3 1DGC 17
REMARK 3 REFINEMENT. 1DGC 18
REMARK 3 PROGRAM X-PLOR 1DGC 19
REMARK 3 AUTHORS BRUNGER 1DGC 20
REMARK 3 R VALUE 0.216 1DGC 21
REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22
REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23


REMARK 3 1DGC 24

HEADER
REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25
REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26
REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27
REMARK 3 PERCENT COMPLETION 98.2 1DGC 28
REMARK 3 1DGC 29


REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30

COMPND REMARK
REMARK
REMARK
REMARK
3
4
NUMBER OF NUCLEIC ACID ATOMS

4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO


4 ACID BIOSYNTHETIC ENZYMES.
386 1DGC
1DGC
1DGC
1DGC
31
32
33
34
REMARK 5 1DGC 35

• SOURCE
REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36
REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37
REMARK 6 1DGC 38
REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39
REMARK 7 1DGC 40
REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41

• AUTHOR
REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42
REMARK 8 1DGC 43
REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44
REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45
REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46
REMARK 9 1DGC 47

• DATE
REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48
REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49
REMARK 10 1DGC 50
REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51
REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52
REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53

• JRNL
REMARK
REMARK
REMARK
REMARK
10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND
10 TRANSLATION VECTOR TO THE COORDINATES X Y Z:
10
10 0 -1 0 X 117.32 X SYMM
1DGC
1DGC
1DGC
1DGC
54
55
56
57
REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58


REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59

REMARK
SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60
SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61
SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62
SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63
SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64

• SECRES
SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65
SEQRES 2 B 19 A T C T C C 1DGC 66
HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67
CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68
ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69
ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70

• ATOM COORDINATES
ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71
SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72
SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73
SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74
ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75
ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76

ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916


ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917
36 TER
MASTER
844
46
C B
0 0
9
1 0 0 0 6 842 2 0 7
1DGC
1DGC
918
919
END 1DGC 920
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14
REMARK 2 12CA 15
REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16
REMARK 3 12CA 17
REMARK 3 REFINEMENT. 12CA 18
REMARK 3 PROGRAM PROLSQ 12CA 19
REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
PDB (cont.)
SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68
SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69
SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70
SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71
SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72
SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73
SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74
SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75
TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76
TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77
TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78
TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79
TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80
TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81
CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82
ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83
ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84
ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85
SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86
SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87
SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88
ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89
ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90
ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91
ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92
ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93
ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94
ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95
ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96
ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97
ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98
ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99
ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100
ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101
ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102
…….
Database Mining Tools
•SRS: Sequence Retrieval System

•Entrez: Search Engine at NCBI, US

•Sequence Similarity Search Tools-BLAST & FASTA


•Finding sequence homologs to deduce the identity
of query sequence
•Identify potential sequence homologs with known
three dimensional structure
Sequence Retrieval System
SRS is a powerful data integration platform
•Provides rapid, easy and user friendly access
•Large volumes of heterogeneous Life Science data
•Stored in more than 400 internal and public domain
databases
•Available at http://srs.ebi.ac.uk/
Entrez at NCBI
It is a retrieval system for searching several linked
databases such as
•PubMed: The biomedical literature (PubMed)
•Nucleotide sequence database (Genbank)
•Protein sequence database
•Structure: Three-dimensional macromolecular structures
•Genome: Complete genome assemblies
•PopSet: Population study data sets
•OMIM: Online Mendelian Inheritance in Man
•Taxonomy: Organisms in GenBank
• Books: Online books
• ProbeSet: Gene expression and microarray datasets
• 3D Domains: Domains from Entrez Structure
• UniSTS: Markers and mapping data
• SNP: Single nucleotide polymorphisms
• CDD: Conserved domains
Entrez: Search fields
•Keyword allows to search a set of indexed terms
•Accession allows to search accession numbers
•Author Name
•Affiliations of authors
•Journal Title
•E.C. Numbers
•Feature Key searches for particular DNA feature
•SeqId is string identifier
•Title Words
•Text Words
•Organism
•Pubmed ID
•Publication and modification date
•Protein Name
File Formats of the sequences
Readseq (http://bimas.dcrt.nih.gov/molbio/readseq/)

1. IG/Stanford 10. Olsen (in-only)


2. GenBank/GB 11. Phylip3.2
3. NBRF 12. Phylip
4. EMBL 13. Plain/Raw
5. GCG 14. PIR/CODATA
6. DNAStrider 15. MSF
7. Fitch 16. ASN.1
8. Pearson/Fasta 17. PAUP
9. Zuker (in-only) 18. Pretty (out-only)
Format
• ASN.1
• Flat Files
– DNA
– Protein
• FASTA
– DNA
– Protein

48
Abstract Syntax Notation (ASN.1)

49
FASTA

>
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4
MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R

50
NBRF format
GCG format
GCG multiple sequence format
(MSF)
Identifiers
• You need identifiers which are stable through
time
• Need identifiers which will always refer to
specific sequences
• Need these identifiers to track history of
sequence updates
• Also need feature and annotation identifiers

54
LOCUS, Accession, NID and protein_id

LOCUS: Unique string of 10 letters and numbers in


the database. Not maintained amongst databases,
and is therefore a poor sequence identifier.
ACCESSION: A unique identifier to that record, citable
entity; does not change when record is updated. A good
record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the
same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer
which will change every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which
will change every time the sequence changes.
protein_id: Identifier which has the same
structure and function as the nucleotide Accession. version
numbers, but slightlt different format.

55
LOCUS, Accession, gi and PID
LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998
DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
ACCESSION U40282
VERSION U40282.1 GI:3150001

LOCUS: HSU40282
LOCUS
ACCESSION: U40282
ACCESSION
VERSION: U40282.1 Accession.version
GI:
PID:
3150001
g3150002
gi
PID
Protein gi:
protein_id:
3150002
AAC16892.1
protein gi
Protein_id
CDS 157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
56
EST: Expressed Sequence Tag
Expressed Sequence Tags are short
(300-500 bp) single reads from mRNA (cDNA)
which are produced in large numbers.
They represent a snapshot of what is expressed
in a given tissue, and developmental stage.

57
STS
Sequenced Tagged Sites, are operationally
unique sequence that identifies the
combination of primer pairs used in a PCR
assay that generate a mapping reagent which
maps to a single position within the genome.

58

You might also like