Protein Databases
Protein Databases
net/publication/230229264
Protein Databases
CITATIONS READS
0 9,247
2 authors, including:
Amos Bairoch
University of Geneva
304 PUBLICATIONS 94,090 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Amos Bairoch on 15 January 2014.
Vivienne Baillie Gerritsen, Swiss Institute of Bioinformatics, Geneva, Switzerland Article contents
Amos Bairoch, Swiss Institute of Bioinformatics, Geneva, Switzerland Introduction
Protein Sequence Databases
An abundance of protein databases are available, dealing with fields as diverse as protein Specialized Protein Sequence Databases
sequences, protein domains, posttranslational modifications and protein protein Protein Domain and Family Databases
interactions. Such resources are crucial to proteomics research. Three-dimensional Structure Databases
Other Types of Database
Protein Information in Other Types of Database
Conclusions
ENCYCLOPEDIA OF LIFE SCIENCES & 2005, John Wiley & Sons, Ltd. www.els.net 1
Protein Databases
hundreds of thousands of new proteins resulting lins, T cell receptors and major histocompatibility
from the never-ending improvements in DNA se- complex molecules of all vertebrate species. It includes
quencing technology. In response, a supplementary sequence databases, Web resources and interactive
database, Translation of EMBL nucleotide sequence tools, and the IMGT server provides common access
database or TrEMBL, was created. TrEMBL to data relative to the field of immunogenetics.
(Boeckmann et al., 2003) consists of computer-
annotated entries in Swiss-Prot format derived from
the translation of coding sequences (CDs) in the
EMBL nucleotide sequence database. Hence TrEMBL
entries are preliminary Swiss-Prot entries that have not
Protein Domain and Family
yet been manually annotated. Databases
The sequence of a new protein can be so distantly
Specialized Protein Sequence related to any other that the detection of any resemb-
lance by similarity searches is obsolete. However,
Databases proteins do have their fingerprints, otherwise known
as patterns, motifs or signatures. These are particular
There are a wide variety of databases dedicated to
clusters of residue types in the sequence, which reflect
groups of proteins, or families of proteins. Some have
conserved regions important to the function of the
no more than a handful of entries, while others are less
protein. A popular way to identify such motifs between
modest and provide a far wider scope. It would be
proteins is to perform a pairwise alignment. When the
quite pointless and near impossible to give a list of all identity is higher than 40%, this method gives good
existing specialized protein sequence databases. What
results. However, the weakness of the pairwise align-
is more, the development of many is stunted after a
ment is that no distinction is made between an amino
short existence, while new ones sprout almost on a
acid at a crucial position (like an active site) and an
daily basis. A positive side to all this is the appearance
amino acid with no critical role. A multiple sequence
of information systems that attempt to collect specific
alignment gives a more general view of a conserved
data into a central resource. In a feeble attempt to
region by giving a better picture of the most conserved
show the diversity, three specialized protein sequence
residues, which are also usually those essential for the
databases are briefly described.
protein’s function. Several databases have developed
their own methods based on multiple sequence align-
G-protein-coupled receptor database: ment in order to identify conserved regions (Table 1).
GPCRDB (see Web Links) A search performed on these databases is very often
more sensitive than a pairwise alignment and can help
GPCRDB (Horn et al., 1998) is an information system to identify very remote homology (less than 20%).
that collects and disseminates GPCR-related data. It
holds sequences, mutant data and ligand-binding con-
stants as primary (experimental) data. Computation-
ally derived data such as multiple sequence alignments,
InterPro (see Web Links)
3D models, phylogenetic trees and 2D visualization InterPro (Mulder et al., 2002) is an integrated
tools are added to enhance the database’s usefulness. documentation resource for protein families, domains
and functional sites that was developed to rationalize
A protease database: MEROPS (see Web Links) the complementary efforts of the individual protein
signature database projects. PRINTS, PROSITE,
MEROPS (Rawlings et al., 2002) provides a wealth of Pfam, ProDom, SMART and TIGRFAMs form the
information on proteases. There are data on individual InterPro core. Each InterPro entry includes a unique
proteases, protease families and also clans into which accession number, functional descriptions and litera-
the families are grouped. Hundreds of proteases can be ture references, and links are made back to the relevant
found by name, identifier or the organism in which member databases.
they occur. InterPro is a useful resource for whole-genome
analysis and has already been used for the proteome
International ImMunoGeneTics database: analysis of a number of completely sequenced organ-
isms, including preliminary analyses of the human
IMGT (see Web Links) genome. Table 1 gives a list of the InterPro database
IMGT (Lefranc, 2001) is a high-quality integrated members as well as a brief description and their URL
information system that specializes in immunoglobu- addresses.
2
Protein Databases
3
Protein Databases
4
Protein Databases
5
Protein Databases
field of proteomics, is given here. One must keep in Mulder NJ, Apweiler R, Attwood TK, et al. (2002) The InterPro
mind that the essence of databases is to grow and database, an integrated documentation resource for protein
families, domains and functional sites. Briefings in Bioinformatics
develop constantly, and the only way to get a good 3: 225 235.
idea of the worth of one or the other is to visit a given Murzin AG, Brenner SE, Hubbard T and Chothia C (1995) SCOP:
database and browse through it. a structural classification of proteins database for the investiga-
tion of sequences and structures. Journal of Molecular Biology
247: 536 540.
See also Orengo CA, Michie AD, Jones S, et al. (1997) CATH a hierarchic
Genetic Databases classification of protein domain structures. Structure 5:
Genome Databases 1093 1108.
Protein Sequence Databases Rawlings ND, O’Brien EA and Barrett AJ (2002) MEROPS:
the protease database. Nucleic Acids Research 30: 343 346.
Schomburg I, Chang A, Hofmann O, et al. (2002) BRENDA, a
References resource for enzyme data and metabolic information. Trends in
Biochemical Sciences 27: 54 56.
Attwood TK, Blythe M, Flower DR, et al. (2002) PRINTS and Sherry ST, Ward MH, Kholodov M, et al. (2002) dbSNP: the NCBI
PRINTS-S shed light on protein ancestry. Nucleic Acids Research database of genetic variation. Nucleic Acids Research 29:
30: 239 241. 308 311.
Auerbach AD (2000) Eighth International HUGO-Mutation Xenarios I, Salwinski L, Duan XJ, et al. (2002) DIP: the Database of
Database Initiative Meeting, April 9, Vancouver, Canada. Interacting Proteins. A research tool for studying cellular
Human Mutation 16: 265 268. networks of protein interactions. Nucleic Acids Research 30:
Bader GD, Donaldson I, Wolting C, et al. (2001) BIND The 303 305.
Biomolecular Interaction Network Database. Nucleic Acids Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-
Research 29: 242 245. Citterich M and Cesareni G (2002) MINT: a Molecular
Bairoch A (2000) The ENZYME database in 2000. Nucleic Acids INTeraction database. FEBS Letters 513(1): 135 140.
Research 28: 304 305.
Bateman A, Birney E, Cerruti L, et al. (2002) The Pfam Protein
Families Database. Nucleic Acids Research 30: 276 280.
Berman HM, Westbrook J, Feng Z, et al. (2000) The Protein Data Web Links
Bank. Nucleic Acids Research 28: 235 242.
Boeckmann B, Bairoch A, Apweiler R, et al. (2003) The Swiss-Prot The ExPASy list of Biomolecular servers. This site lists the major
protein knowledgebase and its supplement TrEMBL in 2003. (over one thousand!) databases of interest relative to proteomic
Nucleic Acids Research 31(1): 365 370. research
Cooper CA, Harrison MJ, Wilkins MR and Packer NH (2001) http://www.expasy.org/alinks.html
GlycoSuiteDB: a new curated relational database of glycoprotein BIND.The Biomolecular Interaction Network Database stores full
glycan structures and their biological sources. Nucleic Acids descriptions of interactions, molecular complexes and pathways,
Research 29: 332 335. among which are protein protein interactions
Corpet F, Servant F, Gouzy J and Kahn D (2000) ProDom and http://bind.mshri.on.ca/
ProDom-CG: tools for protein domain analysis and whole BRENDA. The main collection of enzyme functional data available
genome comparisons. Nucleic Acids Research 28: 267 269. to the scientific community, maintained and developed at the
Falquet L, Pagni M, Bucher P, et al. (2002) The PROSITE database, Institute of Biochemistry at the University of Cologne
its status in 2002. Nucleic Acids Research 30: 235 238. http://www.brenda.uni-koeln.de/
Haft DH, Loftus BJ, Richardson DL, et al. (2001) TIGRFAMSs: a CATH. A hierarchical domain classification of protein structures
protein family resource for the functional identification of derived from PDB (see below)
proteins. Nucleic Acids Research 29: 41 43. http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html
Hoogland C, Sanchez J-C, Tonella L, et al. (2000) The 1999 DbSNP. The Single Nucleotide Polymorphism database is a
SWISS-2DPAGE database update. Nucleic Acids Research 28: repository of all the genetic variations which are discovered as
286 288. the human genome is being deciphered
Hoogland C, Sanchez J-C, Walther D, et al. (1999) Two-dimensional http://www.ncbi.nlm.nih.gov/SNP
electrophoresis resources available from ExPASy. Electrophor- DIP.The Database of Interacting Proteins is curated both manually
esis 20: 3568 3571. by expert curators and automatically. The DIP database provides
Horn F, Weare J, Beukers MW, et al. (1998) GPCRDB: an a comprehensive and integrated tool for browsing and extracting
information system for G protein-coupled receptors. Nucleic information on protein protein interactions
Acids Research 26: 277 281. http://dip.doe-mbi.ucla.edu/
Kanehisa M, Goto S, Kawashima S and Nakaya A (2002) The ENZYME. An enzyme nomenclature database based on the
KEGG databases at GenomeNet. Nucleic Acids Research 30: recommendations of the Nomenclature Committee of the
42 46. International Union of Biochemistry and Molecular Biology
Krawczak M and Cooper DN (1997) The Human Gene Mutation (IUBMB)
Database. Trends in Genetics 13: 121 122. http://us.expasy.org/enzyme/
Lefranc M-P (2001) IMGT, the international ImMunoGeneTics GlycoSuiteDB. A database of glycoprotein glycan structures derived
database. Nucleic Acids Research 29: 207 209. from the scientific literature. Regarding proteins, when the glycan
Letunic I, Goodstadt L, Dickens NJ, et al. (2002) Recent structures are known to be attached to a specific protein, direct
improvements to the SMART domain-based sequence annota- links are made to Swiss-Prot and TrEMBL databases (see below)
tion resource. Nucleic Acids Research 30: 242 244. http://www.glycosuite.com/
McKusick VA (1998) Mendelian Inheritance in Man. Catalogs of GPCRDB. An information system, which collects and disseminates
Human Genes and Genetic Disorders, 12th edn. Baltimore, MD: data related to the G-protein-coupled receptor
Johns Hopkins University Press. http://www.gpcr.org/7tm
6
Protein Databases
HGMD. The Human Gene Mutation Database is a comprehensive (Bethesda, MD). The database offers a wealth of textual
database of gene lesions underlying human inherited disease information provided in each entry, some of which can be useful
http://www.hgmd.org/ in the context of protein studies
HGVS. The Human Genome Variation Society was created to http://www.ncbi.nlm.nih.gov/omim/
promote the discovery and free publication of information on the PDB. The Protein Data Bank is a collection of 3D structures of
variations in human genes by fostering a central repository for proteins, nucleic acids and other biological macromolecules.
such variations PDB is a resource of critical importance in the discovery of new
http://www.hgvs.org/ pharmacological agents, new catalysts, new biomaterials and
IMGT. The international ImMunoGeneTics database is a high- possibly nanodevices
quality integrated information system that specializes in im- http://www.rcsb.org/pdb/
munoglobulins, T cell receptors and major histocompatibility SCOP. Provides a detailed and comprehensive description of the
complex molecules of all vertebrate species structural and evolutionary relationships between proteins whose
http://imgt.cines.fr:8104/ structure is known
InterPro. An integrated documentation resource for protein http://scop.mrc-lmb.cam.ac.uk/scop/
families, domains and functional sites, which was developed to Swiss-2Dpage. Contains 2D-PAGE and SDS PAGE reference maps
rationalize the complementary efforts of the individual protein and information on identified proteins from a variety of human
signature database projects that form the InterPro core (see biological samples. It is maintained collaboratively by the
Table 1) Central Clinical Chemistry Laboratory of the Geneva University
http://www.ebi.ac.uk/interpro/ Hospital and the Swiss Institute of Bioinformatics
KEGG.The Kyoto Encyclopedia of Genes and Genomes strives to http://www.expasy.org/ch2d/
computerize the current knowledge of molecular and cellular Swiss-Prot. A non-redundant curated protein knowledge resource
biology in terms of the information pathways that consist of that provides a high level of annotation. Besides the stark protein
interacting molecules or genes sequence, a Swiss-Prot entry offers the description of the function
http://www.genome.ad.jp/kegg/ of a protein, its domain structure, posttranslational modifica-
MEROPS. Provides data on individual proteases, protease families tions, variants and links to other databases
and also clans into which the families are grouped http://www.expasy.org/sprot
http://merops.iapc.bbsrc.ac.uk/ TrEMBL. Consists of computer-annotated entries in Swiss-Prot
MINT. The Molecular Interactions database. Stores functional format derived from the translation of coding sequences in the
interactions between biological molecules. European Molecular Biology Laboratory nucleotide sequence
http://cbm.bio.uni.oma2.it/mint/ database. Hence TrEMBL entries are preliminary Swiss-Prot
OMIM. The Online Mendelian Inheritance in Man database is a entries that have not yet been manually annotated
collection of human genes and genetic disorders maintained by http://www.ebi.ac.uk/swissprot
the McKusick Nathans Institute for Genetic Medicine, Johns World-2Dpage. A complete index of 2D-PAGE databases and
Hopkins University (Baltimore, MD) and the National Center services
for Biotechnology Information, National Library of Medicine http://www.expasy.org/ch2d/2d-index.html