0% found this document useful (0 votes)
38 views

Lecture 4 Nucleic Acid Sequence Database

The document discusses several major nucleic acid and protein sequence databases. The three primary nucleotide sequence databases are GenBank, EMBL, and DDBJ, which contain nucleic acid sequences submitted directly by scientists. GenBank is maintained by the National Center for Biotechnology Information (NCBI) and contains annotated publicly available nucleotide sequences. EMBL was created in 1980 and is maintained by EBI, while DDBJ was started in 1984 in Japan. Other databases discussed include UniProt, Pfam, and Prosite.

Uploaded by

Bhawna Rathi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lecture 4 Nucleic Acid Sequence Database

The document discusses several major nucleic acid and protein sequence databases. The three primary nucleotide sequence databases are GenBank, EMBL, and DDBJ, which contain nucleic acid sequences submitted directly by scientists. GenBank is maintained by the National Center for Biotechnology Information (NCBI) and contains annotated publicly available nucleotide sequences. EMBL was created in 1980 and is maintained by EBI, while DDBJ was started in 1984 in Japan. Other databases discussed include UniProt, Pfam, and Prosite.

Uploaded by

Bhawna Rathi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Topic Name – Nucleic Acid Sequence

Databases
• GenBank, EMBL and DDBJ are the
three primary nucleotide sequence
databases.

• They include nucleic acid sequences


BASIC NUCLEIC submitted directly by scientists and
ACID/NUCLEOTIDE genome sequencing group, and
SEQUENCE sequences taken from literature and
DATABASES patents.

• The entries in the GenBank, EMBL and


DDBJ databases are synchronized on
a daily basis, and the accession
numbers are managed in a consistent
manner between these three centers.
• An annotated collection of all publicly
available nucleotides.

• The GenBank nucleotide database is


maintained by the National Center for
Biotechnology Information (NCBI),
which is part of the National Institute of
GENBANK Health (NIH), a federal agency of the
US government.

• Maintained since 1992 NCBI


(Bethesda).

• www.ncbi.nlm.nih.gov/Genbank/
GENBANK
• An annotated collection of all
publicly available nucleotide and
protein sequences

• Created in 1980 at the European


EMBL- Molecular Biology Laboratory in
NUCLEOTIDE Heidelberg.
SEQUENCE
DATABASE • Maintained since 1994 by EBI-
Cambridge.

• http://www.ebi.ac.uk/embl.html
• An annotated collection of all
publicly available nucleotide and
protein sequences

DDBJ – DNA • Started, 1984 at the National


DATA BANK Institute of Genetics (NIG) in
Mishima.
OF JAPAN
• Still maintained in this institute a
team led by Takashi Gojobori.

• http://www.ddbj.nig.ac.jp
• UniGene www.ncbi.nlm.nih.gov/UniGene/
• The UniGene system attempts to process the GenBank
sequence data into a non-redundant set of gene-oriented
clusters.
• SGD genome-www.stanford.edu/Saccharomyces/
• The Saccharomyces Genome Database (SGD) is a
OTHER scientific database of the molecular biology and genetics
of the yeast Saccharomyces cerevisiae.

NUCLEOTIDE •

EBI Genomes www.ebi.ac.uk/genomes/
This web site provides access and statistics for the
completed genomes, and information about ongoing
DATABASES projects.
• Genome Biology www.ncbi.nlm.nih.gov/Genomes/
• The Genome Biology site at NCBI contains information
about the available complete genomes.
• Ensembl www.ensembl.org
• Ensembl is a joint project between EMBL-EBI and the
Sanger Centre to develop a software system which
produces and maintains automatic annotation of
eukaryotic genomes.
Protein Information Resource(PIR)

Uniprot - Protein Knowledge Database


PROTEIN/PROTEOMICS
DATABASES
Pfam - Protein Family And Domain

Prosite - Protein Family And Domain


• The Swiss-Prot, TrEMBL, and PIR protein
database activities have united to form the
Universal Protein Resource (UniProt)
– Uniprot Knowledgebase (UniprotKB):
curated Sequence information,
annotations, linked to other

UNIPROT
databases.
– Uniprot Reference Clusters (UniRef):
removing sequence redundancy by

Database merging sequences that are 100%,


90% and 50%, no annotations, linked
to Knowledgebase and UniParc
records.
– Uniprot Archive (UniParc): history of
sequences, no annotation, linked to
source records.
UNIPROT SEQUENCE DATABASES

UniProt Archive (UniParc) UniProt Reference (UniRef)


Stable, comprehensive, non-redundant Three non-redundant collections based
collection of all protein sequences ever on sequence similarity clusters
published • UniRef100 has all identical and
Merged from PIR, SwissProt, TREMBL, identical overlapping subsequences
DDBJ/EMBL/GenBank proteins and merged into one entry in UniRef100
proteomes, PDB, International Protein • UniRef90 merges all protein sequence
Index, RefSeq translations and other clusters with 90% sequence identity
organism proteomes not yet in into a single entry.
DDBJ/EMBL/GenBank • UniRef50 merges all protein sequence
clusters with 50% sequence identity
into a single entry
UniProt Sequence Databases (cont.)
•UniProt Archive (UniProt)
• UniProt/SwissProt
• Manually curated highly-annotated sequences from SwissProt & PIRSF
including descriptions, taxonomy, citations, GO terms, motifs, functional
and structural classifications, residue specific annotations including
variations.
• Some automatic rule-based annotations including InterPro domains and
motifs, PROSITE, PRINTS, Prodom, SMART, PFAM, PIRSF, Superfamily and
TIGRFAMS classifications.
• UniProt/TREMBL
• Automatically translated from genomes including predicted as well as
RefSeq genes.
• Automated rule-based annotations.
• PIR was established in 1984 by the
National Biomedical Research
Foundation (NBRF) as a resource to
assist researchers in the identification
PROTEIN and interpretation of protein sequence
INFORMATION information.
• The Protein Information Resource (PIR)
RESOURCE is an integrated public bioinformatics
resource to support genomic,
proteomic and systems biology
research and scientific studies
PFAM

PFAM IS A DATABASE OF CURATED PROTEIN FAMILIES, IN PFAM, THE PROFILE HMM IS SEARCHED AGAINST A
EACH OF WHICH IS DEFINED BY TWO ALIGNMENTS AND A LARGE SEQUENCE COLLECTION, BASED ON UNIPROT
PROFILE HIDDEN MARKOV MODEL (HMM). KNOWLEDGEBASE (UNIPROTKB), TO FIND ALL INSTANCES
OF THE FAMILY.
PROSITE DATABASE

PROSITE is a database of protein families and domains. It is based


on the observation that, while there is a huge number of different
proteins, most of them can be grouped, on the basis of similarities
in their sequences, into a limited number of families.

Proteins or protein domains belonging to a particular family


generally share functional attributes and are derived from a
common ancestor.
PROSITE DATABASE

You might also like