2 Practical of Basic Bioinformatics Module: 2.1. Uniprotkb/Swiss-Prot
2 Practical of Basic Bioinformatics Module: 2.1. Uniprotkb/Swiss-Prot
2.1. UniProtKB/Swiss-Prot
The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional
information on proteins, with accurate, consistent and rich annotation. In addition to capturing the
core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name
or description, taxonomic data and citation information), as much annotation information as
possible is added. This includes widely accepted biological ontologies, classifications and crossreferences, and clear indications of the quality of annotation in the form of evidence attribution of
experimental and computational data.
The UniProt Knowledgebase consists of two sections: a section containing manually-annotated
records with information extracted from literature and curator-evaluated computational analysis,
and a section with computationally analysed records that await full manual annotation. For the
sake of continuity and name recognition, the two sections are referred to as "UniProtKB/SwissProt" (reviewed, manually annotated) and "UniProtKB/TrEMBL" (unreviewed, automatically
annotated), respectively.
Manual annotation consists of a critical review of experimentally proven or computer-predicted
data about each protein, including the protein sequences. Data are continuously updated by an
expert team of biologists.
We will now see how to search the UniProtKB, and which information contains a
UniProtKB/Swiss-Prot entry.
Step 1. Enter http://www.uniprot.org/ in your browser search tab, or just click on the link.
The main page of the UniProt shows up (Figure 1). The UniProt main page is well organized and
helpful. There are several important things on the start page that one can use to find exactly what
he/she needs.
For now, we will not deal in details with possibilities and tools on the UniProt web page. We will
just search the databases with our old example Pho5, and see what information can be found in
the Swiss-Prot entry.
Step 2. Type Pho5 in the search bar and click on Search.
There are 21 matches to the Pho5 query (possibly more by the time this practical is done). The
results are displayed as in the Figure 2. The results are retrieved from both TrEMBL and SwissProt sections. You can see in which section the entry belongs by the star symbol (see Figure 2,
red label). The entries with the yellow star belong to the Swiss-Prot section, whereas the one
without it belong to the TrEMBL section. There are several ways to retrieve just the reviewed
(Swiss-Prot) entries.
Available tools
Search bar
News feed
Video tutorial
What can be
found here
Detailed help
Figure 1. The UniProt main page. The most important components are labelled in red.
Figure 2. The UniProt result page. Labelled in red is the section status bar, labelled in green are section filter and
other filters, and labelled in blue is the Download link.
If you need to retrieve just the reviewed you can do it by performing the search and the click
Show only reviewed link on the result page (Figure 2, green label). Another way to retrieve just
Swiss-Prot entries is to use Advanced search settings.
The retrieved entries can be downloaded in several ways. You just have to select the entries you
want to download press the Download link (Figure 2, labelled blue) and choose the output format
of wanted entries. For example, you can download the Accession number list or sequences in
FASTA format of the chosen entries.
Step 3. Find the Pho5 protein (repressible acid phosphatase) from the organism
Saccharomyces cerevisiae of the retrieved entries, and click on the Entry link.
After doing this step, the chosen entry is displayed. We will now have a detailed look on the
Swiss-Prot entry.
There are ten different sections in the Swiss-Prot entry. Each section has one or more subsections
which contain entry specific information. Because of the great number of sections and
subsections, one can get lost in the aboundacy of information. Thats why clicking on the name
of section or subsections, the help windows pop-out with description of it.
We will now breafly look at the sections present in the Swiss-Prot entry, and get some general
picture of the information found in the database about concerned protein.
The first three sections are Names and origin, Protein attributes, and General annotation
(comments), shown in Figure 3. It contain basic information about the protein.
Figure 3. The sections Names and origin, Protein attributes, and General annotation (comments) for the Pho5
protein
The Name and origin section, as the name itself says, contain the name of the protein with
alternative names, including E.C. number (recall the Biochemistry I class for definition of E.C.
number). It also includes the name of the gene coding for the protein, as well as the organism of
origin with the taxonomic lineage. Note that there are links to the UniProt Taxonomy database to
each Taxonomy class. You can access the taxonomic data on the UniProt Taxonomy database
and the NCBI Taxonomy database using the NCBI unique identifier (taxonomic identifier).
The following section Protein attributes contains some useful information about the entry. In
the case of Pho5, you can see that the protein has the length of 467 aa, was completely
sequenced, and the existence of protein was confirmed by one or more analytical methods. You
can also see that this protein is subjected to post-translational modification.
The third section General annotation (Comments) generally contains biochemically important
information, for example type of reaction yielded by the enzyme, post-translational
modifications, cell compartment etc. Note that there are yellow links at the end of certain
subsections. Those links, which looks like Ref. [0-9], links to the section References, where
publications in which the information is found can be retrieved.
The next section is the Ontologies section (Figure 4). In information science, ontology
generally refers as a set of concepts within a domain, and the relationships between those
concepts. It can be used to reason about the entities within that domain and may be used to
describe the domain. An example is Gene Ontology. The Gene Ontology project is a major
bioinformatics initiative with the aim of standardizing the representation of gene and gene
product attributes across species and databases. The project provides a controlled vocabulary of
terms for describing gene product characteristics and gene product annotation data from GO
Consortium members, as well as tools to access and process this data. In this section the
keywords for specific field can be found, as well as the entry from the Gene Ontology project
database.
The section Sequence Annotation (Figure 5.) provides a precise but simple means for the
annotation of sequence data. It describes regions or sites of interest in the protein sequence. In
general this section lists post-translational modifications, binding sites, enzyme active sites, local
secondary structure or other characteristics reported in the cited references. Sequence conflicts
between references are also included in this section.
The section Sequence (Figure 6.) displays by default the canonical protein sequence and upon
request all isoforms described in the entry. It also includes information pertinent to the
sequence(s), including length and molecular weight. The protein sequence displayed by default is
the protein sequence to which all positional annotation of the Sequence annotation section
refers. It is called the canonical sequence. Note that this section also includes various tools
which can be used to analyse the sequence (e.g. BLAST, Compute pI, MW, etc.). The sequence
can be easily exported as a FASTA sequence just by clicking on the FASTA link.
There are few more sections. The Reference section, as mentioned before contains the list of
publication, as well as the links to those publications, from which information for the annotation
was retrieved. Each reference has a part which explains which information is gained from that
publication.
The Cross-reference section provides links and unique identifiers which points to collections
and/or databases other than UniProtKB.
And at the end, there are the sections Entry information and Relevant documents. In the
Entry information section, the information like entry submission time, last modify time,
accession number etc., can be found. The section Relevant documents contains links to
documents relevant to the entry (for example genome sequence and annotation, protein family,
etc.).
2.2. KEGG
Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource that integrates
genomic, chemical and systemic functional information. In particular, gene catalogs from
completely sequenced genomes are linked to higher-level systemic functions of the cell, the
organism and the ecosystem. Major efforts have been undertaken to manually create a knowledge
base for such systemic functions by capturing and organizing experimental knowledge in
computable forms; namely, in the forms of KEGG pathway maps, BRITE functional hierarchies
and KEGG modules. Continuous efforts have also been made to develop and improve the crossspecies annotation procedure for linking genomes to the molecular networks through the KEGG
Orthology system.
Step 1. Enter http://www.genome.jp/kegg/ in your browser search tab, or just click on the
link.
The KEGG main page shows up (Figure 1). KEGG offers a wide variety of options and
information. In this practical we will consider just a portion of all KEGG possibility which could
be of most interest. KEGG contains a database of metabolite pathways of wide variety of
organisms which can be explored.
Step 2. Click on the KEGG PATHWAY link on the KEGG main page, then in the section 0.
Global map click the Metabolic pathways link.
A figure as in Figure 2 appears. This figure represents the reference metabolic pathway. The dots
on the figure represent the metabolites, whereas the lines represent metabolic reactions. By
pointing your mouse cursor on the dot, the unique identifier with the picture of the compound is
shown (Figure 3a). Clicking on the dot, the entry of the concerned metabolite is shown. By
pointing on the line, identifiers to the concerned reaction are shown. Those identifiers are given
to enzymes carrying the reaction (EC numbers), orthologous genes for that reaction, the reaction
itself (Figure 3b). By clicking on the line representing the reaction, it is possible to see the entries
involved in this reactions, which are the entries linked in previously mentioned unique identifiers.
Step 3. In the scroll down menu find the organism Rickettsia prowazekii, select it and click
on the Go button.
A figure a little bit different than Figure 2 is displayed (Figure 4). Rickettsia prowazekii is a
obligate intracellular parasitic, aerobic bacteria that is the etiologic agent of epidemic typhus,
transmitted in the feces of lice. Because it is a obligate parasite, it lacks many metabolic
pathways. The grey dots and lines on Figure 4 represent the reactions and metabolites which this
bacterium lacks.
a)
b)
Figure 3. a) Figure of the metabolite pointed in red and b) unique idetifiers of entries involved in the reaction signed
in red
Now that we have seen how metabolic pathways can be retrieved and explored for specific
organism, now it's time to see how genes and/or proteins are represented in KEGG. First we have
to return to the KEGG main page.
Step 4. On the KEGG main page enter Pho5 in the search bar, and then click on the
sce:YBR093C.
The link you have just clicked on is the link to the Saccharomyces cerevisiae PHO5 gene. There
were also other PHO5 genes from other organisms retrieved, but as you already know, we are
interested particularly in PHO5 from Saccharomyces cerevisiae. The page is displayed as in
Figure 5.
As you can see, the entry is pretty simple, but it is as well very complex because it contains a lot
of links and references. It contains standard sections, as gene name, definition with EC number,
organism, references to other databases, protein sequence, nucleotide sequence, position etc. We
will focus on the Pathway section. An also interesting section, but not the focus of our present
exercise, is the Orthology section. The Nucleotide section is also interesting because, not
only that contains the DNA sequence which can be exported in FASTA format, but contains an
option which enables you to retrieve not just the gene sequence itself, but also n nucleotides
downstream or upstream of the gene.
under dots there are actually names of corresponding metabolites, and on the reaction lines there
are the EC numbers for the enzyme which carries the reaction. If the enzyme is available in
KEGG, the EC number links to this enzyme from the concerned organism. Available enzymes
are highlighted in green. By clicking on dot representing the metabolite, it is possible to get to the
entry of that metabolite.
Now we will return to the YBR093C entry to analyse it further.
Step 6. On the YBR0093C entry page, click on the sce04111 link located in the Pathway
section.
By doing so, the portion of yeasts cell cycle regulation pathway is shown (Figure 7). This map
shows which proteins are involved in the yeast cell cycle. There are proteins which regulate the
activity of other proteins, transcription factors, etc. Some proteins are clustered together, which
means that those proteins form complexes with each other. As on the pathway figure, the Pho5
gene is marked in red, available genes are also highlighted green and their entries can be retrieved
by clicking on them. The lines represent the influence of one protein to other. There are specific
lines indicating repression, activation, phosphorylation, etc. Because the lines dont represent
reactions in the usual sense of the word, they dont link to anything.