15 Article
15 Article
Received September 15, 2015; Revised October 04, 2015; Accepted October 05, 2015
* To whom correspondence should be addressed. Tel: +81 774 38 4521; Fax: +81 774 38 3269; Email: [email protected]
C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which
permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
[email protected]
D458 Nucleic Acids Research, 2016, Vol. 44, Database issue
sometimes contains genes and proteins from organisms Experimental evidence for each KO
whose genome sequences are not known and would never
The development of the KO database is tightly coupled with
be known. By expanding this addendum category it is now
the development of KEGG molecular networks including
possible to capture all knowledge about gene/protein func-
KEGG pathway maps, BRITE functional hierarchies and
tions that can be associated with sequence data.
KEGG modules. Ideally a KO represents a single sequence
similarity group with an appropriate level of similarity. In
reality, there are a number of complications. A single KO
OVERVIEW AND NEW DEVELOPMENTS
may consist of multiple sequence similarity groups. A small
Overview of KEGG group with a high similarity threshold is a subset of a larger
group with a lower similarity threshold, in which case two
KEGG is an integrated database resource consisting of 16
KOs are defined as the small group and the large group ex-
main databases, which are categorized into systems, ge-
cluding the small group part. As long as the constituent
nomic, chemical and health information as shown in Ta-
sequence similarity groups are well defined including these
ble 1. The PATHWAY, BRITE and MODULE databases in
examples, the KOALA (KEGG Orthology and Links An-
the systems information category contain KEGG pathway
notation) program (1) to computationally assign K num-
maps, BRITE hierarchy and table files and KEGG mod-
bers works well. However, there are still a small number
ules, respectively, as representations of high-level functions.
of legacy KOs converted from Enzyme Commission (EC)
They are all manually created based on published literature.
number groups, whose associated sequence data are not well
The BRITE table file is a newly introduced representation,
defined.
which can be compared with the multi-column BRITE hi-
Internally KO grouping is constantly updated in the man-
erarchy file. When the data size is not large it is much easier
ual verification part of the KOALA annotation procedure
to capture the overall relationship in a tabular form with a
(1). For outside users the basis of KO grouping and its corre-
few columns optionally used for representation of hierarchy.
spondence to molecular function should be made clear by
BRITE table files are mainly used for drug classifications
experimental evidence. Thus, major efforts have been ini-
and for presenting various relationships involving diseases
tiated to annotate individual KOs with reference informa-
and drugs.
tion reporting experiments on functional characterization
The genomic information category contains the
of genes and proteins and, whenever possible, protein se-
GENOME and GENES databases for collections of
quence data used in the experiments, such as those sub-
organisms with complete genomes and their gene catalogs,
mitted to the INSDC (DDBJ/ENA/GenBank) database or
which are mostly taken from RefSeq (2) and GenBank
those stated in the reference. As of September 2015, refer-
(3) databases. As mentioned, the GENES database now
ences (PubMed links) and sequence data (GENES links) are
contains additional gene sets not related to complete
included in 76% and 45%, respectively, of about 19 000 KO
genomes. There are also other databases not listed in Table
entries. The sequence data listed in the KO entry can now be
1: computationally generated sequence similarity database
considered as the core sequence(s) from which an ortholog
SSDB and auxiliary gene catalog databases DGENES
group has been defined.
and MGENES for draft genomes and metagenomes,
respectively. The KO database containing ortholog groups
associated with molecular functions is a hub for linking
New additions to GENES database
genomic information to systems information through the
KEGG mapping procedure and also to chemical infor- For many years the KEGG GENES database was created
mation through the dual aspect of the metabolic network from NCBI’s RefSeq database. Since mid-2014, newly se-
(4). quenced prokaryotic genomes are taken from GenBank,
The COMPOUND, GLYCAN, REACTION, RPAIR, and since mid-2015, existing prokaryotic genomes, exclud-
RCLASS and ENZYME databases in the chemical infor- ing the NCBI reference genomes, are updated using Gen-
mation category contain chemical substances and reactions Bank, for the current RefSeq entries produced by the NCBI
and are collectively called KEGG LIGAND for histori- Prokaryotic Genome Annotation Pipeline (6) are very dif-
cal reasons. The ENZYME database originates from the ferent from previous versions. No changes have been made
database of Enzyme Nomenclature (5). There is also a small to eukaryotic genomes. The data source of KEGG GENES
data set of reaction modules (1,4), which can be used for an- is summarized in Table 2.
notation of enzyme genes. Eukaryotes and prokaryotes with complete genomes con-
The health information category consists of the DIS- stitute KEGG organisms identified by three- or four-letter
EASE, DRUG, DGROUP and ENVIRON databases for organism codes. As shown in this table, there are three ad-
disease and drug information. DGROUP is a newly added ditional categories, viruses, plasmids and addendum, with
database, which is being developed for grouping function- two-letter codes of vg, pg and ag, respectively. The viruses
ally identical or similar drugs in the drug interaction net- and plasmids categories are taken from RefSeq collections.
works. KEGG MEDICUS is an interface for the general The annotation (K-number assignment) rate is very low for
public integrating these internally developed databases with viruses, about 7% compared to 46% for KEGG organisms,
drug labels (package inserts) of all marketed drugs in Japan but this category is useful in metagenome annotation. Many
and the USA. The Japanese version of KEGG MEDICUS plasmids are included in the complete genomes of KEGG
is especially advanced in this integration, and heavily ac- organisms, and the remaining ones are selected and stored
cessed mostly through web search engines. in the plasmids category.
Nucleic Acids Research, 2016, Vol. 44, Database issue D459
The addendum category is a collection of manually cre- reference and representative genomes in RefSeq (6) and ref-
ated protein sequence entries. In the KEGG pathway maps, erence proteomes in UniProt (7), which include both well-
there used to be cases where no corresponding genes could studied organisms and taxonomic diversity. In KEGG, such
be found in KEGG organisms, thus, only links to UniProt reference genomes are not explicitly defined, but the order-
(7) were given. In order to associate them with sequence ing of KEGG organisms contains preference of genomes.
data and K numbers, addendum entries are created using The KEGG organisms ordering is consistent with that of
the original sequence data with International Nucleotide the NCBI taxonomy (8), but members in each taxonomic
Sequence Database Collaboration (INSDC) protein acces- rank are manually ordered, not alphabetically ordered. The
sion numbers. In addition, there are two focused areas first genome in each taxonomic rank is considered as a ref-
where sequence records are being created. One is Enzyme erence genome, which is used for generating the following
Nomenclature. As we do for each KO entry, we believe that pangenome data sets.
each EC number entry should be linked to the sequence data As an organism’s functional capacity is represented by
used in the original experiment, so that the sequence simi- the set of assigned K numbers (KO content), the functional
larity based extension of EC number assignment can safely capacity of an organism group is represented by the com-
be done. Thus, we are trying to create a list of protein se- bined set of assigned K numbers. A pangenome data set,
quences from the reference list of the Enzyme Nomencla- as we define it here, is created by removing similar organ-
ture database (5). Another focused area is antimicrobial re- isms, but retaining the KO content, at the species, genus or
sistance (AMR), which will be discussed later. family level. When multiple members are present in each
species/genus/family group, the first genome in the KEGG
Taxonomy and pangenomes organisms order is taken as a representative genome. When
the other members in the group contain different K num-
The increasing number of sequenced genomes, especially bers that are not present in the representative genome, those
those of closely related bacterial strains, poses problems genes are added as if they are present in additional chromo-
of how to process and represent them in the database. At- somes or plasmids.
tempts are made to define selected sets of genomes, such as
D460 Nucleic Acids Research, 2016, Vol. 44, Database issue
BlastKOALA and GhostKOALA strains, and in Gram-negative bacteria this is mainly due to
mutations of beta-lactamase genes. There have been efforts
By the genome annotation procedure in KEGG, the
to collect and classify beta-lactamase mutations (12). We ex-
GENES database becomes structured in terms of the
amined about 1200 sequences and concluded that they can
KO (K number) groups. This facilitates the processing
be represented by finely classified KOs, named tight KOs,
of sequence similarity search results against the GENES
because clear phylogenetic relationships exist for groups of
database, which is simply to assign the most appropriate K
mutated genes. Signature KOs are tight KOs that can be
numbers, as implemented in the automatic annotation ser-
linked to phenotypic features, in this case resistant drug
vices of KAAS (9) and newly released BlastKOALA and
groups. The addendum category of the GENES database
GhostKOALA. As shown in Table 3, BlastKOALA and
now contains beta-lactamase sequences, as well as protein
GhostKOALA utilize the pangenome data set, which can
sequences of tetracycline, aminoglycoside and macrolide
be viewed as a non-redundant GENES database after re-
resistance genes. Figure 1 shows taxonomic distributions
moving similar sequences in similar organisms, but retain-
of signature KOs for beta-lactamases that are linked to
ing the KO content and the taxonomic diversity. The re-
carbapenem resistance, according to the current GENES
duced database size was 55% and 24% for prokaryotes at
database. A tool called Pathogen Checker is being devel-
the species and genus levels, respectively, and 81% and 59%
oped as a specialized version of the BlastKOALA server
for eukaryotes at the genus and family levels, respectively,
for comparing a query pathogen genome against a subset
as of this writing.
of the GENES database containing sequences of signature
BlastKOALA is suitable for annotating fully sequenced
KOs and signature modules.
genomes, while GhostKOALA, which uses GHOSTX (10)
and runs 100 times faster, is suitable for annotating large
data sets such as metagenomes. Both assign K numbers to
query amino acid sequences and allow KEGG mapping Drug interaction network
for interpretation of high-level functions. In BlastKOALA
The KO database is our attempt to make limited experimen-
most appropriate K numbers are determined by a method
tal evidence applicable to many other data. Genes and pro-
similar to the KOALA program internally used for annota-
teins in the GENES database are considered as instances
tion of KEGG organisms (1). In GhostKOALA only the
of functional orthologs represented by KOs. By organizing
top scores are examined for K number assignment. One
knowledge in terms of generalized (KO-based) networks,
additional feature of GhostKOALA is the assignment of
high-level functions of individual organisms can be inferred
taxonomic compositions. For this purpose the pangenome
from gene sets in the genome. As shown in Table 5, there are
data set for GhostKOALA is supplemented by sequences
two other network types that are organized in a similar way.
selected from CD-HIT clusters (11), adding sequences with-
One is the chemical reaction network. Enzymatic reactions
out K numbers in each taxonomic rank and viral sequences,
in the REACTION database are grouped into reaction class
thus representing the sequence diversity of the GENES
(RC) in the RCLASS database, representing the same lo-
database.
cal structure transformation patterns for substrate-product
pairs irrespective of overall structures (13). As previously
TRANSLATIONAL BIOINFORMATICS reported (1,13) one-to-many relationships between reaction
modules (ordered sets of RCs) and KEGG modules (sets of
Antimicrobial resistance (AMR)
KOs) may help to annotate enzyme genes.
AMR is a universal problem in the management of infec- The other is the drug interaction network, which is gen-
tious diseases and complications. Traditionally, the KEGG eralized using the newly introduced drug groups (DGs) in
database contains various contents for infectious diseases the DGROUP database. There are multiple levels of drug
and antimicrobial drugs, including KEGG disease path- groups, the lowest level being the chemical group for the
way maps for infectious diseases, KEGG metabolic path- same active ingredient with different salts or hydrates. Many
way maps for biosynthesis of antibiotics, KEGG drug struc- drug interactions are caused by overlapping targets and me-
ture maps for the history of antimicrobial drug develop- tabolizing enzymes (14), and appropriate drug groups have
ment and KEGG DRUG entries for all drugs currently in been defined. The drug interaction data set in the KEGG
use. Knowledge on AMR mechanisms is now organized in DRUG database, which is based on known interactions
KEGG pathway maps and KEGG modules (Table 4). Fur- listed in the drug labels of all marketed drugs in Japan, is
thermore, to meet the practical needs for combating AMR, being expanded with the DG representation. This will al-
we have started developing signature modules and signature low better detection of drug interactions associated with
KOs that can be used to characterize AMR from pathogen contraindications and precautions, as well as duplicate ad-
genome sequences. Signature modules are a class of KEGG ministration of drugs with the same or similar efficacy. Cur-
modules, which can be used for linking units of genes in the rently, the interaction is defined simply by the pair of D
genome, represented by sets of K numbers, to phenotypic numbers (DRUG identifiers) or DG numbers (DGROUP
features. Signature modules of drug resistance in pathogens identifiers). Attempts will be made to incorporate addi-
are treated separately with annotation of threat levels de- tional factors in the human genome, such as polymorphism
fined by CDC (Table 4). of cytochrome P450 (CYP) enzymes and mutation of spe-
There are also cases where mutations of a single gene play cific genes, for defining interaction units (Table 5), which
direct roles for AMR. Beta-lactams, the major class of an- may be used for interpretation of drug responses and drug
tibiotics, have a long history of newly appearing resistant interactions from personal genomes.
Nucleic Acids Research, 2016, Vol. 44, Database issue D461
Figure 1. Eight signature KOs for beta-lactamases that represent carbapenem resistance are shown indicating which organisms at the genus level contain
which genes. This table is generated by the Module Table interface in the KEGG Annotation page (http://www.kegg.jp/kegg/annotation/). The K numbers
correspond to the following gene groups: K18768 (KPC), K18793 (OXA-23), K18971 (OXA-24), K18976 (OXA-48), K18794 (OXA-51), K18972 (OXA-
58), K19211 (OXA-62) and K18780 (NDM).