Protein Side Chain Correction
Protein Side Chain Correction
Author Manuscript
Nat Protoc. Author manuscript; available in PMC 2009 May 14.
Published in final edited form as:
NIH-PA Author Manuscript
Abstract
SCWRL and MolIDE are software applications for prediction of protein structures. SCWRL is
designed specifically for the task of prediction of side-chain conformations given a fixed backbone
usually obtained from an experimental structure determined by X-ray crystallography or NMR.
SCWRL is a command-line program that typically runs in a few seconds. MolIDE provides a
graphical interface for basic comparative (homology) modeling using SCWRL and other programs.
NIH-PA Author Manuscript
MolIDE takes an input target sequence, and uses PSI-BLAST to identify and align templates for
comparative modeling of the target. The sequence alignment to any template can be manually
modified within a graphical window of the target-template alignment and visualization of the
alignment on the template structure. MolIDE builds the model of the target structure based on the
template backbone, predicted side-chain conformations with SCWRL, and a loop-modeling program
for insertion-deletion regions with user-selected sequence segments. SCWRL and MolIDE can be
obtained at http://dunbrack.fccc.edu/Software.php.
Keywords
Computational methods; Protein structure prediction; Comparative (homology) modeling
INTRODUCTION
To understand basic biological processes such as cell division, cell signaling, development,
metabolism, and cell death, detailed knowledge of the three-dimensional structures of the active
participants is necessary. Structures of a large number of proteins have been determined
NIH-PA Author Manuscript
experimentally using primarily X-ray crystallography and NMR spectroscopy. There are
currently more than 50,000 entries in the Protein Data Bank (PDB)1, which archives
experimentally determined structures of proteins and protein complexes, as well as nucleic
acids and other biological macromolecules. However, a list of sequences of proteins in the
PDB such that no two sequences are more than 20% identical to each other, contains only about
4000 sequences2. Many of these proteins can be grouped further into about 1000 folds in 2000
superfamilies3.
Since it was first recognized that proteins can share similar structures4, computational methods
have been developed to build models of proteins of unknown structure based on related proteins
of known structure5. Most such modeling efforts, referred to as homology modeling or
comparative modeling, follow a basic protocol: 1) for a target sequence of unknown structure,
identify a template structure with sequence related to the target and align the target sequence
to the template sequence and structure; 2) for core secondary structures and all well-conserved
parts of the alignment, borrow the backbone coordinates of the template according to the
sequence alignment of the target and template; 3) build side chains onto the backbone model
according to the target sequence; 4) for segments of the target sequence for which coordinates
NIH-PA Author Manuscript
cannot be borrowed from the template because of insertions and deletions in the alignment
(usually in loop regions of the protein) or because of missing coordinates in the template,
rebuild these regions using loop modeling methods or other ab initio structure prediction
methods; 5) refine the structure, modeling likely differences in the relative positions of α
helices, β sheet strands, and other elements of structure.
The identification step necessarily involves sequence alignment, but even once a template has
been identified and aligned to the target, a number of different methods may be used to improve
the alignment, including fold recognition methods6 and profile-profile alignment7. Manual
editing based on visualization of the template structure is frequently used to improve the
alignment. Steps 3 and 4, side-chain and backbone modeling, may be coupled, since certain
backbone conformations may be unable to accommodate the required side chains in any low-
energy conformation. The refinement step involves moving all parts of the structure, including
the backbone model produced in Step 2, allowing them to adjust to the new sequence. For
instance, two helices packed against each other may move apart to accommodate larger side
chains in the target than in the template. Many methods have been proposed to perform each
of the steps in the homology modeling process. Other procedures based on reconstructing
structures (rather than perturbing a starting structure) by satisfying spatial restraints using
NIH-PA Author Manuscript
distance geometry8 or molecular dynamics and energy minimization9-12 have also been
developed. The popular program Modeller is one of these9.
Homology modeling of proteins has been of great value in interpreting the relationships of
sequence, structure, and function. In particular, orthologous proteins usually show a pattern of
conserved residues that can be interpreted in terms of three-dimensional models of the proteins.
Orthologues are genes and proteins in two different organisms that have descended from a
single common ancestor without duplication. Conserved residues often form a contiguous
active site or interaction surface of the protein, even if they are distant from each other in the
sequence. With a structural model, a multiple alignment of orthologous proteins can be
interpreted in terms of the constraints of natural selection in terms of protein folding, stability,
dynamics, and function. Paralogues, on the other hand, arise from gene duplication and
subsequent divergence of sequence and function. For paralogous proteins, three-dimensional
models can be used to interpret the similarities and differences in the sequences in terms of the
related structure but usually different functions of the proteins concerned.13 In many cases,
there are significant insertions and deletions and amino acid changes in the active or binding
site between paralogues. Indeed, homology models can serve to help us identify which protein
belongs to which functional group by the conservation of important residues in the active or
NIH-PA Author Manuscript
binding site14. A number of groups have used comparative modeling to predict protein
function.15-19
Another important use of homology modeling is to understand functional changes due to point
mutations in protein sequences that arise either by natural processes or by experimental
manipulation. The human genome project has produced significant amounts of data concerning
polymorphisms and other mutations potentially related to differences in susceptibility,
prognosis, and treatment of human disease. There are now many such examples, including the
Factor V/Leiden R506Q mutation20 that causes increased occurrence of thrombosis, mutations
in the serine protease HTRA2 associated with Parkinson's disease21, and BRCA1 for which
many sequence differences are known, some of which may lead to breast cancer22. At the same
time, there are many polymorphisms in important genes that have no discernible effect on those
who carry them. At least for some of these, there may be some effect that has yet to be measured
in a large enough population of patients, and therefore the risk of cancer, heart disease, or other
illness to these patients is unknown. This is yet another important application of homology
modeling, since a good model may indicate readily which mutations pose a likely risk and
which do not23.
NIH-PA Author Manuscript
Homology models may also be used in computer-aided drug design, especially when a closely
related template structure is available for the target sequence. In such cases, the active site may
be sufficiently conserved such that a model of the protein provides a reasonable target for
computer programs that can suggest the most likely compounds that will bind to the active site.
This has been used successfully in the early development of HIV protease inhibitors24,25 and
in the development of anti-malarial compounds that target the cysteine protease of P.
falciparum26. It has been used recently for high-throughput computational screening of alpha
glucosidase inhibitors for treatment of diabetes27.
In this paper, we provide protocols for using two tools for protein structure prediction. The
first is SCWRL, which is a computer program for side-chain prediction28,29. SCWRL is
perhaps the most popular program for modeling of side chains (2,636 licenses in 72 countries
as of 12 August 2008), and offers a good tradeoff between accuracy and speed as described in
a recent independent benchmark30. SCWRL may be used for homology modeling by inputting
a sequence different from the input backbone coordinates. It may also be used to complete a
structure that is missing some or all of its side chains. This may happen if some side chains are
not present in a PDB structure because of lack of electron density, and yet it would be useful
NIH-PA Author Manuscript
to have at least a predicted position for these residues. SCWRL can be used to build mutations
into proteins by converting one side-chain type into another, although it does not model any
changes in backbone structure that might result.
The second program we describe is MolIDE31. MolIDE provides a graphical interface to the
basic protocol of homology modeling, except for the final refinement step. That is, it provides
for identification and alignment of the target to a template, modeling of the backbone according
to this alignment, building of side chains of the target sequence, and loop modeling of insertion-
deletion regions. For many purposes, such modeling is sufficient, since it provides rough
locations of all amino acids within the protein: whether they are on the surface or buried, in
active sites or not, and proximity in space versus proximity in sequence, etc. The refinement
step is quite difficult and time consuming in computational resources and may provide little
added benefit for many users of homology models. This is an area of active research32.
Experimental Design
SCWRL itself has been designed to be an easy-to-use command-line program that builds side
chains onto an input backbone structure. The current version, SCWRL3.029, is based on a
graph-theory algorithm that represents the interactions of side chains within a protein as a
NIH-PA Author Manuscript
graph. It then uses the graph and an energy function to determine the lowest energy
conformations of all of the side chains. Most side-chain types in proteins have a limited number
of discrete conformations referred to as rotamers. These arise from steric repulsions of atoms
separated by three or four covalent bonds. The shortest side chains, serine, threonine, valine,
and cysteine, have only three available conformations. Leucine and isoleucine have nine
conformations, although some of these are high in energy and are rare in proteins. The longer
side chains have many possible conformations, although in practice many of these are also
high in energy and many such residues are exposed to the solvent and sample many
conformations in a dynamically active structure. SCWRL uses a backbone-dependent rotamer
library33,34, which expresses the frequency of rotamers as a function of the backbone dihedral
angles ϕ and Ψ for each of the amino acid types. It also contains information on the average
side-chain dihedral angles in a backbone-dependent manner. SCWRL3.0 uses the 2002 version
of the backbone-dependent rotamer library35. This library was based on 850 chains in the PDB
with resolution better than or equal to 1.7 Å. Residues with high B-factors or atomic clashes
were removed from the data set, as suggested by Lovell et al.36, and the terminal dihedrals of
Asn, Gln, and His residues were flipped if there was a clear hydrogen bond formed by doing
so37.
NIH-PA Author Manuscript
MolIDE performs homology modeling of single proteins in a visual environment. The basic
protocol, described step-by-step below, is shown in Figure 1 (Please note that the description
for MolIDE starts at step 5 in the PROCEDURES section). Its first step is to perform a database
search with the program PSI-BLAST38. PSI-BLAST is an iterative program that searches a
database for sequences similar to the target or query sequence. Usually the sequence database
for this search is the non-redundant database (“nr”) of protein sequences provided by the
NCBI39; currently this database contains over 6 million sequences. The first iteration of PSI-
BLAST is just like the BLAST program, finding the most closely related sequences in the
database related to the query. PSI-BLAST creates a multiple sequence alignment of the target
sequence with these sequences. This multiple sequence alignment is then transformed into a
sequence profile, which expresses the frequency of each of the 20 amino acid types in each
column of the multiple sequence alignment. That is, for each residue of the query, the profile
contains a numerical value for each of the 20 amino acids, depending on how often that residue
type is found aligned to the query residue amongst homologues.
The second iteration of PSI-BLAST searches the database with the profile instead of just the
query sequence, and scores each sequence in the database by how well it aligns to the profile.
NIH-PA Author Manuscript
For instance, if the profile shows that a particular position in the multiple sequence is 100%
glycines, then the profile will score glycine in a database sequence very highly, and all other
residue types either neutrally or negatively. If another column is a mixture of hydrophobic
residues, then all hydrophobic residues will be scored positively but charged residues will be
scored quite negatively. Some positions in the multiple sequence alignment contain most or
all of the 20 amino acids, and such positions will score all of the amino acids neutrally. The
second iteration will produce a new list of hits. Many of these will be the same as in the first
round, but the alignments may be different. True hits will usually have longer alignments, and
the expectation values or E-values will be much better. An E-value for a hit is the number of
hits expected with a raw score the same as or better than the hit; thus a very small E-value
(<0.001) indicates a high statistical significance. In the second round, many new hits with good
E-values will be found that were not found in the first round. PSI-BLAST then builds a new
multiple sequence alignment and profile from the hits in the second iteration, and then performs
a search with this profile. The number of iterations is controlled by the user from within
MolIDE. We have added to the most recent version of MolIDE (version 1.6) the ability to open
the PSI-BLAST output file against the nr database. This produces a table that is sortable by E-
value, sequence identity, starting and stopping residues of the alignment, protein name, and
species. As such, MolIDE provides a graphical interface for PSI-BLAST searches of large
NIH-PA Author Manuscript
Once PSI-BLAST has created these profiles, each profile is used to search a database of
sequences of proteins of known structure, called pdbaa. We provide access to this database
from our website each week as new structures are added to the PDB2. This PSI-BLAST of the
PDB runs automatically after the search of nr to create the profiles. The resulting files, one for
each round of PSI-BLAST search of nr, contain lists of possible template structures for
modeling the target sequence as well as alignments between the target and the template
sequences. The version of PSI-BLAST provided in MolIDE is modified from that provided by
NCBI. In particular, it outputs a profile matrix for each round of PSI-BLAST with a different
filename (e.g. file1, file2, etc.) rather than overwriting a single filename. It also outputs a profile
after the last search round of PSI-BLAST, while NCBI's version does not.
It is helpful at this point to perform a prediction of secondary structure within MolIDE using
PSIPRED40. The PSIPRED predictions can be used to identify folded domains of the proteins
in regions where there is significant amounts of predicted secondary structure, and disordered
NIH-PA Author Manuscript
regions where little secondary structure is predicted. Also, the PSIPRED predictions will later
be displayed in the alignments of the target sequence to a template. In this situation, they can
be used to help determine whether the template is correct (at least some agreement of predicted
and experimental secondary structure is expected) and to identify regions that may be
misaligned (poor agreement of major secondary structure elements). PSIPRED uses the
sequence profiles produced by PSI-BLAST to predict the positions of α helices and β sheets
in the target sequence.
The next step in modeling is to choose a template by opening the list of hits from the PDB.
MolIDE parses each PSI-BLAST output file and creates a table of hits, including their
experimental source (XRAY or NMR), the resolution of X-ray structures, the E-value,
sequence identity, the beginning and ending points in the target sequence, and the alignment
length. This table is sortable by any of these elements, which provides a rapid way of finding
the highest resolution structure or the one with the highest sequence identity. Quite frequently,
a multi-domain protein may have homologues of known structure that cover non-overlapping
regions of the target sequence. The table can be sorted by beginning or ending residue numbers
of the target sequence in order to locate templates that cover the domain of interest among a
list of possibly hundreds of hits. The template for modeling is chosen based on the best
NIH-PA Author Manuscript
combination of a number of factors. First, it must cover the region or regions of interest in the
target protein. From among those hits, usually the best E-value or highest sequence identity
template is chosen. When there are a number of templates with about the same evolutionary
distance from the target sequence, the one with highest resolution is usually the best choice of
template. Another consideration may be the number and positions of gaps in the sequence
alignment. Often one template or another may contain a ligand or binding partner of interest
while others will not. This ligand may be nucleic acid, small organic molecules, ions, or other
proteins. Our database program ProtBuD41 can be used to search within a protein family for
particular ligands. Structures of such complexes may be used to identify important binding
residues in the model of the target.
With a simple click within the list of templates, MolIDE downloads the template structure from
the PDB and then displays the structure of the template and the sequence alignment of the
target and template. The predicted secondary structure of the target is shown above the target
sequence and the experimental secondary structure of the template is shown below the template
sequence. At this point, the alignment can be manually edited. PSI-BLAST alignments are
reasonably accurate at sequence identities above 30%42, but even then the positions of gaps
of insertions or deletions in the alignment may be placed within regular secondary structures
NIH-PA Author Manuscript
of the template. The visualization tool within MolIDE allows the user to see the placement of
deletions from the structure (deleted residues marked by red balls) and insertions into the
structure (marked by yellow balls at the point of insertion). The gaps can be moved within the
alignment as described in the protocol and the changes are marked in real time on the structure.
For deletions from the structure, it is best if the end points of the deletion are relatively close
to one another in space on the template structure.
Once the alignment has been edited, the modeling process consists of three steps performed
within the MolIDE graphical interface. The first is simply to copy the backbone coordinates
to a new file and renumber the sequence and the residue names to the target sequence according
to the alignment. Only aligned residues are copied. If the residue types are the same at a given
position in the target and template sequences, the side-chain coordinates are also copied to the
new file. If the template PDB contains modified residues (selenomethionine or phosphorylated
residues), only the backbone is copied. The second step is to run SCWRL to build the side
chains of the target sequence onto the model backbone. SCWRL is able to preserve the
Cartesian coordinates of side chains that are conserved, and this generally results in more
accurate side-chain prediction in homology modeling. The third step is to model each loop in
NIH-PA Author Manuscript
turn. MolIDE uses the program Loopy43 which is a relatively fast program for loop structure
prediction. It is one of the few stand-alone loop-modeling programs. To model the first loop,
the user selects positions for the left and right anchors of the region to be remodeled. These
are positions in the structure that will be kept fixed while the residues in between will be
modeled by Loopy. The left anchor can be chosen as the last residue of the secondary structure
preceding the gap, while the right anchor is the first residue of the secondary structure following
the gap. Alternatively, longer or shorter regions may be chosen in order to choose the anchor
positions as the closest conserved residues to the gap. Once the anchors are set, the loop can
be modeled with a click. The anchors are then set around the next gap in the same manner and
so on, until all the insertion and deletion regions have been remodeled with Loopy.
If a protein is very long (>500 amino acids) and contains multiple domains and long disordered
regions, it is sometimes helpful to use target sequences of the single domains of interest. Many
proteins contain long regions that are intrinsically disordered. Several web servers are available
to predict these regions, including DISOPRED44. Such regions can be removed from the target
sequence before the PSI-BLAST search step.
produces models assuming that aligned regions of the target to the template do not change
backbone conformation. While this is not in general true, it is a reasonable approximation when
the sequence identity is above 30%. Even below this value, the model is still useful in
understanding the relative positions of residues in the protein. In any sequence alignment, some
regions are more conserved than others, and these regions usually have functional or structural
significance. Thus, even such a simple modeling procedure will predict whether residues are
on the surface or buried: mutation of buried residues may lead to unfolding, while mutations
on the surface may abolish binding to other molecules. For protein-protein interactions, a patch
of conserved surface residues may be close together on the surface but far apart in the sequence.
Such a conserved patch may be used to locate likely binding surfaces, which can be tested
using site-directed mutagenesis.
It should be noted that there are web servers that will also produce homology models, including
SwissModel45 as well as databases of models, including ModBase46. These are certainly valid
alternatives to performing homology modeling with SCWRL and MolIDE. However, there are
many choices in homology modeling that a user may wish to make with consideration of
particular biological questions in mind, and MolIDE is designed to allow the user to make these
choices while handling the nitty-gritty computational steps with a few clicks. This is especially
NIH-PA Author Manuscript
true in the choice of template. Some templates may be preferred because they contain ligands
of interest, including other proteins, small molecules, or nucleic acids. Some templates may
have better conservation near residues that the user is interested in, for instance given existing
mutation data. Other templates may have fewer insertions or deletions in regions of interest,
for instance near an interface. Further, MolIDE provides user interaction during the model-
building process that may be highly beneficial. This is especially true of manual editing of the
target-template alignment using additional information that the user may possess, including
multiple sources for target-template alignment and visualization of the template structure.
Choosing the positions where loop modeling begins and ends (the loop anchors) by visualizing
the structure may also lead to better structure predictions.
MATERIALS
Equipment Setup
NIH-PA Author Manuscript
Software—All of the software and various components described here can be downloaded
from http://dunbrack.fccc.edu as described in the procedures below.
Databases—The sequence databases required can be downloaded from our web site and
publicly available websites as described in the procedures below.
PROCEDURES
SCWRL: Obtaining and installing SCWRL
1 | From the SCWRL webpage, http://dunbrack.fccc.edu/SCWRL3.php, follow the link labeled
“Download” and fill out the license form. SCWRL is free to non-profit institutions.
Commercial institutions should contact [email protected]. Fill out the form and
click the “I agree” button at the bottom of the page. This leads to a verification page for the
input information. Click “Send request.” The request is sent to the Fox Chase Cancer Center
for approval. On approval, the user will receive an e-mail message with the subject heading
NIH-PA Author Manuscript
“SCWRL3.0 Download.” Click the link in this e-mail message to obtain SCWRL3.0 for various
platforms, including Windows (both XP and Vista), Linux, Mac OS X, SGI Irix, and SunOS.
Click “download” to begin downloading of an archive that contains the SCWRL program and
the binary rotamer library file used by SCWRL.
The installation kits for each operating system have the following names respectively:
scwrl3_win.msi
scwrl3_lin.tar.gz
scwrl3_mac.tar.gz
scwrl3_sgi.tar.gz
scwrl3_sun.tar.gz
2 | The procedure for installing SCWRL is slightly different on Windows (Option A) and the
Unix-related platforms (Option B).
NIH-PA Author Manuscript
(B) Unix-based operating systems (Linux, MacOS X, SGI Irix, and SunOS)—(i)
Move the archive to a location on your hard drive where you want to keep the SCWRL program.
From a terminal window, give the following commands (for example, for the Linux
distribution):
cd scwrl_path/
gzip -d scwrl3_lin.tar.gz
tar -xvf scwrl3_lin.tar
where “scwrl_path/” is the name of the directory that contains the file scwrl3_lin.tar.gz.
Ordinarily on Linux systems, a typical directory for SCWRL might be /usr/local/bin.
NIH-PA Author Manuscript
This previous step will create a new directory, scwrl3_lin/, and unpacks four files and a folder
into that directory:
BBDep.bin
setup
scwrl3_
README.scwrl3
examples/
On the command line of the terminal window, now type:
cd scwrl3_lin
./setup
This command will modify the executable file scwrl3_ and move it to the filename scwrl3.
This executable now contains within it the location of the rotamer library file, BBDep.bin. The
executable can be moved elsewhere on the computer and can be executed in any directory, as
long as the rotamer library remains in its location where ./setup was run. If you decide to move
NIH-PA Author Manuscript
the BBDep.bin to a different location, repeat the installation procedure for that directory,
beginning with uncompressing the kit.
(A) Windows—Open a console window by selecting “Command Prompt” from the “Start”
menu. On some systems, the Command Prompt may be found by selecting “All Programs”
from the “Start” menu, and then selecting “Accessories” and then “Command Prompt.”
(B) Unix systems—Unix systems vary on how to open a terminal window. On Mac OS X,
for example, the Terminal program is located in the Utilities folder within the Applications
folder. Consult your system administrator for assistance if necessary.
4 | SCWRL may be used in various ways using some optional flags on the command line.
SCWRL can predict side-chain conformations for an input backbone structure without
NIH-PA Author Manuscript
modification of the sequence. In this case, SCWRL removes all side-chain atoms from the
input file (if any), and rebuilds all of the side chains according to the residue names of the
backbone atom coordinates (Option A). SCWRL can predict side chain conformation in the
presence of non-protein atoms, such as ions, ligands, and nucleic acids. The ligand atoms are
treated only with a simple steric repulsive energy function, so that the predicted side chains
will not overlap the ligand atoms. SCWRL determines the element (N, C, Zn, Mg, etc.) from
the atom name, and assigns a radius to each atom based on its element type. The procedure for
modeling in the presence of ligands is as described in Option B. SCWRL can change the
sequence of the input file by reading an additional file containing the new sequence. This
sequence file should contain one-letter codes for the new sequence, and must contain exactly
the same number of residues that the input PDB file contains. If it does not, SCWRL will report
an error. The new sequence is placed on the backbone, retaining the input chain identifiers (A,
B, C, etc.) and residue numbering. The input file may contain multiple chains, as long as the
input sequence file contains the new sequence for each chain in the same order as the input
coordinate file. The input sequence file also may contain information to indicate whether the
Cartesian coordinates of the side chain in the input file should be kept. This is useful in
homology modeling, since better predictions will usually be produced when conserved side
NIH-PA Author Manuscript
chains (same residue type in the target and template sequences) are kept fixed during side-
chain prediction (Option C).
(A) Running SCWRL without modifying the sequence and without ligands
i. Prepare the file containing the backbone coordinates. This is a PDB format file, which
typically looks something like the example shown in Figure 1. The element types on
the end of the line are ignored and are not necessary in the input. The chain ID (“A”)
is also not necessary, as long as this character is replaced by a space. The spacing is
critical and should adhere to the standard PDB format (see
http://www.wwpdb.org/docs.html). For instance, the atom names for the backbone
atoms (N, CA, C, O) should begin in column 14 and are left-justified. The three-letter
residue type of the 20 standard amino acids begins in column 18. All other residue
types are ignored. The residue number is right-justified with the right-most digit in
column 16. The last two numbers on the line are the occupancy (usually 1.0) and the
atomic B-factor or temperature factor. SCWRL ignores these, replacing the
occupancy with 1.0 and the B-factor with 0.0. The file can contain REMARK records
or other record types besides the ATOM records of the x,y,z coordinates of the
backbone. These other record types will also be ignored. This file can have any name
NIH-PA Author Manuscript
or extension, although typically PDB format files have the extension .pdb or .ent.
ii. To predict the side chains for the input backbone conformation, issue this command
in the terminal window or console window:(for Windows)
scwrl_path\scwrl3.exe -i inputpdbfile -o outputpdbfile > logfile
(for Unix systems)
scwrl_path/scwrl3 -i inputpdbfile -o outputpdbfile > logfile
The filenames can be any desired names for the input file (“inputpdbfile”), the output
PDB-format file (“outputpdbfile”), and the log file (“logfile”). So for example, the
actual commands on Windows or Unix systems, respectively, might be:
C:\FCCC\scwrl3_win\scwrl3.exe -i myfile.pdb -o mymodel.pdb >
mylog.log
/usr/local/bin/scwrl3_lin/scwrl3 -i myfile.pdb -o mymodel.pdb > mylog.log
The log file will contain some output from SCWRL about how the prediction problem
NIH-PA Author Manuscript
was solved, including details about the graph theory algorithm process as described
in the paper29. For most users, the information in this file is not relevant.
(C) Running SCWRL with an input sequence different from the input coordinates
i. Prepare the input files. The backbone file is prepared in the same way as in part (A)
(i). The sequence file is a text file containing the new sequence. To indicate that the
input side-chain coordinates should be kept, the residue type is put in lower-case. If
SCWRL is to predict the side-chain coordinates, the residue is in upper-case. SCWRL
will ignore spaces, carriage-returns, and numbers in the sequence file. For instance,
to remodel the side chains in the protein crambin, PDB entry 1CRN, but keeping the
cysteines fixed in their input coordinates, the sequence file would appear as:
TTccPSIVARSNFNVcRLPGTPEAIcATYTGcIIIPGATcPGDYAN
NIH-PA Author Manuscript
If the lower-case residue type does not agree with the input PDB file residue type in
the same position within the structure, then the input PDB file residue type will be
used and the coordinates will be predicted by SCWRL.
ii. Type the appropriate command into the console or terminal window:(for Windows)
scwrl_path\scwrl3.exe -i inputpdbfile -o outputpdbfile -s sequencefile >
logfile
(for Unix-based systems)
scwrl_path/scwrl3 -i inputpdbfile -o outputpdbfile -s sequencefile > logfile
where sequencefile contains the new sequence.SCWRL can combine the optional
sequencefile and framefile by using both flags on the command line. It has two further
flags, -u and -d. The flag -u tells SCWRL not to predict disulfides. This is useful for
proteins in reducing environments, especially if they contain cysteines in proximity
of one another around a metal ion such as zinc. The other option is -d, which tells
SCWRL to print out a file with the dihedral angles of the predicted structure in a file
called outputpdbfile.dihed. The order of flags is not important.
NIH-PA Author Manuscript
The installation kits for each operating system have the following names
molide1.6_win.msi
NIH-PA Author Manuscript
molide1.6_lin.tar.gz
6 | Installing and setting up MolIDE on Windows (Option A) and Linux systems (Option B)
follow different procedures. The procedure on Windows is significantly simpler.
i. Move the archive file, molide1.6_lin.tar.gz, to a location on your hard drive where
you want to keep the MolIDE archive. From a terminal window, give the following
commands:
NIH-PA Author Manuscript
cd molide_path/
gzip -d molide1.6_lin.tar.gz
tar -xvf molide1.6_lin.tar
where “molide_path/” is the name of the directory that contains the file
molide1.6_lin.tar.gz. Ordinarily on Linux systems, a typical directory for MolIDE
might be /usr/local/bin.
The previous step will create a new directory, molide1.6_lin/, and unpacks directories
and files into that directory.
ii. MolIDE uses the wxWindows cross-platform library, which must be installed. For
convenience we have included in the Linux distribution the rpm files for version
2.4.2-1, located in the subdirectory “molide_path/molide1.6_lin/wx”. To install the
wxWindows library on Linux, the user must be logged into the computer as “root.”
If you do not have root privileges on your machine, ask a system administrator to
perform this step. Type the following in a terminal window:
NIH-PA Author Manuscript
cd molide_path/molide1.6_lin/wx
rpm -U *.rpm
!Caution. If a later version of wxWindows is already installed, the system may return
error messages with the previous command. To override this, the flag “-f” can also
be given in the rpm command. However, this may compromise other programs on
your system that may use wxWindows. It is unlikely that most users will face this
problem, because wxWindows is not that common on Linux machines.
iii. To set up MolIDE, type the following in a terminal window:
cd molide_path/molide1.6_lin/
./setup
! Caution If you already have the NCBI package for BLAST installed on your
machine, make back-up copies of the file .ncbirc in your home directory, if it exists.
It will be overwritten by setup.
NIH-PA Author Manuscript
8 | MolIDE depends on two sequence databases for producing homology models. The first is
NIH-PA Author Manuscript
the non-redundant protein sequence database, “nr,” from NCBI, currently about 6 million
sequences. This sequence database is used to produce sequence profiles for the target sequence
based on multiple sequence alignment of many homologues. The second is the PDB protein
sequence database, “pdbaa,” which must be obtained from our website. This version of pdbaa
is different from NCBI's version in a number of ways. It is more up-to-date than NCBI's, and
contains additional information on the header line for each sequence, including experiment
type, resolution, sequence length, and R-factors. It also has distinct names for different
sequences in a single PDB file, based on the gene name for that protein.
The PDB is updated weekly on Wednesdays, and the pdbaa database is updated on our website
within a couple of days. The pdbaa database within MolIDE therefore can be updated as often
as weekly when MolIDE is in use. Over 100 structures are added to the PDB every week. To
update pdbaa, select from the Tools menu, “Update DB”. A window appears with the option
to update either “PDBAA” or “NR.” Select “PDBAA,” click “OK”, and then “Download.”
The PSI-BLAST formatted files will be installed in the appropriate location.
After MolIDE is installed for the first time, download the nr database via the “Update DB”
option in the Tools menu. The nr sequence database will be downloaded from the NCBI ftp
NIH-PA Author Manuscript
site, and automatically formatted for PSI-BLAST using the NCBI program formatdb, included
with MolIDE. Depending on download speed, it may take 30 minutes or more to download nr,
and 10 minutes or more to uncompress it and format it for PSI-BLAST. All of this will be done
automatically by MolIDE. The nr database can be updated periodically. A monthly update is
more than sufficient.
Other databases may be used instead of the nr database from NCBI. For instance the UniRef
databases from UniProt are suitable. To use the uniref100 database from uniprot:
1. under Tools->Servers, change “NR ftp Server” to ftp.uniprot.org;
2. under Tools->Servers, change “NR ftp server Path (.gz file)” to /pub/databases/
uniprot/current_release/uniref/uniref100/uniref100.fasta.gz”
3. under Tools->Psiblast, change “NR Seq DB File and Path” to C:\FCCC\MolIDE\db
\nr\uniref100 or wherever the uniref100 database is located
!Caution. The PDB sequence database pdbaa must be obtained from our website for use in
MolIDE rather than NCBI's file of the same name.
NIH-PA Author Manuscript
? TROUBLESHOOTING
RGRERFEMFRELNEALEL
KDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
A FASTA-formatted sequence file includes “>Name” as the first line in the file with the
sequence following, starting on the next line. The sequence can be spread over as many lines
as necessary and the lines can be of any length. Other text can follow the name on the first line,
and there should only be one sequence in the file. Spaces and numbers within the sequence
will be ignored.
Within MolIDE, open a sequence file via the File menu, selecting “Open” and then “Sequence.”
In what follows, we describe choosing items under one of the menus with the following
notation, as in this case: “File->Open->Sequence”.
automatically searched with each of the profiles and a separate PDB alignment file is created.
You can change other parameters used during the PSI-BLAST runs using Tools->Options-
>PSIBLAST. For instance, for common protein domains such as kinases or immunoglobulins,
it may be desirable to use a stricter E-value cutoff for creating the profiles in PSI-BLAST. If
you know that you have close homologues to your target sequence in the PDB, then rather than
searching nr to create profiles, you can use pdbaa for this step as well.
Once PSI-BLAST is finished, close the PSI-BLAST run window by clicking “Done.” The PSI-
BLAST output of the search against the non-redundant protein database can be opened by
choosing “Open->Seq HITS Alignment”. A window opens with a table of all the hits found in
the non-redundant database. This is shown in Figure 3.
? TROUBLESHOOTING
11 | Run PSIPRED to predict the secondary structure of the target by selecting Tools-
>PSIPRED. PSIPRED uses the output PSI-BLAST sequence profiles from the nr database
search. PSIPRED should therefore be run only after the PSI-BLAST run in the previous step
NIH-PA Author Manuscript
To view the secondary structure predictions, select File->Open->Sec Struct Pred. After
choosing a “.psipred” file, you are given the option to display all of the predictions based on
the matrices generated by PSI-BLAST after each round. These are displayed in a single window
with each prediction on a separate line. A predicted sheet is colored in green and a predicted
helix is in red. Predicted coil regions (loops and unstructured regions) are depicted in gray. A
screenshot of predicted secondary structure from several rounds of PSI-BLAST is also shown
in Figure 3.
The intensity of the color is proportional to the prediction confidence; the darker the color, the
higher the prediction confidence. This view allows you to see if the secondary structure
prediction changes as more remotely related sequences are added to the profiles. For proteins
with few close relatives, the predictions may be more accurate in later rounds as distantly
related sequences provide information on likely secondary structure patterns. However, for
proteins with a good number of close relatives (>50), the addition of distantly related sequences
with potentially large structural changes (additional secondary structure or missing secondary
structure) may degrade the secondary structure prediction.
NIH-PA Author Manuscript
Modeling with MolIDE: Choosing a template and editing the alignment—12 | Open
the PSI-BLAST file containing the alignments of your query sequence with sequences of
proteins from PDB. There will be a separate hits file for each round of PSI-BLAST. Open a
file of hits from the PDB by selecting “Open->PDB Hits Alignment”. Only files with the correct
extension, .pdbout, will be shown.
The results are displayed as a table, as shown in Figure 4. To sort the table by some feature,
click on the column header of that feature, such as E-value, sequence identity, or starting and
ending positions of the alignment. Clicking again on the same column header will reverse the
sorting order for that column. In a multi-domain protein, for instance, it is common to have
many available templates for one domain, but only a few templates for a different domain.
These are hard to locate in the raw PSI-BLAST output text file, but the sorting feature in
MolIDE makes them easy to find.
Because there is a file of hits for each round of PSI-BLAST, one may ask which file should be
used. There are no strict rules for determining this. A rough rule of thumb would be the earliest
iteration that produces a complete alignment of the target sequence or domain of the target to
NIH-PA Author Manuscript
a known structure. For example, if the target is a single-domain protein, then the earliest round
that aligns the entire target domain to a complete domain of known structure should be used.
In many cases, the first or second round may only align a central well-conserved region, but
not the whole domain. In that case, later rounds should be examined. This can be determined
by observing the alignment and the template structure simultaneously, as detailed in the next
step.
13 | While sorting the table enables the user to locate the best templates by resolution, sequence
identity, and other features, viewing the alignment of the target to a possible template is a key
step in this choice. From the hits table, double-click on the hit number in the first column for
a template of interest. The alignment of the target to that structure in the PDB will be extracted
from the PSI-BLAST output file, and saved in a separate file in the same directory where the
sequence file resides. The extension of this file is .alnonet, which stands for “alignment with
onetemplate.”
At the same time, a window will appear for downloading the coordinates of the template
structure from the PDB. Click “Download.” The default ftp server, ftp.wwpdb.org, should work
well (at times it may be slow), but the user can change the ftp server by selecting Tools-
NIH-PA Author Manuscript
>Options->Servers from the Tools menu. MolIDE uses the XML-format files from the PDB
and converts these to PDB format. It also extracts information from the XML file on how the
template sequence (numbered from 1 to N, its length) corresponds to residues in the
coordinates. PDB files are sometimes missing coordinates due to disorder, and the coordinates
may be numbered starting on any number. This information is critical in converting a template
into a target model given an alignment, and MolIDE handles it automatically from information
contained in the XML format. This information is not present in the PDB's PDB-format files.
For a PDB entry with accession code 1abc, for example, the coordinates will appear in file
1abc.pdb and the sequence-coordinate residue correspondence will appear in file 1abc.sc.
Once the XML file is downloaded, read, and converted, a window appears with a view of the
target-template sequence alignment, the secondary structure prediction of the target, and the
experimental secondary structure of the template. Above the alignment, the backbone of the
template structure appears. The whole template protein is displayed in gray, while the part of
the structure used in the alignment is displayed in green. Insertions in the target (target longer
than template) are marked by 2 adjacent yellow spheres on the template structure Cα atoms
surrounding the insertion point. Deletions from the template (target shorter than template) are
represented by red spheres on the Cα atom of residues to be deleted from the template structure.
NIH-PA Author Manuscript
Viewing the alignment and the structure simultaneously allows the user to determine visually
whether the alignment covers an entire template domain and to identify where insertions and
deletions are located on the structure. MolIDE's viewer is quite simple and is not suitable for
making images for publication. Its purpose is to examine and edit the target-template sequence
alignment. The PDB files produced by MolIDE can be read into any molecular viewer, such
as PyMol (W. L. Delano, http://pymol.org).
? TROUBLESHOOTING
Insertions are best placed in the middle of loop regions, not immediately next to regular
secondary structure. The correspondence of predicted secondary structure of the target and the
experimental secondary structure of the template can be used to guide the alignment. Often
PSI-BLAST may fail to align some regions correctly, so if there is other information available,
on conserved residues for instance, then the alignment can be edited accordingly. For sequence
identities below 30%, it is advisable to seek alternative alignments from servers that provide
profile-profile alignments, which are generally more accurate than PSI-BLAST. One of the
best and most usable of these is FFAS7. The MolIDE alignment can be edited according to the
alignment provided by FFAS.
Moving the mouse over the alignment will display in the status bar at the bottom of the window
the sequence numbers for query and template sequences, as well as the corresponding PDB
coordinate residue number in the template PDB. The color-coding scheme for the secondary
structure of the template is the same one used for the secondary structure prediction (helix=red;
sheet=green). The third column of the status bar displays the number of identities in the
alignment.
To move a gap over several residues, delete it first, then move to the place of insertion and
NIH-PA Author Manuscript
These operations can be performed on either the target sequence or the template sequence.
Only gap characters can be inserted or deleted.
Modeling with MolIDE : Building the model—15 | Once the alignment editing is
completed, choose “Copy backbone” from the Tools menu. This step produces a file with
extension .model that contains a model of the target sequence based on the aligned residues in
the current target-template alignment window. Side chains for conserved residues (identical
and aligned in the target and template alignment) are also copied to the model.
16 | Select “Build Side Chains (SCWRL)” from the Tools menu. Click “Run SCWRL” in the
window that appears. The conserved side chains are left in the original conformation from the
NIH-PA Author Manuscript
template crystal structure. This option can be changed with Tools->Options->Scwrl. SCWRL
should run very quickly (seconds). If it takes a long time (>5 minutes), the run should be
canceled in the window, and another template selected. This occurs when the backbone of the
template will not accommodate the target side chains very easily.
17 | Loop building is done by first selecting residues for the left and right anchors. These are
residues that will be kept fixed while the intervening sequence is modeled using the Loopy
program43. It is usually a good idea to allow at least 2-3 residues on either side of the insertion
or deletion to move during the loop-building process. One option is to make the left and right
anchors the last and first residues of the flanking secondary structures respectively. However,
if part of a long loop is well conserved, it may be better to select a smaller region that contains
less conserved segments. Loopy will sometimes be unable to build a loop if the loop length is
too short and the distance to be spanned by the predicted loop is too large. In this case the
anchors should be reset (cleared) and then selected again further apart and Loopy should be
run again.
NIH-PA Author Manuscript
Also note that if residues are missing from the structure due to poor electron density, they will
be marked with blue squares below the template sequence. These regions should also be rebuilt
with Loopy.
To build loops: Right_Click on a Query residue in the sequence alignment will display a pop-
up menu:
• Set Loop Left Anchor
• Set Loop Right Anchor
• Reset Anchors
• Build Loop
!Caution. Click on the query sequence not the template sequence to select anchors, to reset
the anchors, and to build the loop.
After choosing the loop's anchor residues, proceed with “Build Loop”. Click “Run Loopy” in
the window that appears. Loopy should take less than a minute or so to build the loop for loops
up to lengths 15 residues.
NIH-PA Author Manuscript
Proceed with loop building of each insertion-deletion region in turn until all the insertions/
deletions/missing residues are modeled. The model is contained in a PDB-format file. The file
name follows this convention: ProteinName_x_TemplatePDBChain_y.pdb where x is the
round number of PSI-BLAST run and y is the fragment number of the query sequence that is
aligned with that particular template PDB. This file is first generated after the side chains are
built with SCWRL3. It is subsequently overwritten by Loopy output after each loop is built.
When all loops are built, this file will contain the final homology model.
Adding Ligands to the Model (Optional)—18 | It may be desirable to remodel the side
chains in the presence of a ligand using SCWRL on the command line. MolIDE creates a second
sequence file when the “Copy backbone” command is given. This sequence file contains the
complete target sequence over the region of the target-template alignment. So once loops are
built, this sequence file can be used as input to SCWRL. To perform this step, first copy and
paste the ligand coordinates of interest from the template PDB file produced by MolIDE into
a new file, called a frame file (also see SCWRL instructions above). The sequence file has
extension .s3seqall. To remodel the side chains with SCWRL, type this command in the console
window (on Windows) or the terminal window (on Linux):
NIH-PA Author Manuscript
(Windows)
cd modeling_directory\
scwrl_path\scwrl3.exe -i inputpdbfile -o outputfile -f framefile -s file.s3seqall >
logfile
(Linux)
cd modeling_directory/
scwrl_path/scwrl3 -i inputpdbfile -o outputfile -f framefile -s file.s3seqall > logfile
where framefile contains the ligand coordinates. An example of this is shown in Figure 6.
Timing—To give a brief idea about the length of the homology modeling procedure with
SCWRL and MolIDE, we list below the number of minutes required in each step. The estimated
time is based on our modeling experience on a machine with an AMD Dual Core Processor
NIH-PA Author Manuscript
SCWRL3 is able to predict about 83% of side chains with the first dihedral angle of the side
chain (χ1) within 40° of the experimental structure29. At 50% sequence identity between target
and template and keeping conserved side chains fixed according to the template structure,
SCWRL3 predicts about 72% of side chains correctly (unpublished data).
MolIDE provides a simple and fast modeling procedure based on sequence alignments with
PSI-BLAST. At sequence identities above 30%, PSI-BLAST alignments are reasonably
accurate42. Below this value, there may be poorly conserved regions that are not accurately
aligned, or the alignment may not be complete. In this case, it may be advisable to use profile-
profile alignment methods to obtain a more accurate alignment. For instance, the FFAS
server7 produces more accurate alignments than PSI-BLAST. The alignment that MolIDE
produces can then be edited to conform to that provided by FFAS or any other server or
program.
Also, MolIDE does not take account of any structural changes, other than side chains and loops,
between the target and template. Therefore parts of the modeled structure produced by
alignments with large and/or frequent gaps and/or low sequence conservation are therefore
NIH-PA Author Manuscript
quite suspect. At lower sequence identities, the backbone model will not be very accurate. In
these cases, SCWRL may predict side-chain conformations with significant steric overlaps
with other side chains or the backbone. An energy minimization with CHARMM48 or other
programs will remove these steric overlaps, although the resulting model will not likely be any
closer to the target structure, if it were known. However, sequence conservation may vary
substantially in different parts of the alignment. Often a few well-conserved motifs are
noticeable in the alignment, and these are likely to be modeled reasonably well based on the
template structure.
Troubleshooting
1) Trouble downloading database files (Step 8)—It is possible that some users may
have trouble downloading the database and PDB XML files because of local IT security
policies. In the case of the nr database, the FASTA formatted file can be manually downloaded
from NCBI by putting this address into a web browser:
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
The file must first be uncompressed (ungzipped). On Linux, this is accomplished by typing:
NIH-PA Author Manuscript
gzip -d nr.gz
On Windows, the user must obtain software for uncompressing files such as FreeZip
(http://www.versiontracker.com/dyn/moreinfo/win/10360) and follow the instructions in the
program.
Once downloaded and uncompressed, the file should be put in the MolIDE database directory,
C:\FCCC\MolIDE\db on Windows or molide_path/db on Linux. Then the database can be
formatted using the program formatdb distributed with MolIDE from the console window in
Windows and a command-line terminal window on Linux:
(on Windows)
cd C:\FCCC\MolIDE\db
C:\FCCC\MolIDE\bin_aux\NCBI\formatdb.exe -i nr -t nr
(on Linux)
cd molide_path/db
NIH-PA Author Manuscript
molide/path/bin_aux/NCBI/formatdb -i nr -t nr -o
The PDBAA file can be downloaded using a browser from this address:
http://dunbrack.fccc.edu/Guoli/culledpdb/pdbaa.gz
and the same procedure followed to format the database for PSI-BLAST.
2) Running PSI-BLAST (Step 10)—If running PSI-BLAST fails for some reason, copy
the PSI-BLAST command given in the PSI-BLAST runtime window, and paste it into a console
window (Windows) or terminal window (Linux), and hit “return.” PSI-BLAST will now run,
and any error messages it sends to the window will now be visible. These messages may help
to diagnose the problem.
3) Downloading PDB XML files (Step 13)—For users at some institutions, the IT security
setup may not allow MolIDE to access the PDB's ftp site for the XML files. In this case, the
user can go to http://www.rcsb.org, search for the PDB code, and download the “PDBML/
XML gz” or “PDBML/XML text” files and place them in the working directory. In this case,
clicking on the row number in the PDB Hits Alignment (.pdbout) table will use the manually
NIH-PA Author Manuscript
downloaded XML file instead of going to the ftp server to get it. The rest of the processing is
the same.
ACKNOWLEDGMENTS
This work was supported by NIH grants R01-HG02302 and R01-GM84453 (to R.L.D) and P30-CA06927 to Fox
Chase Cancer Center. We thank Mark Andrake and Radka Stoyanova for testing MolIDE 1.6.
REFERENCES
1. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB):
ensuring a single, uniform archive of PDB data. Nucleic Acids Res 2007;35:D301–3. [PubMed:
17142228]
2. Wang G, Dunbrack RL Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic
Acids Res 2005;33:W94–8. [PubMed: 15980589]
3. Lo Conte L, et al. SCOP: a structural classification of proteins database. Nucleic Acids Res
2000;28:257–9. [PubMed: 10592240]
4. Perutz MF, Kendrew JC, Watson HC. Structure and function of haemoglobin. Journal of Molecular
Biology 1965;13:669–678.
NIH-PA Author Manuscript
5. Browne WJ, North AC, Phillips DC. A possible three-dimensional structure of bovine alpha-
lactalbumin based on that of hen's egg-white lysozyme. Journal of Molecular Biology 1969;42:65–86.
[PubMed: 5817651]
6. Jones DT, Taylor WR, Thornton JM. A new approach to protein fold recognition. Nature 1992;358:86–
89. [PubMed: 1614539]
7. Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A. FFAS03: a server for profile--profile sequence
alignments. Nucleic Acids Res 2005;33:W284–8. [PubMed: 15980471]
8. Havel TF, Snow ME. A new method for building protein conformations from sequence alignments
with homologues of known structure. Journal of Molecular Biology 1991;217:1–7. [PubMed:
1988672]
9. Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. Journal of
Molecular Biology 1993;234:779–815. [PubMed: 8254673]
10. Sanchez R, Sali A. Evaluation of comparative protein structure modeling by MODELLER-3. Proteins
1997;Suppl:50–8. [PubMed: 9485495]
11. Li H, et al. Homology modeling using simulated annealing of restrained molecular dynamics and
conformational search calculations with CONGEN: application in predicting the three-dimensional
structure of murine homeodomain Msx-1. Protein Sci 1997;6:956–70. [PubMed: 9144767]
12. Sahasrabudhe PV, Tejero R, Kitao S, Furuichi Y, Montelione GT. Homology modeling of an RNP
NIH-PA Author Manuscript
17. Najmanovich RJ, Torrance JW, Thornton JM. Prediction of protein function from structure: insights
from methods for the detection of local structural similarities. Biotechniques 2005;38:847, 849, 851.
[PubMed: 16018542]
NIH-PA Author Manuscript
18. Watson JD, Laskowski RA, Thornton JM. Predicting protein function from sequence and structural
data. Curr Opin Struct Biol 2005;15:275–84. [PubMed: 15963890]
19. Kim SH, et al. Structure-based functional inference in structural genomics. J Struct Funct Genomics
2003;4:129–35. [PubMed: 14649297]
20. Zoller B, Dahlback B. Linkage between inherited resistance to activated protein C and factor V gene
mutation in venous thrombosis. Lancet 1994;343:1536–8. [PubMed: 7911873]
21. Bogaerts V, et al. Genetic variability in the mitochondrial serine protease HTRA2 contributes to risk
for Parkinson disease. Hum Mutat. 2008
22. Couch FJ, Weber BL. Mutations and polymorphisms in the familial early-onset breast cancer
(BRCA1) gene. Breast Cancer Information Core. Hum Mutat 1996;8:8–18. [PubMed: 8807330]
23. Karchin R, et al. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple
information sources. Bioinformatics 2005;21:2814–20. [PubMed: 15827081]
24. Weber IT, et al. Molecular modeling of the HIV-1 protease and its substrate binding site. Science
1989;243:928–31. [PubMed: 2537531]
25. Weber IT. Evaluation of homology modeling of HIV protease. Proteins 1990;7:172–84. [PubMed:
2158092]
26. Ring CS, et al. Structure-based inhibitor design by using protein models for the development of
antiparasitic agents. Proc. Natl. Acad. Sci. USA 1993;90:3583–7. [PubMed: 8475107]
NIH-PA Author Manuscript
27. Park H, et al. Discovery of novel alpha-glucosidase inhibitors based on the virtual screening with the
homology-modeled protein structure. Bioorg Med Chem 2008;16:284–92. [PubMed: 17920282]
28. Bower, M.; Cohen, FE.; Dunbrack, RL. SCWRL: A program for building sidechains onto protein
backbones. University of California; San Francisco: 1997.
J.http://www.cmpharm.ucsf.edu/~bower/scwrl.html
29. Canutescu AA, Shelenkov AA, Dunbrack RL Jr. A graph-theory algorithm for rapid protein side-
chain prediction. Protein Sci 2003;12:2001–14. [PubMed: 12930999]
30. Wallner B, Elofsson A. All are not equal: a benchmark of different homology modeling programs.
Protein Sci 2005;14:1315–27. [PubMed: 15840834]
31. Canutescu AA, Dunbrack RL Jr. MolIDE: a homology modeling framework you can click with.
Bioinformatics 2005;21:2914–6. [PubMed: 15845657]
32. Qian B, et al. High-resolution structure prediction and the crystallographic phase problem. Nature
2007;450:259–64. [PubMed: 17934447]
33. Dunbrack, RL, Jr.. Ph. D. dissertation. Harvard University; 1993.
34. Dunbrack RL Jr. Cohen FE. Bayesian statistical analysis of protein side-chain rotamer preferences.
Protein Sci 1997;6:1661–81. [PubMed: 9260279]
35. Dunbrack RL Jr. Rotamer libraries in the 21st century. Curr Opin Struct Biol 2002;12:431–40.
[PubMed: 12163064]
NIH-PA Author Manuscript
36. Lovell SC, Word JM, Richardson JS, Richardson DC. The penultimate rotamer library. Proteins
2000;40:389–408. [PubMed: 10861930]
37. Lovell SC, Word JM, Richardson JS, Richardson DC. Asparagine and glutamine rotamers: B-factor
cutoff and correction of amide flips yield distinct clustering. Proc Natl Acad Sci U S A 1999;96:400–
5. [PubMed: 9892645]
38. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of database programs. Nucleic
Acids Research 1997;25:3389–3402. [PubMed: 9254694]
39. Wheeler DL, et al. Database resources of the National Center for Biotechnology Information. Nucleic
Acids Res 2008;36:D13–21. [PubMed: 18045790]
40. Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. Journal
of Molecular Biology 1999;292:195–202. [PubMed: 10493868]
41. Xu Q, Canutescu A, Obradovic Z, Dunbrack RL Jr. ProtBuD: a database of biological unit structures
of protein families and superfamilies. Bioinformatics 2006;22:2876–82. [PubMed: 17018535]
42. Sauder JM, Arthur JW, Dunbrack RL Jr. Large-scale comparison of protein sequence alignment
algorithms with structure alignments. Proteins 2000;40:6–22. [PubMed: 10813826]
43. Xiang Z, Soto CS, Honig B. Evaluating conformational free energies: the colony enegy and its
NIH-PA Author Manuscript
application to the problem of protein loop prediction. Proc. Natl. Acad. Sci. USA 2002;99:7432–
7437. [PubMed: 12032300]
44. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT. The DISOPRED server for the prediction
of protein disorder. Bioinformatics 2004;20:2138–9. [PubMed: 15044227]
45. Schwede T, Kopp J, Guex N, Peitsch MC. SWISS-MODEL: An automated protein homology-
modeling server. Nucleic Acids Res 2003;31:3381–5. [PubMed: 12824332]
46. Pieper U, et al. MODBASE: a database of annotated comparative protein structure models and
associated resources. Nucleic Acids Res 2006;34:D291–5. [PubMed: 16381869]
47. Ye Y, Godzik A. FATCAT: a web server for flexible structure comparison and structure similarity
searching. Nucleic Acids Res 2004;32:W582–5. [PubMed: 15215455]
48. MacKerell AD Jr. Bashford D, Bellott M, Dunbrack RL Jr. Evanseck, M.J.F. J, Fischer S, Gao J, Guo
H, Ha S, Joseph-McCarthy D, Kuchnir L, Kuczera K, Lau FTK, Mattos C, Michnick S, Ngo T,
Nguyen DT, Prodhom B, Reiher WE III, Roux B, Schlenkrich M, Smith J, Stote R, Straub J, Watanabe
M, Wiórkiewicz-Kuczera J, Yin D, Karplus M. All-atom empirical potential for molecular modeling
and dynamics studies of proteins. J. Phys. Chem 1998;B102:3586–3616.
49. Bateman A. The structure of a domain common to archaebacteria and the homocystinuria disease
protein. Trends Biochem Sci 1997;22:12–3. [PubMed: 9020585]
50. Shan X, Dunbrack RL Jr. Christopher SA, Kruger WD. Mutations in the regulatory domain of
NIH-PA Author Manuscript
cystathionine beta synthase can functionally suppress patient-derived mutations in cis. Hum Mol
Genet 2001;10:635–43. [PubMed: 11230183]
51. Proudfoot M, et al. Biochemical and structural characterization of a novel family of cystathionine
beta-synthase domain proteins fused to a Zn ribbon-like domain. J Mol Biol 2008;375:301–15.
[PubMed: 18021800]
NIH-PA Author Manuscript
Figure 1.
An excerpt from a PDB file.
NIH-PA Author Manuscript
Figure 2.
Flowchart for homology modeling with MolIDE. Step numbers to the right of each step
correspond to the protocol described in the text, beginning with Step 5.
Figure 3.
Viewing PSI-BLAST output from the non-redundant sequence database search and secondary
structure predictions with PSIPRED within MolIDE. The target sequence file remains open
(upper left) and the table of PSI-BLAST results from the non-redundant database is shown
(upper right). This table can be sorted by clicking on any of the headers at the top of the table,
once for ascending order (A) and once again for descending order (D). The secondary structure
predictions are shown in the lower window. Predictions are shown in red (helix), green (sheet
strand), and coil (gray). The intensity of the color is in proportion to the confidence values
(from 0 to 9) given by PSIPRED. The region shown is for the C-terminal Bateman
domain49 of the protein cystathionine beta synthase (CBS). Each line contains a secondary
structure prediction from a round of PSI-BLAST. While the longer helices are well predicted,
the beta sheet strands change as more remote homologues are added to the multiple sequence
NIH-PA Author Manuscript
alignment used by PSI-BLAST (top to bottom). The beta sheet strand around 510 is predicted
in the first 3 rounds but fades out at round 4. In fact, the early predictions are more likely to
be accurate in this case than the earlier ones50.
Figure 4.
Viewing and sorting the list of templates. The hits shown are from a PSI-BLAST search of
pdbaa from profiles built from searching NCBI's nr database. The query is the same as in Figure
2, human CBS. The top window is sorted by hit number (the order in the PSI-BLAST output),
and shows that there is an experimental structure of human CBS that covers residues 1 to 413
of the query (length 551). PDB entry 1JBQ is a longer structure than PDB entry 1M54. The
bottom window is sorted by starting residue in the query (A=ascending order; D=descending
order). The window shows that there are a number of templates for the C-terminal regulatory
Bateman domain. From this window, target-template alignments can be viewed by double-
clicking on the hit number in the first column (“No.”). Some Bateman domains in the PDB are
insertions into inosine monophosphate dehydrogenase (IMPDH). The Bateman domains in
these structures are largely disordered. This is evident once the target-template alignment is
opened up. After looking through this list, and examining them visually, the PDB entry
1PVM51 appears to be the best template for the Bateman domain of CBS. It has only 2 gaps
in the alignment, and is a high-resolution structure (1.9 Å) with no missing residues in the
NIH-PA Author Manuscript
aligned region.
Figure 5.
Manual editing of a target-template sequence alignment. The unedited alignment is shown at
left, and the partially edited alignment is shown at right. In each image, the alignment shows
the target sequence and its predicted secondary structure above the template sequence and its
experimental structure, with the same color coding as in Figure 2. The original alignment (left)
shows a gap in the middle of an alpha helix. To edit the alignment, a ctrl-click on the gap in
the left figure removes the gap. Shift-click on a position six residues to the right inserts a gap,
NIH-PA Author Manuscript
and produces the alignment at right. The editing does not reduce the similarity of the aligned
residues, and is more consistent with structural changes in homologous proteins.
NIH-PA Author Manuscript
Figure 6.
Image of model of Bateman domain of human cystathionine beta synthase with S-adenosyl
methionine (SAM). Because the template, 1PVM, does not contain SAM or any other
nucleotide, the model of CBS was superimposed on PDB entry 2YZQ (Kanagawa et al.,
unpublished), which does contain SAM. The side chains of the model were then remodeled as
described in Step 14. Side chains near SAM are shown in stick representation. SAM is colored
magenta. The backbone ribbon is colored from blue to red from N to C terminus. The image
was produced with PyMOL (W. L. Delano, www.pymol.org).
NIH-PA Author Manuscript