Pattern Recognition 1
Pattern Recognition 1
: 06 Computational Biology
Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi
Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
Description of Module
Subject Name Biotechnology
Module Id 05
Pre-requisites
Objectives
Keywords
Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
Pattern Recognition Methods in Sequence Analysis-I
A protein family refers to a set of proteins that share some definite biological property in terms of
their sequence, structure or function. Members of a protein family are generally homologous to
each other, i.e., they share a common evolutionary ancestor. Proteins with very similar structures or
functions often share similar, but not necessarily the same, substrings within their primary
structures. Such substrings can often be represented by using a common expression known as a
regular expression, sequence pattern or sequence motif. Table 1 shows the example of four such
substrings from members of a protein family, which can be represented by a single regular
expression or sequence motif.
AVMPTTHILL
VVQPSRHVIV
VACPSWHLVI
AANPTNHVVV
[AV][AV]xP[ST]xH[LVI][LVI][LVI]
Although the presence of sequence motifs are often very useful in identifying protein domains with
specific functions, it often happens that many protein families are often not sufficiently conserved in
sequence to be able to display easily identifiable sequence motifs. However the concept of a
sequence motif can be generalized to a position specific weight matrix (PSWM)ii. The elements of a
PSWM contains the relative frequencies of amino acids (or nucleic acid bases) in an ungapped
sequence alignment. Weight matrices work well in those cases where sequences are not conserved
enough to produce easily identifiable motifs, however compared to sequences they suffer the
limitation of having a fixed length.
𝑛𝛼𝑖 + 1
𝑓𝛼𝑖 =
∑𝛼(𝑛𝛼𝑖 + 1)
Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
Where: The number of instances of the amino acid (or base) in the ith position of the alignment is
given by n
𝑓𝛼𝑖
𝑤𝛼𝑖 = 𝑙𝑜𝑔
𝑓𝛼𝑜
Where 𝑓𝛼𝑜 is the background frequency of the character For example, in the genome, E. coli fA = fT
= 0.3
Given a sequence SL = { s1, s2, … sL} of length equal to the number of columns in a PSWM, the score of
the sequence is given by:
𝐿
𝑆 = ∑ 𝑤𝑖,𝑠
𝑖=1
When PSWM is used to score all L length windows in a sequence of length n > L, the histogram of the
scores approximates a Guassian distribution.
As an application of PSWMs let us consider a set of sequences all of which are known to bind with a
particular target. As control, let us take another set of sequences that are known to not bind with
the same target. We than calculate all the scores of L-length substrings from both the sequence sets
using a PSWM with L columns trained on the sequences that bind with the target.
Figure 1. Histograms of scores obtained from sequences that are known to bind to a target (blue,
filled) and those that are known not to bind to a target (black, open).
Histograms of scores drawn from both the sets is likely to look similar to that shown in Figure 1. It is
clear from the figure, that scores dran from the sequences in the “binding” set are higher than the
scores drawn from the “non-binding” set , as would be expected if the sequences in the “binding”
set are somehow different from those in the “non-binding” set. On can set a threshold score
(indicated by dashed lines in the figure) which can distinguish between the two sets. However, the
two sets of histograms are overlapping, this means that some sequences from “binding set” are lying
Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
at the left of the threshold and would be classified as “non-binding”. These are the “false negatives”,
the more false negatives returned a prediction algorithm, the lesser is its sensitivity. Similarly, one
may find some sequences from the “non-binding” set that has score greater than the threshold, such
sequences are classified as false positives. Greater the number of false positives lesser is the
specificity of the prediction method. One can adjust the threshold value such that one can eliminate
either the false negatives or the false positives but one cannot eliminate both.
An elegant algorithm has been developed that combines some of the advantages of PSWMs with
alignment based sequence search strategies like BLAST. This algorithm is known as PSI-BLAST
(Position Specific Iterated BLAST). As has been discussed in an earlier module, the BLAST algorithm
finds sequence neighbors based on significant local alignments between a query sequence and a
database of target sequences. The significance of the alignments are determined by a scoring matrix
which is pre-decided in the traditional BLAST algorithm and remains unchanged while the algorithm
is being run. In PSI-BLAST, one begins with a traditional BLAST run, the output alignments and a
position specific weight matrix is built from it. A second iteration of BLAST is then run with exactly
the same query and database but with the difference that the traditional scoring matrix is replaced
with the PSWM built in the previous iteration. The process is then repeated, in each such iteration,
the output alignments are used to build a PSWM, which then becomes the scoring matrix for the
next run. The process is repeated until the PSWM tend to converge, which usually happens within 2-
3 iterations. The main advantage of PSI-BLAST over traditional BLAST is that the scoring matrix is
now tailored to the specific characteristics of the family to which the query belongs. This generally
improves the sensitivity of homologue detection significantly over traditional BLAST.
i
Now part of the ExPasy research tools portal. https://prosite.expasy.org/
ii
Sometimes also known as Position Specific Scoring Matrix (PSSM) or sometimes Position Specific Weight
Matrix (PSWM)
Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I