0% found this document useful (0 votes)
33 views

Pattern Recognition 1

This document provides an overview of pattern recognition methods in sequence analysis, specifically focusing on sequence motifs and position specific scoring matrices (PSSMs). It defines protein families and sequence motifs. It also describes how PSSMs are constructed from multiple sequence alignments and how they can be used to score sequences to distinguish between binding and non-binding sets. Finally, it introduces the PSI-BLAST algorithm which combines PSSMs with BLAST to iteratively improve the scoring matrix and increase sensitivity of homolog detection.

Uploaded by

Tejinder Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Pattern Recognition 1

This document provides an overview of pattern recognition methods in sequence analysis, specifically focusing on sequence motifs and position specific scoring matrices (PSSMs). It defines protein families and sequence motifs. It also describes how PSSMs are constructed from multiple sequence alignments and how they can be used to score sequences to distinguish between binding and non-binding sets. Finally, it introduces the PSI-BLAST algorithm which combines PSSMs with BLAST to iteratively improve the scoring matrix and increase sensitivity of homolog detection.

Uploaded by

Tejinder Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Paper No.

: 06 Computational Biology

Module : 05 Pattern Recognition Methods in Sequence Analysis-I

Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi

Co-Principal Investigator: Prof S K Jain, Professor,


Jamia Hamdard University, New Delhi

Paper Coordinator: Dr. Indira Ghosh, Professor


Jawaharlal Nehru University, New Delhi

Content Writer: Dr. Devapriya Choudhury, Associate Professor


Jawaharlal Nehru University, New Delhi

Paper Reviewer: Dr. Debasisa Mohanty


National Institute of Immunology, New Delhi

Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
Description of Module
Subject Name Biotechnology

Paper Name Computational Biology

Module Name/Title Pattern Recognition Methods in Sequence Analysis-I

Module Id 05

Pre-requisites

Objectives

Keywords

Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
Pattern Recognition Methods in Sequence Analysis-I

A protein family refers to a set of proteins that share some definite biological property in terms of
their sequence, structure or function. Members of a protein family are generally homologous to
each other, i.e., they share a common evolutionary ancestor. Proteins with very similar structures or
functions often share similar, but not necessarily the same, substrings within their primary
structures. Such substrings can often be represented by using a common expression known as a
regular expression, sequence pattern or sequence motif. Table 1 shows the example of four such
substrings from members of a protein family, which can be represented by a single regular
expression or sequence motif.

AVMPTTHILL
VVQPSRHVIV
VACPSWHLVI
AANPTNHVVV
[AV][AV]xP[ST]xH[LVI][LVI][LVI]

Table 1. Four sequences and their associated sequence motif

Another example of a sequence motif will be C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H. Any


sequence that matches this pattern must begin with C, followed by between 2 to 4 arbitrary
characters and then another C. This is followed by exactly 3 arbitrary characters followed by a
character from the set “LIVMFYWC”. This is again followed by exactly 8 arbitrary characters and then
a H, followed by 3 to 5 arbitrary characters and another H. The pattern just described is shown by
most classical zinc-finger domain containing proteins. The two Cysteines and the two Histidines,
which are completely conserved are responsible for coordinating with the zinc ion. A very early
database in the field of Bioinformatics was PROSITEi which maintained a collection of sequence
motifs associated with functional domains in proteins. Seed patterns were collected known
members of a protein family, and used to search the SWISSPROT database for locating additional
members of the same family. Based on the functional annotation of the new hits, the motif is further
improved and the process goes on iteratively until a stable motif is found. Information regarding true
negative and false positive hits are also collected so that the user may have an idea of the sensitivity
and specificity of the association between the pattern and the functional protein domain.

Although the presence of sequence motifs are often very useful in identifying protein domains with
specific functions, it often happens that many protein families are often not sufficiently conserved in
sequence to be able to display easily identifiable sequence motifs. However the concept of a
sequence motif can be generalized to a position specific weight matrix (PSWM)ii. The elements of a
PSWM contains the relative frequencies of amino acids (or nucleic acid bases) in an ungapped
sequence alignment. Weight matrices work well in those cases where sequences are not conserved
enough to produce easily identifiable motifs, however compared to sequences they suffer the
limitation of having a fixed length.

Given an alignment [𝑛𝛼𝑖 ] a frequency matrix [𝑓𝛼𝑖 ] can be constructed as follows:

𝑛𝛼𝑖 + 1
𝑓𝛼𝑖 =
∑𝛼(𝑛𝛼𝑖 + 1)

Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
Where: The number of instances of the amino acid (or base)  in the ith position of the alignment is
given by n

The elements of the PSWM are now given by:

𝑓𝛼𝑖
𝑤𝛼𝑖 = 𝑙𝑜𝑔
𝑓𝛼𝑜

Where 𝑓𝛼𝑜 is the background frequency of the character  For example, in the genome, E. coli fA = fT
= 0.3

Given a sequence SL = { s1, s2, … sL} of length equal to the number of columns in a PSWM, the score of
the sequence is given by:
𝐿

𝑆 = ∑ 𝑤𝑖,𝑠
𝑖=1

When PSWM is used to score all L length windows in a sequence of length n > L, the histogram of the
scores approximates a Guassian distribution.

As an application of PSWMs let us consider a set of sequences all of which are known to bind with a
particular target. As control, let us take another set of sequences that are known to not bind with
the same target. We than calculate all the scores of L-length substrings from both the sequence sets
using a PSWM with L columns trained on the sequences that bind with the target.

Figure 1. Histograms of scores obtained from sequences that are known to bind to a target (blue,
filled) and those that are known not to bind to a target (black, open).

Histograms of scores drawn from both the sets is likely to look similar to that shown in Figure 1. It is
clear from the figure, that scores dran from the sequences in the “binding” set are higher than the
scores drawn from the “non-binding” set , as would be expected if the sequences in the “binding”
set are somehow different from those in the “non-binding” set. On can set a threshold score
(indicated by dashed lines in the figure) which can distinguish between the two sets. However, the
two sets of histograms are overlapping, this means that some sequences from “binding set” are lying

Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I
at the left of the threshold and would be classified as “non-binding”. These are the “false negatives”,
the more false negatives returned a prediction algorithm, the lesser is its sensitivity. Similarly, one
may find some sequences from the “non-binding” set that has score greater than the threshold, such
sequences are classified as false positives. Greater the number of false positives lesser is the
specificity of the prediction method. One can adjust the threshold value such that one can eliminate
either the false negatives or the false positives but one cannot eliminate both.

An elegant algorithm has been developed that combines some of the advantages of PSWMs with
alignment based sequence search strategies like BLAST. This algorithm is known as PSI-BLAST
(Position Specific Iterated BLAST). As has been discussed in an earlier module, the BLAST algorithm
finds sequence neighbors based on significant local alignments between a query sequence and a
database of target sequences. The significance of the alignments are determined by a scoring matrix
which is pre-decided in the traditional BLAST algorithm and remains unchanged while the algorithm
is being run. In PSI-BLAST, one begins with a traditional BLAST run, the output alignments and a
position specific weight matrix is built from it. A second iteration of BLAST is then run with exactly
the same query and database but with the difference that the traditional scoring matrix is replaced
with the PSWM built in the previous iteration. The process is then repeated, in each such iteration,
the output alignments are used to build a PSWM, which then becomes the scoring matrix for the
next run. The process is repeated until the PSWM tend to converge, which usually happens within 2-
3 iterations. The main advantage of PSI-BLAST over traditional BLAST is that the scoring matrix is
now tailored to the specific characteristics of the family to which the query belongs. This generally
improves the sensitivity of homologue detection significantly over traditional BLAST.

i
Now part of the ExPasy research tools portal. https://prosite.expasy.org/
ii
Sometimes also known as Position Specific Scoring Matrix (PSSM) or sometimes Position Specific Weight
Matrix (PSWM)

Computational Biology
Biotechnology
Pattern Recognition Methods in Sequence Analysis-I

You might also like