7.91 – Lecture #2
More Pairwise Sequence Comparisons
ARDFSHGLLENKLLGCDSMRWE
.::. .:::. .:::: :::.
GRDYKMALLEQWILGCD-MRWD
- and –
Multiple Sequence Alignment
ARDFSHGLLENKLLGCDSMRWE
.::. .:::. .:::: :::.
GRDYKMALLEQWILGCD-MRWD
.::. ::.: .. :. .:::
SRDW--ALIEDCMV-CNFFRWD
Reading:
This lecture: Mount pp. 8-9, 65-89, 96-115, 140-155, 161-170
Michael Yaffe
Outline
? Recursion and dynamic programming
? Applied dynamic programming: global
alignments: Needleman-Wunsch
? Applied dynamic programming: local alignments
– Smith-Waterman
? Substitution matrices: PAM, BLOSUM, Gonnet
? Gaps - linear and affine
? Alignment statistics
? What you need to know to optimize an
alignment
Outline (cont)
? Multiple sequence alignments: MSA, Clustal
? Block analysis
? Position-Specific Scoring Matrices (PSSM)
Examples
O(n
k
) is “polynomial time” as long as
K<3 …..tractable
Consider our un-gapped dot matrix
Global alignment:
1 n
1
12345678….
12345678….
12345678….
12345678….
12345678….
m
12345678….
….essentially an O(mn) problem
O.K. Examples
O(n) better than O(n log(n)), better than
O(n
2
), better than O(n
3
)
Terrible Examples
O(k
n
) = exponential time….horrible!!!!
NP problems- no known polynomial time
Solutions = non-deterministic polynomial
Problems.
Recursion and Dynamic
Programming
Aligning two protein sequences without gaps – roughly an O(mn) problem.
With gaps – becomes computationally astronomical, and cannot be done
by direct comparison methods. (= 2
2L
/√(2πL); L=sequence length)
Alternative is to compare all possible pairs of characters (matches and
mismatches, and also take gaps into account as well, while keeping
the number of comparisons manageable. The approach is called
dynamic programming. Mathematically proven to produce optimal alignment
Need a substitution or similarity matrix and some way to account for gaps.
Example of how to score an alignment: Write down two sequences:
sequence#1 V D S – C Y
sequence#2 V E S L C Y
Score from sub. Matrix 4 2 4 -11 9 7
Score = Σ(AA pair scores) – gap penalty = 15
C 9
S
T
5
P
A
0 1 0
G 6
N 6
D 6
E 2 5
Q
0 2 5
H 0 0 8
R
0 1 0 5
K 1 5
M 0
I
4
L 2 4
V
3 1 4
F 0
Y
2 7
W
CSTPAGNDEQHRKMILVFYW
BLOSUM 62 Scoring Matrix
-1 4
-1 1
-3 -1 -1 7
-1 4
-3 0 -2 -2 0
-3 1 0 -2 -2 0
-3 0 -1-1-2-1 1
-4 0 -1-1-1-2 0
-3 0 -1 -1 -1 -2 0
-3 -1 -2 -2 -2 -2 1 -1
-3 -1 -1 -2 -1 -2 0 -2
-3 0 -1-1-1-2 0 -1 1 -1 2
-1 -1 -1 -2 -1 -3 -2 -3 -2 -2 -1 -1 5
-1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1
-1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2
-1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1
-2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 -1 6
-2 -2 -2 -3 -2 -3 -2 -3 -2 -1 -2 -2 -1 -1 -1 -1 3
-2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
Scoring system should: favor matching identical or related amino acids
Penalize for poor matches and for gaps.
To get a good scoring system need to know: how often a particular amino acid
Pair is found in related proteins compared with its occurence by chance. This
Is the information contained in the substitution matrix
…..….and when a gap would be a better choice
Deriving realistic substitution matrices:
First need to know frequency of one amino acid substituting for another
In related proteins [=P(ab)] c/w the chance that substituting one for the other
occurred by chance, based on the relative frequencies of each amino acid
in proteins, q(a) and q(b). Call this the “odds ratio”: P(ab)/q(a)q(b)
If we do this for all positions in an alignment, then the total probablilty will
be the product of the odds ratios at each position….but multiplication is
computationally expensive….so….take the log (odds ratio) and add them instead.
Matrices like PAM and BLOSUM matrices are derived from these log odds ratios
And contain positive and negative numbers reflecting likelihood of amino
Acid substitutions in related proteins.
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
Gap V D S C Y
Gap
0
1 gap
2 gaps
V 1 gap
E 2 gaps
S
L
C
Y
Note – linear gap penalty: γ(n)=nA, where A=gap penalty
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0
s
ij
-8 -16 -24 -32 -40
V
-8
E
-16 So scoring Sij requires that we know
S(i-1, j-1) and S(i, j-1) and S(i-1, j)…
S
-24
Therefore recursive. We use the solutions
Of smaller problems to solve larger ones.
L -32 AND we store how we got to the Sij score,
i.e. the intermediate solutions in a tabular
C -40 matrix. Computer scientists call this dynamic
programming, where “programming” means
Y
-48
the matrix, not some kind of computer code.
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0
s
ij
-8 -16 -24 -32 -40
V
-8
E
-16
Global alignments: Needleman-Wunsch-Sellers
S
-24
O(n
2
) using linear gap penalty
L -32 S
ij
= max of: S
i-1, j-1
+ σ(x
i
, y
j
) (diagonal)
C -40
S
i-1, j
–A (from left to right)
Y
-48
S
i, j-1
–A (from top to bottom)
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
Gap V
0
-8
s
ij
4
4
-8
D S C Y
j =
0 Gap
-16 -24 -32 -40
-8
V
-8
E
-16
Global alignments: Needleman-Wunsch-Sellers
S
-24
O(n
2
) using linear gap penalty
L -32 S
ij
= max of: S
i-1, j-1
+ σ(x
i
, y
j
) (diagonal)
C -40
S
i-1, j
–A (from left to right)
Y
-48
S
i, j-1
–A (from top to bottom)
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
Gap V
0
-8
4
4
4
-8
-8
D S C Y
j =
0 Gap
-16 -24 -32 -40
-8
V
E
-16
S
-24
L -32
C -40
Y
-48
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0
-8
4
4
-3
-8
-16 -24 -32 -40
-8
V
-8
s
ij
E
-16
Global alignments: Needleman-Wunsch-Sellers
S
-24
O(n
2
) using linear gap penalty
L -32 S
ij
= max of: S
i-1, j-1
+ σ(x
i
, y
j
) (diagonal)
C -40
S
i-1, j
–A (from left to right)
Y
-48
S
i, j-1
–A (from top to bottom)
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0
-8
-16
4
4
-4
-3
-8
-24 -32 -40
-8
V
-8
E
S
L
C
Y
1
2
3
4
5
6
To do Dynamic Programming:
First write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0
4
-8
-4
-3
-8
-16 -24 -32 -40
-8
V
-8
4
-12 -20 -28
-1
E
-16
-6
-9
7
-17
2
S
1 -7
-24
-14
3
-6
9
L -32
-22 -14
-30
3
0
C -40
-22
-7
1
3
13
Y
-48
-38
-30
-15 5
23
The Traceback:
After the alignment square is finished, start at the lower right and
work backwards following the arrows to see how you got there…
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0
4
-8
-16 -24 -32 -40
1 V -8
2 E -16
3 S
-24
4 L -32
5 C
-40
6 Y
-48
-8
4 -4
-3
-8
-12 -28
-6
-14
-22
-30
-38
7
3
-6
-14
-22
-30
9
2
-1
1
-7
-15
-9
1
3
13
5
-17
-7
0
3
23
-20
Y
-48
1
2
3
4
5
6
-38
-30
-15 5
23
The Traceback
V D S – C Y
V E S L C Y
gives the alignment:
i =0 1 2 3 4 5
j =
Gap V D S C Y
0
0
-8
-16 -24 -32 -40
-8
Gap
V
E
S
L
C
-8
-16
-24
-32
-40
4
4
-4
-3
-8
-12 -28
-6
-14
-22
-30
7
3
-6
-14
-22
9
2
-1
1
-7
-9
1
3
13
-17
-7
0
3
“Life must be lived forwards and understood backwards.”
-20
- S?ren Kierkegaard
Local Alignment
? Temple Smith and Michael Waterman, 1981 – modified
Needleman-Wunsch-Sellers
Local alignment is the best scoring alignment of a substring
in sequence x to a substring in sequence y.
KEY ELEMENT IS NOT TO FORCE THE ALIGNMENT TO GO TO
THE ENDS OF THE SEQUENCES.
For sequence x, residues 1, 2, 3…..N, can pick up to ~N
2
substrings,
i.e. start point a= 1,2….N and end point b= 1, 2….n. Same for
sequence y, ~M
2
substrings. For any two substrings, we have the
old O(mn) alignment problem, so the total number of possible
alignments is ~ N
2
M
2
(NM)=O(M
3
N
3
)- UGLY!!!! Solveable in polynomial
time, but a big polynomial!!!
Local Alignment
? Again, dynamic programming comes to the rescue!
Same basic scheme for dynamic programming as before
except….
Similarity matrix MUST include negative values for mismatches
--AND–
****When the value calculated for a position in the scoring matrix is
Negative, the value is set to zero.
THIS TERMINATES THE ALIGNMENT
Smith-Waterman:
Write one sequence across the top, and one down along the side
i =0 1 2 3 4 5
j =
Gap V D S C Y
0 Gap
0 0
s
ij
0 0 0 0
1 V 0
2 E 0
Local alignments: Smith-Waterman
3 S
0
S
ij
= max of: S
i-1, j-1
+ σ(x
i
, y
j
) (diagonal)
S
4 L 0
i-1, j
–A (from left to right)
5 C
0
6 Y
S
i, j-1
–A (from top to bottom)
0
0
Programs for Global and Local alignments:
Biology workbench
http://workbench.sdsc.edu/
Bill Pearson’s Web Page
http://fasta.bioch.virginia.edu/
NCBI, Expassy
Amino Acid Substitution Matrices
Margaret Dayhoff, 1978, PAM Matrices
**Evolutionary model **based on a small data set.
Assumes symmetry: A → B = B → A
Assumes amino acid substitutions observed over
short periods of time can be extrapolated to long
periods of time
71 groups of protein sequences, 85% similar
1572 amino acid changes.
Functional proteins →”Accepted” mutations by natural
selection
PAM1 matrix means 1% divergence between proteins - i.e.
1 amino acid change per 100 residues. Some texts re-state
this as the probability of each amino acid changing
into another is ~ 1% and probability of not changing is ~99%
Construction of a Dayhoff Matrix: PAM1
Step 1: Measure pairwise substitution frequencies for each
amino acid within families of related proteins
….GDSFHYFVSHG…..
….GDSFHYYVSFG…..
….GDSYHYFVSFG…..
….GDSFHYFVSFG…..
….GDSFHFFVSFG…..
900 Phe (F)….+ another 100 probable Phe but…
100 Phe (F) → 80 Tyr (Y), 3 Trp (W), 2 His (H)….
Gives f
ab
, i.e. f
FY
=80
f
FW
=3
….by evolution!
C
Do this for all 20 amino acids
D
E
F
G
C D E F G H I ……
f
CC
f
CD
f
CE
f
DC
f
EC
Gives f
ab
= pairwise exchange
frequency
Implicit assumption - 1st order Markov chain transition model
Step 2: Calculate the relative probabilities of pairwise
exchange of a→b
Pa = Probability of amino acid a
fab = number of substitutions between a and b
fa = total number of substitutions involving amino acid a
fa= Σ fab
a≠b
f = total number of mutations in the group of related sequences = Σ fa
a
Define relative probability M’ab as:
M’ab=Pr(a→b)=fab/f
Step 3: Scale relative probabilities to obtain
a 1% total chance of any amino acid changing into
a different amino acid
i.s. scale M’ to ensure that:
Σ Pa Mab = .01
a≠b
i.e. ΣPa Maa=.99
Step 4: This involves defining a “relative mutability”
index m
a
for each amino acid
m
Number of mutations
a
= f
a
involving amino acid a
100 f Pa
“exposure of ‘a’ to mutation”
Prob(a)* total number of
mutations weighted per 100 sites
Set adjusted probability M
ab
= M’
ab
m
a
Step 5: Calculate evolutionary distance scale
so that only 1/100 amino acids change
M
aa
reflects amino acid conservation
M
aa
=
1-
∑
M
ab
b
example
= 1-
(adjusted probability of Phe mutations)
M
FF
**Use a scale factor λ so that M
aa
is ~ 0.99
i.e. chance of it mutating is ~ 1%
i.e. this defines a PAM1 matrix….
M
aa
=
1-
λ∑
M
ab
= ~ 0.99
a≠b
λ is our evolutionary scale factor
… and for any particular mutation probability,
λM
ab
reflects the normalized measure of how likely
amino acid b will replace amino acid a over 1 PAM
Real PAM1 values
Amino Acid Change
F→A
F→R
F→N
F→D
F→C
F→Q
F→E
F→G
F→H
F→I
F→L
F→K
F→M
F→F
F→P
F→S
F→T
F→W
F→Y
F→V
PAM 1 Probability Score
0.0002
0.0001
Note – this is really just
0.0001
0.0000
1 column in a much bigger
0.0000
probability matrix
0.0000
0.0000
0.0001
0.0002
0.0007
0.0013
0.0000
0.0001
0.9946
0.0001
0.0003
0.0001
0.0001
0.0021
0.0001
SUM = 1.0
….. E F G ……
A
0.0002
C
0.0000
D
0.0000
E
0.0000
F
0.9946
G 0.0001
Next, assume that mutations at each site are independent of previous
mutations. Therefore, calculate changes predicted for more distantly related
proteins that have undergone N mutations/100 amino acids by multiplying
the PAM1 matrix against itself N times.
Example: PAM2 matrix:
aa1 aa2 aa3
aa1 aa2 aa3
aa1
a b c
aa1
a b c
aa2 d e f
x
aa2 d e f
aa3
g h i aa3
g h i
aa1 aa2 aa3 A=a
2
+bd+cg+…
aa1
A B C
B=ab+be+ch+…
D E Faa2
C=ac+bf+ci+…
G H I
aa3
D=da+ed+fg+…
PAM 250 matrix
? Multiply PAM1 matrix by itself 250 times!
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)(PAM1)
(PAM1)(PAM1)
Amino Acid Change
F→A
F→R
F→N
F→D
F→C
F→Q
F→E
F→G
F→H
F→I
F→L
F→K
F→M
F→F
F→P
F→S
F→T
F→W
F→Y
F→V
PAM 1 Score
0.0002
0.0001
0.0001
0.0000
0.0000
0.0000
0.0000
0.0001
0.0002
0.0007
0.0013
0.0000
0.0001
0.9946
0.0001
0.0003
0.0001
0.0001
0.0021
0.0001
PAM 250 Score
0.04
0.01
0.02
0.01
0.01
0.01
0.01
0.03
0.02
0.05
0.13
0.02
0.02
0.32
0.02
0.03
0.03
0.01
0.15
0.05
These are the M
ab
values!
i.e. the chance that one
amino acid will replace
another at 250 PAMs in
two proteins that are
evolutionarily related
to each other!
SUM = 1.0
PAM 250 matrix – 250% expected change
Sequences still ~ 15-30 % similar, i.e. Phe will match Phe ~ 32% of the time
Ala will match Ala ~ 13% of the time
Expected % similarity
Other PAM matrices: PAM 120 – 40%
PAM 80 – 50%
Use for similar sequences
PAM 60 – 60%
PAM250 – 15-30% similarity.
Where do the numbers in the
PAM250 Matrix table come from?
Step 6: Calculate relatedness odds
Chance that two amino acids in a sequence alignment come from
related proteins via evolution versus the chance that they are from
two unrelated proteins aligned by chance.
M
ab
= prob. that b replaces a in related proteins
-vs –
P
a
ran
= prob. that b replaces a because the proteins are
completely unrelated…i.e. a was there by chance
Now, P
a
ran
= f
a
, the frequency of occurrence of amino acid a
Where do the numbers in the
PAM250 Matrix table come from?
Step 6: Calculate relatedness odds
Relative odds of evolution rather than chance:
f
R
ij
= M
ij
i
Where do the numbers in the
PAM250 Matrix table come from?
Step 7: Calculate log (relatedness odds) and
multiply by 10 to clear fractional values
Example: Phe→Tyr (which must = Tyr→ Phe)
R
ij
= M
ij
f
i
M
FY
= 0.15
f
Phe
=0.04
So R
FY
=0.15/0.04 = 3.75
Log
10
R
FY
=Log
10
(3.75)=0.57
10 x 0.57 = 5.7
Likewise
M
YF
= 0.20
f
Tyr
=0.03
So R
YF
= 6.7
Log
10
(6.7)=0.83
10 x 0.57 = 8.3
So average = (5.7+8.3)/2 = 7….the number in the PAM250 table!
Remember…
Saw last time how to use these
numbers + dynamic
programming in order to “score”
an alignment…
But we have to use the right matrix!!!
PAM 250 matrix – 250% expected change
Sequences still ~ 15-30 % similar, i.e. Phe will match Phe ~ 32% of the time
Ala will match Ala ~ 13% of the time
Expected % similarity
Other PAM matrices: PAM 120 – 40%
PAM 80 – 50%
Use for similar sequences
PAM 60 – 60%
PAM250 – 15-30% similarity.
Use the correct PAM matrix for alignments based on how similar the
sequences to be algned are! But wait…..how do we know that in the
first place? Usually don’t!!!!.
So…… try PAM200, PAM120, PAM60, PAM80, and
PAM30 matrix and use the one that gives the highest ungapped
aligment score
Alternative amino acid matrices
Problems with Dayhoff:
? Based on amino acids, not nucleotides.
? Assumes evolutionary model with explicit phylogenetic relationships, and
circular arguments: alignment → matrices; matrices → new alignments.
? Based on a small set of closely related molecules.
? Gonnett, Cohen & Benner
-All against All database matching using DARWIN
1,700,000 matches
Compile mutation matrices at different PAMs DIRECTLY
? BLOSUM = Blocks Amino Acid Substitution Matrices-Henikoff&Henikoff 1992
-based on a much larger dataset from ~500 Prosite families identified by
Bairoch using conserved amino acid patterns “blocks” that define each family.
Typically used for multiple sequence alignment.
AA substitutions noted, log odds ratios derived.
for example…Block patterns 60% identical give rise to Blosum60 matrix,
etc….i.e. conservation of functional blocks based on un-gapped alignments.
Blosum62 - best match between information content and amount of data
Not based on explicit evolutionary model
GAPS
AKHFRGCVS
AKKF--CVG
? Linear Gap Penalty
W
n
= nγ,
n= # of gaps, γ = gap penalty
? Affine gap penalty
W
n
= g + nγ,
=# of gaps, γ = gap extension penalty,
and g = gap opening penalty
Simplified Alignment Statistics
? How can we tell how good an alignment is based on its score? What
are the chances that two random sequences would give a similar
score when they were aligned?
? Consider an easier problem – what is the longest run of heads I
will get in a random series of coin tosses?
Fair coin p=0.5, Erd?s and Rényi – longest run = log
1/p
(n)
here this is log
2
(n). If n=100, longest run is 6.65
? For two sequences of length n and m, we’re doing nm
comparisons, so the longest length of the predicted match
would be log
1/p
(mn)
? More precisely, the expectation value, or the mean of the
longest match turns out to be E(M)~=log
1/p
(Kmn) where K is a
constant that depends on amino acid composition.
….OK, this is really only true for ungapped local alignments
and I’m neglecting edge effects and mismatches
A few notes…
? E(M)~=log
1/p
(Kmn) means that the match length
gets bigger as the log of the product of the sequence
lengths. Using the amino acid substitution matrices,
we can turn these match lengths into alignment
scores, S.
? More commonly see two parameters used: λ =ln(1/p)
and the parameter K we already talked about.
? We want to know the number of High-Scoring Pairs,
HSP, (i.e. high scoring runs of amino acids).
? This number of HSPs, E, that exceed some score S
-λS
is given by E=Kmne
? So we can evaluate how good a sequence scores,
(i.e. its S) by looking at how many HSPs (i.e. E value)
we would expect for that score.
Notes (cont).
? Where do we get that distribution
function that tells us how E and are
related? Need to look at the scores in
some model of aligned random
sequences….
Notes (cont)
? The random sequence alignment
scores would give rise to an
“extreme value” distribution – like
a skewed gaussian.
? Called Gumbel extreme value
distribution
For a normal distribution with a mean m and a variance σ, the height of the
curve is described by Y=1/(σ√2π) exp[-(x-m)
2
/2σ
2
]
For an extreme value distribution, the height of the curve is described by
Y=exp[-x-e
-x
] …and P(S>x) = 1-exp[-e
-λ(x-u)]
where u=(ln Kmn)/λ
Can show that mean extreme score is ~ log
2
(nm), and the probability of
getting a score that exceeds some number of “standard deviations” x is:
P(S>x)~ Kmne
-λx.
***K and λ are tabulated for different matrices ****
-λS
For the less statistically inclined: E~ Kmne
-2 -1
0.2
Yev
0.4
-4 4
0.4
B.
Yn
Probability values for the extreme value distribution (A) and the
normal distribution (B). The area under each curve is 1.
0 1 2
X X
A.
3 4 5
? Two ways to get the K and λ parameters:
1- For many amino acid substitution matrices,
Altschul and Gish have tabulated their score
distribution for 10,000 random amino acid
sequences using various gap penalties
2- Even better! Calculate the distribution for the
two sequences you are aligning by keeping one
of them fixed and scrambling the other one – this
preserves BOTH sequence length and amino acid
composition!
Example
SEQ1: FWLEVEGNS M TAPTG
SEQ2: FWLDVQGDS M TAPAG
USE BLOSUM62 MATRIX
Raw score = 67
Bit score: S=λR-lnK
ln(2)
S=(.267)(67)-ln(.0410)
= 30.4
0.693
E~ Kmne
-λS
which is
Equivalent to:
E=mn2
-s
E=(24)(15)(2
-30.4
)=2.54e-07
Multiple Sequence Alignments
? Sequences are aligned so as to bring the
greatest number of single characters into
register.
? If we include gaps, mismatches, then even
dynamic programming becomes limited to
~ 3 sequences unless they are very
short….need an alternative approach…
Why?
Consider the 2 sequence comparison
…..an O(mn) problem – order n
2
i =0 1 2 3 4 5
j =
Gap V D S C Y
0
0
4
-8
-4
-3
-8
-16 -24 -32 -40
-8
1 -8
4
-12 -20 -28
-12 -16
-6
-9
7
-17
2
1 -73
-24
-14
3
-6
9
4 -32
-22 -14
-30
3
0
5
-40
-22
-7
1
3
13
6
-48
-38
-30
-15 5
23
For 3 sequences….
ARDFSHGLLENKLLGCDSMRWE
.::. .:::. .:::: :::.
GRDYKMALLEQWILGCD-MRWD
.::. ::.: .. :. .:::
SRDW--ALIEDCMV-CNFFRWD
An O(mnj) problem !
Consider sequences each 300 amino acids
Uh Oh !!!
2 sequences – (300)
2 Our polynomail problem
3 sequences – (300)
3
Just became exponential!
but for v sequences – (300)
v
Consider pairwise alignments
between 3 sequences
Sequence B
S
e
q
u
e
n
c
e
C
A-C
A-B
B-C
A-B-C
Carillo and Lipman – Sum of Pairs method
Do we need to
Score each node?
Sequence A
Get the multiple alignment score within the cubic lattice by
Adding together the scores of the pairwise alignments…
In practice, doesn’t give optimal alignment…But we’re close!
Seems reasonable that the optimal alignment won’t be far from
the diagonal we were on…so we just set bounds on the location
of the msa within the cube based on each pairwise-alignment.
Then just do dynamic programing within the volume defined by the
pre-imposed bounds
Still takes too long for more than three
sequences…need a better way!
? Progressive Methods of Multiple Sequence
Alignment
Concept – simple:
1-Use DP to build pairwise alignments of most closely
related sequences
2- Then progressively add less related sequences or
groups of sequences…
ClustalW
Higgins and Sharp 1988
?1-Do pairwise analysis of all the sequences
(you choose similarity matrix).
? 2- Use the alignment scores to make a
phylogenetic tree.
? 3- Align the sequences to each other guided
by the phylogenetic relationships in the tree.
New features: Clustal ?ClustalW (allows weights) ? ClustalX (GUI-based
Weighting is important to avoid biasing an alignment by many sequence
Members that are closely related to each other evolutionarily!
Pairwise Alignment
Steps in Multiple Alignment
Multiple alignment following the tree from A
S
1
S
1
S
1
6 pairwise comparisons
then cluster analysis
similarity
align most similar pair
Gaps to optimize alignment
New gap to optimize
alignment of (S
2
S
4
) with (S
1
S
3
)
align next most similar pair
align alignments-preserve gaps
Example - 4 sequences S
1
S
2
S
3
S
4
S
2
S
2
S
2
S
3
S
3
S
3
S
4
S
4
S
4
S
1
S
2
S
3
S
4
Progressive Alignments
Note that the final msa is EXTREMELY DEPENDENT on
the initial pairwise sequence alignments!
If the sequences are close in evolution, and you can see
the alignment – GREAT!
If the sequences are NOT close in evolution, and you CANNOT
See the alignment – errors will be propogated to the final msa
Has led to other approaches to do msa’s that aren’t so
Dependent on initial states….i.e. genetic algorithm
Finding patterns (i.e. motifs and
domains) in Multiple Sequence Analysis
Block Analysis, Position Specific Scoring Matrices (PSSM)
BUILD an msa from groups of related proteins
BLOCKS represent a conserved region in that msa
that is LACKING IN GAPS – i.e. no insertions/deletions
The BLOCKS are typically anwhere from 3-60 amino acids long,
based on exact amino acid matches – i.e. alignment will tolerate
mismatches, but doesn’t use any kind of PAM or
BLOSUM matrix…in fact they generate the BLOSUM matrix!
A single proteins contain numerous such BLOCKS
separated by stretches of intervening sequences that can
differ in length and composition.
These blocks may be whole domains, short sequence motifs,
key parts of enzyme active sites etc, etc.
BLOCKS database….so far exploration limited. Lots of stuff to probe!
Can use these conserved BLOCKS
to derive a PSSM
? The dirty secret behind prosite! Scansite! And in a
12345………….11
twised way Psi-BLAST!
….GDSFHYFVSHG…..
….GDAFHYYISFG…..
….GDSYHYFLSFG…..
….SDSFHYFMSFG…..
….GDSFHFFASFG…..
Now build a matrix with 20 amino acids as the columns, and 11 rows
For the positions in the BLOCK
Each matrix entry is the log(frequency of the amino acid occurance)
at that position in the BLOCK.
….GDSFHYFVSHG…..
….GDAFHYYISFG…..
….GDSYHYFLSFG…..
….SDSFHYFMSFG…..
….GDSFHFFASFG…..
A C D E F G H I K….
1
Log(4)
2 Log(5)
3
4
5
We can now use the PSSM to search a database for other proteins that
have the BLOCK (or motif).
Problem 1 – We need to think about what kind of information is
Contained within the PSSM.
→ Leads to concepts of Information Content & Entropy (next time)
….GDSFHYFVSHG…..
….GDAFHYYISFG…..
….GDSYHYFLSFG…..
….SDSFHYFMSFG…..
….GDSFHFFASFG…..
Problem 2 –The PSSM must accurately represent the expected BLOCK
Or motif….and we have only limited amounts of data! Is it a good statistical
Sampling of the BLOCK/motif? Is it too narrow because of small dataset?
Should we broaden it by adding extra amino acids that we choose using
Some type of randomization scheme (called adding pseudocounts). If so,
How many should we add?