7.91 – Lecture #2 More Pairwise Sequence Comparisons ARDFSHGLLENKLLGCDSMRWE .::. .:::. .:::: :::. GRDYKMALLEQWILGCD-MRWD - and – Multiple Sequence Alignment ARDFSHGLLENKLLGCDSMRWE .::. .:::. .:::: :::. GRDYKMALLEQWILGCD-MRWD .::. ::.: .. :. .::: SRDW--ALIEDCMV-CNFFRWD Reading: This lecture: Mount pp. 8-9, 65-89, 96-115, 140-155, 161-170 Michael Yaffe Outline ? Recursion and dynamic programming ? Applied dynamic programming: global alignments: Needleman-Wunsch ? Applied dynamic programming: local alignments – Smith-Waterman ? Substitution matrices: PAM, BLOSUM, Gonnet ? Gaps - linear and affine ? Alignment statistics ? What you need to know to optimize an alignment Outline (cont) ? Multiple sequence alignments: MSA, Clustal ? Block analysis ? Position-Specific Scoring Matrices (PSSM) Examples O(n k ) is “polynomial time” as long as K<3 …..tractable Consider our un-gapped dot matrix Global alignment: 1 n 1 12345678…. 12345678…. 12345678…. 12345678…. 12345678…. m 12345678…. ….essentially an O(mn) problem O.K. Examples O(n) better than O(n log(n)), better than O(n 2 ), better than O(n 3 ) Terrible Examples O(k n ) = exponential time….horrible!!!! NP problems- no known polynomial time Solutions = non-deterministic polynomial Problems. Recursion and Dynamic Programming Aligning two protein sequences without gaps – roughly an O(mn) problem. With gaps – becomes computationally astronomical, and cannot be done by direct comparison methods. (= 2 2L /√(2πL); L=sequence length) Alternative is to compare all possible pairs of characters (matches and mismatches, and also take gaps into account as well, while keeping the number of comparisons manageable. The approach is called dynamic programming. Mathematically proven to produce optimal alignment Need a substitution or similarity matrix and some way to account for gaps. Example of how to score an alignment: Write down two sequences: sequence#1 V D S – C Y sequence#2 V E S L C Y Score from sub. Matrix 4 2 4 -11 9 7 Score = Σ(AA pair scores) – gap penalty = 15 C 9 S T 5 P A 0 1 0 G 6 N 6 D 6 E 2 5 Q 0 2 5 H 0 0 8 R 0 1 0 5 K 1 5 M 0 I 4 L 2 4 V 3 1 4 F 0 Y 2 7 W CSTPAGNDEQHRKMILVFYW BLOSUM 62 Scoring Matrix -1 4 -1 1 -3 -1 -1 7 -1 4 -3 0 -2 -2 0 -3 1 0 -2 -2 0 -3 0 -1-1-2-1 1 -4 0 -1-1-1-2 0 -3 0 -1 -1 -1 -2 0 -3 -1 -2 -2 -2 -2 1 -1 -3 -1 -1 -2 -1 -2 0 -2 -3 0 -1-1-1-2 0 -1 1 -1 2 -1 -1 -1 -2 -1 -3 -2 -3 -2 -2 -1 -1 5 -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 -1 6 -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 -2 -2 -1 -1 -1 -1 3 -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11 Scoring system should: favor matching identical or related amino acids Penalize for poor matches and for gaps. To get a good scoring system need to know: how often a particular amino acid Pair is found in related proteins compared with its occurence by chance. This Is the information contained in the substitution matrix …..….and when a gap would be a better choice Deriving realistic substitution matrices: First need to know frequency of one amino acid substituting for another In related proteins [=P(ab)] c/w the chance that substituting one for the other occurred by chance, based on the relative frequencies of each amino acid in proteins, q(a) and q(b). Call this the “odds ratio”: P(ab)/q(a)q(b) If we do this for all positions in an alignment, then the total probablilty will be the product of the odds ratios at each position….but multiplication is computationally expensive….so….take the log (odds ratio) and add them instead. Matrices like PAM and BLOSUM matrices are derived from these log odds ratios And contain positive and negative numbers reflecting likelihood of amino Acid substitutions in related proteins. To do Dynamic Programming: First write one sequence across the top, and one down along the side Gap V D S C Y Gap 0 1 gap 2 gaps V 1 gap E 2 gaps S L C Y Note – linear gap penalty: γ(n)=nA, where A=gap penalty 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 s ij -8 -16 -24 -32 -40 V -8 E -16 So scoring Sij requires that we know S(i-1, j-1) and S(i, j-1) and S(i-1, j)… S -24 Therefore recursive. We use the solutions Of smaller problems to solve larger ones. L -32 AND we store how we got to the Sij score, i.e. the intermediate solutions in a tabular C -40 matrix. Computer scientists call this dynamic programming, where “programming” means Y -48 the matrix, not some kind of computer code. 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 s ij -8 -16 -24 -32 -40 V -8 E -16 Global alignments: Needleman-Wunsch-Sellers S -24 O(n 2 ) using linear gap penalty L -32 S ij = max of: S i-1, j-1 + σ(x i , y j ) (diagonal) C -40 S i-1, j –A (from left to right) Y -48 S i, j-1 –A (from top to bottom) 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 Gap V 0 -8 s ij 4 4 -8 D S C Y j = 0 Gap -16 -24 -32 -40 -8 V -8 E -16 Global alignments: Needleman-Wunsch-Sellers S -24 O(n 2 ) using linear gap penalty L -32 S ij = max of: S i-1, j-1 + σ(x i , y j ) (diagonal) C -40 S i-1, j –A (from left to right) Y -48 S i, j-1 –A (from top to bottom) 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 Gap V 0 -8 4 4 4 -8 -8 D S C Y j = 0 Gap -16 -24 -32 -40 -8 V E -16 S -24 L -32 C -40 Y -48 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 -8 4 4 -3 -8 -16 -24 -32 -40 -8 V -8 s ij E -16 Global alignments: Needleman-Wunsch-Sellers S -24 O(n 2 ) using linear gap penalty L -32 S ij = max of: S i-1, j-1 + σ(x i , y j ) (diagonal) C -40 S i-1, j –A (from left to right) Y -48 S i, j-1 –A (from top to bottom) 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 -8 -16 4 4 -4 -3 -8 -24 -32 -40 -8 V -8 E S L C Y 1 2 3 4 5 6 To do Dynamic Programming: First write one sequence across the top, and one down along the side i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 4 -8 -4 -3 -8 -16 -24 -32 -40 -8 V -8 4 -12 -20 -28 -1 E -16 -6 -9 7 -17 2 S 1 -7 -24 -14 3 -6 9 L -32 -22 -14 -30 3 0 C -40 -22 -7 1 3 13 Y -48 -38 -30 -15 5 23 The Traceback: After the alignment square is finished, start at the lower right and work backwards following the arrows to see how you got there… i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 4 -8 -16 -24 -32 -40 1 V -8 2 E -16 3 S -24 4 L -32 5 C -40 6 Y -48 -8 4 -4 -3 -8 -12 -28 -6 -14 -22 -30 -38 7 3 -6 -14 -22 -30 9 2 -1 1 -7 -15 -9 1 3 13 5 -17 -7 0 3 23 -20 Y -48 1 2 3 4 5 6 -38 -30 -15 5 23 The Traceback V D S – C Y V E S L C Y gives the alignment: i =0 1 2 3 4 5 j = Gap V D S C Y 0 0 -8 -16 -24 -32 -40 -8 Gap V E S L C -8 -16 -24 -32 -40 4 4 -4 -3 -8 -12 -28 -6 -14 -22 -30 7 3 -6 -14 -22 9 2 -1 1 -7 -9 1 3 13 -17 -7 0 3 “Life must be lived forwards and understood backwards.” -20 - S?ren Kierkegaard Local Alignment ? Temple Smith and Michael Waterman, 1981 – modified Needleman-Wunsch-Sellers Local alignment is the best scoring alignment of a substring in sequence x to a substring in sequence y. KEY ELEMENT IS NOT TO FORCE THE ALIGNMENT TO GO TO THE ENDS OF THE SEQUENCES. For sequence x, residues 1, 2, 3…..N, can pick up to ~N 2 substrings, i.e. start point a= 1,2….N and end point b= 1, 2….n. Same for sequence y, ~M 2 substrings. For any two substrings, we have the old O(mn) alignment problem, so the total number of possible alignments is ~ N 2 M 2 (NM)=O(M 3 N 3 )- UGLY!!!! Solveable in polynomial time, but a big polynomial!!! Local Alignment ? Again, dynamic programming comes to the rescue! Same basic scheme for dynamic programming as before except…. Similarity matrix MUST include negative values for mismatches --AND– ****When the value calculated for a position in the scoring matrix is Negative, the value is set to zero. THIS TERMINATES THE ALIGNMENT Smith-Waterman: Write one sequence across the top, and one down along the side i =0 1 2 3 4 5 j = Gap V D S C Y 0 Gap 0 0 s ij 0 0 0 0 1 V 0 2 E 0 Local alignments: Smith-Waterman 3 S 0 S ij = max of: S i-1, j-1 + σ(x i , y j ) (diagonal) S 4 L 0 i-1, j –A (from left to right) 5 C 0 6 Y S i, j-1 –A (from top to bottom) 0 0 Programs for Global and Local alignments: Biology workbench http://workbench.sdsc.edu/ Bill Pearson’s Web Page http://fasta.bioch.virginia.edu/ NCBI, Expassy Amino Acid Substitution Matrices Margaret Dayhoff, 1978, PAM Matrices **Evolutionary model **based on a small data set. Assumes symmetry: A → B = B → A Assumes amino acid substitutions observed over short periods of time can be extrapolated to long periods of time 71 groups of protein sequences, 85% similar 1572 amino acid changes. Functional proteins →”Accepted” mutations by natural selection PAM1 matrix means 1% divergence between proteins - i.e. 1 amino acid change per 100 residues. Some texts re-state this as the probability of each amino acid changing into another is ~ 1% and probability of not changing is ~99% Construction of a Dayhoff Matrix: PAM1 Step 1: Measure pairwise substitution frequencies for each amino acid within families of related proteins ….GDSFHYFVSHG….. ….GDSFHYYVSFG….. ….GDSYHYFVSFG….. ….GDSFHYFVSFG….. ….GDSFHFFVSFG….. 900 Phe (F)….+ another 100 probable Phe but… 100 Phe (F) → 80 Tyr (Y), 3 Trp (W), 2 His (H)…. Gives f ab , i.e. f FY =80 f FW =3 ….by evolution! C Do this for all 20 amino acids D E F G C D E F G H I …… f CC f CD f CE f DC f EC Gives f ab = pairwise exchange frequency Implicit assumption - 1st order Markov chain transition model Step 2: Calculate the relative probabilities of pairwise exchange of a→b Pa = Probability of amino acid a fab = number of substitutions between a and b fa = total number of substitutions involving amino acid a fa= Σ fab a≠b f = total number of mutations in the group of related sequences = Σ fa a Define relative probability M’ab as: M’ab=Pr(a→b)=fab/f Step 3: Scale relative probabilities to obtain a 1% total chance of any amino acid changing into a different amino acid i.s. scale M’ to ensure that: Σ Pa Mab = .01 a≠b i.e. ΣPa Maa=.99 Step 4: This involves defining a “relative mutability” index m a for each amino acid m Number of mutations a = f a involving amino acid a 100 f Pa “exposure of ‘a’ to mutation” Prob(a)* total number of mutations weighted per 100 sites Set adjusted probability M ab = M’ ab m a Step 5: Calculate evolutionary distance scale so that only 1/100 amino acids change M aa reflects amino acid conservation M aa = 1- ∑ M ab b example = 1- (adjusted probability of Phe mutations) M FF **Use a scale factor λ so that M aa is ~ 0.99 i.e. chance of it mutating is ~ 1% i.e. this defines a PAM1 matrix…. M aa = 1- λ∑ M ab = ~ 0.99 a≠b λ is our evolutionary scale factor … and for any particular mutation probability, λM ab reflects the normalized measure of how likely amino acid b will replace amino acid a over 1 PAM Real PAM1 values Amino Acid Change F→A F→R F→N F→D F→C F→Q F→E F→G F→H F→I F→L F→K F→M F→F F→P F→S F→T F→W F→Y F→V PAM 1 Probability Score 0.0002 0.0001 Note – this is really just 0.0001 0.0000 1 column in a much bigger 0.0000 probability matrix 0.0000 0.0000 0.0001 0.0002 0.0007 0.0013 0.0000 0.0001 0.9946 0.0001 0.0003 0.0001 0.0001 0.0021 0.0001 SUM = 1.0 ….. E F G …… A 0.0002 C 0.0000 D 0.0000 E 0.0000 F 0.9946 G 0.0001 Next, assume that mutations at each site are independent of previous mutations. Therefore, calculate changes predicted for more distantly related proteins that have undergone N mutations/100 amino acids by multiplying the PAM1 matrix against itself N times. Example: PAM2 matrix: aa1 aa2 aa3 aa1 aa2 aa3 aa1 a b c aa1 a b c aa2 d e f x aa2 d e f aa3 g h i aa3 g h i aa1 aa2 aa3 A=a 2 +bd+cg+… aa1 A B C B=ab+be+ch+… D E Faa2 C=ac+bf+ci+… G H I aa3 D=da+ed+fg+… PAM 250 matrix ? Multiply PAM1 matrix by itself 250 times! (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) (PAM1)(PAM1)(PAM1)(PAM1)(PAM1)(PAM1) (PAM1)(PAM1)(PAM1) (PAM1)(PAM1) Amino Acid Change F→A F→R F→N F→D F→C F→Q F→E F→G F→H F→I F→L F→K F→M F→F F→P F→S F→T F→W F→Y F→V PAM 1 Score 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0007 0.0013 0.0000 0.0001 0.9946 0.0001 0.0003 0.0001 0.0001 0.0021 0.0001 PAM 250 Score 0.04 0.01 0.02 0.01 0.01 0.01 0.01 0.03 0.02 0.05 0.13 0.02 0.02 0.32 0.02 0.03 0.03 0.01 0.15 0.05 These are the M ab values! i.e. the chance that one amino acid will replace another at 250 PAMs in two proteins that are evolutionarily related to each other! SUM = 1.0 PAM 250 matrix – 250% expected change Sequences still ~ 15-30 % similar, i.e. Phe will match Phe ~ 32% of the time Ala will match Ala ~ 13% of the time Expected % similarity Other PAM matrices: PAM 120 – 40% PAM 80 – 50% Use for similar sequences PAM 60 – 60% PAM250 – 15-30% similarity. Where do the numbers in the PAM250 Matrix table come from? Step 6: Calculate relatedness odds Chance that two amino acids in a sequence alignment come from related proteins via evolution versus the chance that they are from two unrelated proteins aligned by chance. M ab = prob. that b replaces a in related proteins -vs – P a ran = prob. that b replaces a because the proteins are completely unrelated…i.e. a was there by chance Now, P a ran = f a , the frequency of occurrence of amino acid a Where do the numbers in the PAM250 Matrix table come from? Step 6: Calculate relatedness odds Relative odds of evolution rather than chance: f R ij = M ij i Where do the numbers in the PAM250 Matrix table come from? Step 7: Calculate log (relatedness odds) and multiply by 10 to clear fractional values Example: Phe→Tyr (which must = Tyr→ Phe) R ij = M ij f i M FY = 0.15 f Phe =0.04 So R FY =0.15/0.04 = 3.75 Log 10 R FY =Log 10 (3.75)=0.57 10 x 0.57 = 5.7 Likewise M YF = 0.20 f Tyr =0.03 So R YF = 6.7 Log 10 (6.7)=0.83 10 x 0.57 = 8.3 So average = (5.7+8.3)/2 = 7….the number in the PAM250 table! Remember… Saw last time how to use these numbers + dynamic programming in order to “score” an alignment… But we have to use the right matrix!!! PAM 250 matrix – 250% expected change Sequences still ~ 15-30 % similar, i.e. Phe will match Phe ~ 32% of the time Ala will match Ala ~ 13% of the time Expected % similarity Other PAM matrices: PAM 120 – 40% PAM 80 – 50% Use for similar sequences PAM 60 – 60% PAM250 – 15-30% similarity. Use the correct PAM matrix for alignments based on how similar the sequences to be algned are! But wait…..how do we know that in the first place? Usually don’t!!!!. So…… try PAM200, PAM120, PAM60, PAM80, and PAM30 matrix and use the one that gives the highest ungapped aligment score Alternative amino acid matrices Problems with Dayhoff: ? Based on amino acids, not nucleotides. ? Assumes evolutionary model with explicit phylogenetic relationships, and circular arguments: alignment → matrices; matrices → new alignments. ? Based on a small set of closely related molecules. ? Gonnett, Cohen & Benner -All against All database matching using DARWIN 1,700,000 matches Compile mutation matrices at different PAMs DIRECTLY ? BLOSUM = Blocks Amino Acid Substitution Matrices-Henikoff&Henikoff 1992 -based on a much larger dataset from ~500 Prosite families identified by Bairoch using conserved amino acid patterns “blocks” that define each family. Typically used for multiple sequence alignment. AA substitutions noted, log odds ratios derived. for example…Block patterns 60% identical give rise to Blosum60 matrix, etc….i.e. conservation of functional blocks based on un-gapped alignments. Blosum62 - best match between information content and amount of data Not based on explicit evolutionary model GAPS AKHFRGCVS AKKF--CVG ? Linear Gap Penalty W n = nγ, n= # of gaps, γ = gap penalty ? Affine gap penalty W n = g + nγ, =# of gaps, γ = gap extension penalty, and g = gap opening penalty Simplified Alignment Statistics ? How can we tell how good an alignment is based on its score? What are the chances that two random sequences would give a similar score when they were aligned? ? Consider an easier problem – what is the longest run of heads I will get in a random series of coin tosses? Fair coin p=0.5, Erd?s and Rényi – longest run = log 1/p (n) here this is log 2 (n). If n=100, longest run is 6.65 ? For two sequences of length n and m, we’re doing nm comparisons, so the longest length of the predicted match would be log 1/p (mn) ? More precisely, the expectation value, or the mean of the longest match turns out to be E(M)~=log 1/p (Kmn) where K is a constant that depends on amino acid composition. ….OK, this is really only true for ungapped local alignments and I’m neglecting edge effects and mismatches A few notes… ? E(M)~=log 1/p (Kmn) means that the match length gets bigger as the log of the product of the sequence lengths. Using the amino acid substitution matrices, we can turn these match lengths into alignment scores, S. ? More commonly see two parameters used: λ =ln(1/p) and the parameter K we already talked about. ? We want to know the number of High-Scoring Pairs, HSP, (i.e. high scoring runs of amino acids). ? This number of HSPs, E, that exceed some score S -λS is given by E=Kmne ? So we can evaluate how good a sequence scores, (i.e. its S) by looking at how many HSPs (i.e. E value) we would expect for that score. Notes (cont). ? Where do we get that distribution function that tells us how E and are related? Need to look at the scores in some model of aligned random sequences…. Notes (cont) ? The random sequence alignment scores would give rise to an “extreme value” distribution – like a skewed gaussian. ? Called Gumbel extreme value distribution For a normal distribution with a mean m and a variance σ, the height of the curve is described by Y=1/(σ√2π) exp[-(x-m) 2 /2σ 2 ] For an extreme value distribution, the height of the curve is described by Y=exp[-x-e -x ] …and P(S>x) = 1-exp[-e -λ(x-u)] where u=(ln Kmn)/λ Can show that mean extreme score is ~ log 2 (nm), and the probability of getting a score that exceeds some number of “standard deviations” x is: P(S>x)~ Kmne -λx. ***K and λ are tabulated for different matrices **** -λS For the less statistically inclined: E~ Kmne -2 -1 0.2 Yev 0.4 -4 4 0.4 B. Yn Probability values for the extreme value distribution (A) and the normal distribution (B). The area under each curve is 1. 0 1 2 X X A. 3 4 5 ? Two ways to get the K and λ parameters: 1- For many amino acid substitution matrices, Altschul and Gish have tabulated their score distribution for 10,000 random amino acid sequences using various gap penalties 2- Even better! Calculate the distribution for the two sequences you are aligning by keeping one of them fixed and scrambling the other one – this preserves BOTH sequence length and amino acid composition! Example SEQ1: FWLEVEGNS M TAPTG SEQ2: FWLDVQGDS M TAPAG USE BLOSUM62 MATRIX Raw score = 67 Bit score: S=λR-lnK ln(2) S=(.267)(67)-ln(.0410) = 30.4 0.693 E~ Kmne -λS which is Equivalent to: E=mn2 -s E=(24)(15)(2 -30.4 )=2.54e-07 Multiple Sequence Alignments ? Sequences are aligned so as to bring the greatest number of single characters into register. ? If we include gaps, mismatches, then even dynamic programming becomes limited to ~ 3 sequences unless they are very short….need an alternative approach… Why? Consider the 2 sequence comparison …..an O(mn) problem – order n 2 i =0 1 2 3 4 5 j = Gap V D S C Y 0 0 4 -8 -4 -3 -8 -16 -24 -32 -40 -8 1 -8 4 -12 -20 -28 -12 -16 -6 -9 7 -17 2 1 -73 -24 -14 3 -6 9 4 -32 -22 -14 -30 3 0 5 -40 -22 -7 1 3 13 6 -48 -38 -30 -15 5 23 For 3 sequences…. ARDFSHGLLENKLLGCDSMRWE .::. .:::. .:::: :::. GRDYKMALLEQWILGCD-MRWD .::. ::.: .. :. .::: SRDW--ALIEDCMV-CNFFRWD An O(mnj) problem ! Consider sequences each 300 amino acids Uh Oh !!! 2 sequences – (300) 2 Our polynomail problem 3 sequences – (300) 3 Just became exponential! but for v sequences – (300) v Consider pairwise alignments between 3 sequences Sequence B S e q u e n c e C A-C A-B B-C A-B-C Carillo and Lipman – Sum of Pairs method Do we need to Score each node? Sequence A Get the multiple alignment score within the cubic lattice by Adding together the scores of the pairwise alignments… In practice, doesn’t give optimal alignment…But we’re close! Seems reasonable that the optimal alignment won’t be far from the diagonal we were on…so we just set bounds on the location of the msa within the cube based on each pairwise-alignment. Then just do dynamic programing within the volume defined by the pre-imposed bounds Still takes too long for more than three sequences…need a better way! ? Progressive Methods of Multiple Sequence Alignment Concept – simple: 1-Use DP to build pairwise alignments of most closely related sequences 2- Then progressively add less related sequences or groups of sequences… ClustalW Higgins and Sharp 1988 ?1-Do pairwise analysis of all the sequences (you choose similarity matrix). ? 2- Use the alignment scores to make a phylogenetic tree. ? 3- Align the sequences to each other guided by the phylogenetic relationships in the tree. New features: Clustal ?ClustalW (allows weights) ? ClustalX (GUI-based Weighting is important to avoid biasing an alignment by many sequence Members that are closely related to each other evolutionarily! Pairwise Alignment Steps in Multiple Alignment Multiple alignment following the tree from A S 1 S 1 S 1 6 pairwise comparisons then cluster analysis similarity align most similar pair Gaps to optimize alignment New gap to optimize alignment of (S 2 S 4 ) with (S 1 S 3 ) align next most similar pair align alignments-preserve gaps Example - 4 sequences S 1 S 2 S 3 S 4 S 2 S 2 S 2 S 3 S 3 S 3 S 4 S 4 S 4 S 1 S 2 S 3 S 4 Progressive Alignments Note that the final msa is EXTREMELY DEPENDENT on the initial pairwise sequence alignments! If the sequences are close in evolution, and you can see the alignment – GREAT! If the sequences are NOT close in evolution, and you CANNOT See the alignment – errors will be propogated to the final msa Has led to other approaches to do msa’s that aren’t so Dependent on initial states….i.e. genetic algorithm Finding patterns (i.e. motifs and domains) in Multiple Sequence Analysis Block Analysis, Position Specific Scoring Matrices (PSSM) BUILD an msa from groups of related proteins BLOCKS represent a conserved region in that msa that is LACKING IN GAPS – i.e. no insertions/deletions The BLOCKS are typically anwhere from 3-60 amino acids long, based on exact amino acid matches – i.e. alignment will tolerate mismatches, but doesn’t use any kind of PAM or BLOSUM matrix…in fact they generate the BLOSUM matrix! A single proteins contain numerous such BLOCKS separated by stretches of intervening sequences that can differ in length and composition. These blocks may be whole domains, short sequence motifs, key parts of enzyme active sites etc, etc. BLOCKS database….so far exploration limited. Lots of stuff to probe! Can use these conserved BLOCKS to derive a PSSM ? The dirty secret behind prosite! Scansite! And in a 12345………….11 twised way Psi-BLAST! ….GDSFHYFVSHG….. ….GDAFHYYISFG….. ….GDSYHYFLSFG….. ….SDSFHYFMSFG….. ….GDSFHFFASFG….. Now build a matrix with 20 amino acids as the columns, and 11 rows For the positions in the BLOCK Each matrix entry is the log(frequency of the amino acid occurance) at that position in the BLOCK. ….GDSFHYFVSHG….. ….GDAFHYYISFG….. ….GDSYHYFLSFG….. ….SDSFHYFMSFG….. ….GDSFHFFASFG….. A C D E F G H I K…. 1 Log(4) 2 Log(5) 3 4 5 We can now use the PSSM to search a database for other proteins that have the BLOCK (or motif). Problem 1 – We need to think about what kind of information is Contained within the PSSM. → Leads to concepts of Information Content & Entropy (next time) ….GDSFHYFVSHG….. ….GDAFHYYISFG….. ….GDSYHYFLSFG….. ….SDSFHYFMSFG….. ….GDSFHFFASFG….. Problem 2 –The PSSM must accurately represent the expected BLOCK Or motif….and we have only limited amounts of data! Is it a good statistical Sampling of the BLOCK/motif? Is it too narrow because of small dataset? Should we broaden it by adding extra amino acids that we choose using Some type of randomization scheme (called adding pseudocounts). If so, How many should we add?