7.91 / 7.36 / BE.490 Lecture #4 Mar. 4, 2004 Markov & Hidden Markov Models for DNA Sequence Analysis Chris Burge Organization of Topics Model Dependence Lecture Object Structure Weight Matrix Model Hidden Markov Model Independence 3/2 Local 3/4 Dependence Energy Model, Covariation Model Non-local Dependence 3/9 Markov & Hidden Markov Models for DNA ? Hidden Markov Models - looking under the hood See Ch. 4 of Mount ? Markov Models for splice sites ? The Viterbi Algorithm ? Real World HMMs Review of DNA Motif Modeling & Discovery ? Information Content of a Motif See Ch. 4 of Mount ? WMMs for splice sites ? The Motif Finding/Discovery Problem ? The Gibbs Sampler ? Motif Modeling - Beyond Weight Matrices Information Content of a DNA/RNA Motif -3 -2 -1 1 2 3 4 5 6 f k = freq. of nt k at position Shannon Entropy H( null f ) =? ∑ f log 2 ( f k k ) (bits) k Information/position ) = 2 + ∑ f log 2 ( f ) = ∑ f log 2 ( 1 f k ) (bits) k k k k k 4 null f null fI( ) = 2 ? H ( Motif containing m bits of info. will occur approximately once per 2 m bases of random sequence Variables Affecting Motif Finding gcggaagagggcactagcccatgtgagagggcaaggacca atctttctcttaaaaataacataattcagggccaggatgt gtcacgagctttatcctacagatgatgaatgcaaatcagc taaaagataatatcgaccctagcgtggcgggcaaggtgct L = avg. sequence length gtagattcgggtaccgttcataaaagtacgggaatttcgg tatacttttaggtcgttatgttaggcgagggcaaaagtca ctctgccgattcggcgagtgatcgaagagggcaatgcctc aggatggggaaaatatgagaccaggggagggccacactgc N = no. of sequences acacgtctagggctgtgaaatctctgccgggctaacagac gtgtcgatgttgagaacgtaggcgccgaggccaacgctga atgcaccgccattagtccggttccaagagggcaactttgt ctgcgggcggcccagtgcgcaacgcacagggcaaggttta I = info. content of motif tgtgttgggcggttctgaccacatgcgagggcaacctccc gtcgcctaccctggcaattgtaaaacgacggcaatgttcg cgtattaatgataaagaggggggtaggaggtcaactcttc aatgcttataacataggagtagagtagtgggtaaactacg W = motif width tctgaaccttctttatgcgaagacgcgagggcaatcggga tgcatgtctgacaacttgtccaggaggaggtcaacgactc cgtgtcatagaattccatccgccacgcggggtaatttgga tcccgtcaaagtgccaacttgtgccggggggctagcagct acagcccgggaatatagacgcgtttggagtgcaaacatac acgggaagatacgagttcgatttcaagagttcaaaacgtg cccgataggactaataaggacgaaacgagggcgatcaatg ttagtacaaacccgctcacccgaaaggagggcaaatacct agcaaggttcagatatacagccaggggagacctataactc gtccacgtgcgtatgtactaattgtggagagcaaatcatt … How is the 5’ss recognized? U1 snRNA 3’ ………CCAUUCAUAG-5’ |||||| Pre-mRNA 5’…………UUCGUGAGU…………… 3’ RNA Energetics I …CCAUUCAUAG-5’ |||||| Free energy of helix formation 5’…CGUGAGU……… 3’ derives from: G A G ? base pairing: > > C U U ? base stacking: 5' --> 3' UX AY | G p A | 3' <-- 5’ C p U X Y A C G U A . . . -1.30 Doug Turner’s Energy Rules: C . . -2.40 . G . -2.10 . -1.00 T -0.90 . -1.30 . RNA Energetics II N p N p N p N p N p N p N A) x | | | | xx N p N p N p N p N p N p N B) N p N p N p N p N p N p N x | | x | | x N p N p N p N p N p N p N N p N p N p N p N p N p N C) x | | | x | x N p N p N p N p N p N p N Lots of consecutive base pairs - good Internal loop - bad Terminal base pair not stable - bad Generally A will be more stable than B or C Conditional Frequencies in 5’ss Sequences 1123456- 5’ss which have G at +5 5’ss which lack G at +5 Pos -1 +3 +4 +6 A 9 44 75 14 C 4 3 4 18 G 78 51 13 19 T 9 3 9 49 Pos -1 +3 +4 +6 A 2 81 51 22 C 1 3 28 20 G 97 15 9 30 T 0 2 12 28 Data from Burge, 1998 “Computational Methods in Molecular Biology” What kind of model could incorporate interactions between positions? A Markov Model -3 -2 -1 1 2 3 4 5 6 -3 -2 -1 1 2 3 4 5 6 Terminology Random Variable (RV): A quantity which may assume any one of a set of values, each with a definite probability of occurrence Examples: X = the outcome of rolling a die 1 1 1 P(X=1) = 6 P(X=2) = 6 … P(X=6) = 6 The craps process: X 1 , X 2 , X 3 , … successive dice rolls Stochastic Process: a random process or a sequence of Random Variables What is a Markov Model (aka Markov Chain)? Classical Definition A discrete stochastic process X 1 , X 2 , X 3 , … which has the Markov property: P(X n+1 = j | X 1 =x 1 , X 2 =x 2 , … X n =x n ) = P(X n+1 = j | X n =x ) n (for all x , all j, all n) i In words: A random process which has the property that the future (next state) is conditionally independent of the past given the present (current state) Markov - a Russian mathematician, ca. 1922 Inhomogeneous 1st-Order Markov Model -3 -2 -1 1 2 3 4 5 6 N ( ?3,?2) CA P -2 (A |C) = N ( ?3) C -3 -2 -1 1 2 3 4 5 6 S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 R = P(S|+) = P -3 (S 1 )P -2 (S 2 |S 1 )P -1 (S 3 |S 2 ) ??? P 6 (S 9 |S 8 ) P(S|-) = P bg (S 1 )P bg (S 2 |S 1 )P bg (S 3 |S 2 ) ??? P bg (S 9 |S 8 ) Estimating Parameters for a Markov Model -3 -2 -1 1 2 3 4 5 6 N (?3,?2) CA P -2 (A |C) = N (?3) C -3 -2 -1 1 2 3 4 5 6 What about longer-range dependence? - k-order Markov model ~4 k+1 parameters / position for Markov model of order k -3 -2 -1 1 2 3 4 5 6 -3 -2 -1 1 2 3 4 5 6 S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 Inhomogeneous R = P(S|+) = P -3 (S 1 )P -2 (S 2 |S 1 )P -1 (S 3 |S 2 ) ??? P 6 (S 9 |S 8 ) P(S|-) = P bg (S 1 )P bg (S 2 |S 1 )P bg (S 3 |S 2 ) ??? P bg (S 9 |S 8 ) s = log 2 R Homogeneous WMM vs 1st-order Markov Models of Human 5’ss Decoy 5’ss Decoy 5’ss True 5’ss WMM 5’ss Score I1M 5’ss Score WMM Markov True 5’ss Splicing Model I branch site 5’ splice site 3’ splice site Commitment complex ATP U2 snRNP ATP U5, U4/U6 snRNPs Complexes across intron Pre-spliceosome complexPolypyrimidine tractBranch point 3' splice site 5' splice site A/CAG GURAGU YNYURAY 48 nt 18 nt Y 10-20 Spliceosome U1 snRNP U1 snRNP U1 snRNP SRp SRp SRp SRp pppG 7 m pppG 7 m SRp AAAAA Exon definition complex m 7 Gppp m 7 Gppp CBC CBC CBC CBC U1 snRNP ESE SRp SF1/ BBP SF1/ BBP SF1/ BBP 35 U2AF 65 pol II pppp 35 35 35 U2AF 65 U2AF 65 ESE SRp ESE U2AF 65 U2 snRNP ESE ESE YAG Splicing Model II li i l II (A) (B) A Recent Model of Human Pre-mRNA Splicing ESEs are short motifs that enhance recognition of adjacent splice sites in both constitutive and alternatively spliced exons - precise sequence requirements not well characterized aka HMMs A later development, developed in E. E. for applications to voice recognition Markov Models Courtesy of M. Yaffe Markov and Hidden Markov Models -3 -2 -1 1 2 3 4 5 6 Markov P ig … Hidden Genome Island P ii P gi P gg A C T C G A G T A Hidden Markov Observable CpG Islands %C+G 60 50 40 30 Hidden CpG Island Hidden Markov Model P P ig = 0.001 P = 0.99999 ii = 0.999gg Genome P gi = 0.00001 Island … A C T C G A G T A CpG Island: C 0.3 G 0.3 A 0.2 T 0.2 Genome: 0.2 0.2 0.3 0.3 Observable CpG Island HMM II P = 0.99999 P ig = 0.001 Island gg “Transition Genome P ii = 0.999 probabilities” P gi = 0.00001 … TA C C G A G T A CpG Island: C G A T 0.3 0.3 0.2 0.2 Genome: 0.2 0.2 0.3 0.3 “Emission Probabilities” CpG Island HMM III … Want to infer A C T C G A G T A Observe But HMM is written in the other direction (observable depends on hidden) Inferring the Hidden from the Observable (Bayes’ Rule) P ( H = h 1 , h 2 ,..., h n | O = o 1 , o 2 ,..., o n ) Conditional Prob: P(A|B) = P(A,B)/P(B) P ( H = h 1 ,..., h n , O = o 1 , ..., o n ) = P (O = o 1 , ..., o n ) P ( H = h 1 ,..., h n )P (O = o 1 ,..., o n | H = h 1 ,..., h n ) = P (O = o 1 , ..., o n ) P (O = o 1 ,..., o n ) somewhat difficult to calculate But notice: P ( H = h 1 , ..., h n , O = o 1 ,..., o n ) > P ( H = h ′1 , ..., h ′n , O = o 1 , ..., o n ) implies P (H = h 1 , ..., h n | O = o 1 ,..., o n ) > P (H = h ′1 ,..., h ′n | O = o 1 , ..., o n ) so can treat P (O = o 1 ,..., o n ) as a constant Finding the Optimal “Parse” (Viterbi Algorithm) H opt optopt , h 2 which maximizes joint probability: P( H = h 1 , ..., h n , O = o 1 ,..., o n ) Want to find sequence of hidden states = h 1 opt , h 3 , ... (optimal “parse” of sequence) Solution: ( h) = probability of optimal parse of the Define R i subsequence 1..i ending in state h ( h) Solve recursively, i.e. determine R ( 2 h) in terms of R 1 , etc. A. Viterbi, an MIT BS/Meng student in E.E. - founder of Qualcomm “Trellis” Diagram for Viterbi Algorithm Position in Sequence → 1 … i i+1 i+2 i+3 i+4 … L T … A T C G C … A Run time for k-state HMM on sequence of length L? ??H i d d e n St a t e s → HMMs developed Useful HMMs developed