7.91 / 7.36 / BE.490 Lecture #3 Mar. 2, 2004 DNA Motif Modeling & Discovery Chris Burge Review of DNA Seq. Comparison/Alignment ? Target frequencies and mismatch penalties ? Eukaryotic gene structure ? Comparative genomics applications: (2 species comparison) ? Intro to DNA sequence motifs - Pipmaker - Phylogenetic Shadowing (many species) Organization of Topics Model Dependence Lecture Object Structure Weight Matrix Model Hidden Markov Model Independence 3/2 Local 3/4 Dependence Energy Model, Covariation Model Non-local Dependence 3/9 DNA Motif Modeling & Discovery ? Information Content of a Motif See Ch. 4 of Mount ? Review - WMMs for splice sites ? The Motif Finding/Discovery Problem ? The Gibbs Sampler ? Motif Modeling - Beyond Weight Matrices Splicing Model I branch site 5’ splice site 3’ splice site Weight Matrix Models II 5’ splice signal C A G … G T Background Con: Pos -3 -2 -1 … +5 +6 A 0.3 0.6 0.1 … 0.1 0.1 C 0.4 0.1 0.0 … 0.1 0.2 G 0.2 0.2 0.8 … 0.8 0.2 T 0.1 0.1 0.1 … 0.0 0.5 Pos Generic A 0.25 C 0.25 G 0.25 T 0.25 S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 (S 1 )P -2 (S 2 )P -1 (S 3 ) ??? P 5 (S 8 )P 6 (S 9 ) Odds Ratio: R = P(S|+) = P -3 P(S|-) = P bg (S 1 )P bg (S 2 )P bg (S 3 ) ??? P bg (S 8 )P bg (S 9 ) Background model homogenous, assumes independence Weight Matrix Models III S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 P(S|+) P -3 (S 1 )P -2 (S 2 )P -1 (S 3 ) ??? P 5 (S 8 )P 6 (S 9 ) Odds Ratio: R = = P(S|-) P bg (S 1 )P bg (S 2 )P bg (S 3 ) ??? P bg (S 8 )P bg (S 9 ) k=9 = ∏ P -4+k (S k )/ P bg (S k ) k=1 k=9 Score s = log 2 R = ∑ log 2 (P -4+k (S k )/ P bg (S k )) k=1 Neyman-Pearson Lemma: Optimal decision rules are of the form R > C Equiv.: log 2 (R) > C ’ because log is a monotone function Weight Matrix Models IV Slide WMM along sequence: ttgacctagatgagatgtcgttcacttttactgagctacagaaaa …… Assign score to each 9 base window. Use score cutoff to predict potential 5’ splice sites Histogram of 5’ss Scores True 5’ Splice Sites “Decoy” 5’ Splice Sites Score (1/10 bit units) Measuring Accuracy: Sensitivity = % of true sites w/ score > cutoff Specificity = % of sites w/ score > cutoff that are true sites Sn: 20% 50% 90% Sp: 50% 32% 7% What does this result tell us? A) Splicing machinery also uses other information besides 5’ss motif to identify splice sites; OR B) WMM model does not accurately capture some aspects of the 5’ss that are used in recognition (or both) This is a pretty common situation in biology What is a DNA (RNA) Motif ? A pattern common to a set of DNA (RNA) sequences that share a common biological property, such as being binding sites for a regulatory protein Common motif adjectives: exact/precise versus degenerate strong versus weak (good versus lousy) high information content versus low information content Information Theory So we end up with Shannon’s famous formula: 20 H = - ∑ Pi(log 2 Pi) Where H = the “Shannon Entropy” In bits per position in the alignment i=1 What does this mean??? H is a measure of entropy or randomness or disorder ….it tells us how much uncertainty there is for the different amino acid abundances at one position in a sequence motif This slide courtesy of M. Yaffe Information Theory Courtesy of M. Yaffe ….GDSFHQFVSHG….. ….SDAFHQYISFG….. ….GDSYWNFLSFG….. ….SDSFHQFMSFG….. ….LDSYWNYASFG….. Assuming all 20 amino acids equally possible: H before = 4.32, H after =0 Therefore, this position encodes 4.32-0 =4.32 bits of information! Another position in the motif that contains all 20 amino acids… H before = 4.32, H after =4.32 Therefore, this position encodes 4.32-4.32 =0 bits of information! Information Content of a DNA Motif Information at position j: I j = H before -H after Motif probabilities: p k (k = A, C, G, T) 1 Background probabilities: q k = 4 (k = A, C, G, T) 4 4 I j = ? ∑ q k log q k - ? ∑ p k log p k = 2 - H j 2 2 k =1 k =1 w I motif = ∑ I j = 2w - H motif (motif of width w bases) j =1 Log base 2 gives entropy/information in ‘bits’ Mean Bit-score of a Motif Bit-score: log 2 ( p k ) q k Mean bit-score: (motif width w, n = 4 w , q k = 4 1 w ) n ∑ p k log 2 ( p k ) = 2w - H motif = I motif k =1 q k Rule of thumb*: motif w/ m bits of information will occur about once every 2 m bases of random sequence * True for regular expressions, approx. true for other motifs The Motif Finding Problem Unaligned agggcactagcccatgtgagagggcaaggaccagcggaag taattcagggccaggatgtatctttctcttaaaaataaca tatcctacagatgatgaatgcaaatcagcgtcacgagctt tggcgggcaaggtgcttaaaagataatatcgaccctagcg attcgggtaccgttcataaaagtacgggaatttcgggtag gttatgttaggcgagggcaaaagtcatatacttttaggtc aagagggcaatgcctcctctgccgattcggcgagtgatcg gatggggaaaatatgagaccaggggagggccacactgcag ctgccgggctaacagacacacgtctagggctgtgaaatct gtaggcgccgaggccaacgctgagtgtcgatgttgagaac attagtccggttccaagagggcaactttgtatgcaccgcc gcggcccagtgcgcaacgcacagggcaaggtttactgcgg ccacatgcgagggcaacctccctgtgttgggcggttctga gcaattgtaaaacgacggcaatgttcggtcgcctaccctg gataaagaggggggtaggaggtcaactcttccgtattaat aggagtagagtagtgggtaaactacgaatgcttataacat gcgagggcaatcgggatctgaaccttctttatgcgaagac tccaggaggaggtcaacgactctgcatgtctgacaacttg gtcatagaattccatccgccacgcggggtaatttggacgt gtgccaacttgtgccggggggctagcagcttcccgtcaaa cgcgtttggagtgcaaacatacacagcccgggaatataga aagatacgagttcgatttcaagagttcaaaacgtgacggg gacgaaacgagggcgatcaatgcccgataggactaataag tagtacaaacccgctcacccgaaaggagggcaaatacctt atatacagccaggggagacctataactcagcaaggttcag cgtatgtactaattgtggagagcaaatcattgtccacgtg … Aligned gcggaagagggcactagcccatgtgagagggcaaggacca atctttctcttaaaaataacataattcagggccaggatgt gtcacgagctttatcctacagatgatgaatgcaaatcagc taaaagataatatcgaccctagcgtggcgggcaaggtgct gtagattcgggtaccgttcataaaagtacgggaatttcgg tatacttttaggtcgttatgttaggcgagggcaaaagtca ctctgccgattcggcgagtgatcgaagagggcaatgcctc aggatggggaaaatatgagaccaggggagggccacactgc acacgtctagggctgtgaaatctctgccgggctaacagac gtgtcgatgttgagaacgtaggcgccgaggccaacgctga atgcaccgccattagtccggttccaagagggcaactttgt ctgcgggcggcccagtgcgcaacgcacagggcaaggttta tgtgttgggcggttctgaccacatgcgagggcaacctccc gtcgcctaccctggcaattgtaaaacgacggcaatgttcg cgtattaatgataaagaggggggtaggaggtcaactcttc aatgcttataacataggagtagagtagtgggtaaactacg tctgaaccttctttatgcgaagacgcgagggcaatcggga tgcatgtctgacaacttgtccaggaggaggtcaacgactc cgtgtcatagaattccatccgccacgcggggtaatttgga tcccgtcaaagtgccaacttgtgccggggggctagcagct acagcccgggaatatagacgcgtttggagtgcaaacatac acgggaagatacgagttcgatttcaagagttcaaaacgtg cccgataggactaataaggacgaaacgagggcgatcaatg ttagtacaaacccgctcacccgaaaggagggcaaatacct agcaaggttcagatatacagccaggggagacctataactc gtccacgtgcgtatgtactaattgtggagagcaaatcatt … Motif Finding Example: The Gibbs Sampler The Gibbs sampler is a Monte-Carlo method, which seeks to maximize a likelihood function over the input sequence data. The likelihood function for a sequence s with a motif in location A background freq. vector weight matrix Ps, A | Θ,θ ) = ( B s = “actactgtatcgtactgactgattaggccatgactgcat” θ B,a × ... ×θ B,g ×Θ 1, t ×Θ 2, a × ... ×Θ 8, c ×θ B,t ×... ×θ B,t Motif location A Lawrence et al. Science 1993 Prepare Yourself for Featuring the Gibbs Sampling Algorithm in: ? Pictures ?Words ?Movies Gibbs Sampling Algorithm I 1. Select a random position in each sequence Sequence set motif instance Gibbs Sampling Algorithm II 2. Build a weight matrix Θ Gibbs Sampling Algorithm III 3. Select a sequence at random Θ Gibbs Sampling Algorithm IV 4. Score possible sites in seq using weight matrix Θ Likelihood (probability) Gibbs Sampling Algorithm V 5. Sample a new site proportional to likelihood Θ Likelihood (probability) Gibbs Sampling Algorithm VI 6. Update weight matrix Θ Likelihood (probability) Gibbs Sampling Algorithm VII 7. Iterate until convergence (no change in sites/Θ) Θ The Gibbs Sampling Algorithm In Words I Given N sequences of length L and desired motif width W: Step 1) Choose a starting position in each sequence at random: a 1 in seq 1, a 2 in seq 2, …, a N in sequence N Step 2) Choose a sequence at random from the set (say, seq 1). Step 3) Make a weight matrix model of width W from the sites in all sequences except the one chosen in step 2. Step 4) Assign a probability to each position in seq 1 using the weight matrix model constructed in step 3: p = { p 1 , p 2 , p 3 , …, p L-W+1 } Lawrence et al., Science 1993 The Gibbs Sampling Algorithm In Words II Given N sequences of length L and desired motif width W: Step 5) Sample a starting position in seq 1 based on this probability distribution and set a 1 to this new position. Step 6) Choose a sequence at random from the set (say, seq 2). Step 7) Make a weight matrix model of width W from the sites in all sequences except the one chosen in step 6. Step 8) Assign a probability to each position in seq 2 using the weight matrix model constructed in step 7. Step 9) Sample a starting position in seq 2 based on this dist. Step 10) Repeat until convergence Lawrence et al., Science 1993 What, if anything, does this algorithm accomplish (besides keeping your computer busy)? Input Sequences with Strong Motif Cumulative frequency T G C A 100% 75% 50% 25% 0% Input Sequences (Weak Motif) gcggaagagggcactagcccatgtgagagggcaaggacca atctttctcttaaaaataacataattcagggccaggatgt gtcacgagctttatcctacagatgatgaatgcaaatcagc taaaagataatatcgaccctagcgtggcgggcaaggtgct gtagattcgggtaccgttcataaaagtacgggaatttcgg tatacttttaggtcgttatgttaggcgagggcaaaagtca ctctgccgattcggcgagtgatcgaagagggcaatgcctc aggatggggaaaatatgagaccaggggagggccacactgc acacgtctagggctgtgaaatctctgccgggctaacagac gtgtcgatgttgagaacgtaggcgccgaggccaacgctga atgcaccgccattagtccggttccaagagggcaactttgt ctgcgggcggcccagtgcgcaacgcacagggcaaggttta tgtgttgggcggttctgaccacatgcgagggcaacctccc gtcgcctaccctggcaattgtaaaacgacggcaatgttcg cgtattaatgataaagaggggggtaggaggtcaactcttc aatgcttataacataggagtagagtagtgggtaaactacg tctgaaccttctttatgcgaagacgcgagggcaatcggga tgcatgtctgacaacttgtccaggaggaggtcaacgactc cgtgtcatagaattccatccgccacgcggggtaatttgga tcccgtcaaagtgccaacttgtgccggggggctagcagct acagcccgggaatatagacgcgtttggagtgcaaacatac acgggaagatacgagttcgatttcaagagttcaaaacgtg cccgataggactaataaggacgaaacgagggcgatcaatg ttagtacaaacccgctcacccgaaaggagggcaaatacct agcaaggttcagatatacagccaggggagacctataactc gtccacgtgcgtatgtactaattgtggagagcaaatcatt … Gibbs Sampler Summary ? A stochastic (Monte Carlo) algorithm for motif finding ? Works by ‘stumbling’ onto a few motif instances, which bias the weight matrix, which causes it to sample more motif instances, which biases the weight matrix more, … until convergence ? Not guaranteed to converge to same motif every time run several times, compare results ? Works for protein, DNA, RNA motifs MEME - Multiple EM for Motif Elicitation ? Another popular motif finding algorithm optimizes a similar likelihood function using an algorithm called ‘expectation maximization’ (EM) ? Unlike Gibbs Sampler, MEME is deterministic Bailey & Elkan, Proc. ISMB, 1994 Weight Matrix Models II 5’ splice signal C A G … G T Background Con: Pos -3 -2 -1 … +5 +6 A 0.3 0.6 0.1 … 0.1 0.1 C 0.4 0.1 0.0 … 0.1 0.2 G 0.2 0.2 0.8 … 0.8 0.2 T 0.1 0.1 0.1 … 0.0 0.5 Pos Generic A 0.25 C 0.25 G 0.25 T 0.25 S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 (S 1 )P -2 (S 2 )P -1 (S 3 ) ??? P 5 (S 8 )P 6 (S 9 ) Odds Ratio: R = P(S|+) = P -3 P(S|-) = P bg (S 1 )P bg (S 2 )P bg (S 3 ) ??? P bg (S 8 )P bg (S 9 ) Background model homogenous, assumes independence Histogram of 5’ss Scores True 5’ Splice Sites “Decoy” 5’ Splice Sites Score (1/10 bit units) Measuring Accuracy: Sensitivity = % of true sites w/ score > cutoff Specificity = % of sites w/ score > cutoff that are true sites Sn: 20% 50% 90% Sp: 50% 32% 7% What does this result tell us? A) Splicing machinery also uses other information besides 5’ss motif to identify splice sites; OR B) WMM model does not accurately capture some aspects of the 5’ss that are used in recognition (or both) This is a pretty common situation in biology How is the 5’ss recognized? U1 snRNA 3’ ………CCAUUCAUAG-5’ |||||| Pre-mRNA 5’…………UUCGUGAGU…………… 3’ RNA Energetics I …CCAUUCAUAG-5’ |||||| Free energy of helix formation 5’…CGUGAGU……… 3’ derives from: G A G ? base pairing: > > C U U ? base stacking: 5' --> 3' UX AY | G p A | 3' <-- 5’ C p U X Y A C G U A . . . -1.30 Doug Turner’s Energy Rules: C . . -2.40 . G . -2.10 . -1.00 T -0.90 . -1.30 . RNA Energetics II N p N p N p N p N p N p N A) x | | | | xx N p N p N p N p N p N p N B) N p N p N p N p N p N p N x | | x | | x N p N p N p N p N p N p N N p N p N p N p N p N p N C) x | | | x | x N p N p N p N p N p N p N Lots of consecutive base pairs - good Internal loop - bad Terminal base pair not stable - bad Generally A will be more stable than B or C Conditional Frequencies in 5’ss Sequences 1123456- 5’ss which have G at +5 5’ss which lack G at +5 Pos -1 +3 +4 +6 A 9 44 75 14 C 4 3 4 18 G 78 51 13 19 T 9 3 9 49 Pos -1 +3 +4 +6 A 2 81 51 22 C 1 3 28 20 G 97 15 9 30 T 0 2 12 28 Data from Burge, 1998 “Computational Methods in Molecular Biology” What kind of model could incorporate interactions between positions?