7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA ? Hidden Markov Models - looking under the hood Ch. 4 of Mount ? Markov Models for splice sites ? The Viterbi Algorithm ? Real World HMMs CpG Islands %C+G 60 50 40 30 Hidden CpG Island Hidden Markov Model P P ig = 0.001 P = 0.99999 ii = 0.999gg Genome P gi = 0.00001 Island … A C T C G A G T A CpG Island: C 0.3 G 0.3 A 0.2 T 0.2 Genome: 0.2 0.2 0.3 0.3 Observable CpG Island HMM II P = 0.99999 P ig = 0.001 Island gg “Transition Genome P ii = 0.999 probabilities” P gi = 0.00001 … TA C C G A G T A CpG Island: C G A T 0.3 0.3 0.2 0.2 Genome: 0.2 0.2 0.3 0.3 “Emission Probabilities” CpG Island HMM III … Want to infer A C T C G A G T A Observe But HMM is written in the other direction (observable depends on hidden) Inferring the Hidden from the Observable (Bayes’ Rule) P ( H = h 1 , h 2 ,..., h n | O = o 1 , o 2 ,..., o n ) Conditional Prob: P(A|B) = P(A,B)/P(B) P ( H = h 1 ,..., h n , O = o 1 , ..., o n ) = P (O = o 1 , ..., o n ) P ( H = h 1 ,..., h n )P (O = o 1 ,..., o n | H = h 1 ,..., h n ) = P (O = o 1 , ..., o n ) P (O = o 1 ,..., o n ) somewhat difficult to calculate But notice: P ( H = h 1 , ..., h n , O = o 1 ,..., o n ) > P ( H = h ′1 , ..., h ′n , O = o 1 , ..., o n ) implies P (H = h 1 , ..., h n | O = o 1 ,..., o n ) > P (H = h ′1 ,..., h ′n | O = o 1 , ..., o n ) so can treat P (O = o 1 ,..., o n ) as a constant Finding the Optimal “Parse” (Viterbi Algorithm) H opt optopt , h 2 which maximizes joint probability: P( H = h 1 , ..., h n , O = o 1 ,..., o n ) Want to find sequence of hidden states = h 1 opt , h 3 , ... (optimal “parse” of sequence) Solution: ( h) = probability of optimal parse of the Define R i subsequence 1..i ending in state h ( h) Solve recursively, i.e. determine R ( 2 h) in terms of R 1 , etc. A. Viterbi, an MIT BS/MEng student in E.E. - founder of Qualcomm “Trellis” Diagram for Viterbi Algorithm Position in Sequence → 1 … i i+1 i+2 i+3 i+4 … L T … A T C G C … A Run time for k-state HMM on sequence of length L? ??H i d d e n St a t e s → Viterbi Algorithm Examples What is the optimal parse of the sequence: ?(ACGT) 10000 ? A 1000 C 80 T 1000 C 40 A 1000 G 60 T 1000 Powers of 1.5: N = 20 40 60 80 (1.5) N =3x10 3 1x10 7 3x10 10 1x10 14 What else can you model with HMMs? ORF Start Stop Bacterial gene: Open Reading Frame HMMs developed Useful HMMs developed Parameter Estimation for HMMs How many parameters for a k-state HMM over an alphabet of size 4? Initial probabilities: Transition probabilities: Emission probabilities: Pseudocounts Courtesy of M. Yaffe ?If the number of sequences in the training set is both large and diverse, then the sequences in the training set represent a good statistical sampling of the motif….if not, then we have a sampling error! Correct for this by adding pseudocounts. How many to add? → Too many pseudocounts dominate the frequencies… and the resulting matrix won’t work! → Too few pseudocounts then we’ll miss many amino acid variations, and matrix will only find sequences that produced the motif! Add few pseudocounts if sampling is good (robust), and add more pseudocounts if sampling is sparse One reasonable approach is to add √N pseudocounts, where N is the number of sequences… As N increases, the influence of pseusocounts decreases since N increases faster than √N, but doesn’t add enough at low N Dealing With Small Training Sets Position: 1 2 3 4 5 Training Set A 8 C 1 ACCTG G 1 AGCTG T 0 ACCCG ACCTG If the true frequency of T at pos. 1 was 10%, ACCCA what’s the probability we wouldn’t see any Ts GACTG in a sample of 10 seqs? ACGTA ACCTG P(N=0) = (10!/0!10!)(0.1) 0 (0.9) 10 = ~35% CCCCG ACATC So we should add pseudocounts Pseudocounts (Ψcounts) Nt Count Ψcount Bayescount ML est. Bayes est. A 8 + 1 9 0.80 0.64 C 1 + 1 2 0.10 0.14 G 1 + 1 2 0.10 0.14 T 0 + 1 1 0.00 0.07 10 14 1.0 1.0 The ‘add 1 to each observed count’ rule can be derived analytically from the Bayesian posterior distribution under a Dirichlet prior - see Appendix A of statistics primer for details. Real World HMMs Please see the following Web site: http://www.cbs.dtu.dk/services/TMHMM/ Reference for TMHMM: Krogh, A, B Larsson, G von Heijne, and EL Sonnhammer. "Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes." J Mol Biol. 305, no. 3 (19 January 2001): 567-80. Architecture of TMHMM Krogh, A, B Larsson, G von Heijne, and EL Sonnhammer. "Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes." J Mol Biol. 305, no. 3 (19 January 2001): 567-80. Please see figures 1a and 1c of: Optimal Parse TMHMM Output for Mouse Chloride Channel CLC6 inside outside Posterior Probability Transmembrane Genscan Model Incorporates: Transcriptional signals Splicing signals Translational signals Composition of exons Composition of introns Other gene features Burge & Karlin, J Mol Biol 1997 Semi-Markov HMM Model Genscan predictions in human CD4 gene region Annotated exons Genscan predicted exons Overall: ~75% of exons exactly correct Burge and Karlin J. Mol. Biol. 1997 Genscan, GenomeScan Predictions in Human BRCA1 Region Please see figures 1 of Yeh, RF, LP Lim, and CB Burge. "Computational Inference of Homologous Gene Structures in the Human Genome." Genome Res. 11, no. 5 (May 2001): 803-16. DNA Sequence Evolution Generation n-1 (grandparent) 5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTACGCCTAGCCCATGCGA 3’ |||||||||||||||||||||||||||||||||||||||||||||||||| 3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATGCGGATCGGGTACGCT 5’ 5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCATGCGA 3’ |||||||||||||||||||||||||||||||||||||||||||||||||| 3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGTACGCT 5’ Generation n (parent) Generation n+1 (child) 5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCGTGCGA 3’ |||||||||||||||||||||||||||||||||||||||||||||||||| 3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGCACGCT 5’ What is a Markov Model (aka Markov Chain)? Classical Definition A discrete stochastic process X 1 , X 2 , X 3 , … which has the Markov property: P(X n+1 = j | X 1 =x 1 , X 2 =x 2 , … X n =x n ) = P(X n+1 = j | X n =x ) n (for all x , all j, all n) i In words: A random process which has the property that the future (next state) is conditionally independent of the past given the present (current state) Markov - a Russian mathematician, ca. 1922 DNA Sequence Evolution is a Markov Process No selection case ? P AA P AC P AG P AT ? P CC P CG P CT ? S n = base at generation n P = ? ? P CA ? ? P GA P GC P GG P GT ? ? P ij = P ( S = j | S n = i ) ? ? P TA P TC P TG P TT ? n +1 null q n =( q A ,q C ,q , q T ) = vector of prob’s of bases at gen. n G Handy relations: null q n +1 null P q n = null q n +k = null q n P k Limit Theorem for Markov Chains S n = base at generation n P ij =P ( S n +1 = j | S n =i ) If P ij >0 for all i,j (and ∑ P ij =1 for all i) j null then there is a unique vector P n null Pr null r r such that null q null r null lim = q= and (for any prob. vector ) n →∞ null r is called the “stationary” or “limiting” distribution of P See Ch. 4, Taylor & Karlin, An Introduction to Stochastic Modeling, 1984 for details Stationary Distribution Examples 2-letter alphabet: R = purine, Y = pyrimidine Stationary distributions for: ? 10 ? ? 01 ? I = ? ? Q = ? ? ? 01? ? 10? ? 1 ? p p ? P = ? ? p 1 ? p? ? 0 < p < 1 ? 1 ? p p ? 0 < p < 1, 0 < q < 1 P′ = ? ? q 1 ? q? ? How does entropy change when a Markov transition matrix is applied? If limiting distribution is uniform, then entropy increases (analogous to 2nd Law of Thermodynamics) However, this is not true in general (why not?) How rapidly is the stationary distribution approached? Jukes-Cantor Model Courtesy of M. Yaffe G T A C α α α α Assume each nucleotide equally likely to change into any other nt, with rate of change=α. α Overall rate of substitution = 3α …so if G at t=0, at t=1, P G(1) =1-3α α and P G(2) =(1-3α)P G(1) +α [1? P G(1) ] Expanding this gives P G(t) =1/4 + (3/4)e -4αt Can show that this gives K = -3/4 ln[1-(4/3)(p)] K = true number of substitutions that have occurred, P = fraction of nt that differ by a simple count. Captures general behaviour… Literature Discussion Tues. 3/16 Paper #1: Part 1 - Finding Genes, etc., pp. 241-247 Part 2 - Regulatory Elements, pp. 247-254 Paper #2: Kellis, M, N Patterson, M Endrizzi, B Birren, and ES Lander. "Sequencing and Comparison of Yeast Species to Identify Genes and Regulatory Elements." Nature 423, no. 6937 (15 May 2003): 241-54. Rivas, E, RJ Klein, TA Jones, and SR Eddy. "Computational Identification of Noncoding RNAs in E. coli by Comparative Genomics." Curr Biol . 11, no. 17 (4 September 2001): 1369-73.