7.91 / 7.36 / BE.490 Lecture #6 Mar. 11, 2004 Predicting RNA Secondary Structure Chris Burge Review of Markov Models & DNA Evolution Ch. 4 of Mount ? CpG Island HMM ? The Viterbi Algorithm ? Real World HMMs ? Markov Models for DNA Evolution DNA Sequence Evolution Generation n-1 (grandparent) 5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTACGCCTAGCCCATGCGA 3’ |||||||||||||||||||||||||||||||||||||||||||||||||| 3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATGCGGATCGGGTACGCT 5’ 5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCATGCGA 3’ |||||||||||||||||||||||||||||||||||||||||||||||||| 3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGTACGCT 5’ Generation n (parent) Generation n+1 (child) 5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCGTGCGA 3’ |||||||||||||||||||||||||||||||||||||||||||||||||| 3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGCACGCT 5’ What is a Markov Model (aka Markov Chain)? Classical Definition A discrete stochastic process X 1 , X 2 , X 3 , … which has the Markov property: P(X n+1 = j | X 1 =x 1 , X 2 =x 2 , … X n =x n ) = P(X n+1 = j | X n =x ) n (for all x , all j, all n) i In words: A random process which has the property that the future (next state) is conditionally independent of the past given the present (current state) Markov - a Russian mathematician, ca. 1922 DNA Sequence Evolution is a Markov Process No selection case ? P AA P AC P AG P AT ? P CC P CG P CT ? S n = base at generation n P = ? ? P CA ? ? P GA P GC P GG P GT ? ? P ij = P ( S = j | S n = i ) ? ? P TA P TC P TG P TT ? n +1 null q n =( q A ,q C ,q , q T ) = vector of prob’s of bases at gen. n G Handy relations: null q n +1 null P q n = null q n +k = null q n P k Limit Theorem for Markov Chains S n = base at generation n P ij =P ( S n +1 = j | S n =i ) If P ij >0 for all i,j (and ∑ P ij =1 for all i) j null then there is a unique vector P n null Pr null r r such that null q null r null lim = q= and (for any prob. vector ) n →∞ null r is called the “stationary” or “limiting” distribution of P See Ch. 4, Taylor & Karlin, An Introduction to Stochastic Modeling, 1984 for details Stationary Distribution Examples 2-letter alphabet: R = purine, Y = pyrimidine Stationary distributions for: ? 10 ? ? 01 ? I = ? ? Q = ? ? ? 01? ? 10? ? 1 ? p p ? P = ? ? p 1 ? p? ? 0 < p < 1 ? 1 ? p p ? 0 < p < 1, 0 < q < 1 P′ = ? ? q 1 ? q? ? How are mutation rates measured? How does entropy change when a Markov transition matrix is applied? If limiting distribution is uniform, then entropy increases (analogous to 2nd Law of Thermodynamics) However, this is not true in general (why not?) How rapidly is the stationary distribution approached? Jukes-Cantor Model Courtesy of M. Yaffe Assume each nucleotide equally likely to change into any other nt, α α with rate of change=α. Overall rate of substitution = 3α …so if G at t=0, at t=1, P G(1) =1-3α and P G(2) =(1-3α)P G(1) +α [1? P G(1) ] G T A C α α α α Expanding this gives P G(t) =1/4 + (3/4)e -4αt Can show that this gives K = -3/4 ln[1-(4/3)(p)] K = true number of substitutions that have occurred, P = fraction of nt that differ by a simple count. Captures general behaviour… Predicting RNA Secondary Structure o structure by energy minimization o structure by covariation Read Mount, Ch. 5 ? Review of RNA structure - Motivation: ribosome gallery, miRNAs/siRNAs ? Predicting 2 ? Predicting 2 ? Finding non-coding RNA genes People who study proteins People who study RNA Types of Functional RNAs ? tRNAs ? RNaseP ? rRNAs ? SRP RNA ? mRNAs ? tmRNA ? snRNAs ? miRNAs ? snoRNAs ? siRNAs The Good News: Functional RNAs have secondary structure Composition of the Ribosome E. coli 70S ribosome - 2.6 x 10 6 daltons 30S subunit - 0.9 x 10 6 daltons 16S rRNA (1542 nts) 21 proteins 50S subunit - 1.7 x 10 6 daltons 5S rRNA (120 nts) 23S rRNA (2904 nts) 34 proteins The ribosome is a large macromolecular machine composed of RNA and protein components in a ratio of about 2 to 1. For many years,biochemistry and evolutionary considerations have argued for a central role being played by the rRNAs in the function of the ribosome. Now, in the face of atomic resolution data, the answer is clear - the ribosome is an RNA machine - and that is part of the story that I will tell you about today. The microRNA and RNAi Pathways microRNA pathway RNAi pathway Dicer precursor stRNA/microRNA siRNAs Dicer Translational repression Exogenous dsRNA, transposon, etc. target mRNA Drosha RISCmiRNP MicroRNA gene mRNA degradation " Soon after the discovery of let-7, the Mello, Zamore, and Hannon labs reported that CLICK gene inactivation by RNAi and the control of developmental timing by stRNAs, are interconnected processes that share certain molecular components. - -- - The most prominent component was the highly-conserved nuclease Dicer, which cleaves double-stranded precursor molecules into stRNAs and siRNAs. - Essential background on microRNAs - - Family of small non-coding RNAs found in animals and plants - Endogenous precursor RNA foldbacks are processed by the enzyme Dicer to mature single-stranded 21 or 22 nt microRNAs - RNA interference involves longer, perfect duplex RNA (exogenous or endogenous), processed by Dicer to ~21mer siRNAs - Characterized animal microRNAs direct translational inhibition by basepairing to 3’ UTRs of protein coding mRNAs and are often involved in developmental control. Pairing between microRNA and mRNA is always partial/incomplete (usually multiple bulges/loops). There are typically several microRNA complementary sites per regulated mRNA. - Plant microRNAs and siRNAs generally have perfect or near-perfect complementarity to mRNAs and can trigger mRNA degradation. Ways to Predict RNA 2 o Structure ?Dot Plot (+ Dynamic Programming on Helices ) ? Energy Minimization ? Covariation Helices in tRNA All possible helices RNA Energetics I …CCAUUCAUAG-5’ |||||| Free energy of helix formation 5’…CGUGAGU……… 3’ derives from: G A G ? base pairing: > > C U U ? base stacking: 5' --> 3' UX AY | G p A | 3' <-- 5’ C p U X Y A C G U A . . . -1.30 Doug Turner’s Energy Rules: C . . -2.40 . G . -2.10 . -1.00 T -0.90 . -1.30 . RNA Energetics II N p N p N p N p N p N p N A) x | | | | xx N p N p N p N p N p N p N B) N p N p N p N p N p N p N x | | x | | x N p N p N p N p N p N p N N p N p N p N p N p N p N C) x | | | x | x N p N p N p N p N p N p N Lots of consecutive base pairs - good Internal loop - bad Terminal base pair not stable - bad Generally A will be more stable than B or C RNA Energetics III Other Contributions to Folding Free Energy ? Hairpin loop destabilizing energies - a function of loop length ? Interior and bulge loop destabilizing energies - a function of loop length ? Terminal mismatch and base pair energies See Mount, Ch. 5 RNA Energetics IV Folding by Energy Minimization A clever dynamic programming algorithm is used - the Zuker algorithm - see Mount, Ch. 5 for details Gives: ? minimum energy fold ? suboptimal folds (e.g., five lowest ?G folds) ? probabilities of particular base pairs ? full partition function Accuracy: ~70-80% of base pairs correct M. Zuker, a Canadian scientist, now at RPI Practical Stuff The Mfold web server: http://www.bioinfo.rpi.edu/applications/mfold/old/rna/ The Vienna RNAfold package (free for download) http://www.tbi.univie.ac.at/~ivo/RNA/ RNA folding references: M. Zuker, et al. In RNA Biochemistry and Biotechnology (1999) D.H. Mathews et al. J. Mol. Biol. 288, 911-940 (1999) Vienna package by Ivo Hofacker Sample Mfold Output dG = -34.6 http://www.biology.wustl.edu/gcg/mfold.html Other ways to infer RNA 2 o structure Method of Covariation / Compensatory changes Seq1: A C G A A A G U Seq2: U A G T A A U A Seq3: A G G T G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C Mutual information statistic for pair of columns in a multiple alignment ( i , j ) Mij = ∑ f ( i , j ) f x , y ( i ) ( j ) x , y x , y log 2 f f x y ( i , j ) f = fraction of seqs w/ nt. x in col. i, nt. y in col. j x , y ( i ) f = fraction of seqs w/ nt. x in col. i x sum over x, y = A, C, G, U Mij is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary) Inferring 2 o structure from covariation Brown, T. A. Genomes. NY: John Wiley & Sons, 1999. Please see The ncRNA Gene Finding Problem Approach 1: Devise algorithm to find specific family of ncRNAs ? Lowe, T. M. and S. R. Eddy. "A Computational Screen for Methylation Guide snoRNAs in Yeast." Science 283 (1999): 1168. Approach 2: Devise algorithm to find ncRNAs in general ? Rivas, E., et al. "Computational Identification of Noncoding RNAs in E. coli by Comparative Genomics." Curr. Biol. 11 (2001): 1369. Literature Discussion Tues. 3/16 Paper #1: Part 1 - Finding Genes, etc., pp. 241-247 Part 2 - Regulatory Elements, pp. 247-254 Paper #2: Kellis, M, N Patterson, M Endrizzi, B Birren, and ES Lander. "Sequencing and Comparison of Yeast Species to Identify Genes and Regulatory Elements." Nature 423, no. 6937 (15 May 2003): 241-54. Rivas, E, RJ Klein, TA Jones, and SR Eddy. "Computational Identification of Noncoding RNAs in E. coli by Comparative Genomics." Curr Biol. 11, no. 17 (4 September 2001): 1369-73.