7.91 / 7.36 / BE.490
Lecture #6
Mar. 11, 2004
Predicting RNA Secondary Structure
Chris Burge
Review of Markov Models & DNA Evolution
Ch. 4 of Mount
? CpG Island HMM
? The Viterbi Algorithm
? Real World HMMs
? Markov Models for DNA Evolution
DNA Sequence Evolution
Generation n-1 (grandparent)
5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTACGCCTAGCCCATGCGA 3’
||||||||||||||||||||||||||||||||||||||||||||||||||
3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATGCGGATCGGGTACGCT 5’
5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCATGCGA 3’
||||||||||||||||||||||||||||||||||||||||||||||||||
3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGTACGCT 5’
Generation n (parent)
Generation n+1 (child)
5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCGTGCGA 3’
||||||||||||||||||||||||||||||||||||||||||||||||||
3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGCACGCT 5’
What is a Markov Model (aka Markov
Chain)?
Classical Definition
A discrete stochastic process X
1
, X
2
, X
3
, …
which has the Markov property:
P(X
n+1
= j | X
1
=x
1
, X
2
=x
2
, … X
n
=x
n
) = P(X
n+1
= j | X
n
=x )
n
(for all x , all j, all n)
i
In words:
A random process which has the property that the future
(next state) is conditionally independent of the past given
the present (current state)
Markov - a Russian mathematician, ca. 1922
DNA Sequence Evolution is
a Markov Process
No selection case
?
P
AA
P
AC
P
AG
P
AT ?
P
CC
P
CG
P
CT
?
S
n
= base at generation n
P =
?
?
P
CA
?
?
P
GA
P
GC
P
GG
P
GT ?
?
P
ij
= P (
S
= j |
S
n
= i )
?
?
P
TA
P
TC
P
TG
P
TT
?
n +1
null
q
n
=(
q
A
,q
C
,q
,
q
T
) = vector of prob’s of bases at gen. n
G
Handy relations:
null
q
n +1
null
P
q
n
=
null
q
n
+k
=
null
q
n
P
k
Limit Theorem for Markov Chains
S
n
= base at generation n
P
ij
=P (
S
n +1
= j |
S
n
=i )
If P
ij
>0
for all i,j (and
∑
P
ij
=1 for all i)
j
null
then there is a unique vector
P
n
null
Pr
null
r
r
such that
null
q
null
r
null
lim
=
q=
and (for any prob. vector )
n →∞
null
r
is called the “stationary” or “limiting” distribution of P
See Ch. 4, Taylor & Karlin, An Introduction to Stochastic Modeling, 1984 for details
Stationary Distribution
Examples
2-letter alphabet: R = purine, Y = pyrimidine
Stationary distributions for:
?
10
? ?
01
?
I =
? ?
Q =
? ?
? 01? ? 10?
?
1 ? p p
?
P =
?
?
p 1 ? p?
?
0 < p < 1
?
1 ? p p
?
0 < p < 1, 0 < q < 1
P′ =
?
?
q 1 ? q?
?
How are mutation rates measured?
How does entropy change when a
Markov transition matrix is applied?
If limiting distribution is uniform, then entropy increases
(analogous to 2nd Law of Thermodynamics)
However, this is not true in general (why not?)
How rapidly is the stationary
distribution approached?
Jukes-Cantor Model
Courtesy of M. Yaffe
Assume each nucleotide equally likely
to change into any other nt,
α
α
with rate of change=α.
Overall rate of substitution = 3α
…so if G at t=0, at t=1, P
G(1)
=1-3α
and P
G(2)
=(1-3α)P
G(1)
+α [1? P
G(1)
]
G
T
A
C
α
α
α
α
Expanding this gives P
G(t)
=1/4 + (3/4)e
-4αt
Can show that this gives K = -3/4 ln[1-(4/3)(p)]
K = true number of substitutions that have occurred,
P = fraction of nt that differ by a simple count.
Captures general behaviour…
Predicting RNA Secondary Structure
o
structure by energy minimization
o
structure by covariation
Read Mount, Ch. 5
? Review of RNA structure
- Motivation: ribosome gallery, miRNAs/siRNAs
? Predicting 2
? Predicting 2
? Finding non-coding RNA genes
People who
study proteins
People who
study RNA
Types of Functional RNAs
? tRNAs ? RNaseP
? rRNAs ? SRP RNA
? mRNAs ? tmRNA
? snRNAs ? miRNAs
? snoRNAs ? siRNAs
The Good News:
Functional RNAs have secondary structure
Composition of the Ribosome
E. coli 70S ribosome - 2.6 x 10
6
daltons
30S subunit - 0.9 x 10
6
daltons
16S rRNA (1542 nts)
21 proteins
50S subunit - 1.7 x 10
6
daltons
5S rRNA (120 nts)
23S rRNA (2904 nts)
34 proteins
The ribosome is a large macromolecular machine composed of RNA and protein components in a ratio of about 2 to 1. For many years,biochemistry and evolutionary considerations
have argued for a central role being played by the rRNAs in the function of the ribosome. Now, in the face of atomic resolution data, the answer is clear - the ribosome is an RNA
machine - and that is part of the story that I will tell you about today.
The microRNA and RNAi Pathways
microRNA pathway
RNAi pathway
Dicer
precursor
stRNA/microRNA
siRNAs
Dicer
Translational repression
Exogenous dsRNA, transposon, etc.
target mRNA
Drosha
RISCmiRNP
MicroRNA gene
mRNA degradation
" Soon after the discovery of let-7, the Mello, Zamore, and Hannon labs reported
that CLICK gene inactivation by RNAi and the control of developmental timing by
stRNAs, are interconnected processes that share certain molecular components.
- --
- The most prominent component was the highly-conserved nuclease Dicer, which
cleaves double-stranded precursor molecules into stRNAs and siRNAs.
- Essential background on microRNAs
- - Family of small non-coding RNAs found in animals and plants
- Endogenous precursor RNA foldbacks are processed by the enzyme Dicer to
mature single-stranded 21 or 22 nt microRNAs
- RNA interference involves longer, perfect duplex RNA (exogenous or
endogenous), processed by Dicer to ~21mer siRNAs
- Characterized animal microRNAs direct translational inhibition by basepairing to
3’ UTRs of protein coding mRNAs and are often involved in developmental
control. Pairing between microRNA and mRNA is always partial/incomplete
(usually multiple bulges/loops). There are typically several microRNA
complementary sites per regulated mRNA.
- Plant microRNAs and siRNAs generally have perfect or near-perfect
complementarity to mRNAs and can trigger mRNA degradation.
Ways to Predict RNA 2
o
Structure
?Dot Plot
(+ Dynamic Programming on Helices )
? Energy Minimization
? Covariation
Helices in tRNA
All possible helices
RNA Energetics I
…CCAUUCAUAG-5’
||||||
Free energy of helix formation
5’…CGUGAGU……… 3’
derives from:
G A G
? base pairing: > >
C U U
? base stacking:
5' --> 3'
UX
AY
|
G p A
|
3' <-- 5’
C p U
X
Y A C G U
A . . . -1.30
Doug Turner’s Energy Rules:
C . . -2.40 .
G . -2.10 . -1.00
T -0.90 . -1.30 .
RNA Energetics II
N p N p N p N p N p N p N
A) x | | | | xx
N p N p N p N p N p N p N
B)
N p N p N p N p N p N p N
x | | x | | x
N p N p N p N p N p N p N
N p N p N p N p N p N p N
C) x | | | x | x
N p N p N p N p N p N p N
Lots of consecutive
base pairs - good
Internal loop - bad
Terminal base pair
not stable - bad
Generally A will be more stable than B or C
RNA Energetics III
Other Contributions to Folding Free Energy
? Hairpin loop destabilizing energies
- a function of loop length
? Interior and bulge loop destabilizing energies
- a function of loop length
? Terminal mismatch and base pair energies
See Mount, Ch. 5
RNA Energetics IV
Folding by Energy Minimization
A clever dynamic programming algorithm is used
- the Zuker algorithm - see Mount, Ch. 5 for details
Gives:
? minimum energy fold
? suboptimal folds (e.g., five lowest ?G folds)
? probabilities of particular base pairs
? full partition function
Accuracy: ~70-80% of base pairs correct
M. Zuker, a Canadian scientist, now at RPI
Practical Stuff
The Mfold web server:
http://www.bioinfo.rpi.edu/applications/mfold/old/rna/
The Vienna RNAfold package (free for download)
http://www.tbi.univie.ac.at/~ivo/RNA/
RNA folding references:
M. Zuker, et al. In RNA Biochemistry and Biotechnology (1999)
D.H. Mathews et al. J. Mol. Biol. 288, 911-940 (1999)
Vienna package by Ivo Hofacker
Sample Mfold Output
dG = -34.6
http://www.biology.wustl.edu/gcg/mfold.html
Other ways to infer RNA 2
o
structure
Method of Covariation / Compensatory changes
Seq1: A C G A A A G U
Seq2: U A G T A A U A
Seq3: A G G T G A C U
Seq4: C G G C A A U G
Seq5: G U G G G A A C
Mutual information statistic for pair of
columns in a multiple alignment
( i , j )
Mij
=
∑
f
( i , j )
f
x , y
( i ) ( j )
x , y
x , y
log
2
f f
x y
( i , j )
f
= fraction of seqs w/ nt. x in col. i, nt. y in col. j
x , y
( i )
f
= fraction of seqs w/ nt. x in col. i
x
sum over x, y = A, C, G, U
Mij
is maximal (2 bits) if x and y individually appear at
random (A,C,G,U equally likely), but are perfectly
correlated (e.g., always complementary)
Inferring 2
o
structure from covariation
Brown, T. A. Genomes. NY: John Wiley & Sons, 1999.
Please see
The ncRNA Gene Finding Problem
Approach 1:
Devise algorithm to find specific family of ncRNAs
? Lowe, T. M. and S. R. Eddy. "A Computational Screen for
Methylation Guide snoRNAs in Yeast." Science 283 (1999): 1168.
Approach 2:
Devise algorithm to find ncRNAs in general
? Rivas, E., et al. "Computational Identification of Noncoding RNAs in
E. coli by Comparative Genomics." Curr. Biol. 11 (2001): 1369.
Literature Discussion Tues. 3/16
Paper #1:
Part 1 - Finding Genes, etc., pp. 241-247
Part 2 - Regulatory Elements, pp. 247-254
Paper #2:
Kellis, M, N Patterson, M Endrizzi, B Birren, and ES Lander. "Sequencing and Comparison of Yeast Species
to Identify Genes and Regulatory Elements." Nature 423, no. 6937 (15 May 2003): 241-54.
Rivas, E, RJ Klein, TA Jones, and SR Eddy. "Computational Identification of Noncoding RNAs in E. coli by
Comparative Genomics." Curr Biol.
11, no. 17 (4 September 2001): 1369-73.