7.91 / 7.36 / BE.490
Lecture #4
Mar. 4, 2004
Markov & Hidden Markov Models
for DNA Sequence Analysis
Chris Burge
Organization of Topics
Model
Dependence
Lecture
Object
Structure
Weight
Matrix
Model
Hidden
Markov
Model
Independence
3/2
Local
3/4
Dependence
Energy Model,
Covariation Model
Non-local
Dependence
3/9
Markov & Hidden Markov Models for DNA
? Hidden Markov Models
- looking under the hood
See Ch. 4 of Mount
? Markov Models for splice sites
? The Viterbi Algorithm
? Real World HMMs
Review of DNA Motif Modeling & Discovery
? Information Content of a Motif
See Ch. 4 of Mount
? WMMs for splice sites
? The Motif Finding/Discovery Problem
? The Gibbs Sampler
? Motif Modeling - Beyond Weight Matrices
Information Content of a DNA/RNA Motif
-3 -2 -1 1 2 3 4 5 6
f
k
= freq. of nt k
at position
Shannon Entropy
H(
null
f
) =?
∑
f log
2
(
f
k k
)
(bits)
k
Information/position
) = 2 +
∑
f log
2
(
f
) =
∑
f log
2
(
1
f
k
)
(bits)
k k k
k k
4
null
f
null
fI( ) = 2 ? H
(
Motif containing m bits of info. will occur approximately
once per
2
m
bases of random sequence
Variables Affecting Motif Finding
gcggaagagggcactagcccatgtgagagggcaaggacca
atctttctcttaaaaataacataattcagggccaggatgt
gtcacgagctttatcctacagatgatgaatgcaaatcagc
taaaagataatatcgaccctagcgtggcgggcaaggtgct
L = avg. sequence length
gtagattcgggtaccgttcataaaagtacgggaatttcgg
tatacttttaggtcgttatgttaggcgagggcaaaagtca
ctctgccgattcggcgagtgatcgaagagggcaatgcctc
aggatggggaaaatatgagaccaggggagggccacactgc
N = no. of sequences
acacgtctagggctgtgaaatctctgccgggctaacagac
gtgtcgatgttgagaacgtaggcgccgaggccaacgctga
atgcaccgccattagtccggttccaagagggcaactttgt
ctgcgggcggcccagtgcgcaacgcacagggcaaggttta
I = info. content of motif
tgtgttgggcggttctgaccacatgcgagggcaacctccc
gtcgcctaccctggcaattgtaaaacgacggcaatgttcg
cgtattaatgataaagaggggggtaggaggtcaactcttc
aatgcttataacataggagtagagtagtgggtaaactacg
W = motif width
tctgaaccttctttatgcgaagacgcgagggcaatcggga
tgcatgtctgacaacttgtccaggaggaggtcaacgactc
cgtgtcatagaattccatccgccacgcggggtaatttgga
tcccgtcaaagtgccaacttgtgccggggggctagcagct
acagcccgggaatatagacgcgtttggagtgcaaacatac
acgggaagatacgagttcgatttcaagagttcaaaacgtg
cccgataggactaataaggacgaaacgagggcgatcaatg
ttagtacaaacccgctcacccgaaaggagggcaaatacct
agcaaggttcagatatacagccaggggagacctataactc
gtccacgtgcgtatgtactaattgtggagagcaaatcatt
…
How is the 5’ss recognized?
U1 snRNA
3’
………CCAUUCAUAG-5’
||||||
Pre-mRNA
5’…………UUCGUGAGU…………… 3’
RNA Energetics I
…CCAUUCAUAG-5’
||||||
Free energy of helix formation
5’…CGUGAGU……… 3’
derives from:
G A G
? base pairing: > >
C U U
? base stacking:
5' --> 3'
UX
AY
|
G p A
|
3' <-- 5’
C p U
X
Y A C G U
A . . . -1.30
Doug Turner’s Energy Rules:
C . . -2.40 .
G . -2.10 . -1.00
T -0.90 . -1.30 .
RNA Energetics II
N p N p N p N p N p N p N
A) x | | | | xx
N p N p N p N p N p N p N
B)
N p N p N p N p N p N p N
x | | x | | x
N p N p N p N p N p N p N
N p N p N p N p N p N p N
C) x | | | x | x
N p N p N p N p N p N p N
Lots of consecutive
base pairs - good
Internal loop - bad
Terminal base pair
not stable - bad
Generally A will be more stable than B or C
Conditional Frequencies in 5’ss Sequences
1123456-
5’ss which have G at +5 5’ss which lack G at +5
Pos -1 +3 +4 +6
A 9 44 75 14
C 4 3 4 18
G 78 51 13 19
T 9 3 9 49
Pos -1 +3 +4 +6
A 2 81 51 22
C 1 3 28 20
G 97 15 9 30
T 0 2 12 28
Data from Burge, 1998 “Computational Methods in Molecular Biology”
What kind of model could
incorporate interactions
between positions?
A Markov Model
-3 -2 -1 1 2 3 4 5 6
-3 -2 -1 1 2 3 4 5 6
Terminology
Random Variable (RV):
A quantity which may assume any one of a set of
values, each with a definite probability of occurrence
Examples: X = the outcome of rolling a die
1
1 1
P(X=1) =
6
P(X=2) =
6
… P(X=6) =
6
The craps process: X
1
, X
2
, X
3
, … successive dice rolls
Stochastic Process:
a random process
or a sequence of Random Variables
What is a Markov Model (aka Markov
Chain)?
Classical Definition
A discrete stochastic process X
1
, X
2
, X
3
, …
which has the Markov property:
P(X
n+1
= j | X
1
=x
1
, X
2
=x
2
, … X
n
=x
n
) = P(X
n+1
= j | X
n
=x )
n
(for all x , all j, all n)
i
In words:
A random process which has the property that the future
(next state) is conditionally independent of the past given
the present (current state)
Markov - a Russian mathematician, ca. 1922
Inhomogeneous 1st-Order Markov
Model
-3 -2 -1 1 2 3 4 5 6
N
( ?3,?2)
CA
P
-2
(A |C) =
N
( ?3)
C
-3 -2 -1 1 2 3 4 5 6
S = S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
R =
P(S|+) = P
-3
(S
1
)P
-2
(S
2
|S
1
)P
-1
(S
3
|S
2
) ??? P
6
(S
9
|S
8
)
P(S|-) = P
bg
(S
1
)P
bg
(S
2
|S
1
)P
bg
(S
3
|S
2
) ??? P
bg
(S
9
|S
8
)
Estimating Parameters for a Markov Model
-3 -2 -1 1 2 3 4 5 6
N
(?3,?2)
CA
P
-2
(A |C) =
N
(?3)
C
-3 -2 -1 1 2 3 4 5 6
What about longer-range dependence?
- k-order Markov model
~4
k+1
parameters / position for Markov model of order k
-3 -2 -1 1 2 3 4 5 6
-3 -2 -1 1 2 3 4 5 6
S = S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
Inhomogeneous
R =
P(S|+) = P
-3
(S
1
)P
-2
(S
2
|S
1
)P
-1
(S
3
|S
2
) ??? P
6
(S
9
|S
8
)
P(S|-) = P
bg
(S
1
)P
bg
(S
2
|S
1
)P
bg
(S
3
|S
2
) ??? P
bg
(S
9
|S
8
)
s = log
2
R
Homogeneous
WMM vs 1st-order Markov Models of
Human 5’ss
Decoy 5’ss
Decoy 5’ss
True
5’ss
WMM 5’ss Score
I1M 5’ss Score
WMM
Markov
True
5’ss
Splicing Model I
branch site
5’ splice site
3’ splice site
Commitment complex
ATP
U2 snRNP
ATP
U5, U4/U6 snRNPs
Complexes across intron
Pre-spliceosome complexPolypyrimidine
tractBranch point
3' splice site
5' splice site
A/CAG
GURAGU YNYURAY
48 nt 18 nt
Y 10-20
Spliceosome
U1
snRNP
U1
snRNP
U1
snRNP
SRp
SRp
SRp
SRp
pppG
7
m
pppG
7
m
SRp
AAAAA
Exon definition complex
m
7
Gppp
m
7
Gppp
CBC
CBC
CBC
CBC
U1
snRNP
ESE
SRp
SF1/
BBP
SF1/
BBP
SF1/
BBP
35
U2AF
65
pol II
pppp
35
35
35
U2AF
65
U2AF
65
ESE
SRp
ESE
U2AF
65
U2
snRNP
ESE
ESE
YAG
Splicing Model II
li i l II
(A)
(B)
A Recent Model of
Human Pre-mRNA Splicing
ESEs are short motifs that enhance recognition of adjacent splice
sites in both constitutive and alternatively spliced exons - precise
sequence requirements not well characterized
aka HMMs
A later development, developed in E. E. for applications to voice recognition
Markov Models
Courtesy of M. Yaffe
Markov and Hidden Markov
Models
-3 -2 -1 1 2 3 4 5 6
Markov
P
ig
…
Hidden
Genome
Island
P
ii
P
gi
P
gg
A C T C G A G T A
Hidden Markov
Observable
CpG Islands
%C+G
60
50
40
30
Hidden
CpG Island Hidden Markov
Model
P
P
ig
= 0.001
P = 0.99999
ii
= 0.999gg
Genome
P
gi
= 0.00001
Island
…
A C T C G A G T A
CpG Island:
C
0.3
G
0.3
A
0.2
T
0.2
Genome: 0.2 0.2 0.3 0.3
Observable
CpG Island HMM II
P = 0.99999
P
ig
= 0.001
Island
gg
“Transition
Genome
P
ii
= 0.999
probabilities”
P
gi
= 0.00001
…
TA C C G A G T A
CpG Island:
C G A T
0.3 0.3 0.2 0.2
Genome: 0.2 0.2 0.3 0.3
“Emission
Probabilities”
CpG Island HMM III
…
Want to infer
A C T C G A G T A
Observe But HMM is written in the other direction
(observable depends on hidden)
Inferring the Hidden from the
Observable
(Bayes’ Rule)
P ( H = h
1
, h
2
,..., h
n
| O = o
1
, o
2
,..., o
n
)
Conditional Prob:
P(A|B) = P(A,B)/P(B)
P ( H = h
1
,..., h
n
, O = o
1
, ..., o
n
)
=
P (O =
o
1
, ...,
o
n
)
P ( H = h
1
,..., h
n
)P (O = o
1
,..., o
n
| H = h
1
,..., h
n
)
=
P (O =
o
1
, ...,
o
n
)
P (O = o
1
,..., o
n
)
somewhat difficult to calculate
But notice:
P ( H = h
1
, ..., h
n
, O = o
1
,..., o
n
) > P ( H =
h ′1
, ...,
h ′n
, O = o
1
, ..., o
n
)
implies P (H = h
1
, ..., h
n
| O = o
1
,..., o
n
)
>
P (H =
h ′1
,...,
h ′n
| O = o
1
, ..., o
n
)
so can treat P (O = o
1
,..., o
n
) as a constant
Finding the Optimal “Parse”
(Viterbi Algorithm)
H
opt
optopt
, h
2
which maximizes joint probability:
P( H = h
1
, ..., h
n
, O = o
1
,..., o
n
)
Want to find sequence of hidden states = h
1
opt
, h
3
, ...
(optimal “parse” of sequence)
Solution:
( h)
=
probability of optimal parse of the
Define R
i
subsequence 1..i ending in state h
( h)
Solve recursively, i.e. determine R
(
2
h)
in terms of R
1
, etc.
A. Viterbi, an MIT BS/Meng student in E.E. - founder of Qualcomm
“Trellis” Diagram for Viterbi Algorithm
Position in Sequence →
1 … i i+1 i+2 i+3 i+4 … L
T … A T C G C … A
Run time for k-state HMM on sequence of length L?
??H
i
d
d
e
n
St
a
t
e
s
→
HMMs
developed
Useful HMMs
developed