7.91 / 7.36 / BE.490
Lecture #5
Mar. 9, 2004
Markov Models
& 
DNA Sequence Evolution
Chris Burge
Review of Markov & HMM Models for DNA 
? Hidden Markov Models 
- looking under the hood 
Ch. 4 of Mount 
? Markov Models for splice sites 
? The Viterbi Algorithm 
? Real World HMMs 
CpG Islands
%C+G
60 
50 
40 
30 
Hidden
CpG Island Hidden Markov Model 
P
P
ig 
= 0.001 
P = 0.99999 
ii 
= 0.999gg 
Genome 
P
gi 
= 0.00001 
Island 
… 
A C T C G A G T A 
CpG Island: 
C 
0.3 
G 
0.3 
A 
0.2 
T 
0.2 
Genome: 0.2 0.2 0.3 0.3 
Observable 
CpG Island HMM II
P = 0.99999 
P
ig 
= 0.001 
Island 
gg 
“Transition 
Genome 
P
ii 
= 0.999 
probabilities” 
P
gi 
= 0.00001 
… 
TA C C G A G T A 
CpG Island: 
C G A T 
0.3 0.3 0.2 0.2 
Genome: 0.2 0.2 0.3 0.3 
“Emission 
Probabilities” 
CpG Island HMM III
… 
Want to infer 
A C T C G A G T A
Observe But HMM is written in the other direction 
(observable depends on hidden) 
Inferring the Hidden from the 
Observable 
(Bayes’ Rule)
P ( H = h 
1
, h 
2
,..., h
n 
| O = o 
1
, o 
2
,..., o
n 
)	
Conditional Prob: 
P(A|B) = P(A,B)/P(B) 
P ( H = h 
1
,..., h
n 
, O = o 
1
, ..., o
n 
)
= 
P (O = 
o 
1
, ..., 
o
n 
) 
P ( H = h 
1
,..., h
n 
)P (O = o 
1
,..., o
n 
| H = h 
1
,..., h
n 
)
= 
P (O = 
o 
1
, ..., 
o
n 
) 
P (O = o 
1
,..., o
n 
) 
somewhat difficult to calculate 
But notice: 
P ( H = h 
1
, ..., h
n 
, O = o 
1
,..., o
n 
) > P ( H = 
h ′1
, ..., 
h ′n 
, O = o 
1
, ..., o
n 
) 
implies P (H = h 
1
, ..., h
n 
| O = o 
1
,..., o
n 
) 
> 
P (H = 
h ′1
,..., 
h ′n 
| O = o 
1
, ..., o
n 
) 
so can treat P (O = o 
1
,..., o
n 
) as a constant 
Finding the Optimal “Parse” 
(Viterbi Algorithm) 
H 
opt 
optopt 
, h
2
which maximizes joint probability: 
P( H = h
1
, ..., h
n
, O = o
1
,..., o
n
) 
Want to find sequence of hidden states = h
1
opt 
, h
3
, ... 
(optimal “parse” of sequence) 
Solution: 
( h)
= 
probability of optimal parse of the
Define R
i
subsequence 1..i ending in state h 
( h)
Solve recursively, i.e. determine R
(
2
h) 
in terms of R
1
, etc. 
A. Viterbi, an MIT BS/MEng student in E.E. - founder of Qualcomm 
“Trellis” Diagram for Viterbi Algorithm
Position in Sequence → 
1 … i i+1 i+2 i+3 i+4 … L 
T … A T C G C … A 
Run time for k-state HMM on sequence of length L? 
??H
i
d
d
e
n 
St
a
t
e
s 
→ 
Viterbi Algorithm Examples
What is the optimal parse of the sequence:
?(ACGT)
10000
? A
1000
C
80
T
1000
C
40
A
1000
G
60
T
1000 
Powers of 1.5:
N = 20 40 60 80 
(1.5)
N 
=3x10
3 
1x10
7 
3x10
10 
1x10
14 
What else can you model with HMMs? 
ORF 
Start Stop 
Bacterial gene: 
Open Reading Frame 
HMMs
developed
Useful HMMs 
developed
Parameter Estimation for HMMs 
How many parameters for a k-state HMM over an 
alphabet of size 4? 
Initial probabilities: 
Transition probabilities: 
Emission probabilities: 
Pseudocounts
Courtesy of M. Yaffe
?If the number of sequences in the training set is both large and diverse, 
then the sequences in the training set represent a good statistical 
sampling of the motif….if not, then we have a sampling error! 
Correct for this by adding pseudocounts.  How many to add? 
→	 Too many pseudocounts dominate the frequencies… 
and the resulting matrix won’t work! 
→ Too few pseudocounts then we’ll miss many amino 
acid variations, and matrix will only find sequences 
that produced the motif! 
Add few pseudocounts if sampling is good (robust), and add 
more pseudocounts if sampling is sparse 
One reasonable approach is to add √N pseudocounts, where
N is the number of sequences…
As N increases, the influence of pseusocounts decreases since 
N increases faster than √N, but doesn’t add enough at low N 
Dealing With Small Training 
Sets 
Position: 1 2 3 4 5 
Training Set
A 8 
C 1 
ACCTG 
G 1 
AGCTG 
T 0 
ACCCG 
ACCTG 
If the true frequency of T at pos. 1 was 10%, 
ACCCA 
what’s the probability we wouldn’t see any Ts 
GACTG 
in a sample of 10 seqs? 
ACGTA 
ACCTG 
P(N=0) = (10!/0!10!)(0.1)
0
(0.9)
10 
= ~35% 
CCCCG 
ACATC 
So we should add pseudocounts
Pseudocounts (Ψcounts)
Nt Count Ψcount Bayescount ML est. Bayes est. 
A 8 + 1 9 0.80 0.64 
C 1 + 1 2 0.10 0.14 
G 1 + 1 2 0.10 0.14 
T 0 + 1 1 0.00 0.07 
10 14 1.0 1.0 
The ‘add 1 to each observed count’ rule can be 
derived analytically from the Bayesian posterior 
distribution under a Dirichlet prior - see Appendix A 
of statistics primer for details. 
Real World HMMs
Please see the following Web site: http://www.cbs.dtu.dk/services/TMHMM/
Reference for TMHMM: Krogh, A, B Larsson, G von Heijne, and EL Sonnhammer. "Predicting Transmembrane
Protein Topology with a Hidden Markov Model: Application to Complete Genomes." J Mol Biol. 305, no. 3 
(19 January 2001): 567-80.
Architecture of TMHMM 
Krogh, A, B Larsson, G von Heijne, and EL Sonnhammer. "Predicting Transmembrane Protein Topology
with a Hidden Markov Model: Application to Complete Genomes." J Mol Biol. 305, no. 3 (19 January  2001): 567-80.
Please see figures 1a and 1c of:
Optimal 
Parse 
TMHMM Output for 
Mouse 
Chloride Channel CLC6 
inside outside 
Posterior Probability 
Transmembrane 
Genscan Model
Incorporates:
Transcriptional signals 
Splicing signals 
Translational signals 
Composition of exons 
Composition of introns 
Other gene features 
Burge & Karlin, J Mol Biol 1997
Semi-Markov HMM Model
Genscan predictions in human CD4 gene 
region
Annotated exons Genscan predicted exons 
Overall: ~75% of exons exactly correct 
Burge and Karlin J. Mol. Biol. 1997
Genscan, GenomeScan Predictions in Human BRCA1 Region
Please see figures 1 of
Yeh, RF, LP Lim, and CB Burge. "Computational Inference of Homologous Gene Structures in the Human Genome."
Genome Res. 11, no. 5 (May 2001): 803-16.
DNA Sequence Evolution
Generation n-1 (grandparent)
5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTACGCCTAGCCCATGCGA 3’ 
|||||||||||||||||||||||||||||||||||||||||||||||||| 
3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATGCGGATCGGGTACGCT 5’ 
5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCATGCGA 3’ 
|||||||||||||||||||||||||||||||||||||||||||||||||| 
3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGTACGCT 5’ 
Generation n (parent) 
Generation n+1 (child) 
5’ TGGCATGCACCCTGTAAGTCAATATAAATGGCTATGCCTAGCCCGTGCGA 3’
||||||||||||||||||||||||||||||||||||||||||||||||||
3’ ACCGTACGTGGGACATTCAGTTATATTTACCGATACGGATCGGGCACGCT 5’
What is a Markov Model (aka Markov 
Chain)? 
Classical Definition
A discrete stochastic process X
1
, X
2
, X
3
, … 
which has the Markov property:
P(X
n+1 
= j | X
1
=x
1
, X
2
=x
2
, … X
n
=x
n
) = P(X
n+1 
= j | X
n
=x )
n 
(for all x , all j, all n)
i
In words: 
A random process which has the property that the future 
(next state) is conditionally independent of the past given 
the present (current state) 
Markov - a Russian mathematician, ca. 1922 
DNA Sequence Evolution is 
a Markov Process 
No selection case 
? 
P
AA 
P
AC 
P
AG 
P
AT ? 
P
CC
P
CG 
P
CT 
?
S
n 
= base at generation n 
P = 
? 
? 
P
CA
? 
? 
P
GA 
P
GC 
P
GG 
P
GT ?
?
P
ij
= P (
S 
= j |
S
n 
= i ) 
?
? 
P
TA 
P
TC 
P
TG
P
TT 
? 
n +1 
null
q 
n 
=(
q
A 
,q
C 
,q 
,
q
T
) = vector of prob’s of bases at gen. n
G 
Handy relations: 
null 
q 
n +1
null
P
q 
n
= 
null 
q 
n
+k 
= 
null
q 
n 
P
k
Limit Theorem for Markov Chains 
S
n 
= base at generation n 
P
ij
=P (
S
n +1 
= j |
S
n 
=i ) 
If P
ij
>0 
for all i,j (and 
∑
P
ij
=1 for all i) 
j 
null
then there is a unique vector 
P
n
null
Pr 
null 
r 
r 
such that 
null 
q 
null 
r 
null
lim
= 
q=
and (for any prob. vector ) 
n →∞
null 
r 
is called the “stationary” or “limiting” distribution of P
See Ch. 4, Taylor & Karlin, An Introduction to Stochastic Modeling, 1984 for details
Stationary Distribution 
Examples
2-letter alphabet: R = purine, Y = pyrimidine 
Stationary distributions for: 
? 
10
? ? 
01
?
I =
? ? 
Q =
? ?
? 01? ? 10? 
?
1 ? p p 
?
P = 
?
? 
p 1 ? p?
? 
0 < p < 1 
?
1 ? p p 
? 
0 < p < 1, 0 < q < 1
P′ = 
?
? 
q 1 ? q?
? 
How does entropy change when 
a Markov transition matrix is 
applied? 
If limiting distribution is uniform, then entropy increases 
(analogous to 2nd Law of Thermodynamics) 
However, this is not true in general (why not?)
How rapidly is the stationary 
distribution approached? 
Jukes-Cantor Model 
Courtesy of M. Yaffe
G 
T 
A 
C 
α 
α 
α 
α 
Assume each nucleotide equally likely 
to change into any other nt, 
with rate of change=α. 
α 
Overall rate of substitution = 3α 
…so if G at t=0, at t=1, P
G(1)
=1-3α 
α 
and P
G(2)
=(1-3α)P
G(1) 
+α [1? P
G(1) 
] 
Expanding this gives P
G(t)
=1/4 + (3/4)e
-4αt 
Can show that this gives K = -3/4 ln[1-(4/3)(p)] 
K = true number of substitutions that have occurred,
P = fraction of nt that differ by a simple count.
Captures general behaviour…
Literature Discussion Tues. 3/16
Paper #1:
Part 1 - Finding Genes, etc., pp. 241-247
Part 2 - Regulatory Elements, pp. 247-254
Paper #2:
Kellis, M, N Patterson, M Endrizzi, B Birren, and ES Lander. "Sequencing and Comparison of Yeast Species
 
to Identify Genes and Regulatory Elements." Nature 423, no. 6937 (15 May 2003): 241-54.
Rivas, E, RJ Klein, TA Jones, and SR Eddy. "Computational Identification of Noncoding RNAs in E. coli by
Comparative Genomics." Curr Biol    
 
 . 11, no. 17 (4 September 2001): 1369-73.

