7.91 / 7.36 / BE.490
Lecture #3
Mar. 2, 2004
DNA Motif Modeling
&
Discovery
Chris Burge
Review of DNA Seq. Comparison/Alignment 
? Target frequencies and mismatch penalties 
? Eukaryotic gene structure 
? Comparative genomics applications: 
(2 species comparison) 
? Intro to DNA sequence motifs 
- Pipmaker 
- Phylogenetic Shadowing (many species) 
Organization of Topics
Model 
Dependence 
Lecture 
Object 
Structure 
Weight
Matrix 
Model 
Hidden 
Markov 
Model 
Independence
3/2 
Local
3/4 
Dependence 
Energy Model,
Covariation Model 
Non-local
Dependence
3/9 
DNA Motif Modeling & Discovery 
? Information Content of a Motif 
See Ch. 4 of Mount 
? Review - WMMs for splice sites 
? The Motif Finding/Discovery Problem 
? The Gibbs Sampler 
? Motif Modeling - Beyond Weight Matrices 
Splicing Model I
branch site
5’ splice site 
3’ splice site 
Weight Matrix Models II
5’ splice
signal
C A G … G T 
Background 
Con: 
Pos -3 -2 -1 … +5 +6 
A 0.3 0.6 0.1 … 0.1 0.1 
C 0.4 0.1 0.0 … 0.1 0.2 
G 0.2 0.2 0.8 … 0.8 0.2 
T 0.1 0.1 0.1 … 0.0 0.5 
Pos Generic 
A 0.25 
C 0.25 
G 0.25 
T 0.25 
S = S
1 
S
2 
S
3 
S
4 
S
5 
S
6 
S
7 
S
8 
S
9 
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R = 
P(S|+) = P
-3
P(S|-) = P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
) 
Background model homogenous, assumes independence 
Weight Matrix Models III
S = S
1 
S
2 
S
3 
S
4 
S
5 
S
6 
S
7 
S
8 
S
9 
P(S|+) P
-3
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R = = 
P(S|-) P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
) 
k=9 
= ∏ P
-4+k
(S
k
)/ P
bg
(S
k
) 
k=1 
k=9 
Score s = log
2
R = ∑ log
2 
(P
-4+k
(S
k
)/ P
bg
(S
k
)) 
k=1 
Neyman-Pearson Lemma: 
Optimal decision rules are of the form R > C
Equiv.: log
2
(R) > C ’ because log is a monotone function
Weight Matrix Models IV
Slide WMM along sequence: 
ttgacctagatgagatgtcgttcacttttactgagctacagaaaa
…… 
Assign score to each 9 base window.
Use score cutoff to predict potential 5’ splice sites
Histogram of 5’ss Scores
True 
5’ 
Splice 
Sites 
“Decoy” 
5’ 
Splice 
Sites 
Score (1/10 bit units) 
Measuring Accuracy: 
Sensitivity = % of true sites w/ score > cutoff 
Specificity = % of sites w/ score > cutoff 
that are true sites 
Sn: 20% 50% 90% 
Sp: 50% 32% 7% 
What does this result tell us?
A) Splicing machinery also uses other information 
besides 5’ss motif to identify splice sites;
OR
B) WMM model does not accurately capture some 
aspects of the 5’ss that are used in recognition 
(or both) 
This is a pretty common situation in biology 
What is a DNA (RNA) Motif ?
A pattern common to a set of DNA (RNA) 
sequences that share a common biological property, 
such as being binding sites for a regulatory protein 
Common motif adjectives:
exact/precise versus degenerate
strong versus weak (good versus lousy)
high information content versus low information content
Information Theory 
So we end up with Shannon’s famous formula:
20 
H = -
∑
Pi(log
2
Pi) 
Where H = the “Shannon Entropy” 
In bits per position in the alignment
i=1 
What does this mean???
H is a measure of entropy or randomness or disorder 
….it tells us how much uncertainty there is for the different 
amino acid abundances at one position in a sequence motif 
This slide courtesy of M. Yaffe
Information Theory 
Courtesy of M. Yaffe 
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….LDSYWNYASFG…..
Assuming all 20 amino acids equally possible: 
H
before 
= 4.32, H
after
=0 
Therefore, this position encodes 4.32-0 =4.32 bits of information! 
Another position in the motif that contains all 20 amino acids…
H
before 
= 4.32, H
after
=4.32
Therefore, this position encodes 4.32-4.32 =0 bits of information!
Information Content of a DNA Motif 
Information at position j: I
j 
= H
before 
-H
after 
Motif probabilities: p
k 
(k = A, C, G, T) 
1
Background probabilities: q
k 
= 
4 
(k = A, C, G, T) 
4 4 
I
j
= 
?
∑ 
q
k 
log 
q
k 
-
?
∑ 
p
k 
log 
p
k 
= 2 - H
j
2 2
k =1 k =1
w
I
motif 
= 
∑ 
I j 
= 2w - H
motif 
(motif of width w bases) 
j =1 
Log base 2 gives entropy/information in ‘bits’ 
Mean Bit-score of a Motif
Bit-score: 
log
2
( 
p
k
)
q
k
Mean bit-score: (motif width w, n = 4
w
, q
k 
= 
4
1 
w 
) 
n 
∑ 
p
k
log
2
( 
p
k
) 
= 2w - H
motif 
= I
motif
k =1 
q
k 
Rule of thumb*: motif w/ m bits of information will occur 
about once every 2
m 
bases of random sequence 
* True for regular expressions, approx. true for other motifs 
The Motif Finding Problem
Unaligned 
agggcactagcccatgtgagagggcaaggaccagcggaag
taattcagggccaggatgtatctttctcttaaaaataaca
tatcctacagatgatgaatgcaaatcagcgtcacgagctt
tggcgggcaaggtgcttaaaagataatatcgaccctagcg
attcgggtaccgttcataaaagtacgggaatttcgggtag
gttatgttaggcgagggcaaaagtcatatacttttaggtc
aagagggcaatgcctcctctgccgattcggcgagtgatcg
gatggggaaaatatgagaccaggggagggccacactgcag
ctgccgggctaacagacacacgtctagggctgtgaaatct
gtaggcgccgaggccaacgctgagtgtcgatgttgagaac
attagtccggttccaagagggcaactttgtatgcaccgcc
gcggcccagtgcgcaacgcacagggcaaggtttactgcgg
ccacatgcgagggcaacctccctgtgttgggcggttctga
gcaattgtaaaacgacggcaatgttcggtcgcctaccctg
gataaagaggggggtaggaggtcaactcttccgtattaat
aggagtagagtagtgggtaaactacgaatgcttataacat
gcgagggcaatcgggatctgaaccttctttatgcgaagac
tccaggaggaggtcaacgactctgcatgtctgacaacttg
gtcatagaattccatccgccacgcggggtaatttggacgt
gtgccaacttgtgccggggggctagcagcttcccgtcaaa
cgcgtttggagtgcaaacatacacagcccgggaatataga
aagatacgagttcgatttcaagagttcaaaacgtgacggg
gacgaaacgagggcgatcaatgcccgataggactaataag
tagtacaaacccgctcacccgaaaggagggcaaatacctt
atatacagccaggggagacctataactcagcaaggttcag
cgtatgtactaattgtggagagcaaatcattgtccacgtg
…
Aligned 
gcggaagagggcactagcccatgtgagagggcaaggacca
atctttctcttaaaaataacataattcagggccaggatgt
gtcacgagctttatcctacagatgatgaatgcaaatcagc
taaaagataatatcgaccctagcgtggcgggcaaggtgct
gtagattcgggtaccgttcataaaagtacgggaatttcgg
tatacttttaggtcgttatgttaggcgagggcaaaagtca
ctctgccgattcggcgagtgatcgaagagggcaatgcctc
aggatggggaaaatatgagaccaggggagggccacactgc
acacgtctagggctgtgaaatctctgccgggctaacagac
gtgtcgatgttgagaacgtaggcgccgaggccaacgctga
atgcaccgccattagtccggttccaagagggcaactttgt
ctgcgggcggcccagtgcgcaacgcacagggcaaggttta
tgtgttgggcggttctgaccacatgcgagggcaacctccc
gtcgcctaccctggcaattgtaaaacgacggcaatgttcg
cgtattaatgataaagaggggggtaggaggtcaactcttc
aatgcttataacataggagtagagtagtgggtaaactacg
tctgaaccttctttatgcgaagacgcgagggcaatcggga
tgcatgtctgacaacttgtccaggaggaggtcaacgactc
cgtgtcatagaattccatccgccacgcggggtaatttgga
tcccgtcaaagtgccaacttgtgccggggggctagcagct
acagcccgggaatatagacgcgtttggagtgcaaacatac
acgggaagatacgagttcgatttcaagagttcaaaacgtg
cccgataggactaataaggacgaaacgagggcgatcaatg
ttagtacaaacccgctcacccgaaaggagggcaaatacct
agcaaggttcagatatacagccaggggagacctataactc
gtccacgtgcgtatgtactaattgtggagagcaaatcatt
…
Motif Finding Example: The Gibbs 
Sampler 
The Gibbs sampler is a Monte-Carlo method, which seeks to maximize a 
likelihood function over the input sequence data. 
The likelihood function for a sequence s with a motif in location A 
background 
freq. vector
weight matrix 
Ps, A | Θ,θ
)
=
( 
B 
s = “actactgtatcgtactgactgattaggccatgactgcat” 
θ
B,a 
× ... ×θ
B,g 
×Θ
1, t
×Θ
2, a
× ... ×Θ
8, c
×θ
B,t 
×... ×θ
B,t 
Motif location A 
Lawrence et al. Science 1993 
Prepare Yourself for
Featuring the Gibbs Sampling Algorithm in:
? Pictures
?Words 
?Movies 
Gibbs Sampling Algorithm I
1. Select a random position in each sequence 
Sequence set motif instance 
Gibbs Sampling Algorithm II
2. Build a weight matrix 
Θ 
Gibbs Sampling Algorithm III
3. Select a sequence at random
Θ 
Gibbs Sampling Algorithm IV
4. Score possible sites in seq using weight matrix
Θ 
Likelihood 
(probability) 
Gibbs Sampling Algorithm V
5. Sample a new site proportional to likelihood
Θ 
Likelihood 
(probability) 
Gibbs Sampling Algorithm VI
6. Update weight matrix
Θ 
Likelihood 
(probability) 
Gibbs Sampling Algorithm VII
7. Iterate until convergence (no change in sites/Θ)
Θ 
The Gibbs Sampling Algorithm In Words I
Given N sequences of length L and desired motif width W:
Step 1) Choose a starting position in each sequence at random:
a
1 
in seq 1, a
2 
in seq 2, …, a
N 
in sequence N 
Step 2) Choose a sequence at random from the set (say, seq 1). 
Step 3) Make a weight matrix model of width W from the sites 
in all sequences except the one chosen in step 2. 
Step 4) Assign a probability to each position in seq 1 using the 
weight matrix model constructed in step 3: 
p = { p
1
, p
2
, p
3
, …, p
L-W+1 
} 
Lawrence et al., Science 1993
The Gibbs Sampling Algorithm In Words II
Given N sequences of length L and desired motif width W: 
Step 5) Sample a starting position in seq 1 based on this 
probability distribution and set a
1 
to this new position. 
Step 6) Choose a sequence at random from the set (say, seq 2). 
Step 7) Make a weight matrix model of width W from the sites 
in all sequences except the one chosen in step 6. 
Step 8) Assign a probability to each position in seq 2 using the 
weight matrix model constructed in step 7. 
Step 9) Sample a starting position in seq 2 based on this dist. 
Step 10) Repeat until convergence 
Lawrence et al., Science 1993
What, if anything, does this algorithm 
accomplish (besides keeping your computer 
busy)? 
Input Sequences with Strong Motif 
Cumulative frequency 
T 
G 
C 
A 
100% 
75% 
50% 
25% 
0% 
Input Sequences (Weak Motif)
gcggaagagggcactagcccatgtgagagggcaaggacca
atctttctcttaaaaataacataattcagggccaggatgt
gtcacgagctttatcctacagatgatgaatgcaaatcagc
taaaagataatatcgaccctagcgtggcgggcaaggtgct
gtagattcgggtaccgttcataaaagtacgggaatttcgg
tatacttttaggtcgttatgttaggcgagggcaaaagtca
ctctgccgattcggcgagtgatcgaagagggcaatgcctc
aggatggggaaaatatgagaccaggggagggccacactgc
acacgtctagggctgtgaaatctctgccgggctaacagac
gtgtcgatgttgagaacgtaggcgccgaggccaacgctga
atgcaccgccattagtccggttccaagagggcaactttgt
ctgcgggcggcccagtgcgcaacgcacagggcaaggttta
tgtgttgggcggttctgaccacatgcgagggcaacctccc
gtcgcctaccctggcaattgtaaaacgacggcaatgttcg
cgtattaatgataaagaggggggtaggaggtcaactcttc
aatgcttataacataggagtagagtagtgggtaaactacg
tctgaaccttctttatgcgaagacgcgagggcaatcggga
tgcatgtctgacaacttgtccaggaggaggtcaacgactc
cgtgtcatagaattccatccgccacgcggggtaatttgga
tcccgtcaaagtgccaacttgtgccggggggctagcagct
acagcccgggaatatagacgcgtttggagtgcaaacatac
acgggaagatacgagttcgatttcaagagttcaaaacgtg
cccgataggactaataaggacgaaacgagggcgatcaatg
ttagtacaaacccgctcacccgaaaggagggcaaatacct
agcaaggttcagatatacagccaggggagacctataactc
gtccacgtgcgtatgtactaattgtggagagcaaatcatt
…
Gibbs Sampler Summary
? A stochastic (Monte Carlo) algorithm for motif finding
? Works by ‘stumbling’ onto a few motif instances, which 
bias the weight matrix, which causes it to sample more 
motif instances, which biases the weight matrix more, 
… until convergence 
? Not guaranteed to converge to same motif every time  
run several times, compare results 
? Works for protein, DNA, RNA motifs 
MEME - Multiple EM for 
Motif Elicitation 
? Another popular motif finding algorithm  
optimizes a similar likelihood function using an 
algorithm called ‘expectation maximization’ (EM) 
? Unlike Gibbs Sampler, MEME is deterministic  
Bailey & Elkan, Proc. ISMB, 1994 
Weight Matrix Models II
5’ splice
signal
C A G … G T 
Background 
Con: 
Pos -3 -2 -1 … +5 +6 
A 0.3 0.6 0.1 … 0.1 0.1 
C 0.4 0.1 0.0 … 0.1 0.2 
G 0.2 0.2 0.8 … 0.8 0.2 
T 0.1 0.1 0.1 … 0.0 0.5 
Pos Generic 
A 0.25 
C 0.25 
G 0.25 
T 0.25 
S = S
1 
S
2 
S
3 
S
4 
S
5 
S
6 
S
7 
S
8 
S
9 
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R = 
P(S|+) = P
-3
P(S|-) = P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
) 
Background model homogenous, assumes independence 
Histogram of 5’ss Scores
True 
5’ 
Splice 
Sites 
“Decoy” 
5’ 
Splice 
Sites 
Score (1/10 bit units) 
Measuring Accuracy: 
Sensitivity = % of true sites w/ score > cutoff 
Specificity = % of sites w/ score > cutoff 
that are true sites 
Sn: 20% 50% 90% 
Sp: 50% 32% 7% 
What does this result tell us?
A) Splicing machinery also uses other information 
besides 5’ss motif to identify splice sites;
OR
B) WMM model does not accurately capture some 
aspects of the 5’ss that are used in recognition 
(or both) 
This is a pretty common situation in biology 
How is the 5’ss recognized?
U1 snRNA 
3’ 
………CCAUUCAUAG-5’ 
|||||| 
Pre-mRNA 
5’…………UUCGUGAGU…………… 3’
RNA Energetics I
…CCAUUCAUAG-5’
||||||
Free energy of helix formation 
5’…CGUGAGU……… 3’
derives from: 
G A G
? base pairing: > > 
C U U
? base stacking: 
5' --> 3' 
UX 
AY 
|
G p A
| 
3' <-- 5’
C p U
X
Y A C G U
A . . . -1.30
Doug Turner’s Energy Rules: 
C . . -2.40 . 
G . -2.10 . -1.00 
T -0.90 . -1.30 . 
RNA Energetics II
N p N p N p N p N p N p N
A) x | | | | xx 
N p N p N p N p N p N p N
B) 
N p N p N p N p N p N p N 
x | | x | | x 
N p N p N p N p N p N p N
N p N p N p N p N p N p N
C) x | | | x | x 
N p N p N p N p N p N p N
Lots of consecutive 
base pairs - good 
Internal loop - bad 
Terminal base pair
not stable - bad 
Generally A will be more stable than B or C
Conditional Frequencies in 5’ss Sequences
1123456-
5’ss which have G at +5 5’ss which lack G at +5 
Pos -1 +3 +4 +6 
A 9 44 75 14 
C 4 3 4 18 
G 78 51 13 19 
T 9 3 9 49 
Pos -1 +3 +4 +6 
A 2 81 51 22 
C 1 3 28 20 
G 97 15 9 30 
T 0 2 12 28 
Data from Burge, 1998 “Computational Methods in Molecular Biology”
What kind of model could 
incorporate interactions 
between positions? 

