7.91 / 7.36 / BE.490
Lecture #3
Mar. 2, 2004
DNA Motif Modeling
&
Discovery
Chris Burge
Review of DNA Seq. Comparison/Alignment
? Target frequencies and mismatch penalties
? Eukaryotic gene structure
? Comparative genomics applications:
(2 species comparison)
? Intro to DNA sequence motifs
- Pipmaker
- Phylogenetic Shadowing (many species)
Organization of Topics
Model
Dependence
Lecture
Object
Structure
Weight
Matrix
Model
Hidden
Markov
Model
Independence
3/2
Local
3/4
Dependence
Energy Model,
Covariation Model
Non-local
Dependence
3/9
DNA Motif Modeling & Discovery
? Information Content of a Motif
See Ch. 4 of Mount
? Review - WMMs for splice sites
? The Motif Finding/Discovery Problem
? The Gibbs Sampler
? Motif Modeling - Beyond Weight Matrices
Splicing Model I
branch site
5’ splice site
3’ splice site
Weight Matrix Models II
5’ splice
signal
C A G … G T
Background
Con:
Pos -3 -2 -1 … +5 +6
A 0.3 0.6 0.1 … 0.1 0.1
C 0.4 0.1 0.0 … 0.1 0.2
G 0.2 0.2 0.8 … 0.8 0.2
T 0.1 0.1 0.1 … 0.0 0.5
Pos Generic
A 0.25
C 0.25
G 0.25
T 0.25
S = S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R =
P(S|+) = P
-3
P(S|-) = P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
)
Background model homogenous, assumes independence
Weight Matrix Models III
S = S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
P(S|+) P
-3
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R = =
P(S|-) P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
)
k=9
= ∏ P
-4+k
(S
k
)/ P
bg
(S
k
)
k=1
k=9
Score s = log
2
R = ∑ log
2
(P
-4+k
(S
k
)/ P
bg
(S
k
))
k=1
Neyman-Pearson Lemma:
Optimal decision rules are of the form R > C
Equiv.: log
2
(R) > C ’ because log is a monotone function
Weight Matrix Models IV
Slide WMM along sequence:
ttgacctagatgagatgtcgttcacttttactgagctacagaaaa
……
Assign score to each 9 base window.
Use score cutoff to predict potential 5’ splice sites
Histogram of 5’ss Scores
True
5’
Splice
Sites
“Decoy”
5’
Splice
Sites
Score (1/10 bit units)
Measuring Accuracy:
Sensitivity = % of true sites w/ score > cutoff
Specificity = % of sites w/ score > cutoff
that are true sites
Sn: 20% 50% 90%
Sp: 50% 32% 7%
What does this result tell us?
A) Splicing machinery also uses other information
besides 5’ss motif to identify splice sites;
OR
B) WMM model does not accurately capture some
aspects of the 5’ss that are used in recognition
(or both)
This is a pretty common situation in biology
What is a DNA (RNA) Motif ?
A pattern common to a set of DNA (RNA)
sequences that share a common biological property,
such as being binding sites for a regulatory protein
Common motif adjectives:
exact/precise versus degenerate
strong versus weak (good versus lousy)
high information content versus low information content
Information Theory
So we end up with Shannon’s famous formula:
20
H = -
∑
Pi(log
2
Pi)
Where H = the “Shannon Entropy”
In bits per position in the alignment
i=1
What does this mean???
H is a measure of entropy or randomness or disorder
….it tells us how much uncertainty there is for the different
amino acid abundances at one position in a sequence motif
This slide courtesy of M. Yaffe
Information Theory
Courtesy of M. Yaffe
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….LDSYWNYASFG…..
Assuming all 20 amino acids equally possible:
H
before
= 4.32, H
after
=0
Therefore, this position encodes 4.32-0 =4.32 bits of information!
Another position in the motif that contains all 20 amino acids…
H
before
= 4.32, H
after
=4.32
Therefore, this position encodes 4.32-4.32 =0 bits of information!
Information Content of a DNA Motif
Information at position j: I
j
= H
before
-H
after
Motif probabilities: p
k
(k = A, C, G, T)
1
Background probabilities: q
k
=
4
(k = A, C, G, T)
4 4
I
j
=
?
∑
q
k
log
q
k
-
?
∑
p
k
log
p
k
= 2 - H
j
2 2
k =1 k =1
w
I
motif
=
∑
I j
= 2w - H
motif
(motif of width w bases)
j =1
Log base 2 gives entropy/information in ‘bits’
Mean Bit-score of a Motif
Bit-score:
log
2
(
p
k
)
q
k
Mean bit-score: (motif width w, n = 4
w
, q
k
=
4
1
w
)
n
∑
p
k
log
2
(
p
k
)
= 2w - H
motif
= I
motif
k =1
q
k
Rule of thumb*: motif w/ m bits of information will occur
about once every 2
m
bases of random sequence
* True for regular expressions, approx. true for other motifs
The Motif Finding Problem
Unaligned
agggcactagcccatgtgagagggcaaggaccagcggaag
taattcagggccaggatgtatctttctcttaaaaataaca
tatcctacagatgatgaatgcaaatcagcgtcacgagctt
tggcgggcaaggtgcttaaaagataatatcgaccctagcg
attcgggtaccgttcataaaagtacgggaatttcgggtag
gttatgttaggcgagggcaaaagtcatatacttttaggtc
aagagggcaatgcctcctctgccgattcggcgagtgatcg
gatggggaaaatatgagaccaggggagggccacactgcag
ctgccgggctaacagacacacgtctagggctgtgaaatct
gtaggcgccgaggccaacgctgagtgtcgatgttgagaac
attagtccggttccaagagggcaactttgtatgcaccgcc
gcggcccagtgcgcaacgcacagggcaaggtttactgcgg
ccacatgcgagggcaacctccctgtgttgggcggttctga
gcaattgtaaaacgacggcaatgttcggtcgcctaccctg
gataaagaggggggtaggaggtcaactcttccgtattaat
aggagtagagtagtgggtaaactacgaatgcttataacat
gcgagggcaatcgggatctgaaccttctttatgcgaagac
tccaggaggaggtcaacgactctgcatgtctgacaacttg
gtcatagaattccatccgccacgcggggtaatttggacgt
gtgccaacttgtgccggggggctagcagcttcccgtcaaa
cgcgtttggagtgcaaacatacacagcccgggaatataga
aagatacgagttcgatttcaagagttcaaaacgtgacggg
gacgaaacgagggcgatcaatgcccgataggactaataag
tagtacaaacccgctcacccgaaaggagggcaaatacctt
atatacagccaggggagacctataactcagcaaggttcag
cgtatgtactaattgtggagagcaaatcattgtccacgtg
…
Aligned
gcggaagagggcactagcccatgtgagagggcaaggacca
atctttctcttaaaaataacataattcagggccaggatgt
gtcacgagctttatcctacagatgatgaatgcaaatcagc
taaaagataatatcgaccctagcgtggcgggcaaggtgct
gtagattcgggtaccgttcataaaagtacgggaatttcgg
tatacttttaggtcgttatgttaggcgagggcaaaagtca
ctctgccgattcggcgagtgatcgaagagggcaatgcctc
aggatggggaaaatatgagaccaggggagggccacactgc
acacgtctagggctgtgaaatctctgccgggctaacagac
gtgtcgatgttgagaacgtaggcgccgaggccaacgctga
atgcaccgccattagtccggttccaagagggcaactttgt
ctgcgggcggcccagtgcgcaacgcacagggcaaggttta
tgtgttgggcggttctgaccacatgcgagggcaacctccc
gtcgcctaccctggcaattgtaaaacgacggcaatgttcg
cgtattaatgataaagaggggggtaggaggtcaactcttc
aatgcttataacataggagtagagtagtgggtaaactacg
tctgaaccttctttatgcgaagacgcgagggcaatcggga
tgcatgtctgacaacttgtccaggaggaggtcaacgactc
cgtgtcatagaattccatccgccacgcggggtaatttgga
tcccgtcaaagtgccaacttgtgccggggggctagcagct
acagcccgggaatatagacgcgtttggagtgcaaacatac
acgggaagatacgagttcgatttcaagagttcaaaacgtg
cccgataggactaataaggacgaaacgagggcgatcaatg
ttagtacaaacccgctcacccgaaaggagggcaaatacct
agcaaggttcagatatacagccaggggagacctataactc
gtccacgtgcgtatgtactaattgtggagagcaaatcatt
…
Motif Finding Example: The Gibbs
Sampler
The Gibbs sampler is a Monte-Carlo method, which seeks to maximize a
likelihood function over the input sequence data.
The likelihood function for a sequence s with a motif in location A
background
freq. vector
weight matrix
Ps, A | Θ,θ
)
=
(
B
s = “actactgtatcgtactgactgattaggccatgactgcat”
θ
B,a
× ... ×θ
B,g
×Θ
1, t
×Θ
2, a
× ... ×Θ
8, c
×θ
B,t
×... ×θ
B,t
Motif location A
Lawrence et al. Science 1993
Prepare Yourself for
Featuring the Gibbs Sampling Algorithm in:
? Pictures
?Words
?Movies
Gibbs Sampling Algorithm I
1. Select a random position in each sequence
Sequence set motif instance
Gibbs Sampling Algorithm II
2. Build a weight matrix
Θ
Gibbs Sampling Algorithm III
3. Select a sequence at random
Θ
Gibbs Sampling Algorithm IV
4. Score possible sites in seq using weight matrix
Θ
Likelihood
(probability)
Gibbs Sampling Algorithm V
5. Sample a new site proportional to likelihood
Θ
Likelihood
(probability)
Gibbs Sampling Algorithm VI
6. Update weight matrix
Θ
Likelihood
(probability)
Gibbs Sampling Algorithm VII
7. Iterate until convergence (no change in sites/Θ)
Θ
The Gibbs Sampling Algorithm In Words I
Given N sequences of length L and desired motif width W:
Step 1) Choose a starting position in each sequence at random:
a
1
in seq 1, a
2
in seq 2, …, a
N
in sequence N
Step 2) Choose a sequence at random from the set (say, seq 1).
Step 3) Make a weight matrix model of width W from the sites
in all sequences except the one chosen in step 2.
Step 4) Assign a probability to each position in seq 1 using the
weight matrix model constructed in step 3:
p = { p
1
, p
2
, p
3
, …, p
L-W+1
}
Lawrence et al., Science 1993
The Gibbs Sampling Algorithm In Words II
Given N sequences of length L and desired motif width W:
Step 5) Sample a starting position in seq 1 based on this
probability distribution and set a
1
to this new position.
Step 6) Choose a sequence at random from the set (say, seq 2).
Step 7) Make a weight matrix model of width W from the sites
in all sequences except the one chosen in step 6.
Step 8) Assign a probability to each position in seq 2 using the
weight matrix model constructed in step 7.
Step 9) Sample a starting position in seq 2 based on this dist.
Step 10) Repeat until convergence
Lawrence et al., Science 1993
What, if anything, does this algorithm
accomplish (besides keeping your computer
busy)?
Input Sequences with Strong Motif
Cumulative frequency
T
G
C
A
100%
75%
50%
25%
0%
Input Sequences (Weak Motif)
gcggaagagggcactagcccatgtgagagggcaaggacca
atctttctcttaaaaataacataattcagggccaggatgt
gtcacgagctttatcctacagatgatgaatgcaaatcagc
taaaagataatatcgaccctagcgtggcgggcaaggtgct
gtagattcgggtaccgttcataaaagtacgggaatttcgg
tatacttttaggtcgttatgttaggcgagggcaaaagtca
ctctgccgattcggcgagtgatcgaagagggcaatgcctc
aggatggggaaaatatgagaccaggggagggccacactgc
acacgtctagggctgtgaaatctctgccgggctaacagac
gtgtcgatgttgagaacgtaggcgccgaggccaacgctga
atgcaccgccattagtccggttccaagagggcaactttgt
ctgcgggcggcccagtgcgcaacgcacagggcaaggttta
tgtgttgggcggttctgaccacatgcgagggcaacctccc
gtcgcctaccctggcaattgtaaaacgacggcaatgttcg
cgtattaatgataaagaggggggtaggaggtcaactcttc
aatgcttataacataggagtagagtagtgggtaaactacg
tctgaaccttctttatgcgaagacgcgagggcaatcggga
tgcatgtctgacaacttgtccaggaggaggtcaacgactc
cgtgtcatagaattccatccgccacgcggggtaatttgga
tcccgtcaaagtgccaacttgtgccggggggctagcagct
acagcccgggaatatagacgcgtttggagtgcaaacatac
acgggaagatacgagttcgatttcaagagttcaaaacgtg
cccgataggactaataaggacgaaacgagggcgatcaatg
ttagtacaaacccgctcacccgaaaggagggcaaatacct
agcaaggttcagatatacagccaggggagacctataactc
gtccacgtgcgtatgtactaattgtggagagcaaatcatt
…
Gibbs Sampler Summary
? A stochastic (Monte Carlo) algorithm for motif finding
? Works by ‘stumbling’ onto a few motif instances, which
bias the weight matrix, which causes it to sample more
motif instances, which biases the weight matrix more,
… until convergence
? Not guaranteed to converge to same motif every time
run several times, compare results
? Works for protein, DNA, RNA motifs
MEME - Multiple EM for
Motif Elicitation
? Another popular motif finding algorithm
optimizes a similar likelihood function using an
algorithm called ‘expectation maximization’ (EM)
? Unlike Gibbs Sampler, MEME is deterministic
Bailey & Elkan, Proc. ISMB, 1994
Weight Matrix Models II
5’ splice
signal
C A G … G T
Background
Con:
Pos -3 -2 -1 … +5 +6
A 0.3 0.6 0.1 … 0.1 0.1
C 0.4 0.1 0.0 … 0.1 0.2
G 0.2 0.2 0.8 … 0.8 0.2
T 0.1 0.1 0.1 … 0.0 0.5
Pos Generic
A 0.25
C 0.25
G 0.25
T 0.25
S = S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R =
P(S|+) = P
-3
P(S|-) = P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
)
Background model homogenous, assumes independence
Histogram of 5’ss Scores
True
5’
Splice
Sites
“Decoy”
5’
Splice
Sites
Score (1/10 bit units)
Measuring Accuracy:
Sensitivity = % of true sites w/ score > cutoff
Specificity = % of sites w/ score > cutoff
that are true sites
Sn: 20% 50% 90%
Sp: 50% 32% 7%
What does this result tell us?
A) Splicing machinery also uses other information
besides 5’ss motif to identify splice sites;
OR
B) WMM model does not accurately capture some
aspects of the 5’ss that are used in recognition
(or both)
This is a pretty common situation in biology
How is the 5’ss recognized?
U1 snRNA
3’
………CCAUUCAUAG-5’
||||||
Pre-mRNA
5’…………UUCGUGAGU…………… 3’
RNA Energetics I
…CCAUUCAUAG-5’
||||||
Free energy of helix formation
5’…CGUGAGU……… 3’
derives from:
G A G
? base pairing: > >
C U U
? base stacking:
5' --> 3'
UX
AY
|
G p A
|
3' <-- 5’
C p U
X
Y A C G U
A . . . -1.30
Doug Turner’s Energy Rules:
C . . -2.40 .
G . -2.10 . -1.00
T -0.90 . -1.30 .
RNA Energetics II
N p N p N p N p N p N p N
A) x | | | | xx
N p N p N p N p N p N p N
B)
N p N p N p N p N p N p N
x | | x | | x
N p N p N p N p N p N p N
N p N p N p N p N p N p N
C) x | | | x | x
N p N p N p N p N p N p N
Lots of consecutive
base pairs - good
Internal loop - bad
Terminal base pair
not stable - bad
Generally A will be more stable than B or C
Conditional Frequencies in 5’ss Sequences
1123456-
5’ss which have G at +5 5’ss which lack G at +5
Pos -1 +3 +4 +6
A 9 44 75 14
C 4 3 4 18
G 78 51 13 19
T 9 3 9 49
Pos -1 +3 +4 +6
A 2 81 51 22
C 1 3 28 20
G 97 15 9 30
T 0 2 12 28
Data from Burge, 1998 “Computational Methods in Molecular Biology”
What kind of model could
incorporate interactions
between positions?