7.91 / 7.36 / BE.490
Lecture #2 
Feb. 26, 2004 
DNA Sequence Comparison
&
Alignment
Chris Burge 
Review of Lecture 1: “Genome Sequencing 
& DNA Sequence Analysis” 
? The Language of Genomics 
? 
? Flavors of BLAST 
? Statistics of High Scoring Segments 
- cDNAs, ESTs, BACs, Alus, etc. 
Dideoxy Method / Shotgun Sequencing 
- The ‘shotgun coverage equation’ (Poisson) 
- BLAST[PNX], TBLAST[NX] 
Shotgun Sequencing a BAC or a Genome
200 kb (NIH) 
3 Gb (Celera) 
Sequence, Assemble 
Sonicate, Subclone 
Subclones 
Shotgun Contigs 
What 
would 
cause 
problems 
with 
assembly? 
DNA Sequence Alignment IV
Which alignments are significant? 
Q: 1 ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 45
|||| |||||||||||| | |||||||||||||||||||||||||
S: 403 ttgatctagatgagatgccattcacttttactgagctacagaaaa 447
Identify high scoring segments whose score S exceeds 
a cutoff x using dynamic programming. 
Scores follow an extreme value distribution: 
P(S > x) = 1 - exp[-Kmn e
-λx
] 
For sequences of length m, n where K, λ depend on the score 
matrix and the composition of the sequences being compared 
(Same theory as for protein sequence alignments) 
From M. Yaffe 
Notes (cont) 
Lecture #2 
?	 The random sequence alignment scores 
would give rise to an “extreme value” 
distribution – like a skewed gaussian. 
?	 Called Gumbel extreme value 
distribution 
For a normal distribution with a mean m and a variance σ, the height of the 
curve is described by Y=1/(σ√2π) exp[-(x-m)
2
/2σ
2
] 
For an extreme value distribution, the height of the  curve is described by 
Y=exp[-x-e
-x
] …and P(S>x) = 1-exp[-e
-λ(x-u)] 
where u=(ln Kmn)/λ 
Can show that mean extreme score is ~ log
2
(nm), and the probability of 
getting a score that exceeds some number of “standard deviations” x is: 
P(S>x)~ Kmne
-λx. 
***K and λ are tabulated for different matrices **** 
-λS
For the less statistically inclined: E~ Kmne 
-2 -1
0.2
Yev
0.4
-4 4
0.4
B.
Yn
Probability values for the extreme value distribution (A) and the 
normal distribution (B). The area under each curve is 1.
0 1 2
X X
A.
3 4 5
DNA Sequence Comparison & Alignment 
? Target frequencies and mismatch penalties 
? Eukaryotic gene structure 
? Comparative genomics applications: 
? 
See Ch. 7 of Mount 
- Pipmaker (2 species comparison) 
- Phylogenetic Shadowing (many species) 
Intro to DNA sequence motifs 
i 
DNA Sequence Alignment V
How is λ related to the score matrix? 
λ is the unique positive solution to the equation*: 
∑ p p
j
e
λ
s
ij 
= 1
i
i,j 
p = frequency of nt i, s
ij 
= score for aligning an i,j pair 
What kind of an equation is this? 
What would happen to λ ? if we doubled all the scores? 
What does this tell us about the nature of λ? 
*Karlin & Altschul, 1990 
DNA Sequence Alignment VI
What scoring matrix to use for DNA? 
Usually use simple match-mismatch matrices: 
i j: A C G T 
A 1 m m m 
C m 1 m m 
s
i,j 
: 
G 
T 
m 
m 
m 
m 
1 
m 
m 
1 
m = “mismatch penalty” (must be negative) 
DNA Sequence Alignment VII
How to choose the mismatch penalty? 
Use theory of High Scoring Segment composition* 
High scoring alignments will have composition: 
q
ij 
=p
i
p
j
e
λs
ij 
where q
ij 
= frequency of i,j pairs (“target frequencies”) 
p , p = freq of i, j bases in sequences being compared
i j 
What would happen to the target frequencies if we 
doubled all of the scores? 
*Karlin & Altschul, 1990 
DNA Sequence Alignment VIII 
Still figuring out how to choose the mismatch penalty m
Target frequencies: q
ij
=p
i
p
j
e
λs
ij
s
ij
=ln(q
ij
/ p
i
p
j
)/
?
If you want to find regions with R% identities:
r = R /100      q
ii
= r/4      q
ij
= (1-r)/12  (i,j)     Set s
ii
= 1 
Then  m = s
ij
= s
ij
/s
ii
= ln(q
ij
/ p
i
p
j
)/λ) / (ln(q
ii
/ p
i
p
i
)/λ
(i≠j)
?
m = ln(4(1-r)/3)/ln(4r)
λ
DNA Sequence Alignment IX 
The single most useful thing there is to know 
about mismatch penalties: 
m = ln(4(1-r)/3)/ln(4r) 
Examples:
r 0.75 0.95  0.99
m -1 -2 -3
r = desired fraction of identities in BLAST hits 
Nucleotide-
nucleotide 
BLAST Web 
Server 
(BLASTN) 
Other advanced 
DNA Sequence Alignment X 
NCBI BLAST Advanced Options 
-G Cost to open gap [Integer] 
default = 5 for nucleotides 11 proteins 
-E Cost to extend gap [Integer] 
default = 2 nucleotides 1 proteins 
-q Penalty for nucleotide mismatch [Integer] 
default = -3 
-r Reward for nucleotide match [Integer] 
default = 1 
-e Expect value [Real] 
default = 10 
Expression of a Eukaryotic Gene*
DNA 
… …
Transcription (N)
pre-mRNA 
Translation (C) 
AAAAAAA
mRNA 
Splicing (N) 
Protein 
Typical Human Gene Statistics
Length of primary transcript: ~30,000 bp 
Number of exons: ~8-10 
Mean (internal) exon length: ~150 bp 
Mean intron length: ~3,000 bp 
Comparative Genomics - Examples
? PipMaker: applications to 
- human/mouse exon finding 
- human/mouse regulatory region finding 
? “Phylogenetic Shadowing”: applications to 
- multi-genome exon finding 
- multi-genome regulatory region finding 
PipMaker -
Percent Identity Plot (PIP) for two genomic sequences
For an illustration of pips, please see figure 1 of 
Schwartz, Scott, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs,
 Ross Hardison, and Webb Miller. "PipMaker--A Web Server for Aligning Two Genomic DNA Sequences."
Genome Res. 10 (April 2000): 577-586. 
Application of PipMaker    #1- finding human/mouse exons
Please see figure 2 of 
Schwartz, Scott, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs,
 Ross Hardison, and Webb Miller. "PipMaker--A Web Server for Aligning Two Genomic DNA Sequences."
Genome Res. 10 (April 2000): 577-586. 
A Computational Biology Paradigm
for Finding Genomic Features of Interest 
? Identify properties that feature of interest should have 
? Develop an algorithm to find seq’s with these properties 
? Run algorithm on genome to predict features 
? Test a subset of the predicted features experimentally
to determine how well the method works
Application of PipMaker #2 - finding regulatory regions 
Please see figure 2 of
Loots, GG, RM Locksley, CM Blankespoor, ZE Wang, W Miller, EM Rubin, and KA Frazer. "Identification of A
Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-species Sequence Comparisons." Science 288, no. 5463
(7 April 2000): 136-40.
Effects on Transcription of Deleting CNS-1 region
Please see figure 1 of
Loots, GG, RM Locksley, CM Blankespoor, ZE Wang, W Miller, EM Rubin, and KA Frazer. "Identification of A
Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-species Sequence Comparisons." Science 288, no. 5463
(7 April 2000): 136-40.
“Phylogenetic Shadowing”
(the power of many genomes) 
? Sequence orthologous region from 13-17 primates (~90+% identical) 
? Do a multiple sequence alignment (MSA): 
? Measure variability at each position 
? Calculate P(data|fast evol.), P(data|slow evol.) 
What are potential advantages of many close species versus 
fewer more distant organisms? 
Please see figure 1 of
Boffelli, D, J McAuliffe, D Ovcharenko, KD Lewis, I Ovcharenko, L Pachter, and EM Rubin. "Phylogenetic 
Shadowing of Primate Sequences to Find Functional Regions of The Human Genome." Science 299, no. 5611
(28 February 2003): 1391-4.
Phylogenetic Shadowing of Regulatory Elements I
Please see figure 2 of
Boffelli, D, J McAuliffe, D Ovcharenko, KD Lewis, I Ovcharenko, L Pachter, and EM Rubin. "Phylogenetic 
Shadowing of Primate Sequences to Find Functional Regions of The Human Genome." Science 299, no. 5611
(28 February 2003): 1391-4.
Phylogenetic Shadowing of Regulatory Elements II
Please see figure 3 of
Boffelli, D, J McAuliffe, D Ovcharenko, KD Lewis, I Ovcharenko, L Pachter, and EM Rubin. "Phylogenetic 
Shadowing of Primate Sequences to Find Functional Regions of The Human Genome." Science 299, no. 5611
(28 February 2003): 1391-4.
Human Splice Signal Motif “Pictograms”
5' splice signal
3' splice signal
Binding Affinity of the Dog Transcription Factor
Site Fraction bound Yeast Genome 
CAT 1/2 A 1/3 T 1/3 
AAT 1/4 C  1/6 G 1/6 
GAT 1/4 
Others 0 
Search yeast promoters for potential Dog binding sites: 
How should you prioritize the promoters for followup 
experiments? 
Prioritizing Potential Dog Binding Sites
Site Fraction bound Odds Ratio (R) 
CAT 1/2 (1/2) / (1/6)(1/3)(1/3)?  27.0 
AAT 1/4 (1/4) / (1/3)(1/3)(1/3) 6.75 
GAT 1/4 (1/4) / (1/6)(1/3)(1/3) 13.5 
Others 0 (0) /   (   ) ( ) ( ) 0.0 
Yeast Genome 
A 1/3 T 1/3 
C 1/6 G 1/6 
Neyman-Pearson Lemma: 
Optimal decision rules are of the form R > C
(C, a chosen cutoff value)
Therefore: CAT > GAT > AAT > Others 
Weight Matrix Model (WMM)
A 0.3 0.1 0.0 0.4 0.1 
C 0.4 0.1 0.0 0.1 0.1 
G 0.8 0.0 0.4 0.1 0.8 0.2 
T 0.1 0.0 1.0 0.1 0.0 0.5 
Pos -3 -2 -1 +1 +2 +3 +4 +5 +6 
0.6 0.0 0.7 0.1 
0.0 0.0 0.1 0.2 
0.2 0.2 1.0 
0.1 0.1 0.1 
S = S
1 
S
2 
S
3 
S
4 
S
5 
S
6 
S
7 
S
8 
S
9
P(S|+) = P
-3
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Inhomogeneous, assumes independence between positions
Statistical Independence
Two events A and B are said to be independent if: 
A B if and only if P(A,B) = P(A) P(B) 
In terms of conditional probabilities P(B|A) = P(A,B)/P(A) 
A B ???? P(B|A) = P(B) and P(A|B) = P(A) 
Example: E = { die roll an even number } (2,4,6) 
R = { die roll a prime number } (2,3,5) 
Are the events E and R independent? 
P(A,B) indicates probability that both A and B occur 
Weight Matrix Models II
5’ splice signal 
Background
Con: C A G … G T 
Pos -3 -2 -1 … +5 +6 
A 0.3 0.6 0.1 … 0.1 0.1 
C 0.4 0.1 0.0 … 0.1 0.2 
G 0.2 0.2 0.8 … 0.8 0.2 
T 0.1 0.1 0.1 … 0.0 0.5 
Pos Generic 
A 0.25 
C 0.25 
G 0.25 
T 0.25 
S = S
1 
S
2 
S
3 
S
4 
S
5 
S
6 
S
7 
S
8 
S
9 
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R = 
P(S|+) = P
-3
P(S|-) = P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
) 
Background model homogenous, assumes independence 
Weight Matrix Models III
S = S
1 
S
2 
S
3 
S
4 
S
5 
S
6 
S
7 
S
8 
S
9 
P(S|+) P
-3
(S
1
)P
-2
(S
2
)P
-1
(S
3
) ??? P
5
(S
8
)P
6
(S
9
)
Odds Ratio: R = = 
P(S|-) P
bg
(S
1
)P
bg
(S
2
)P
bg
(S
3
) ??? P
bg
(S
8
)P
bg
(S
9
) 
k=9 
= ∏ P
-4+k
(S
k
)/ P
bg
(S
k
) 
k=1 
k=9 
Score s = log
2
R = ∑ log
2 
(P
-4+k
(S
k
)/ P
bg
(S
k
)) 
k=1 
Neyman-Pearson Lemma: 
Optimal decision rules are of the form R > C
Equiv.: log
2
(R) > C’ because log is a monotone function
Weight Matrix Models IV
Slide WMM along sequence: 
ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 
…… 
Assign score to each 9 base window.
Use score cutoff to predict potential 5’ splice sites
Histogram of 5’ss Scores
True 
5’ 
Splice 
Sites 
“Decoy” 
5’ 
Splice 
Sites 
Score (1/10 bit units) 
Measuring Accuracy: 
Sensitivity = % of true sites w/ score > cutoff 
Specificity = % of sites w/ score > cutoff 
that are true sites 
Sn: 20% 50% 90% 
Sp: 50% 32% 7% 
What does this result tell us?

