7.91 – Lecture #4
Database Searching &
Molecular Phylogenetics
A
A
B
B
C
D
D
(((A,B)C)D)
C
Michael Yaffe
Outline
? FASTA, Blast searching, Smith-Waterman
? Psi-Blast
? Review of Genomic DNA structure
? Substitution patterns and mutation rates
? Synonymous and non-Synonymous substitutions
? Jukes-Cantor Model
? Kimura’s Two-Parameter Model
? Molecular Clocks
? Phylogenetic Trees – rooted and unrooted
? Distance Matrix Methods
? Neighbor-Joining Method and Related Neighbor
Methods
? Maximum Likelihood
Outline (cont)
? Parsimony
Branch and Bound
Heuristic Seaching
? Consensus Trees
? Software (PHYLIP, PAUP)
? The Tree of Life
Reading: Mount, p. 237-280, 283-286, 291-308
Database Searching
Problem is simple:
I want to find homologues to my protein in the database
How do I do it?
Do the obvious – compare my protein against
every other protein in the database and look
for local alignments by dynamic programming
Uh Oh!
1 n
For k sequences in the
1
12345678…. Database
12345678….
this becomes an O(mnk)
12345678….
problem!
12345678….
12345678….
m
12345678….
….essentially an O(mn) problem
Database Searching
Still, this can be done - ~ 50x slower than Blast/FASTA,
Smith-Waterman algorithm…
SSEARCH (ftp.virginia.edu/pub/fasta) – do it locally!
But in the old days, needed a faster method…
2 approaches – Blast, FASTA – both heuristic
(i.e. tried and true) – almost always finds related
Proteins but cannot guarantee optimal solution
FASTA: Basic Idea
1- Search for matching sequence patterns or words
Called k-tuples, which are exact matches of “k” characters
between the two sequences
i.e. RW = 2-tuple
Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database Searching
FASTA: Basic Idea
2- Repeat for all possible k-tuples
i.e. CV = 2-tuple
Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
3- Make a Hash Table (Hashing) that has the position of
each k-tuple in each sequence
2-tuple pos. in Seq1
RW 5
CV 10
AH 1
i.e.
pos in Seq 2 Offset (pos1-pos2)
2 3
7 3
---- ---
Database Searching
Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
3- Make a Hash Table
i.e.
(Hashing) that has the position of
each k-tuple in each sequence
2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2)
3
3
RW 5 2
CV 10 7
AH 1 ---- ----
4- Look for words (k-tuples) with same offset
These are in-phase and reveal a region of alignment
between the two sequences.
5- Build a local alignment based on these, extend it outwards
Seq 1: AHFYRWNKLCV
Seq 2: DRWNLFCVATYWE
Database Searching
With hashing, number of comparisons is proportional
To the average sequence length (i.e. an O(n) problem),
Not an O(mn) problem as in dynamic programming.
Proteins – ktup = 1-2,
Nucleotides, ktup=4-6
One big problem – low complexity regions.
Seq 1: AHFYPPPPPPPPFSER
Seq 2: DVATPPPPPPPPPPPNLFK
Database Searching
BLAST
Same basic idea as FASTA, but faster and more sensitive!
How?
BLAST searches for common words or k-tuples, but
limits the search for k-tuples that are most significant,
by using the log-odds values in the Blosum62 amino
acid substitution matrix
i.e. look for WHK and might accept WHR but
not HFK as a possible match (note 8000 possibilities)
Repeat for all 3-tuples in the query
Search the database for a match to the top 50 3-tuples that
match the first query position in the sequence, the second
query position, etc.
Use any match to seed an ungapped alignment (old BLAST)
Database Searching
Word length is fixed: 3-tuple for proteins
11-tuple for nucleotides
By default, filters out low complexity regions.
Determine if the alignment is statistically significant.
calculates the probability of observing a score greater
than or equal to your alignment based on extreme
value distribution.
Calculates an E-value = expectation value:
This is the probability of finding an unrelated sequence
that shows this good an alignment just by chance.
Remember if p=.0001 and my database has 500,000
sequences, I will have an E=50! (normal starting E=10)
Psi-BLAST
Position-specific iterative BLAST
Combines BLAST searching with PSSMs!
1- Start with regular BLAST search – look at the results
Psi-BLAST
Position-specific iterative BLAST
Combines BLAST searching with PSSMs!
1- Start with regular BLAST search – look at the results
2- Pick the ones you believe are really homologous
Psi-BLAST
Position-specific iterative BLAST
Combines BLAST searching with PSSMs!
1- Start with regular BLAST search – look at the results
2- Pick the ones you believe are really homologous
3- Now align these sequences to the query sequence
and make up a PSSM that tells how much to weigh
each amino acid in each position in the alignment
4- Use this PSSM to do another BLAST search
5- Add any new sequences that come up to the old ones
if you believe they are really homologous
6- Repeat the alignment to make a new and improved PSSM
that tells how much to weigh each amino acid in each
position in the alignment
Psi-BLAST
7-- Use this PSSM to do another BLAST search
8– Keep iterating until no new sequences are found
Very good for finding weakly related sequences
…on to Molecular Phylogenetics
1 2 3 4 5 6 7
Gene Structure
5’ Flank
AUG
3’ Flank
exon
intron
Poly(A) addition
splicing
3’ UTR
5’ UTR
AUG
UGA…AAAAAAA
Mutation Rates
Mutations:
deleterious
neutral
advantageous
←substantial minority
Consider 2 sequences
K= # of substitutions since they shared a common ancestor
T= divergence time
R = mutation rate = K/(2T)
KEY PREMISE OF PHYLOGENETICS:
If R=constant for all species, then K will provide insight into
evolutionary relatedness for which no other physical evidence
is available.
Mutation Rates
Mutations: refined by process of natural selection…
Often, but not always at the protein level…
→ Functional constraint
AUG
3’ Flank
5’ Flank
exon
intron
Poly(A) addition
splicing
AUG
UGA…AAAAAAA
5’ UTR
3’ UTR
1 2 3 4 5 6 7
Human, mouse, rabbit and cow beta-globin
Substitution rate
Length, bp
# Pairwise
?’s (mean)
(subs/site x 10
9
years)
Noncoding, overall
913 268
1.58
3.39
1.86
3.48
3.33
Coding, overall
441 69
functionally
5’ Flank 300
96
rapid
constrained
5’ UTR 50 9
changes
Intron 1
131
3’ UTR
132
3’ Flank
300
42
33 …generally
3.00
76 true
3.04
Common ancestor 100 million years ago
Synonymous vs Nonsynonymous Substitutions
18 out of 20 amino acids have more than one codon
GGG, GGC, GGU, GGA → Glycine
?‘s here – same aa →synonymous substitution
GGG
GGC
GGU
GGA
nonsynonymous substitution, GCG →Alanine
Human and rabbit beta-globin genes:
but 3x as many
47 substitutions in coding sequence
opportunities!
? 27 synonymous substitutions
? 20 nonsynonymous substitutions
Synonymous vs Nonsynonymous Substitutions
Not all positions in a codon equally likely to give
non-synonymous substitutions
For Gly, 4-fold degenerate
GGG
GGC
GGU
GGA
GAU
GAC
GAA
GAG
Asp
Glu
2-fold degenerate
UUU
CUU
AUU
GUU
Phe
Leu
Ile
Non-degenerate
Val
Synonymous vs Nonsynonymous Substitutions
If natural selection operates at protein level, expect
nucleotide substitutions appear most rapidly at 4-fold
sites and least rapidly at non-degenerate sites
What does data show?
Human vs rabbit beta-globin genes (coding region)
Sub. Rate
Region
# sites (bp) #changes
Subs/site.10
9
years
Non-deg.
302 17
.56
2-fold deg.
60 10 1.67
4-fold deg.
85 20 2.35
Mutation versus Substitutions
Mutation: changes in nucleotide sequences due to errors
in DNA replication or repair
Substitution: mutations that pass through the filter of
natural selection
Synonymous substitution rates, Ks, reflect actual mutation rate
Non-synonymous substitution rates, Ka, do NOT reflect
actual mutation rate, as subject to natural selection
New alleles (versions of a gene) typically begin at low frequencies
q = 1/(2N) where N=# of diploid reproducing organisms
Why are there persistant high levels of variation in populations?
Why not q→0, q→1?
Most mutations are selectively neutral!
Estimating Substitution Numbers
Infrequent substitutions between 2 sequences:
Count ‘em…gives K
More frequent substitutions – counting will significantly
UNDERestimate the number of true substitutions since
they shared a common ancestor
Why?
Time
Scenario 1 Scenario 2
G G
G T
G G
Sampling time
Jukes-Cantor Model
G
α
α
α
α
T
α
Assume each nucleotide equally likely
to change into any other nt,
with rate of change=α.
Overall rate of substitution = 3α
A
α
C …so if G at t=0, at t=1, P
G(1)
=1-3α
and P
G(2)
=(1-3α)P
G(1)
+α [1? P
G(1)
]
Expanding this gives P
G(t)
=1/4 + (3/4)e
-4αt
Can show that this gives K = -3/4 ln[1-(4/3)(p)]
K = true number of substitutions that have occurred,
P = fraction of nt that differ by a simple count.
Captures general behaviour…
Compare J-C with real data…assumption of global uniformity in
α was unrealistic…still provides a useful framework…
Nucleotides: two categories: purines: A, G
pyrimidines: C, T, U
Exchange nucleotides within or between classes happens at
different rates!
Transitions: purine→purine, pyrmidine →pyrimidine
three times as commmon as
Transversions: purine→pyrimidine or pyrimidine→purine
Led to Kimura’s Two Parameter Model
Kimura’s Two Parameter Model
Transitions occur at rate α
β
G
T
A
C
β
α
β
β
Transversions occur at rate β
α
Now P
GG(1)
= 1- α -2 β
P
GG(2)
: 4 possibilities:
Time
Scenario 1 Scenario 2
Scenario 3 Scenario 4
G
G
G
A
G
C
G
T
G G
G G
No ?
1 Transition
2Transversions
Kimura’s Two Parameter Model
β
Transitions occur at rate α
α
Transversions occur at rate β
P
GG(1)
= 1- α -2 β
P
GG(2)
: 4 possibilities:
Time
Scenario 1 Scenario 2
Scenario 3 Scenario 4
G
T
A
C
β
α
β
β
G
G
G
G
A
G
G
C
G
G
T
G
No ?
1 Transition
2 Transversions
P
GG(2)
= (1- α -2 β) P
GG(1)
+ αP
GA(1)
+ βP
GC(1)
+ βP
GT(1)
expanding…
P
GG(t)
= ? + (1/4)e
-4βt
+ (1/2)e
-2(α+β)t
Kimura’s Two Parameter Model
β
G
T
A
C
β
α
β
β
Transitions occur at rate α
Transversions occur at rate β
α
P
GG(2)
= (1- α -2 β) P
GG(1)
+ αP
GA(1)
+ βP
GG(1)
+ βP
GG(1)
expanding…
P
GG(t)
= 1/4 + (1/4)e
-4βt
+ (1/2)e
-2(α+β)t
Manipulating equation gives estimate of true number of
substitutions if only two sequences are available,
K = 1/2 ln[1/(1-2P-Q)] + 1/4 ln[1/(1-2Q)]
Where K = true number of substitutions
P = fraction of nts undergoing transitions by simple count
Q = fraction of nts undergoing tranversions by simple count
More complex Parameter Models Possible
T
G
α
ε
φ
G
α
T
γ
δ
β
γ
C
δ
A
Could even make A→C ≠ C →A
A
ε
φ
β
λ
C
Problem is sampling error – not enough data to get
Good parameters within a single gene family, usually
Why not combine different genes?
Find strikingly different rates of evolution between different
Genes – up to and greater than 200-fold.
RATE DEPENDS ON FUNCTION!
Histones – each aa interacts with DNA – slowest rate of
substitution known
HLA gene locus – involved in immune system recognition of
foreign antigens – needs to adapt rapidly - one of the
Highest substitution rates known
However, rates of molecular evolution for loci with
similar functional constraints often very uniform
over long periods of evolutionary time.
Molecular Clocks
1960s: Emile Zuckerkandl and Linus Pauling
Postulate: Substitution rates so constant within homologous
proteins over long periods of evolutionary time that
accumulation of amino acid changes reflects the steady ticking
of a molecular clock.
Clock may run at different rates for different proteins
WAS CONTROVERSIAL
AS DIVERGENCE TIMES
s
c
t
e
ss
Fibrinopeptides
s
ne
ti
l
/
i
y
s
r
e
hpl
es
# aa substitutions
per 100 residues
500
1000
50
100
150
Hemoglobin
B
i
r
d
s
/
r
e
p
ti
l
e
m
a
m
m
a
l
s
/
r
e
R
e
p
ti
l
e
s
/
f
i
s
C
a
r
p
/
l
a
m
p
m
a
m
m
a
V
e
r
t
e
b
r
a
t
Cytochrome C
NOT ACCURATELY KNOWN
1400
Millions of years since divergence
1
Relative Rate Test of Molecular Clock Hypothesis
1973: Sarich and Wilson
Consider relative rate of substitution in lineage for species
1 and 2
Need to designate a less related species 3 as an outgroup
i.e. 1=humans, 2=gorillas, 3= baboons
Phylogenetic tree (more soon!)
1 and 2 diverged from a common
2 3
ancestor, A
Number of substitutions between
A
any two species = sum of number of
substitutions along branches of the tree
that connect them
1
1 and 2 diverged from a common
2 3
ancestor, A
Number of substitutions between
A Any two species = sum of number of
Substitutions along branches of the tree
That connect them
d
13
, d
23
, d
12
– can measure directly
d
13
= d
A1
+ d
A3
Algebra:
d
23
= d
A2
+ d
A3
d
A1
= (d
12
+ d
13
–d
23
)/2
d
12
= d
A1
+ d
A2
d
A2
= (d
12
+ d
23
–d
13
)/2
Theorum 1
Molecular clock predicts d
A1
=d
A2
Find, for the most part, this is true, but not always, depending on species…
So bottom line when comparing two species need to prove
Theorum 1 before using the molecular clock!
Distance-Based Phylogenetics
Phylogenetic Trees – also called dendrograms
? Made by arranging nodes and branches.
? Graphical representation of evolutionary relatedness
of 3 or more sequences
Nodes – distinct taxonomical unit
? Terminal nodes: gene or organism for which data has
been collected
? Internal node – inferred common ancestor that gave
rise to 2 lineages
3 4
1 2
5
A
B
C
D
For the mathematicians:
Tree - special graph with
n nodes, n-1 links, no circuits
1
Distance-Based Phylogenetics
Newick notation
A
B
C
D
2
3 4
5
(((1,2), (3,4)),5)
Scaled trees
Branch length is α difference between pairs of
neighboring nodes.
Ideally, scaled trees should be additive
Unscaled trees
Only convey relative kinship information without representing
number of changes that separate sequences
1
Distance-Based Phylogenetics
2
3 4
5
A
B
C
D
root
Time
Rooted trees
Make an inference about common ancestor and direction
of evolution. A single node is designated as common
ancestor with unique path from it through evolutionary
time to any other node.
Root is assigned through use of an outgroup –
something that unambiguously separated earlier than
species being considered.
1
Distance-Based Phylogenetics
1
2
A
B
C
D
4
5
2
3
3
5
4
Unrooted trees
Only specifies relationship between nodes. Says nothing
about the direction of evolution
Why not always use rooted trees?
Distance-Based Phylogenetics
Why not always use rooted trees?
1 – Need a clear outgroup
2 – Computational difficulty
Consider 3 sequences 1, 2, and 3:
3 possible rooted trees
1
2 3
1
3 2
3
2 1
Only 1 possible unrooted tree
1 2
3
Distance-Based Phylogenetics
Why not always use rooted trees?
Number of Number of Number of
sequences rooted trees unrooted trees
1
2
1
1
3
3
4
15
3
5
105
15
10
34,459,425
2,027, 025
15
213,458,046,767,875
7,905,853,580,625
N
R
=(2n-3)!/2
n-2
(n-2)!
N
U
=(2n-5)!/2
n-3
(n-3)!
Shortcuts…
UPGMA
Unweighted pair-group method with arithmetic mean
? Oldest distance method, statistically based
? Requires data be condensed to a measure of genetic distance
**Build a distance matrix between taxa (I.e. sequences) **
Consider 4 sequences A, B, C, D
Species
A B C
B
C
d
AB
d
AC
d
BC
d
AB
is distance
Between A and B
D d
AD
d
BD
d
CD
Step 1: Cluster the two closest sequences into composite group,
i.e. if d
AB
is smallest, make new group (AB).
UPGMA
Consider 4 sequences A, B, C, D
Species
A B C
B
C
d
AB
d
AC
d
BC
d
AB
is distance
Between A and B
D d
AD
d
BD
d
CD
Step 1: Cluster the two closest sequences into composite group,
i.e. if d
AB
is smallest, make new group (AB).
Step 2: Create a new distance matrix between (AB) and C and D.
d
(AB)C
=1/2 (d
AC
+ d
BC
); d
(AB)D
=1/2 (d
AD
+ d
BD
)
UPGMA
Consider 4 sequences A, B, C, D
Species
A B C
B
C
d
AB
d
AC
d
BC
d
AB
is distance
Between A and B
D d
AD
d
BD
d
CD
Step 1: Cluster the two closest sequences into composite group,
i.e. if d
AB
is smallest, make new group (AB).
Step 2: Create a new distance matrix between (AB) and C and D.
d
(AB)C
=1/2 (d
AC
+ d
BC
); d
(AB)D
=1/2 (d
AD
+ d
BD
)
Step 3: Using new matrix, cluster the two closest sequences
into composite group. Repeat above until all species
have been grouped.
UPGMA
Species
A B C
d
AB
is distance
B
d
AB
Between A and B
C d
AC
d
BC
D d
AD
d
BD
d
CD
Step 1: Cluster the two closest sequences into composite group,
i.e. if d
AB
is smallest, make new group (AB).
Step 2: Create a new distance matrix between (AB) and C and D.
d
(AB)C
=1/2 (d
AC
+ d
BC
); d
(AB)D
=1/2 (d
AD
+ d
BD
)
Step 3: Using new matrix, cluster the two closest sequences
into composite group. Repeat above until all species
have been grouped.
Step 4: For scaled branch lengths, put node halfway between
grouped species.
UPGMA - example
Species
A B C D
B
9
C 8
11
D 12
15 10
E
15
18
13
5
D E
(D,E)
UPGMA - example
Species
A B C D
B
9
C 8
11
D 12
15 10
E
15
18
13
5
D E
(D,E)
A C
D E
Species
A B C
B
9
C 8
11
(A,C) (D,E)
DE
13.5
16.5 11.5
UPGMA - example
A C
D E
Species
A B C
B
9
C 8
11
(A,C) (D,E)
DE
13.5
16.5 11.5
A C
B D E
Species
B AC
AC
10
DE 16.5
12.5
(((A,C)B)(D,E))
UPGMA – adding distances
Species
A B C D
B
9
C 8
11
D 12
15 10
E
15
18
13
5
D E
(D,E)
2.5
2.5
A
4
2.5
C
D
E
Species
A B C
4
B
9
C 8
11
(A,C) (D,E)
DE
13.5
16.5 11.5
2.5
UPGMA – adding distances
Species
B AC
A
4
6.25
C
D
E
AC
10
4 2.5
2.5
DE 16.5
12.5 6.25
Branch lengths for scaled unrooted tree
= Fitch-Margoliash Algorithm for 3 sequences
A
x
d
AC
= x+y
y
d
AB
= x+z
C d
BC
= y+z
B
z
x=(d
AB
+d
AC
–d
BC
)/2
y=(d
AC
+d
BC
–d
AB
)/2
z=(d
AB
+d
BC
–d
AC
)/2
Note: F-M assumes additivity of branch lengths but DOES NOT
Assume equal rates of evolution along branches.
(Can specify this though : Kitsch-Margoliash)
Fitch-Margoliash Algorithm for >3 sequences
A
B
C
a
b
f
c
D
g
d
e
E
Steps:
1- Find closest 2 sequences (D,E)
2- Treat rest as composite and take average of D to (ABC), E to (ABC)
3- Use these to calculate d, e
4- Make new composite DE
5- Make new distance table
6- Find next most closely related pair and repeat from step 2
Now repeat starting with another pair as the closest starting pair
In the end, calculate all predicted distances for all trees, and choose
What best fits data
Transformed Distance Method
UPGMA assumes constant rate
Of evolution across all lineages
Can allow different rates of evolution across different
lineages if you normalize using an external reference that
diverged early…i.e. use an outgroup
Define d
D
=average distance
A B C
D
Between outgroup and all ingroups
d’
ij
= (d
ij
–d
iD
–d
jD
)/2 + d
D
Now use d’
ij
to do the clustering
..basically just comes from the insight that
ingroups evolved separately from each other
ONLY AFTER they diverged from outgroup
Neighbor’s Relation Method
Variant of UPGMA that pairs species in a way that creates a
tree with minimal overall branch lengths.
Pairs of sequences separated by only 1 node are said to be
neighbors.
A
B
C
D
a
b
c
d
e
terminal branches
single central
branch
For this tree topology
d
AC
+d
BD
= d
AD
+d
BC
= a + b + c + d + 2e =d
AB
+ d
CD
+2e
For neighbor relations, four-point condition will be true:
d
AB
+d
CD
< d
AC
+d
BD
…and…d
AB
+d
CD
< d
AD
+d
BC
So just have to consider all pairwise arrangements and
determine which one satisfies the four-point condition.
Neighbor-Joining Methods
Start with star-like tree. Find neighbors sequentially to minimize
total length of all branches
A
B
D
C
D
Studier & Kepler 1988:
Q
12
=(N-2)d
12
- Σd
1i
- Σd
2i
Where any 2 sequences can be 1 and 2
Try all possible sequence combinations. Whichever
combination of pairs gives the smallest Q
12
is the final tree!
B
A
C
Maximum Likelihood
? A purely statistical method.
? Probablilities for every nucleotide substitution in a set of aligned
sequences is considered.
? Calculation of probabilities is complex since ancestor is unknown
? Test all possible trees and calculate the aggregate probablility.
? Tree with single highest aggregate probablilty is the most likely
to reflect the true phylogenetic tree.
VERY COMPUTATIONALLY INTENSE
Parsimony
Parsimony: a derogatory term from the 1930s and 1940s
To describe someone who was especially careful with
Spending money.
Biologically: Attach preference to an evolutionary pathway
That minimizes the number of mutational events since
(1) Mutations are rare events, and
(2) The more unlikely events a model postulates, the less l
likely the model is to be true.
Parsimony: a character-based method, NOT a distance-based
method.
Parsimony
For parsimony analysis, positions in a sequence alignment
fall into one of two categories: informative and uninformative.
Position
Sequence
1 2 3 4
5
6
1
G G G
G G G
2
G G G
A G T
3
G G A
T A G
4
G A T
C A T
Only 3 possible unrooted trees you can make…
Parsimony
Position
Sequence
1 2 3 4
5
6
1
G G G
G G G
2
G G G
A G T
3
G G A
T A G
4
G A T
C A T
3
1
1
2
2
1
4
3
2
3
4 4
Which tree is the right one?
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
G
G
G G G
G G
1
G G A
G T
2
G A T
A G
3
A T C
A T
4
G
3
1
G
1
G
1
G G
2
G
2
2
G
G
4
G
3
3
G
G
4 4
G
Invariant positions – contain NO INFORMATION→ uninformative
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
G
A
G G G G
1
G
G A G T
2
G
A T A G
3
G
T C A T
4
G
G
3
1
G
1
G
1
G G
2
G
2
2
G
A
4
G
3
3
G
A
4 4
A
Equally uninformative – need one mutation in each tree
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
A
T
G G G
1
G G
A G T
2
G G
T A G
3
G G
C A T
4
G A
1
G
G
2
G
?
A
G G
G G
?
1
G A
3
1
G G
2
2
G
?
T
4
3
A
?
T
4 4
T
? ?
A
3
Also uninformative – need two mutations in each tree
Parsimony
Position
Sequence
1 2 3 4
5
6
G
A
T
C
G G
1
G G G
G T
2
G G G
A G
3
G G A
A T
4
G A T
1
G
A
2
1
G T
3
1
G A
2
G
?
T
G
?
A
G
? A
?
?
2
A
?
C
4
3
T
?
C
4 4
C
? ?
T
3
Also uninformative – need three mutations in each tree
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
A
A
G
1
G G G
G
T
2
G G G
A
G
3
G G A
T
T
4
G A T
C
1
G
G
2
G
?
A
G G
G G
?
1
G A
3
1
G G
2
2
G
A
4
3
A
?
A
4 4
A
? ?
A
3
Informative! – need only one mutation in one tree but two
In the other trees!
Parsimony
Position
Sequence
1 2 3 4
5
6
1
G G G
G G
2
G G G
A G
3
G G A
T A
4
G A T
C A
G
T
G
T
1
G G
3
1
G T
2
1
G
?
T
2
G
G
G ?T
G G
?
?
2
T
?
T
4
G
3
3
G
T
4 4
T
Informative! – need only one mutation in one tree but two
In the other trees!
Parsimony
Position
Sequence
So to be Informative, need at least 2 different nucleotides
And each has to be present at least twice.
Every tree is considered for every site, maintaining a running
score of the number of mutations required. The tree with the
smallest number of invoked mutations is the most
parsimonious
6
5
432
1
G
T
G
G
G
A
G
G
G
G
G
G
G
ATAGG
3
T
ACTAG
4
1
2
Parsimony
1
G
G
2
G
?
A
G G
G G
?
1
G A
3
1
G G
2
2
G
A
4
3
A
?
A
4 4
A
? ?
A
3
Mathematically, most likely candidate nts at an
Internal node are:
{descendent node 1} ? {descendant node 2}
IF this is null set, then most likely candidate nts are:
{descendent node 1} U {descendant node 2}
Σ U = minimum number of substitutions required to account for
nts at terminal nodes since they last shared common ancestor
Total number of substitutions, informative + uninformative =
tree length
Parsimony
1
G
G
2
G
?
A
G G
G G
?
1
G A
3
1
G G
2
2
G
A
4
3
A
?
A
4 4
A
? ?
A
3
If you use parsimony but weigh the mutations by
some kind of scoring system that accounts for the
likelihood of each mutation →weighted parsimony
By-product of parsimony is inference of nt identity in
the ancestral sequence
Parsimony
10 sequences: > 2 million possible trees…
Need a better way…
Branch and bound (Hardy and Penny, 1982)
Step 1: Determine an upper bound to the length of the
most parsimonious tree = L - either chosen randomly, or else
using a computationally fast way like UPGMA
Step 2: Grow trees incrementally by adding branches to a
smaller tree that describes just some of the sequences.
Step 3: If at any point, the number of required substitutions
is > L, abandon that tree.
Step 4: As soon as you get a tree with fewer substiitutions
than L, use that tree as the new upper bound to make
remainder of the search even more efficient.
Works for <= 20 sequences
Parsimony
For > 20 sequences
Heuristic Searches
Assumption: Alternative trees are not all independent of
each other. Most parsimonious trees have similar
topologies.
Step 1: Construct an initial tree as a good guess: UPGMA,
and use it as a starting point.
Step 2: Branch-swap subtrees and graft them onto the starting
tree, keeping overall topology. See how many are shorter
than the starting tree. The prune and re-graft, and see
if it keeps getting better.
Step 3: Repeat until a round of branch swapping fails to generate
any better trees
Parsimony
Often get tens or hundreds of equally parsimonious
trees
Build a consensus tree – any internal node supported by
At least half the trees becomes a simple bifurcation.
Phylogenetic Software
PHYLIP: Phylogenetics Inference Package
free at http://evolution.genetics.washington.edu
Includes many programs including various distance methods,
maximum likelihood, parsimony, with many of the options
we’ve discussed.
PAUP: Phylogenetic Analysis Using Parsimony
http://www.lms.si.edu/PAUP -NOT FREE
Now includes maximum likelihood and distance methods as well
Tree of Life
Carl Woese and colleagues, 1970s
Used 16S rRNA – all organisms possess.
Found 3 major evolutionary groups:
Bacteria
Eucarya
Archea (including thermophilic bacteria
Human Origins:
mt DNA sequences – huan populations differ by ~ 0.33%
(very small). Greatest differences NOT between
current populations on different continents, but between
human populations residing in Africa – “out of Africa” theory
Mitochondrial “eve” and Y-chromosome “adam”