7.91 – Lecture #5
Database Searching &
Molecular Phylogenetics
A
A
B
B
C
D
D
(((A,B)C)D)
C
Michael Yaffe
Outline
? Distance Matrix Methods
? Neighbor-Joining Method and Related Neighbor
Methods
? Maximum Likelihood
? Parsimony
Branch and Bound
Heuristic Seaching
? Consensus Trees
? Software (PHYLIP, PAUP)
? The Tree of Life
Transformed Distance Method
UPGMA assumes constant rate
Of evolution across all lineages - can lead to wrong tree topologies
Can allow different rates of evolution across different
lineages if you normalize using an external reference that
diverged early…i.e. use an outgroup
Define d
D
=average distance
A B C
D
Between outgroup and all ingroups
d’
ij
= (d
ij
–d
iD
–d
jD
)/2 + d
D
Now use d’
ij
to do the clustering
..basically just comes from the insight that
ingroups evolved separately from each other
ONLY AFTER they diverged from outgroup
Example
Species
A B C
B
9
d
AB
is distance
Between A and B
C 8
11
D 12
15
10
A
B C D
6
Use D as outgroup
3
3
6
Species
A B
2
1
B
10/3
C 16/3
16/3
d
D
= 37/3
Now use UPGMA to build tree
Neighbor’s Relation Method
Variant of UPGMA that pairs species in a way that creates a
tree with minimal overall branch lengths.
Pairs of sequences separated by only 1 node are said to be
neighbors.
A
B
C
D
a
b
c
d
e
terminal branches
single central
branch
For this tree topology
d
AC
+d
BD
= d
AD
+d
BC
= a + b + c + d + 2e =d
AB
+ d
CD
+2e
For neighbor relations, four-point condition will be true:
d
AB
+d
CD
< d
AC
+d
BD
…and…d
AB
+d
CD
< d
AD
+d
BC
So just have to consider all pairwise arrangements and
determine which one satisfies the four-point condition.
Neighbor-Joining Methods
Start with star-like tree. Find neighbors sequentially to minimize
total length of all branches
A
B
D
C
D
Studier & Kepler 1988:
Q
12
=(N-2)d
12
-Σd
1i
-Σd
2i
Where any 2 sequences can be 1 and 2
Try all possible sequence combinations. Whichever
combination of pairs gives the smallest Q
12
is the final tree!
B
A
C
Maximum Likelihood
? A purely statistical method.
? Probablilities for every nucleotide substitution in a set of aligned
sequences is considered.
? Calculation of probabilities is complex since ancestor is unknown
? Test all possible trees and calculate the aggregate probablility.
? Tree with single highest aggregate probablilty is the most likely
to reflect the true phylogenetic tree.
VERY COMPUTATIONALLY INTENSE
Parsimony
Parsimony: a derogatory term from the 1930s and 1940s
To describe someone who was especially careful with
Spending money.
Biologically: Attach preference to an evolutionary pathway
That minimizes the number of mutational events since
(1) Mutations are rare events, and
(2) The more unlikely events a model postulates, the less l
likely the model is to be true.
Parsimony: a character-based method, NOT a distance-based
method.
Parsimony
For parsimony analysis, positions in a sequence alignment
fall into one of two categories: informative and uninformative.
Position
Sequence
1 2 3 4
5
6
1
G G G
G G G
2
G G G
A G T
3
G G A
T A G
4
G A T
C A T
Only 3 possible unrooted trees you can make…
Parsimony
Position
Sequence
1 2 3 4
5
6
1
G G G
G G G
2
G G G
A G T
3
G G A
T A G
4
G A T
C A T
3
1
1
2
2
1
4
3
2
3
4 4
Which tree is the right one?
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
G
G
G G G
G G
1
G G A
G T
2
G A T
A G
3
A T C
A T
4
G
3
1
G
1
G
1
G G
2
G
2
2
G
G
4
G
3
3
G
G
4 4
G
Invariant positions – contain NO INFORMATION→ uninformative
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
G
A
G G G G
1
G
G A G T
2
G
A T A G
3
G
T C A T
4
G
G
3
1
G
1
G
1
G G
2
G
2
2
G
A
4
G
3
3
G
A
4 4
A
Equally uninformative – need one mutation in each tree
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
A
T
G G G
1
G G
A G T
2
G G
T A G
3
G G
C A T
4
G A
1
G
G
2
G
*
A
G G
G G
*
1
G A
3
1
G G
2
* *
2
G
T
4
*
*
A
3
T
4 4
T
3
A
Also uninformative – need two mutations in each tree
Parsimony
Position
Sequence
1 2 3 4
5
6
G
A
T
C
G G
1
G G G
G T
2
G G G
A G
3
G G A
A T
4
G A T
1
G
A
2
1
G T
3
1
G A
2
G
*
T
G
*
A
G
* A
*
*
* *
2
A
C
4
*
*
T
3
C
4 4
C
3
T
Also uninformative – need three mutations in each tree
Parsimony
Position
Sequence
1 2 3 4
5
6
G
G
A
A
G
1
G G G
G
T
2
G G G
A
G
3
G G A
T
T
4
G A T
C
1
G
G
2
G
*
A
G G
G G
* *
1
G A
3
1
G G
2
2
G
A
4
*
*
A
3
A
4 4
A
3
A
Informative! – need only one mutation in one tree but two
In the other trees!
Parsimony
Position
Sequence
1 2 3 4
5
6
1
G G G
G G
2
G G G
A G
3
G G A
T A
4
G A T
C A
G
T
G
T
1
G
T
2
1
G G
3
1
G T
2
*
G
G
G * T
G G
*
*
*
2
T
T
4
T
4
G
34
T
3
G
Informative! – need only one mutation in one tree but two
In the other trees!
Parsimony
Position
Sequence
So to be Informative, need at least 2 different nucleotides
And each has to be present at least twice.
Every tree is considered for every site, maintaining a running
score of the number of mutations required. The tree with the
smallest number of invoked mutations is the most
parsimonious
6
5
432
1
G
T
G
G
G
A
G
G
G
G
G
G
G
ATAGG
3
T
ACTAG
4
1
2
Parsimony
1
G
G
2
G
*
A
G G
G G
* *
1
G A
3
1
G G
2
2
G
A
4
*
*
A
3
A
4 4
A
3
A
Mathematically, most likely candidate nts at an
Internal node are:
{descendent node 1} ? {descendant node 2}
IF this is null set, then most likely candidate nts are:
{descendent node 1} U {descendant node 2}
Σ U = minimum number of substitutions required to account for
nts at terminal nodes since they last shared common ancestor
Total number of substitutions, informative + uninformative =
tree length
Parsimony
1
G
G
2
G
*
A
G G
G G
* *
1
G A
3
1
G G
2
2
G
A
4
*
*
A
3
A
4 4
A
3
A
If you use parsimony but weigh the mutations by
some kind of scoring system that accounts for the
likelihood of each mutation →weighted parsimony
By-product of parsimony is inference of nt identity in
the ancestral sequence
Parsimony
10 sequences: > 2 million possible trees…
Need a better way…
Branch and bound (Hardy and Penny, 1982)
Step 1: Determine an upper bound to the length of the
most parsimonious tree = L - either chosen randomly, or else
using a computationally fast way like UPGMA
Step 2: Starting from a simple tree with just some of the
Sequences, grow trees incrementally by adding branches to a
smaller tree that describes just some of the sequences.
Step 3: If at any point, the number of required substitutions
is > L, abandon that tree.
Step 4: As soon as you get a tree with fewer substitutions
than L, use that tree as the new upper bound to make
remainder of the search even more efficient.
Works for <= 20 sequences
Parsimony
For > 20 sequences
Heuristic Searches
Assumption: Alternative trees are not all independent of
each other. Most parsimonious trees have similar
topologies.
Step 1: Construct an initial tree as a good guess: UPGMA,
and use it as a starting point.
Step 2: Branch-swap subtrees and graft them onto the starting
tree, keeping overall topology. See how many are shorter
than the starting tree. The prune and re-graft, and see
if it keeps getting better.
Step 3: Repeat until a round of branch swapping fails to generate
any better trees
Parsimony
Often get tens or hundreds of equally parsimonious
trees
Build a consensus tree – any internal node supported by
At least half the trees becomes a simple bifurcation.
Phylogenetic Software
PHYLIP: Phylogenetics Inference Package
free at http://evolution.genetics.washington.edu
Includes many programs including various distance methods,
maximum likelihood, parsimony, with many of the options
we’ve discussed.
PAUP: Phylogenetic Analysis Using Parsimony
http://www.lms.si.edu/PAUP -NOT FREE
Now includes maximum likelihood and distance methods as well
Tree of Life
Carl Woese and colleagues, 1970s
Used 16S rRNA – all organisms possess.
Found 3 major evolutionary groups:
Bacteria
Eucarya
Archea (including thermophilic bacteria
Human Origins:
mt DNA sequences – huan populations differ by ~ 0.33%
(very small). Greatest differences NOT between
current populations on different continents, but between
human populations residing in Africa – “out of Africa” theory
Mitochondrial “eve” and Y-chromosome “adam”