7.91 – Lecture #5 Database Searching & Molecular Phylogenetics A A B B C D D (((A,B)C)D) C Michael Yaffe Outline ? Distance Matrix Methods ? Neighbor-Joining Method and Related Neighbor Methods ? Maximum Likelihood ? Parsimony Branch and Bound Heuristic Seaching ? Consensus Trees ? Software (PHYLIP, PAUP) ? The Tree of Life Transformed Distance Method UPGMA assumes constant rate Of evolution across all lineages - can lead to wrong tree topologies Can allow different rates of evolution across different lineages if you normalize using an external reference that diverged early…i.e. use an outgroup Define d D =average distance A B C D Between outgroup and all ingroups d’ ij = (d ij –d iD –d jD )/2 + d D Now use d’ ij to do the clustering ..basically just comes from the insight that ingroups evolved separately from each other ONLY AFTER they diverged from outgroup Example Species A B C B 9 d AB is distance Between A and B C 8 11 D 12 15 10 A B C D 6 Use D as outgroup 3 3 6 Species A B 2 1 B 10/3 C 16/3 16/3 d D = 37/3 Now use UPGMA to build tree Neighbor’s Relation Method Variant of UPGMA that pairs species in a way that creates a tree with minimal overall branch lengths. Pairs of sequences separated by only 1 node are said to be neighbors. A B C D a b c d e terminal branches single central branch For this tree topology d AC +d BD = d AD +d BC = a + b + c + d + 2e =d AB + d CD +2e For neighbor relations, four-point condition will be true: d AB +d CD < d AC +d BD …and…d AB +d CD < d AD +d BC So just have to consider all pairwise arrangements and determine which one satisfies the four-point condition. Neighbor-Joining Methods Start with star-like tree. Find neighbors sequentially to minimize total length of all branches A B D C D Studier & Kepler 1988: Q 12 =(N-2)d 12 -Σd 1i -Σd 2i Where any 2 sequences can be 1 and 2 Try all possible sequence combinations. Whichever combination of pairs gives the smallest Q 12 is the final tree! B A C Maximum Likelihood ? A purely statistical method. ? Probablilities for every nucleotide substitution in a set of aligned sequences is considered. ? Calculation of probabilities is complex since ancestor is unknown ? Test all possible trees and calculate the aggregate probablility. ? Tree with single highest aggregate probablilty is the most likely to reflect the true phylogenetic tree. VERY COMPUTATIONALLY INTENSE Parsimony Parsimony: a derogatory term from the 1930s and 1940s To describe someone who was especially careful with Spending money. Biologically: Attach preference to an evolutionary pathway That minimizes the number of mutational events since (1) Mutations are rare events, and (2) The more unlikely events a model postulates, the less l likely the model is to be true. Parsimony: a character-based method, NOT a distance-based method. Parsimony For parsimony analysis, positions in a sequence alignment fall into one of two categories: informative and uninformative. Position Sequence 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T Only 3 possible unrooted trees you can make… Parsimony Position Sequence 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T 3 1 1 2 2 1 4 3 2 3 4 4 Which tree is the right one? Parsimony Position Sequence 1 2 3 4 5 6 G G G G G G G G G 1 G G A G T 2 G A T A G 3 A T C A T 4 G 3 1 G 1 G 1 G G 2 G 2 2 G G 4 G 3 3 G G 4 4 G Invariant positions – contain NO INFORMATION→ uninformative Parsimony Position Sequence 1 2 3 4 5 6 G G G A G G G G 1 G G A G T 2 G A T A G 3 G T C A T 4 G G 3 1 G 1 G 1 G G 2 G 2 2 G A 4 G 3 3 G A 4 4 A Equally uninformative – need one mutation in each tree Parsimony Position Sequence 1 2 3 4 5 6 G G A T G G G 1 G G A G T 2 G G T A G 3 G G C A T 4 G A 1 G G 2 G * A G G G G * 1 G A 3 1 G G 2 * * 2 G T 4 * * A 3 T 4 4 T 3 A Also uninformative – need two mutations in each tree Parsimony Position Sequence 1 2 3 4 5 6 G A T C G G 1 G G G G T 2 G G G A G 3 G G A A T 4 G A T 1 G A 2 1 G T 3 1 G A 2 G * T G * A G * A * * * * 2 A C 4 * * T 3 C 4 4 C 3 T Also uninformative – need three mutations in each tree Parsimony Position Sequence 1 2 3 4 5 6 G G A A G 1 G G G G T 2 G G G A G 3 G G A T T 4 G A T C 1 G G 2 G * A G G G G * * 1 G A 3 1 G G 2 2 G A 4 * * A 3 A 4 4 A 3 A Informative! – need only one mutation in one tree but two In the other trees! Parsimony Position Sequence 1 2 3 4 5 6 1 G G G G G 2 G G G A G 3 G G A T A 4 G A T C A G T G T 1 G T 2 1 G G 3 1 G T 2 * G G G * T G G * * * 2 T T 4 T 4 G 34 T 3 G Informative! – need only one mutation in one tree but two In the other trees! Parsimony Position Sequence So to be Informative, need at least 2 different nucleotides And each has to be present at least twice. Every tree is considered for every site, maintaining a running score of the number of mutations required. The tree with the smallest number of invoked mutations is the most parsimonious 6 5 432 1 G T G G G A G G G G G G G ATAGG 3 T ACTAG 4 1 2 Parsimony 1 G G 2 G * A G G G G * * 1 G A 3 1 G G 2 2 G A 4 * * A 3 A 4 4 A 3 A Mathematically, most likely candidate nts at an Internal node are: {descendent node 1} ? {descendant node 2} IF this is null set, then most likely candidate nts are: {descendent node 1} U {descendant node 2} Σ U = minimum number of substitutions required to account for nts at terminal nodes since they last shared common ancestor Total number of substitutions, informative + uninformative = tree length Parsimony 1 G G 2 G * A G G G G * * 1 G A 3 1 G G 2 2 G A 4 * * A 3 A 4 4 A 3 A If you use parsimony but weigh the mutations by some kind of scoring system that accounts for the likelihood of each mutation →weighted parsimony By-product of parsimony is inference of nt identity in the ancestral sequence Parsimony 10 sequences: > 2 million possible trees… Need a better way… Branch and bound (Hardy and Penny, 1982) Step 1: Determine an upper bound to the length of the most parsimonious tree = L - either chosen randomly, or else using a computationally fast way like UPGMA Step 2: Starting from a simple tree with just some of the Sequences, grow trees incrementally by adding branches to a smaller tree that describes just some of the sequences. Step 3: If at any point, the number of required substitutions is > L, abandon that tree. Step 4: As soon as you get a tree with fewer substitutions than L, use that tree as the new upper bound to make remainder of the search even more efficient. Works for <= 20 sequences Parsimony For > 20 sequences Heuristic Searches Assumption: Alternative trees are not all independent of each other. Most parsimonious trees have similar topologies. Step 1: Construct an initial tree as a good guess: UPGMA, and use it as a starting point. Step 2: Branch-swap subtrees and graft them onto the starting tree, keeping overall topology. See how many are shorter than the starting tree. The prune and re-graft, and see if it keeps getting better. Step 3: Repeat until a round of branch swapping fails to generate any better trees Parsimony Often get tens or hundreds of equally parsimonious trees Build a consensus tree – any internal node supported by At least half the trees becomes a simple bifurcation. Phylogenetic Software PHYLIP: Phylogenetics Inference Package free at http://evolution.genetics.washington.edu Includes many programs including various distance methods, maximum likelihood, parsimony, with many of the options we’ve discussed. PAUP: Phylogenetic Analysis Using Parsimony http://www.lms.si.edu/PAUP -NOT FREE Now includes maximum likelihood and distance methods as well Tree of Life Carl Woese and colleagues, 1970s Used 16S rRNA – all organisms possess. Found 3 major evolutionary groups: Bacteria Eucarya Archea (including thermophilic bacteria Human Origins: mt DNA sequences – huan populations differ by ~ 0.33% (very small). Greatest differences NOT between current populations on different continents, but between human populations residing in Africa – “out of Africa” theory Mitochondrial “eve” and Y-chromosome “adam”