7.91 – Lecture #4 Database Searching & Molecular Phylogenetics A A B B C D D (((A,B)C)D) C Michael Yaffe Outline ? FASTA, Blast searching, Smith-Waterman ? Psi-Blast ? Review of Genomic DNA structure ? Substitution patterns and mutation rates ? Synonymous and non-Synonymous substitutions ? Jukes-Cantor Model ? Kimura’s Two-Parameter Model ? Molecular Clocks ? Phylogenetic Trees – rooted and unrooted ? Distance Matrix Methods ? Neighbor-Joining Method and Related Neighbor Methods ? Maximum Likelihood Outline (cont) ? Parsimony Branch and Bound Heuristic Seaching ? Consensus Trees ? Software (PHYLIP, PAUP) ? The Tree of Life Reading: Mount, p. 237-280, 283-286, 291-308 Database Searching Problem is simple: I want to find homologues to my protein in the database How do I do it? Do the obvious – compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh Oh! 1 n For k sequences in the 1 12345678…. Database 12345678…. this becomes an O(mnk) 12345678…. problem! 12345678…. 12345678…. m 12345678…. ….essentially an O(mn) problem Database Searching Still, this can be done - ~ 50x slower than Blast/FASTA, Smith-Waterman algorithm… SSEARCH (ftp.virginia.edu/pub/fasta) – do it locally! But in the old days, needed a faster method… 2 approaches – Blast, FASTA – both heuristic (i.e. tried and true) – almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea 1- Search for matching sequence patterns or words Called k-tuples, which are exact matches of “k” characters between the two sequences i.e. RW = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE Database Searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. CV = 2-tuple Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 RW 5 CV 10 AH 1 i.e. pos in Seq 2 Offset (pos1-pos2) 2 3 7 3 ---- --- Database Searching Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE 3- Make a Hash Table i.e. (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2) 3 3 RW 5 2 CV 10 7 AH 1 ---- ---- 4- Look for words (k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences. 5- Build a local alignment based on these, extend it outwards Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE Database Searching With hashing, number of comparisons is proportional To the average sequence length (i.e. an O(n) problem), Not an O(mn) problem as in dynamic programming. Proteins – ktup = 1-2, Nucleotides, ktup=4-6 One big problem – low complexity regions. Seq 1: AHFYPPPPPPPPFSER Seq 2: DVATPPPPPPPPPPPNLFK Database Searching BLAST Same basic idea as FASTA, but faster and more sensitive! How? BLAST searches for common words or k-tuples, but limits the search for k-tuples that are most significant, by using the log-odds values in the Blosum62 amino acid substitution matrix i.e. look for WHK and might accept WHR but not HFK as a possible match (note 8000 possibilities) Repeat for all 3-tuples in the query Search the database for a match to the top 50 3-tuples that match the first query position in the sequence, the second query position, etc. Use any match to seed an ungapped alignment (old BLAST) Database Searching Word length is fixed: 3-tuple for proteins 11-tuple for nucleotides By default, filters out low complexity regions. Determine if the alignment is statistically significant. calculates the probability of observing a score greater than or equal to your alignment based on extreme value distribution. Calculates an E-value = expectation value: This is the probability of finding an unrelated sequence that shows this good an alignment just by chance. Remember if p=.0001 and my database has 500,000 sequences, I will have an E=50! (normal starting E=10) Psi-BLAST Position-specific iterative BLAST Combines BLAST searching with PSSMs! 1- Start with regular BLAST search – look at the results Psi-BLAST Position-specific iterative BLAST Combines BLAST searching with PSSMs! 1- Start with regular BLAST search – look at the results 2- Pick the ones you believe are really homologous Psi-BLAST Position-specific iterative BLAST Combines BLAST searching with PSSMs! 1- Start with regular BLAST search – look at the results 2- Pick the ones you believe are really homologous 3- Now align these sequences to the query sequence and make up a PSSM that tells how much to weigh each amino acid in each position in the alignment 4- Use this PSSM to do another BLAST search 5- Add any new sequences that come up to the old ones if you believe they are really homologous 6- Repeat the alignment to make a new and improved PSSM that tells how much to weigh each amino acid in each position in the alignment Psi-BLAST 7-- Use this PSSM to do another BLAST search 8– Keep iterating until no new sequences are found Very good for finding weakly related sequences …on to Molecular Phylogenetics 1 2 3 4 5 6 7 Gene Structure 5’ Flank AUG 3’ Flank exon intron Poly(A) addition splicing 3’ UTR 5’ UTR AUG UGA…AAAAAAA Mutation Rates Mutations: deleterious neutral advantageous ←substantial minority Consider 2 sequences K= # of substitutions since they shared a common ancestor T= divergence time R = mutation rate = K/(2T) KEY PREMISE OF PHYLOGENETICS: If R=constant for all species, then K will provide insight into evolutionary relatedness for which no other physical evidence is available. Mutation Rates Mutations: refined by process of natural selection… Often, but not always at the protein level… → Functional constraint AUG 3’ Flank 5’ Flank exon intron Poly(A) addition splicing AUG UGA…AAAAAAA 5’ UTR 3’ UTR 1 2 3 4 5 6 7 Human, mouse, rabbit and cow beta-globin Substitution rate Length, bp # Pairwise ?’s (mean) (subs/site x 10 9 years) Noncoding, overall 913 268 1.58 3.39 1.86 3.48 3.33 Coding, overall 441 69 functionally 5’ Flank 300 96 rapid constrained 5’ UTR 50 9 changes Intron 1 131 3’ UTR 132 3’ Flank 300 42 33 …generally 3.00 76 true 3.04 Common ancestor 100 million years ago Synonymous vs Nonsynonymous Substitutions 18 out of 20 amino acids have more than one codon GGG, GGC, GGU, GGA → Glycine ?‘s here – same aa →synonymous substitution GGG GGC GGU GGA nonsynonymous substitution, GCG →Alanine Human and rabbit beta-globin genes: but 3x as many 47 substitutions in coding sequence opportunities! ? 27 synonymous substitutions ? 20 nonsynonymous substitutions Synonymous vs Nonsynonymous Substitutions Not all positions in a codon equally likely to give non-synonymous substitutions For Gly, 4-fold degenerate GGG GGC GGU GGA GAU GAC GAA GAG Asp Glu 2-fold degenerate UUU CUU AUU GUU Phe Leu Ile Non-degenerate Val Synonymous vs Nonsynonymous Substitutions If natural selection operates at protein level, expect nucleotide substitutions appear most rapidly at 4-fold sites and least rapidly at non-degenerate sites What does data show? Human vs rabbit beta-globin genes (coding region) Sub. Rate Region # sites (bp) #changes Subs/site.10 9 years Non-deg. 302 17 .56 2-fold deg. 60 10 1.67 4-fold deg. 85 20 2.35 Mutation versus Substitutions Mutation: changes in nucleotide sequences due to errors in DNA replication or repair Substitution: mutations that pass through the filter of natural selection Synonymous substitution rates, Ks, reflect actual mutation rate Non-synonymous substitution rates, Ka, do NOT reflect actual mutation rate, as subject to natural selection New alleles (versions of a gene) typically begin at low frequencies q = 1/(2N) where N=# of diploid reproducing organisms Why are there persistant high levels of variation in populations? Why not q→0, q→1? Most mutations are selectively neutral! Estimating Substitution Numbers Infrequent substitutions between 2 sequences: Count ‘em…gives K More frequent substitutions – counting will significantly UNDERestimate the number of true substitutions since they shared a common ancestor Why? Time Scenario 1 Scenario 2 G G G T G G Sampling time Jukes-Cantor Model G α α α α T α Assume each nucleotide equally likely to change into any other nt, with rate of change=α. Overall rate of substitution = 3α A α C …so if G at t=0, at t=1, P G(1) =1-3α and P G(2) =(1-3α)P G(1) +α [1? P G(1) ] Expanding this gives P G(t) =1/4 + (3/4)e -4αt Can show that this gives K = -3/4 ln[1-(4/3)(p)] K = true number of substitutions that have occurred, P = fraction of nt that differ by a simple count. Captures general behaviour… Compare J-C with real data…assumption of global uniformity in α was unrealistic…still provides a useful framework… Nucleotides: two categories: purines: A, G pyrimidines: C, T, U Exchange nucleotides within or between classes happens at different rates! Transitions: purine→purine, pyrmidine →pyrimidine three times as commmon as Transversions: purine→pyrimidine or pyrimidine→purine Led to Kimura’s Two Parameter Model Kimura’s Two Parameter Model Transitions occur at rate α β G T A C β α β β Transversions occur at rate β α Now P GG(1) = 1- α -2 β P GG(2) : 4 possibilities: Time Scenario 1 Scenario 2 Scenario 3 Scenario 4 G G G A G C G T G G G G No ? 1 Transition 2Transversions Kimura’s Two Parameter Model β Transitions occur at rate α α Transversions occur at rate β P GG(1) = 1- α -2 β P GG(2) : 4 possibilities: Time Scenario 1 Scenario 2 Scenario 3 Scenario 4 G T A C β α β β G G G G A G G C G G T G No ? 1 Transition 2 Transversions P GG(2) = (1- α -2 β) P GG(1) + αP GA(1) + βP GC(1) + βP GT(1) expanding… P GG(t) = ? + (1/4)e -4βt + (1/2)e -2(α+β)t Kimura’s Two Parameter Model β G T A C β α β β Transitions occur at rate α Transversions occur at rate β α P GG(2) = (1- α -2 β) P GG(1) + αP GA(1) + βP GG(1) + βP GG(1) expanding… P GG(t) = 1/4 + (1/4)e -4βt + (1/2)e -2(α+β)t Manipulating equation gives estimate of true number of substitutions if only two sequences are available, K = 1/2 ln[1/(1-2P-Q)] + 1/4 ln[1/(1-2Q)] Where K = true number of substitutions P = fraction of nts undergoing transitions by simple count Q = fraction of nts undergoing tranversions by simple count More complex Parameter Models Possible T G α ε φ G α T γ δ β γ C δ A Could even make A→C ≠ C →A A ε φ β λ C Problem is sampling error – not enough data to get Good parameters within a single gene family, usually Why not combine different genes? Find strikingly different rates of evolution between different Genes – up to and greater than 200-fold. RATE DEPENDS ON FUNCTION! Histones – each aa interacts with DNA – slowest rate of substitution known HLA gene locus – involved in immune system recognition of foreign antigens – needs to adapt rapidly - one of the Highest substitution rates known However, rates of molecular evolution for loci with similar functional constraints often very uniform over long periods of evolutionary time. Molecular Clocks 1960s: Emile Zuckerkandl and Linus Pauling Postulate: Substitution rates so constant within homologous proteins over long periods of evolutionary time that accumulation of amino acid changes reflects the steady ticking of a molecular clock. Clock may run at different rates for different proteins WAS CONTROVERSIAL AS DIVERGENCE TIMES s c t e ss Fibrinopeptides s ne ti l / i y s r e hpl es # aa substitutions per 100 residues 500 1000 50 100 150 Hemoglobin B i r d s / r e p ti l e m a m m a l s / r e R e p ti l e s / f i s C a r p / l a m p m a m m a V e r t e b r a t Cytochrome C NOT ACCURATELY KNOWN 1400 Millions of years since divergence 1 Relative Rate Test of Molecular Clock Hypothesis 1973: Sarich and Wilson Consider relative rate of substitution in lineage for species 1 and 2 Need to designate a less related species 3 as an outgroup i.e. 1=humans, 2=gorillas, 3= baboons Phylogenetic tree (more soon!) 1 and 2 diverged from a common 2 3 ancestor, A Number of substitutions between A any two species = sum of number of substitutions along branches of the tree that connect them 1 1 and 2 diverged from a common 2 3 ancestor, A Number of substitutions between A Any two species = sum of number of Substitutions along branches of the tree That connect them d 13 , d 23 , d 12 – can measure directly d 13 = d A1 + d A3 Algebra: d 23 = d A2 + d A3 d A1 = (d 12 + d 13 –d 23 )/2 d 12 = d A1 + d A2 d A2 = (d 12 + d 23 –d 13 )/2 Theorum 1 Molecular clock predicts d A1 =d A2 Find, for the most part, this is true, but not always, depending on species… So bottom line when comparing two species need to prove Theorum 1 before using the molecular clock! Distance-Based Phylogenetics Phylogenetic Trees – also called dendrograms ? Made by arranging nodes and branches. ? Graphical representation of evolutionary relatedness of 3 or more sequences Nodes – distinct taxonomical unit ? Terminal nodes: gene or organism for which data has been collected ? Internal node – inferred common ancestor that gave rise to 2 lineages 3 4 1 2 5 A B C D For the mathematicians: Tree - special graph with n nodes, n-1 links, no circuits 1 Distance-Based Phylogenetics Newick notation A B C D 2 3 4 5 (((1,2), (3,4)),5) Scaled trees Branch length is α difference between pairs of neighboring nodes. Ideally, scaled trees should be additive Unscaled trees Only convey relative kinship information without representing number of changes that separate sequences 1 Distance-Based Phylogenetics 2 3 4 5 A B C D root Time Rooted trees Make an inference about common ancestor and direction of evolution. A single node is designated as common ancestor with unique path from it through evolutionary time to any other node. Root is assigned through use of an outgroup – something that unambiguously separated earlier than species being considered. 1 Distance-Based Phylogenetics 1 2 A B C D 4 5 2 3 3 5 4 Unrooted trees Only specifies relationship between nodes. Says nothing about the direction of evolution Why not always use rooted trees? Distance-Based Phylogenetics Why not always use rooted trees? 1 – Need a clear outgroup 2 – Computational difficulty Consider 3 sequences 1, 2, and 3: 3 possible rooted trees 1 2 3 1 3 2 3 2 1 Only 1 possible unrooted tree 1 2 3 Distance-Based Phylogenetics Why not always use rooted trees? Number of Number of Number of sequences rooted trees unrooted trees 1 2 1 1 3 3 4 15 3 5 105 15 10 34,459,425 2,027, 025 15 213,458,046,767,875 7,905,853,580,625 N R =(2n-3)!/2 n-2 (n-2)! N U =(2n-5)!/2 n-3 (n-3)! Shortcuts… UPGMA Unweighted pair-group method with arithmetic mean ? Oldest distance method, statistically based ? Requires data be condensed to a measure of genetic distance **Build a distance matrix between taxa (I.e. sequences) ** Consider 4 sequences A, B, C, D Species A B C B C d AB d AC d BC d AB is distance Between A and B D d AD d BD d CD Step 1: Cluster the two closest sequences into composite group, i.e. if d AB is smallest, make new group (AB). UPGMA Consider 4 sequences A, B, C, D Species A B C B C d AB d AC d BC d AB is distance Between A and B D d AD d BD d CD Step 1: Cluster the two closest sequences into composite group, i.e. if d AB is smallest, make new group (AB). Step 2: Create a new distance matrix between (AB) and C and D. d (AB)C =1/2 (d AC + d BC ); d (AB)D =1/2 (d AD + d BD ) UPGMA Consider 4 sequences A, B, C, D Species A B C B C d AB d AC d BC d AB is distance Between A and B D d AD d BD d CD Step 1: Cluster the two closest sequences into composite group, i.e. if d AB is smallest, make new group (AB). Step 2: Create a new distance matrix between (AB) and C and D. d (AB)C =1/2 (d AC + d BC ); d (AB)D =1/2 (d AD + d BD ) Step 3: Using new matrix, cluster the two closest sequences into composite group. Repeat above until all species have been grouped. UPGMA Species A B C d AB is distance B d AB Between A and B C d AC d BC D d AD d BD d CD Step 1: Cluster the two closest sequences into composite group, i.e. if d AB is smallest, make new group (AB). Step 2: Create a new distance matrix between (AB) and C and D. d (AB)C =1/2 (d AC + d BC ); d (AB)D =1/2 (d AD + d BD ) Step 3: Using new matrix, cluster the two closest sequences into composite group. Repeat above until all species have been grouped. Step 4: For scaled branch lengths, put node halfway between grouped species. UPGMA - example Species A B C D B 9 C 8 11 D 12 15 10 E 15 18 13 5 D E (D,E) UPGMA - example Species A B C D B 9 C 8 11 D 12 15 10 E 15 18 13 5 D E (D,E) A C D E Species A B C B 9 C 8 11 (A,C) (D,E) DE 13.5 16.5 11.5 UPGMA - example A C D E Species A B C B 9 C 8 11 (A,C) (D,E) DE 13.5 16.5 11.5 A C B D E Species B AC AC 10 DE 16.5 12.5 (((A,C)B)(D,E)) UPGMA – adding distances Species A B C D B 9 C 8 11 D 12 15 10 E 15 18 13 5 D E (D,E) 2.5 2.5 A 4 2.5 C D E Species A B C 4 B 9 C 8 11 (A,C) (D,E) DE 13.5 16.5 11.5 2.5 UPGMA – adding distances Species B AC A 4 6.25 C D E AC 10 4 2.5 2.5 DE 16.5 12.5 6.25 Branch lengths for scaled unrooted tree = Fitch-Margoliash Algorithm for 3 sequences A x d AC = x+y y d AB = x+z C d BC = y+z B z x=(d AB +d AC –d BC )/2 y=(d AC +d BC –d AB )/2 z=(d AB +d BC –d AC )/2 Note: F-M assumes additivity of branch lengths but DOES NOT Assume equal rates of evolution along branches. (Can specify this though : Kitsch-Margoliash) Fitch-Margoliash Algorithm for >3 sequences A B C a b f c D g d e E Steps: 1- Find closest 2 sequences (D,E) 2- Treat rest as composite and take average of D to (ABC), E to (ABC) 3- Use these to calculate d, e 4- Make new composite DE 5- Make new distance table 6- Find next most closely related pair and repeat from step 2 Now repeat starting with another pair as the closest starting pair In the end, calculate all predicted distances for all trees, and choose What best fits data Transformed Distance Method UPGMA assumes constant rate Of evolution across all lineages Can allow different rates of evolution across different lineages if you normalize using an external reference that diverged early…i.e. use an outgroup Define d D =average distance A B C D Between outgroup and all ingroups d’ ij = (d ij –d iD –d jD )/2 + d D Now use d’ ij to do the clustering ..basically just comes from the insight that ingroups evolved separately from each other ONLY AFTER they diverged from outgroup Neighbor’s Relation Method Variant of UPGMA that pairs species in a way that creates a tree with minimal overall branch lengths. Pairs of sequences separated by only 1 node are said to be neighbors. A B C D a b c d e terminal branches single central branch For this tree topology d AC +d BD = d AD +d BC = a + b + c + d + 2e =d AB + d CD +2e For neighbor relations, four-point condition will be true: d AB +d CD < d AC +d BD …and…d AB +d CD < d AD +d BC So just have to consider all pairwise arrangements and determine which one satisfies the four-point condition. Neighbor-Joining Methods Start with star-like tree. Find neighbors sequentially to minimize total length of all branches A B D C D Studier & Kepler 1988: Q 12 =(N-2)d 12 - Σd 1i - Σd 2i Where any 2 sequences can be 1 and 2 Try all possible sequence combinations. Whichever combination of pairs gives the smallest Q 12 is the final tree! B A C Maximum Likelihood ? A purely statistical method. ? Probablilities for every nucleotide substitution in a set of aligned sequences is considered. ? Calculation of probabilities is complex since ancestor is unknown ? Test all possible trees and calculate the aggregate probablility. ? Tree with single highest aggregate probablilty is the most likely to reflect the true phylogenetic tree. VERY COMPUTATIONALLY INTENSE Parsimony Parsimony: a derogatory term from the 1930s and 1940s To describe someone who was especially careful with Spending money. Biologically: Attach preference to an evolutionary pathway That minimizes the number of mutational events since (1) Mutations are rare events, and (2) The more unlikely events a model postulates, the less l likely the model is to be true. Parsimony: a character-based method, NOT a distance-based method. Parsimony For parsimony analysis, positions in a sequence alignment fall into one of two categories: informative and uninformative. Position Sequence 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T Only 3 possible unrooted trees you can make… Parsimony Position Sequence 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T 3 1 1 2 2 1 4 3 2 3 4 4 Which tree is the right one? Parsimony Position Sequence 1 2 3 4 5 6 G G G G G G G G G 1 G G A G T 2 G A T A G 3 A T C A T 4 G 3 1 G 1 G 1 G G 2 G 2 2 G G 4 G 3 3 G G 4 4 G Invariant positions – contain NO INFORMATION→ uninformative Parsimony Position Sequence 1 2 3 4 5 6 G G G A G G G G 1 G G A G T 2 G A T A G 3 G T C A T 4 G G 3 1 G 1 G 1 G G 2 G 2 2 G A 4 G 3 3 G A 4 4 A Equally uninformative – need one mutation in each tree Parsimony Position Sequence 1 2 3 4 5 6 G G A T G G G 1 G G A G T 2 G G T A G 3 G G C A T 4 G A 1 G G 2 G ? A G G G G ? 1 G A 3 1 G G 2 2 G ? T 4 3 A ? T 4 4 T ? ? A 3 Also uninformative – need two mutations in each tree Parsimony Position Sequence 1 2 3 4 5 6 G A T C G G 1 G G G G T 2 G G G A G 3 G G A A T 4 G A T 1 G A 2 1 G T 3 1 G A 2 G ? T G ? A G ? A ? ? 2 A ? C 4 3 T ? C 4 4 C ? ? T 3 Also uninformative – need three mutations in each tree Parsimony Position Sequence 1 2 3 4 5 6 G G A A G 1 G G G G T 2 G G G A G 3 G G A T T 4 G A T C 1 G G 2 G ? A G G G G ? 1 G A 3 1 G G 2 2 G A 4 3 A ? A 4 4 A ? ? A 3 Informative! – need only one mutation in one tree but two In the other trees! Parsimony Position Sequence 1 2 3 4 5 6 1 G G G G G 2 G G G A G 3 G G A T A 4 G A T C A G T G T 1 G G 3 1 G T 2 1 G ? T 2 G G G ?T G G ? ? 2 T ? T 4 G 3 3 G T 4 4 T Informative! – need only one mutation in one tree but two In the other trees! Parsimony Position Sequence So to be Informative, need at least 2 different nucleotides And each has to be present at least twice. Every tree is considered for every site, maintaining a running score of the number of mutations required. The tree with the smallest number of invoked mutations is the most parsimonious 6 5 432 1 G T G G G A G G G G G G G ATAGG 3 T ACTAG 4 1 2 Parsimony 1 G G 2 G ? A G G G G ? 1 G A 3 1 G G 2 2 G A 4 3 A ? A 4 4 A ? ? A 3 Mathematically, most likely candidate nts at an Internal node are: {descendent node 1} ? {descendant node 2} IF this is null set, then most likely candidate nts are: {descendent node 1} U {descendant node 2} Σ U = minimum number of substitutions required to account for nts at terminal nodes since they last shared common ancestor Total number of substitutions, informative + uninformative = tree length Parsimony 1 G G 2 G ? A G G G G ? 1 G A 3 1 G G 2 2 G A 4 3 A ? A 4 4 A ? ? A 3 If you use parsimony but weigh the mutations by some kind of scoring system that accounts for the likelihood of each mutation →weighted parsimony By-product of parsimony is inference of nt identity in the ancestral sequence Parsimony 10 sequences: > 2 million possible trees… Need a better way… Branch and bound (Hardy and Penny, 1982) Step 1: Determine an upper bound to the length of the most parsimonious tree = L - either chosen randomly, or else using a computationally fast way like UPGMA Step 2: Grow trees incrementally by adding branches to a smaller tree that describes just some of the sequences. Step 3: If at any point, the number of required substitutions is > L, abandon that tree. Step 4: As soon as you get a tree with fewer substiitutions than L, use that tree as the new upper bound to make remainder of the search even more efficient. Works for <= 20 sequences Parsimony For > 20 sequences Heuristic Searches Assumption: Alternative trees are not all independent of each other. Most parsimonious trees have similar topologies. Step 1: Construct an initial tree as a good guess: UPGMA, and use it as a starting point. Step 2: Branch-swap subtrees and graft them onto the starting tree, keeping overall topology. See how many are shorter than the starting tree. The prune and re-graft, and see if it keeps getting better. Step 3: Repeat until a round of branch swapping fails to generate any better trees Parsimony Often get tens or hundreds of equally parsimonious trees Build a consensus tree – any internal node supported by At least half the trees becomes a simple bifurcation. Phylogenetic Software PHYLIP: Phylogenetics Inference Package free at http://evolution.genetics.washington.edu Includes many programs including various distance methods, maximum likelihood, parsimony, with many of the options we’ve discussed. PAUP: Phylogenetic Analysis Using Parsimony http://www.lms.si.edu/PAUP -NOT FREE Now includes maximum likelihood and distance methods as well Tree of Life Carl Woese and colleagues, 1970s Used 16S rRNA – all organisms possess. Found 3 major evolutionary groups: Bacteria Eucarya Archea (including thermophilic bacteria Human Origins: mt DNA sequences – huan populations differ by ~ 0.33% (very small). Greatest differences NOT between current populations on different continents, but between human populations residing in Africa – “out of Africa” theory Mitochondrial “eve” and Y-chromosome “adam”