7.91 / 7.36 / BE.490 Lecture #2 Feb. 26, 2004 DNA Sequence Comparison & Alignment Chris Burge Review of Lecture 1: “Genome Sequencing & DNA Sequence Analysis” ? The Language of Genomics ? ? Flavors of BLAST ? Statistics of High Scoring Segments - cDNAs, ESTs, BACs, Alus, etc. Dideoxy Method / Shotgun Sequencing - The ‘shotgun coverage equation’ (Poisson) - BLAST[PNX], TBLAST[NX] Shotgun Sequencing a BAC or a Genome 200 kb (NIH) 3 Gb (Celera) Sequence, Assemble Sonicate, Subclone Subclones Shotgun Contigs What would cause problems with assembly? DNA Sequence Alignment IV Which alignments are significant? Q: 1 ttgacctagatgagatgtcgttcacttttactgagctacagaaaa 45 |||| |||||||||||| | ||||||||||||||||||||||||| S: 403 ttgatctagatgagatgccattcacttttactgagctacagaaaa 447 Identify high scoring segments whose score S exceeds a cutoff x using dynamic programming. Scores follow an extreme value distribution: P(S > x) = 1 - exp[-Kmn e -λx ] For sequences of length m, n where K, λ depend on the score matrix and the composition of the sequences being compared (Same theory as for protein sequence alignments) From M. Yaffe Notes (cont) Lecture #2 ? The random sequence alignment scores would give rise to an “extreme value” distribution – like a skewed gaussian. ? Called Gumbel extreme value distribution For a normal distribution with a mean m and a variance σ, the height of the curve is described by Y=1/(σ√2π) exp[-(x-m) 2 /2σ 2 ] For an extreme value distribution, the height of the curve is described by Y=exp[-x-e -x ] …and P(S>x) = 1-exp[-e -λ(x-u)] where u=(ln Kmn)/λ Can show that mean extreme score is ~ log 2 (nm), and the probability of getting a score that exceeds some number of “standard deviations” x is: P(S>x)~ Kmne -λx. ***K and λ are tabulated for different matrices **** -λS For the less statistically inclined: E~ Kmne -2 -1 0.2 Yev 0.4 -4 4 0.4 B. Yn Probability values for the extreme value distribution (A) and the normal distribution (B). The area under each curve is 1. 0 1 2 X X A. 3 4 5 DNA Sequence Comparison & Alignment ? Target frequencies and mismatch penalties ? Eukaryotic gene structure ? Comparative genomics applications: ? See Ch. 7 of Mount - Pipmaker (2 species comparison) - Phylogenetic Shadowing (many species) Intro to DNA sequence motifs i DNA Sequence Alignment V How is λ related to the score matrix? λ is the unique positive solution to the equation*: ∑ p p j e λ s ij = 1 i i,j p = frequency of nt i, s ij = score for aligning an i,j pair What kind of an equation is this? What would happen to λ ? if we doubled all the scores? What does this tell us about the nature of λ? *Karlin & Altschul, 1990 DNA Sequence Alignment VI What scoring matrix to use for DNA? Usually use simple match-mismatch matrices: i j: A C G T A 1 m m m C m 1 m m s i,j : G T m m m m 1 m m 1 m = “mismatch penalty” (must be negative) DNA Sequence Alignment VII How to choose the mismatch penalty? Use theory of High Scoring Segment composition* High scoring alignments will have composition: q ij =p i p j e λs ij where q ij = frequency of i,j pairs (“target frequencies”) p , p = freq of i, j bases in sequences being compared i j What would happen to the target frequencies if we doubled all of the scores? *Karlin & Altschul, 1990 DNA Sequence Alignment VIII Still figuring out how to choose the mismatch penalty m Target frequencies: q ij =p i p j e λs ij s ij =ln(q ij / p i p j )/ ? If you want to find regions with R% identities: r = R /100 q ii = r/4 q ij = (1-r)/12 (i,j) Set s ii = 1 Then m = s ij = s ij /s ii = ln(q ij / p i p j )/λ) / (ln(q ii / p i p i )/λ (i≠j) ? m = ln(4(1-r)/3)/ln(4r) λ DNA Sequence Alignment IX The single most useful thing there is to know about mismatch penalties: m = ln(4(1-r)/3)/ln(4r) Examples: r 0.75 0.95 0.99 m -1 -2 -3 r = desired fraction of identities in BLAST hits Nucleotide- nucleotide BLAST Web Server (BLASTN) Other advanced DNA Sequence Alignment X NCBI BLAST Advanced Options -G Cost to open gap [Integer] default = 5 for nucleotides 11 proteins -E Cost to extend gap [Integer] default = 2 nucleotides 1 proteins -q Penalty for nucleotide mismatch [Integer] default = -3 -r Reward for nucleotide match [Integer] default = 1 -e Expect value [Real] default = 10 Expression of a Eukaryotic Gene* DNA … … Transcription (N) pre-mRNA Translation (C) AAAAAAA mRNA Splicing (N) Protein Typical Human Gene Statistics Length of primary transcript: ~30,000 bp Number of exons: ~8-10 Mean (internal) exon length: ~150 bp Mean intron length: ~3,000 bp Comparative Genomics - Examples ? PipMaker: applications to - human/mouse exon finding - human/mouse regulatory region finding ? “Phylogenetic Shadowing”: applications to - multi-genome exon finding - multi-genome regulatory region finding PipMaker - Percent Identity Plot (PIP) for two genomic sequences For an illustration of pips, please see figure 1 of Schwartz, Scott, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs, Ross Hardison, and Webb Miller. "PipMaker--A Web Server for Aligning Two Genomic DNA Sequences." Genome Res. 10 (April 2000): 577-586. Application of PipMaker #1- finding human/mouse exons Please see figure 2 of Schwartz, Scott, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs, Ross Hardison, and Webb Miller. "PipMaker--A Web Server for Aligning Two Genomic DNA Sequences." Genome Res. 10 (April 2000): 577-586. A Computational Biology Paradigm for Finding Genomic Features of Interest ? Identify properties that feature of interest should have ? Develop an algorithm to find seq’s with these properties ? Run algorithm on genome to predict features ? Test a subset of the predicted features experimentally to determine how well the method works Application of PipMaker #2 - finding regulatory regions Please see figure 2 of Loots, GG, RM Locksley, CM Blankespoor, ZE Wang, W Miller, EM Rubin, and KA Frazer. "Identification of A Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-species Sequence Comparisons." Science 288, no. 5463 (7 April 2000): 136-40. Effects on Transcription of Deleting CNS-1 region Please see figure 1 of Loots, GG, RM Locksley, CM Blankespoor, ZE Wang, W Miller, EM Rubin, and KA Frazer. "Identification of A Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-species Sequence Comparisons." Science 288, no. 5463 (7 April 2000): 136-40. “Phylogenetic Shadowing” (the power of many genomes) ? Sequence orthologous region from 13-17 primates (~90+% identical) ? Do a multiple sequence alignment (MSA): ? Measure variability at each position ? Calculate P(data|fast evol.), P(data|slow evol.) What are potential advantages of many close species versus fewer more distant organisms? Please see figure 1 of Boffelli, D, J McAuliffe, D Ovcharenko, KD Lewis, I Ovcharenko, L Pachter, and EM Rubin. "Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of The Human Genome." Science 299, no. 5611 (28 February 2003): 1391-4. Phylogenetic Shadowing of Regulatory Elements I Please see figure 2 of Boffelli, D, J McAuliffe, D Ovcharenko, KD Lewis, I Ovcharenko, L Pachter, and EM Rubin. "Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of The Human Genome." Science 299, no. 5611 (28 February 2003): 1391-4. Phylogenetic Shadowing of Regulatory Elements II Please see figure 3 of Boffelli, D, J McAuliffe, D Ovcharenko, KD Lewis, I Ovcharenko, L Pachter, and EM Rubin. "Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of The Human Genome." Science 299, no. 5611 (28 February 2003): 1391-4. Human Splice Signal Motif “Pictograms” 5' splice signal 3' splice signal Binding Affinity of the Dog Transcription Factor Site Fraction bound Yeast Genome CAT 1/2 A 1/3 T 1/3 AAT 1/4 C 1/6 G 1/6 GAT 1/4 Others 0 Search yeast promoters for potential Dog binding sites: How should you prioritize the promoters for followup experiments? Prioritizing Potential Dog Binding Sites Site Fraction bound Odds Ratio (R) CAT 1/2 (1/2) / (1/6)(1/3)(1/3)? 27.0 AAT 1/4 (1/4) / (1/3)(1/3)(1/3) 6.75 GAT 1/4 (1/4) / (1/6)(1/3)(1/3) 13.5 Others 0 (0) / ( ) ( ) ( ) 0.0 Yeast Genome A 1/3 T 1/3 C 1/6 G 1/6 Neyman-Pearson Lemma: Optimal decision rules are of the form R > C (C, a chosen cutoff value) Therefore: CAT > GAT > AAT > Others Weight Matrix Model (WMM) A 0.3 0.1 0.0 0.4 0.1 C 0.4 0.1 0.0 0.1 0.1 G 0.8 0.0 0.4 0.1 0.8 0.2 T 0.1 0.0 1.0 0.1 0.0 0.5 Pos -3 -2 -1 +1 +2 +3 +4 +5 +6 0.6 0.0 0.7 0.1 0.0 0.0 0.1 0.2 0.2 0.2 1.0 0.1 0.1 0.1 S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 P(S|+) = P -3 (S 1 )P -2 (S 2 )P -1 (S 3 ) ??? P 5 (S 8 )P 6 (S 9 ) Inhomogeneous, assumes independence between positions Statistical Independence Two events A and B are said to be independent if: A B if and only if P(A,B) = P(A) P(B) In terms of conditional probabilities P(B|A) = P(A,B)/P(A) A B ???? P(B|A) = P(B) and P(A|B) = P(A) Example: E = { die roll an even number } (2,4,6) R = { die roll a prime number } (2,3,5) Are the events E and R independent? P(A,B) indicates probability that both A and B occur Weight Matrix Models II 5’ splice signal Background Con: C A G … G T Pos -3 -2 -1 … +5 +6 A 0.3 0.6 0.1 … 0.1 0.1 C 0.4 0.1 0.0 … 0.1 0.2 G 0.2 0.2 0.8 … 0.8 0.2 T 0.1 0.1 0.1 … 0.0 0.5 Pos Generic A 0.25 C 0.25 G 0.25 T 0.25 S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 (S 1 )P -2 (S 2 )P -1 (S 3 ) ??? P 5 (S 8 )P 6 (S 9 ) Odds Ratio: R = P(S|+) = P -3 P(S|-) = P bg (S 1 )P bg (S 2 )P bg (S 3 ) ??? P bg (S 8 )P bg (S 9 ) Background model homogenous, assumes independence Weight Matrix Models III S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 P(S|+) P -3 (S 1 )P -2 (S 2 )P -1 (S 3 ) ??? P 5 (S 8 )P 6 (S 9 ) Odds Ratio: R = = P(S|-) P bg (S 1 )P bg (S 2 )P bg (S 3 ) ??? P bg (S 8 )P bg (S 9 ) k=9 = ∏ P -4+k (S k )/ P bg (S k ) k=1 k=9 Score s = log 2 R = ∑ log 2 (P -4+k (S k )/ P bg (S k )) k=1 Neyman-Pearson Lemma: Optimal decision rules are of the form R > C Equiv.: log 2 (R) > C’ because log is a monotone function Weight Matrix Models IV Slide WMM along sequence: ttgacctagatgagatgtcgttcacttttactgagctacagaaaa …… Assign score to each 9 base window. Use score cutoff to predict potential 5’ splice sites Histogram of 5’ss Scores True 5’ Splice Sites “Decoy” 5’ Splice Sites Score (1/10 bit units) Measuring Accuracy: Sensitivity = % of true sites w/ score > cutoff Specificity = % of sites w/ score > cutoff that are true sites Sn: 20% 50% 90% Sp: 50% 32% 7% What does this result tell us?