7.91 – Lecture #1 Introduction to Bioinformatics Focus on Kinases & Pairwise Sequence Comparisons ARDFSHGLLENKLLGCDSMRWE .::. .:::. .:::: :::. GRDYKMALLEQWILGCD-MRWD Reading: This lecture: Mount pp.1-7, 29-35, 45-48, 51-64 Next lecture: Mount pp. 8-9, 65-89, 96-115, +more The Protein Data Bank (PDB - http://www.pdb.org/) is the single biological macromolecular structure data. Research 28 (2000): 235-242. (PDB Advisory Notice on using materials available in the archive: worldwide repository for the processing and distribution of 3-D Research 28 (2000): 235-242. (PDB Advisory Notice on using materials available in the archive: http://www.rcsb.org/pdb/advisory.html) Michael Yaffe Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Research 28 (2000): 235-242. http://www.rcsb.org/pdb/advisory.html) Outline ? Review of biological fundamentals ? Genetics example ? Biological databases/NCBI resources ? Simplified sequence analysis – where to start… ? Simple sequence comparisons ? Definitions of related sequences ? Concepts and types of alignments – the good, the bad, and the ugly ? Dot matrix alignments ? Computational efficiency ? Recursion and dynamic programming ? Substitution matrices: PAM, BLOSUM, Gonnet Outline (cont) ?Gaps ? Applied dynamic programming: global alignments: Needleman-Wunsch ? Applied dynamic programming: local alignments – Smith-Waterman ? Basic statistics of sequence alignments Review of biological fundamentals Genetic material ? gene as a concept DNA ? DNA as hereditary material RNA Protein Critical cell functions DNA structure A G C T U Bases DNA structure A G C T Base pairing DNA structure 1 5 4 3 2 1 5 4 3 2 Nucleoside Nucleotide DNA structure 5’ 3’ DNA structure 5’ 3’ DNA structure 5’ 3’ 3’ 5’ RNA structure 1 5 4 3 2 1 5 4 3 2 Transcription 5’ 3’ 3’ 5’ OH OH OH DNA RNA Protein structure Aspartic acid (Asp, D) Arginine (Arg, R) C C O - O H Glycine (Gly, G) Alanine (Ala, A) Valine (Val, V) H H 3 N + C C O - O H CH 3 CH 3 CH 3 H 3 N + C C O - O H CH H 3 N + C C O - O H CH H 3 N + H 3 C OH CH 3 CH 3 CH 3 C C O - O H CH 2 CH 2 CH 3 C C O - O H H 3 N + CH 2 CH 2 C C O - O H H 3 N + CH 2 S CH 3 H 3 N + C C O - O H H 3 N + CH 2 C C O - O O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 CH 2 C NH 2 C C C O - O H H 3 N + CH 2 C C O - O H H 2 N + H 2 C C C O - O O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 CH 2 - O C O C C O - O H H 3 N + CH 2 CH 2 CH 2 CH 2 NH 3 + NH 2 + NH + NH C C O - O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 CH 2 CH 2 C NH NH 2 - O C O NH 2 CH 2 CH 2NH C C O - O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 SH OH C C O - O H H 3 N + CH OH Leucine (Leu, L) Isoleucine (IIe, I) Methionine (Met, M) Serine (Ser, S) Aspartic acid (Asp, D) Glutamic acid (Glu, E) Lysine (Lys, K) Arginine (Arg, R) Histidine (His, H) Threonine (Thr, T) Cysteine (Cys, C) Tyrosine (Tyr, Y) Asparagine (Asn, N) Glutamine (Gln, Q) Acidic Basic Phenylalanine (Phe, F) Tryptophan (Trp, W) Proline (Pro, P) Electrically charged Polar, Hydrophillic R-groups Nonpolar, Hydrophobic R-groups Codon Table phe (UUU) phe leu leu leu leu leu leu ile ile ile met (and initiation) val val val val ser ser ser ser tyr tyr termination termination cys U C (5')....pNpNpN....(3') in mRNA Middle Base of Codon Base at 5' End of Codon Base at 3' End of Codon A G cys termination trp pro pro pro pro thr thr thr thr his his gln gln asn asn lys lys arg arg arg arg ser ser arg arg ala ala ala ala asp asp glu glu gly gly gly gly U C A G U C A G U C A G U C A G U C A G DNA structure 5’ 3’ 3’ 5’ “Sense” strand “Anti-Sense” strand Convention: Sense strand Read 5’? 3’ Reverse Complement Genetics experiment: phenotype Isolate a yeat mutant that has increased chromosome number Can rescue the phenotype with a piece of DNA …corresponds to a site of mutation in the yeasts DNA genotype CGTTTTTCCTGTAAAGCGCTTAATTTGTTTACCATTCTATAAAAACCTTGAGCTAAGGCCAACTGATGCA ATTGCTCAAGTGAATGCATAAACAAAGCAAGATCATTCTTAGCGCAAAAAAAACTGGGATTTTGAAATAC AACAAAAGAAAGAAGTAAAAAGGGAATGCAACGCAATAGTTTAGTAAATATCAAACTAAACGCTAATTCG CCATCGAAAAAGACCACAACAAGACCAAATACGTCCAGGATCAATAAACCATGGAGAATATCCCATTCGC CGCAGCAAAGAAACCCGAATTCAAAAATACCTTCACCTGTAAGAGAAAAATTGAACAGATTACCTGTAAA CAATAAGAAGTTTTTGGATATGGAAAGCTCCAAAATTCCATCACCTATAAGGAAAGCGACTTCTTCCAAA ATGATACACGAAAATAAGAAGCTACCTAAATTTAAATCCCTATCACTCGATGACTTTGAACTGGGGAAGA AATTAGGAAAGGGTAAATTCGGTAAAGTTTATTGCGTTCGGCACAGGAGTACAGGATATATTTGCGCACT GAAAGTAATGGAGAAGGAAGAAATAATAAAGTATAATTTACAGAAACAATTCAGAAGGGAGGTAGAAATA CAAACATCGCTAAATCATCCGAATCTAACTAAATCATACGGCTATTTTCATGATGAAAAAAGAGTGTACC TGCTAATGGAATACTTAGTCAATGGGGAAATGTATAAACTATTGAGGTTACACGGACCCTTCAACGATAT TTTAGCATCAGATTATATTTATCAAATTGCCAATGCCCTAGATTATATGCATAAAAAGAATATTATTCAT AGAGATATTAAACCTGAAAATATACTAATAGGGTTCAATAATGTCATTAAGTTAACGGACTTCGGATGGA GTATAATAAATCCGCCAGAAAATAGAAGGAAAACTGTCTGTGGGACAATTGACTACCTTTCTCCAGAAAT GGTGGAGTCAAGGGAATATGATCACACTATAGATGCATGGGCTCTTGGCGTCCTGGCGTTTGAACTACTG ACCGGTGCCCCTCCGTTCGAAGAAGAAATGAAAGATACTACATATAAAAGGATAGCAGCACTGGATATCA AAATGCCCAGTAACATTTCTCAGGATGCGCAAGATTTAATACTTAAACTACTAAAATACGACCCCAAAGA TAGAATGCGCCTTGGAGACGTAAAAATGCATCCTTGGATACTAAGAAACAAGCCCTTTTGGGAAAATAAG CGGTTATAGAATTAAAGTATGACAGAATCGTTTGAAGGGCACTATTAATCACTCCCGCACATATCACATA ATACTAAGTATCCATTTCTAATATTTCATTACTCTTTTCGGCATCGTATATTGCGATATTTGATTAAATT TTCTTGTTCATTTTTTCCTCTTTTCTTTCGCCTTTGTCGAAAGAAAAGAGGAAAACAAGCTGAAAATTGC TATGCATTAAAGTAGCAGATTTACTTTGTTGAGTTGGTTCTGATCAATAATAAGAGTAATGAAAGAAAGC AAAAAAATGGCTAAAGATAATTTAACTAATTTGCTCTCTCAATTGAACATTCAATTGTCTCAA The National Center for Biotechnology Information ? Created as a part of NLM in 1988 – Establish public databases – Perform research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Molecular Databases ? – – information ? ? – ? ? – ? – ? Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly What is GenBank? ? Nucleotide only sequence database ? Archival in nature ? – Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts sequencing centers ? Data shared three collaborating databases – GenBank – DNA Database of Japan (DDBJ). – (EMBL) at EBI. NCBI’s Primary Sequence Database GenBank Data European Molecular Biology Laboratory Database Relationships of chromosomes to genome sequencing markers. 163 Mb The X chromosome is about 163 Mb in length. In this diagram, there are 16 overlapping BAC clones that span the entire length. In reality, 1,408 BACs were needed to span the X chromosome. Arrows (top) mark STSs scattered throughout the chromosome and on overlapping BACs. BACs STSs X Chromosome GenBank: NCBI’s Primary Sequence Database ? full release every two months ? incremental and cumulative updates daily ftp://ftp.ncbi.nih.gov/genbank/ Release 133 December 2003 19,808,101 Records 28,507,990,166 Nucleotides 110,000 + ? available only through internet Species 107.07 Gigabytes of data GenBank Divisions Bulk Sequence Divisions PAT Patent EST Expressed Sequence Tags STS Sequence Tagged Sites GSS Genome Survey Sequences HTG High Throughput Genome HTC High Throughput cDNA CON Contig Traditional Divisions BCT INV MAM PHG PLN PRI ROD SYN UNA VRL VRT The International Sequence Database Collaboration EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH ?Submissions ?Updates ?Submissions ?Updates ?Submissions ?Updates Web Access Text BLAST VAST Entrez Sequence Structure http://www.ncbi.nlm.nih.gov Genetics experiment: phenotype Isolate a yeat mutant that has increased chromosome number Can rescue the phenotype with a piece of DNA …corresponds to a site of mutation in the yeasts DNA genotype CGTTTTTCCTGTAAAGCGCTTAATTTGTTTACCATTCTATAAAAACCTTGAGCTAAGGCCAACTGATGCA ATTGCTCAAGTGAATGCATAAACAAAGCAAGATCATTCTTAGCGCAAAAAAAACTGGGATTTTGAAATAC AACAAAAGAAAGAAGTAAAAAGGGAATGCAACGCAATAGTTTAGTAAATATCAAACTAAACGCTAATTCG CCATCGAAAAAGACCACAACAAGACCAAATACGTCCAGGATCAATAAACCATGGAGAATATCCCATTCGC CGCAGCAAAGAAACCCGAATTCAAAAATACCTTCACCTGTAAGAGAAAAATTGAACAGATTACCTGTAAA CAATAAGAAGTTTTTGGATATGGAAAGCTCCAAAATTCCATCACCTATAAGGAAAGCGACTTCTTCCAAA ATGATACACGAAAATAAGAAGCTACCTAAATTTAAATCCCTATCACTCGATGACTTTGAACTGGGGAAGA AATTAGGAAAGGGTAAATTCGGTAAAGTTTATTGCGTTCGGCACAGGAGTACAGGATATATTTGCGCACT GAAAGTAATGGAGAAGGAAGAAATAATAAAGTATAATTTACAGAAACAATTCAGAAGGGAGGTAGAAATA CAAACATCGCTAAATCATCCGAATCTAACTAAATCATACGGCTATTTTCATGATGAAAAAAGAGTGTACC TGCTAATGGAATACTTAGTCAATGGGGAAATGTATAAACTATTGAGGTTACACGGACCCTTCAACGATAT TTTAGCATCAGATTATATTTATCAAATTGCCAATGCCCTAGATTATATGCATAAAAAGAATATTATTCAT AGAGATATTAAACCTGAAAATATACTAATAGGGTTCAATAATGTCATTAAGTTAACGGACTTCGGATGGA GTATAATAAATCCGCCAGAAAATAGAAGGAAAACTGTCTGTGGGACAATTGACTACCTTTCTCCAGAAAT GGTGGAGTCAAGGGAATATGATCACACTATAGATGCATGGGCTCTTGGCGTCCTGGCGTTTGAACTACTG ACCGGTGCCCCTCCGTTCGAAGAAGAAATGAAAGATACTACATATAAAAGGATAGCAGCACTGGATATCA AAATGCCCAGTAACATTTCTCAGGATGCGCAAGATTTAATACTTAAACTACTAAAATACGACCCCAAAGA TAGAATGCGCCTTGGAGACGTAAAAATGCATCCTTGGATACTAAGAAACAAGCCCTTTTGGGAAAATAAG CGGTTATAGAATTAAAGTATGACAGAATCGTTTGAAGGGCACTATTAATCACTCCCGCACATATCACATA ATACTAAGTATCCATTTCTAATATTTCATTACTCTTTTCGGCATCGTATATTGCGATATTTGATTAAATT TTCTTGTTCATTTTTTCCTCTTTTCTTTCGCCTTTGTCGAAAGAAAAGAGGAAAACAAGCTGAAAATTGC TATGCATTAAAGTAGCAGATTTACTTTGTTGAGTTGGTTCTGATCAATAATAAGAGTAATGAAAGAAAGC AAAAAAATGGCTAAAGATAATTTAACTAATTTGCTCTCTCAATTGAACATTCAATTGTCTCAA Using Entrez An integrated database search and retrieval system 3-D Structur e Entrez: Database Integration VAST Phylogeny Genomes Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3 -D Structur e Word weight BLASTBLAST The (ever) Expanding Entrez Entrez Journals UniGene PubMed Nucleotide Protein SNP Genome Books OMIM Taxonomy 3D Domains UniSTS Structure System ProbeSet CDD PopSet Entrez Databases PubMed Biomedical literature Books Online textbooks Nucleotide GenBank, EMBL, DDBJ, RefSeq, PDB Protein [GenBank, EMBL, DDBJ], RefSeq, SWISS-PROT, PIR, PRF, PDB Genome Complete genomes Taxonomy Organisms in NCBI sequence databases Structure MMDB: experimental 3D structures Domains CDD: conserved protein domains 3D Domains Compact 3D protein domains in MMDB OMIM Online Mendelian Inheritance in Man SNP Single nucleotide polymorphisms UniSTS Sequence Tagged Site markers ProbeSet Gene expression and microarray datasets PopSet Population study datasets UniGene Gene-based expressed sequence clusters A Traditional GenBank Record Definition =Title ACCESSION U07163 VERSION U07163.1 GI:460243 Accession Number Version Number GI Number NCBI’s Taxonomy FASTA Format > FASTA Definition Line >gi|460243|gb|U07163.1|SCU07163 Database Identifiers gb GenBank emb EMBL dbj DDBJ sp SWISS-PROT pdb pir prf ref RefSeq Accession number Locus Name gi number Protein Databank PIR PRF Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPept GenBank ASN.1 ***************************************/ {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, ,NULL ,TRUE,'e',A * * asn2ff.c * * *********************************** #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> #include <accid1.h> #endif FILE *fpl; {"Input is a Seq-entry","F", NULL RG_BOOLEAN,0.0,0,NULL}, Toolbox Sources ftp> open ftp.ncbi.nih.gov . . ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox /************************************************************************ convert an ASN.1 entry to flat file format, using the FFPrintArray. #ifdef ENABLE_ID1 Args myargs[] = { {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, ftp> cd toolbox ftp> cd ncbi_tools /protein_id="AAA20496.1" /db_xref="GI:460244" GenPept Protein IDS Why align sequences? ? Functional predictions based on identifying homologues. Assumes: conservation of conservation of sequence function BUT: Function carried out at level of proteins, i.e. 3-D structure Sequence conservation carried out at level of DNA 1-D sequence Implicit Assumption of Evolution in Models of Sequence Homology Assume: Sequence conservation Structure conservation Note that the converse is NOT necessarily true! Structure conservation X Sequence conservation Definitions ? Homologue: a relatively non-specific term (meaningless?) that conveys the idea that two sequences are somehow related ? Orthologue: Ortho = (greek) straight….implies direct descent, 1 ancestor Vav human Vav bovine human Vav -or- avian Vav bovine Vav Vav avian dinosaur Vav Vav elegans Ye’ olde Vav Definitions ? Paralogue: Para = (greek) along side of….implies some type of gene duplication event Vav1 human Vav2human Vav3human -or Complex problem Chicken Vav Vav1 human Vav2human Vav3human Orthologue or Paralogue? Need to know evolutionary relationships Can try and get these from alignments as well… Alignments- Good Bad and Ugly HgbA-human GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL :+ +::+:::::+ :+++++::+:++ +++++::+::+ :: GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HgbB-human Alignments- Good Bad and Ugly GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE HgbA-human GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL ::+ ++: + ++:+: ++ :+ :+ : +:+ : ++::+ Nematode glutathione S-transferase SPURIOUS ALIGNMENT HgbA-human GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++:+ ::+ ++ +:++ + +: :+ +:+ : NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG Leghemoglobin, yellow lupin Alignments ? Types: ? Local ? Global ? Ungapped ? Gapped (2 types- linear, affine) ? Methods: ? Dot matrix ? Dynamic Programming ? Word, k-tup Simplest alignment representation – Dot plot Model: Need a metric of similarity between amino acid pairs Simplest metric – unitary matrix, identity matrix K I H G F E D C A 0 0 0 0 0 0 0 0 1 A C 0 0 0 0 0 0 0 1 D 0 0 0 0 0 0 1 E 0 0 0 0 0 1 F 0 0 0 0 1 G 0 0 0 1 H 0 0 1 I 0 1 K 1 Example – self alignment V E S F E L R K F S D F G 0 0 0 0 0 0 0 0 0 0 0 0 1 G F 0 0 0 0 1 0 0 1 1 0 0 0 D 0 0 0 0 0 0 1 0 0 0 0 S 0 0 0 0 0 1 0 1 0 0 F 0 0 0 0 1 1 0 0 0 K 0 0 0 1 0 0 0 0 R 0 0 1 0 0 0 0 L 0 1 0 0 0 0 E 1 0 0 1 0 F 1 0 0 0 S 1 0 0 Example – self alignment…..with sliding window V E S F E L R K F S D F G 0 0 0 0 0 0 0 0 0 0 0 0 1 G F 0 0 0 0 1 0 0 1 1 0 0 0 D 0 0 0 0 0 0 1 0 0 0 0 S 0 0 0 0 0 1 0 1 0 0 F 0 0 0 0 1 1 0 0 0 K 0 0 0 1 0 0 0 0 R 0 0 1 0 0 0 0 L 0 1 0 0 0 0 E 1 0 0 1 0 F 1 0 0 0 S 1 0 0 Example – self alignment V E S F E L R K F S D F G 0 0 0 0 0 0 0 0 0 0 0 0 1 G F 0 0 0 0 1 0 0 1 0 0 0 0 D 0 0 0 0 0 0 1 0 0 0 0 S 0 0 0 0 0 1 0 0 0 0 F 0 0 0 0 1 0 0 0 0 K 0 0 0 1 0 0 0 0 R 0 0 1 0 0 0 0 L 0 1 0 0 0 0 E 1 0 0 0 0 F 1 0 0 0 S 1 0 0 Simple alignments 1 1 200 1 1 200 200 200 DNA Self alignment – Dot Matrix symmetric Now align two different sequences * Consider other similarity matrices besides identity…. ? Chemical similarity – binary decision ? Amino acid conservation in aligned protein families – min. similarity score (+/- window) ? Average of multiple scoring systems Dot Matrix Alignments Sequence#1 Sequence#1 1 n 1 m Note – NOT symmetric A Local Alignment Dot Matrix Alignments Sequence#1 Sequence#1 Insertion Insertion 1 n 1 m A Global Alignment General Rules for Dot Matrices ? Advantage – let’s your eyes/brain do the work – VERY EFFICIENT!!!! ? DNA comparisons – long windows and high stringencies (11/7, 15/11) ? Proteins – short windows and stringencies (1/1) EXCEPT in looking for domains – then longer windows and smaller stringencies (15/5). ? Note – we can use these types of self-aligned matrices for more than just sequence comparisons…i.e. distance between side Cα atoms in 3d-structures etc….maybe later in the course! Computational Efficiency Measure efficiency in cpu run time and memory O() = “big-oh” notation Both scale as size of the problem, measured in number of units, n, in the problem, i.e. run time is f(n). Analyse the asymptotic worst-case running time…. Sometimes just do the experiment and measure it…. If problem scales as square of the number of units in the problem O(n 2 ) (=order n-squared) Examples O(n k ) is “polynomial time” as long as K<3 …..tractable Consider our un-gapped dot matrix Global alignment: 1 n 1 12345678…. 12345678…. 12345678…. 12345678…. 12345678…. m 12345678…. ….essentially an O(mn) problem O.K. Examples O(n) better than O(n log(n)), better than O(n 2 ), better than O(n 3 ) Terrible Examples O(k n ) = exponential time….horrible!!!! NP problems- no known polynomial time Solutions = non-deterministic polynomial Problems.