Protein Structure Outline of the next part of the course 4/1 Protein Structure Comparison & Classification 4/6 Principles of Molecular Mechanics 4/8 X-ray crystallography and NMR 4/13 Modeling Mutants and Homologs 4/15 Threading and Ab Initio Structure Prediction 4/22 Computational Protein Design 7.91 April 1, 2004 Amy Keating 7.91 April 1, 2004 Amy Keating Introduction to Protein Structure & Classification Protein structures basics where to find them how to look at them what they can tell you structural and evolutionary comparisons PDB ID: 1HCL Schulze-Gahmen, U., J. Brandsen, H. D. Jones, D. O. Morgan, L. Meijer, J. Vesely, S. H. Kim. "Multiple Modes of Ligand Recognition: Crystal Structures of Cyclin-dependent Protein Kinase 2 in Complex with ATP and Two Inhibitors, Olomoucine and Isopentenyladenine." Proteins 22 (1995): 378. The Protein Data Bank (PDB - http://www.pdb.org/) is the single worldwide repository for the processing and distribution of 3-D biological macromolecular structure data. Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Research 28 (2000): 235-242 . (PDB Advisory Notice on using materials available in the archive: http://www.rcsb.org/pdb/advisory.html) Review of protein structure hierarchy ? Primary structure MAAAAAAGPEMVRGQVF ? 20 amino acids – hydrophobic/hydrophilic – acidic/basic –large/small – specialized (Gly,Pro,Cys) C C O - O H Glycine (Gly, G) Alanine (Ala, A) Valine (Val, V) H H 3 N + C C O - O H CH 3 CH 3 CH 3 H 3 N + C C O - O H CH H 3 N + C C O - O H CH H 3 N + H 3 C OH CH 3 CH 3 CH 3 C C O - O H CH 2 CH 2 CH 3 C C O - O H H 3 N + CH 2 CH 2 C C O - O H H 3 N + CH 2 S CH 3 H 3 N + C C O - O H H 3 N + CH 2 C C O - O O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 CH 2 C NH 2 C C C O - O H H 3 N + CH 2 C C O - O H H 2 N + H 2 C C C O - O O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 CH 2 - O C O C C O - O H H 3 N + CH 2 CH 2 CH 2 CH 2 NH 3 + NH 2 + NH + NH C C O - O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 CH 2 CH 2 C NH NH 2 - O C O NH 2 CH 2 CH 2NH C C O - O H H 3 N + CH 2 C C O - O H H 3 N + CH 2 SH OH C C O - O H H 3 N + CH OH Leucine (Leu, L) Isoleucine (IIe, I) Methionine (Met, M) Serine (Ser, S) Aspartic acid (Asp, D) Glutamic acid (Glu, E) Lysine (Lys, K) Arginine (Arg, R) Histidine (His, H) Threonine (Thr, T) Cysteine (Cys, C) Tyrosine (Tyr, Y) Asparagine (Asn, N) Glutamine (Gln, Q) Acidic Basic Phenylalanine (Phe, F) Tryptophan (Trp, W) Proline (Pro, P) Electrically charged Polar, Hydrophillic R-groups Nonpolar, Hydrophobic R-groups O N N O N N O N N N O O N N S Representations of Protein Structure Review of protein structure hierarchy ? Secondary structure - why do you get regular secondary structure? α-helices β-strands SGAYGSVCAA FDTKTGHRVA VKKLSRPFQS IIHAKRTYRE LRLLKHMKHE EEEEEE EE EEE EEEE HHHHHHHHHH HHHHHH Review of protein structure hierarchy ? Tertiary structure ? Quaternary structure N-terminal domain of kinase hemoglobin Why do you get compact/globular tertiary structures? Other units of protein structure Motifs Domains EF hand coiled coil Sequence determines structure. How? ? Secondary structure preferences (satisfy H bonds) ? Hydrophobic/polar patterning ? Steric complementarity ? Electrostatics Interactions are both LOCAL and NONLOCAL in sequence E n e r g y X Where do protein structures live? www.rcsb.org/pdb 24,785 structures now in the PDB! Compare: SwissProt 146,193, TrEMBL 1,070,786 Finding structures in the PDB GET MORE INFO THE PDB CODE THE TECHNIQUE THE RESOLUTION Exploring structures in the PDB LOOK AT THE STRUCTURE THE RESOLUTION R-value Exploring structures in the PDB GET THE PDB FILE Useful information in the PDB header REMARK 280 CRYSTAL REMARK 280 SOLVENT CONTENT, VS (%): 58.0 REMARK 280 MATTHEWS COEFFICIENT, VM (ANGSTROMS**3/DA): 2.92 REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: THE PROTEIN CRYSTALLIZED IN 18% REMARK 280 PEG 8000, 0.2M MG(OAC)2, 0.1M HEPES, PH7.0. THE PROTEIN REMARK 280 CONCENTRATION WAS ~ 10MG/ML IN A BUFFER OF 50MM NACL, REMARK 280 1MM EDTA, 10MM DTT, 1MM BENZAMIDINE, 1UM PEPSTATIN, 10UG/ML REMARK 280 LEUPEPTIN, 25MM HEPES,PH7.4. REMARK 999 SEQUENCE REMARK 999 1P38 SWS P47811 1 - 3 NOT IN ATOMS LIST REMARK 999 1P38 SWS P47811 355 - 360 NOT IN ATOMS LIST DBREF 1P38 4 354 SWS P47811 MP38_MOUSE 4 354 SEQRES 1 379 GLY SER SER HIS HIS HIS HIS HIS HIS SER SER GLY LEU SEQRES 2 379 VAL PRO ARG GLY SER HIS MET SER GLN GLU ARG PRO THR SEQRES 3 379 PHE TYR ARG GLN GLU LEU ASN LYS THR ILE TRP GLU VAL SEQRES 4 379 PRO GLU ARG TYR GLN ASN LEU SER PRO VAL GLY SER GLY Useful information in the PDB header REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 CROSS-VALIDATION METHOD : NULL REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM REMARK 3 R VALUE (WORKING SET) : 0.212 REMARK 3 FREE R VALUE : 0.244 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 10. REMARK 3 FREE R VALUE TEST SET COUNT : NULL REMARK 3 ESTIMATED ERROR OF FREE R VALUE : NULL REMARK 3 RMS DEVIATIONS FROM IDEAL VALUES. REMARK 3 BOND LENGTHS (A) : 0.010 REMARK 3 BOND ANGLES (DEGREES) : 1.58 REMARK 3 DIHEDRAL ANGLES (DEGREES) : NULL REMARK 3 IMPROPER ANGLES (DEGREES) : NULL REMARK 3 B VALUES. REMARK 3 FROM WILSON PLOT (A**2) : NULL REMARK 3 MEAN B VALUE (OVERALL, A**2) : 29.7 Atomic coordinates in the PDB file X Y Z occ B ATOM 1 N GLU 4 28.492 3.212 23.465 1.00 70.88 ATOM 2 CA GLU 4 27.552 4.354 23.629 1.00 69.99 ATOM 3 C GLU 4 26.545 4.432 22.489 0.00 67.56 ATOM 4 O GLU 4 26.915 4.250 21.328 0.00 68.09 ATOM 5 CB GLU 4 28.326 5.683 23.680 0.00 72.34 ATOM 6 CG GLU 4 27.447 6.910 23.973 0.00 75.98 ATOM 7 CD GLU 4 28.123 8.247 23.659 0.00 78.43 ATOM 8 OE1 GLU 4 29.375 8.299 23.604 0.00 79.32 ATOM 9 OE2 GLU 4 27.393 9.251 23.468 0.00 79.58 ATOM 10 N ARG 5 25.274 4.610 22.852 1.00 63.77 ATOM 11 CA ARG 5 24.179 4.807 21.907 1.00 59.83 ATOM 12 C ARG 5 23.411 3.698 21.219 1.00 56.20 ATOM 13 O ARG 5 23.987 2.808 20.596 1.00 57.33 ATOM 14 CB ARG 5 24.604 5.784 20.812 1.00 60.86 ATOM 15 CG ARG 5 23.926 7.127 20.866 1.00 61.89 ATOM 16 CD ARG 5 24.295 7.944 19.647 1.00 62.21 Looking at Protein Structures Quick and dirty Rasmol Chime Cn3D (NCBI) More powerful Swiss PDB Viewer, PyMol (free! Many platforms) Insight, Quanta ($$$, nice interface, powerful) Publication quality graphics, but not easy to manipulate Molscript/Raster3D Comparing Protein Structures Why? Reading: Mount, Chapter 9 Comparing Protein Structures Why? detect evolutionary relationships identify recurring motifs detect structure/function relationships predict function assess predicted structures classify structures - used for many purposes Structure is more conserved than sequence 28% sequence identity mouse Abl tyrosine kinase human p38 serine kinase Detecting substructures is challenging Please see figure 1 of Ortiz, Angel R., Charlie E. M. Strauss, and Osvaldo Olmea. "MAMMOTH (Matching Molecular Models Obtained from Theory): An Automated Method for Model Comparison." Protein Sci 11 (2002): 2606-2621. Recognizing Structural Similarity GOAL: Of all solved structures, find the structure or substructure most similar to a protein of interest By eye - tried and true! requires an expert viewer with a GREAT memory! Automated detection - good for database searching How would you do this? Features of automated structure comparison 1. What representation will you use for the protein? 2. How will you assess structural similarity? 3. How will you search the possible comparisons? 4. How significant is a “hit”? Example: Superposition to minimize RMSD 1. Define measure of similarity RMSD = {Σ|x-x j | 2 )/N} 1/2 i 2. Determine correspondence between residues of each protein (e.g. by sequence alignment, or a guess) 3. Align centers of mass 4. Use matrix methods to solve for the rotation that gives minimal RMSD (variety of methods available) 5. Evaluate the resulting number 6. Refine the alignment 7. iterate Very useful. Commonly used for comparing similar structures. But… Example: Superposition to minimize RMSD 1. Define measure of similarity RMSD = {Σ|x-x j | 2 )/N} 1/2 i 2. Determine correspondence between residues of each protein (e.g. by sequence alignment, or a guess) 3. Align centers of mass 4. Use matrix methods to solve for the rotation that gives minimal RMSD (variety of methods available) 5. Evaluate the resulting number 6. Refine the alignment 7. iterate Very useful. Commonly used for comparing similar structures. But… Not a good choice when proteins are only partially similar. Why? Also, points far from center of mass are weighted more heavily. Algorithms for detecting structure similarity Dynamic Programming - works on 1D strings - reduce problem to this - can’t accommodate topological changes - example: Secondary Structure Alignment Program (SSAP) 3D Comparison/Clustering - identify secondary structure elements or fragments - look for a similar arrangement of these between different structures - allows for different topology, large insertions - example: Vector Alignment Search Tool (VAST) Distance Matrix - identify contact patterns of groups that are close together - compare these for different structures - fast, insensitive to insertions - example: Distance ALIgnment Tool (DALI) Unit vector RMS - map structure to sphere of vectors - minimize the difference between spheres - fast, insensitive to outliers - example: Matching Molecular Models Obtained from Theory (MAMMOTH) DALI represents proteins at the residue level; look for similarities using a distance matrix 25 1 20 70 50 45 1 70 close in space indicates residues 1 70 Images based on Holm, L, and C Sander. "Protein Structure Comparison by Alignment of Distance Matrices." J Mol Biol. 233, no. 1 (5 September 1993): 123-38. Compare contact patterns of different proteins 25 1 20 70 50 45 1 20 40 60 85 65 Images based on Holm, L, and C Sander. "Protein Structure Comparison by Alignment of Distance Matrices." J Mol Biol. 233, no. 1 (5 September 1993): 123-38. Break distance matrix into hexapeptide regions list of contact patterns 25 1 20 70 50 45 1 70 1 70 . . . Images based on Holm, L, and C Sander. "Protein Structure Comparison by Alignment of Distance Matrices." J Mol Biol. 233, no. 1 (5 September 1993): 123-38. Compare contact patterns of different proteins 25 1 20 70 50 45 1 20 40 60 85 65 Images based on Holm, L, and C Sander. "Protein Structure Comparison by Alignment of Distance Matrices." J Mol Biol. 233, no. 1 (5 September 1993): 123-38. Compare contact patterns of different proteins 25 1 20 70 50 45 1 20 40 60 85 65 40,000 pairs that match 1-6 with 65-70 1-6 with 50-55 15-20 with 80-85 15-20 with 65-70 1-6 with 55-60 1-6 with 40-45 Images based on Holm, L, and C Sander. "Protein Structure Comparison by Alignment of Distance Matrices." J Mol Biol. 233, no. 1 (5 September 1993): 123-38. Φ(i,j) = (0.2 - |d A ij -d B ij | ) e -r 2 /a 2 avg(d A ij ,d B ij ) How do you compare assemblies? distance between i and j in A (get from matrix) distance between i and j in B i i j j S = Σ i Σ j φ(i,j), where (i, j) is a pair of matches residues down-weight pairs that are far Monte Carlo assembly of fragments 25 1 20 70 50 45 1 20 40 60 85 65 Example of structural similarity detected by DALI 10-18% sequence identity chloramphenicol acetyl transferase Keating et al. Nat. Struct. Biol. (2002) 9, 522-526 Advantages of DALI 3D matrix similarity search ? Can accommodate: – gaps/insertions – altered connectivity – chain reversal ? Fast enough for database comparisons ? Coordinate-frame invariant ? Pre-processing of distance matrices gives fast alignment performance ? Sensitive and accurate, even in presence of distortions ? CONVENIENT WEB INTERFACE!! www.ebi.ac.uk/dali/ Fold classificatiion based on Pre-computed Structure-Structure Alignment of similarities of Proteins proteins in the pdb 24% sequence ID, rmsd = 3.0 ? http://www.ebi.ac.uk/dali/ structure-based sequence alignment 1 unitRMS 5 7 9 2 8 10 10 5 7 9 2 6 8 3 4 Cα trace 11 4 6 1 3 12 Cα trace indices i 5 j = i + offset 9 7 10 6 2 5 4 6 11 9 7 12 4 8 8 10 3 URMS = min_over_rotations(Σ(V i - V j ) 2 ) 1/2 Chew et al, RECOMB (1999) Kedem et al. PROTEINS 37, 554 (1999) URMS advantages 1. Insensitive to outliers URMS max = 2 2. Weighs all parts of protein equally 3. URMS min is bounded - not very sensitive to length of protein 4. More compact representation - O(n), compared to O(n 2 ) for distance matrices 5. Fast to compute: O(nlogn) for searching for substructures