Solving structures using X-ray crystallography & NMR spectroscopy 7.91 Amy Keating How are X-ray crystal structures determined? 1. Grow crystals - structure determination by X-ray crystallography relies on the repeating structure of a crystalline lattice. 2. Collect a diffraction pattern - periodically spaced atoms in the crystal give specific “spots” where X-rays interfere constructively. 3. Carry out a Fourier transform to get from “reciprocal space” to a real space description of the electron density. 4. THIS STEP REQUIRES KNOWLEDGE OF THE PHASES OF THE INTERFERING WAVES, WHICH CAN’T BE DIRECTLY MEASURED “THE PHASE PROBLEM” 4. Build a preliminary model of the protein into the envelope of electron density that results from the experiment. 5. Refine the structure through an iterative process of changing the model and comparing how it fits the data. The Phase Problem: we don’t know what phases to use to add up all of the contributing waves. BIG PROBLEM. | F hkl | exp(iα hkl ) = observable amplitude atomic scattering factor - related the phase of F is determined by the to electron density around atom j x, y and z coordinates of the atoms What we observe is I hkl α |F khl | 2 we can’t measure the phases directly Get phases from molecular replacement, or heavy atom methods X-Ray Crystal Structure Refinement The model: Computed The data: Actual intensities of spots intensities of spots F obs (h,k,l) ? F calc (h,k,l) ] 2 U X -ray expt = ∑[ Summation h ,k,l Actual intensity of spot Intensity of spot calculated runs over spots observed in expt from trial structure U hybrid = U Model Molec + sU expt ray -X ? Simulated annealing on hybrid potential rapidly improves correspondence between structure and X-ray observations while maintaining reasonable chemistry (large radius of convergence) ? Previous method effectively used local minimization which became trapped in local minima (small radius of convergence) The Free R factor current model 90% of X-ray amplitudes R = Σ||F obs calc ||/Σ|F obs | model-derived amplitudes change model 10% of X-ray amplitudes R free = Σ||F obs calc ||/Σ|F obs | assess model | - |F | - |F What parameters do you refine? ? Atomic coordinates X, Y, Z ? The temperature factor of each atom, B ? Can also refine the occupancy u B = 8π 2 x u 2 2 = mean square atomic displacement B results from atomic vibrations and disorder units = ? 2 Example: B = 20 --> 0.5? displacement B = 80 --> 1? displacement Atomic coordinates in the PDB file X Y Z occ B ATOM 1 N GLU 4 28.492 3.212 23.465 1.00 70.88 ATOM 2 CA GLU 4 27.552 4.354 23.629 1.00 69.99 ATOM 3 C GLU 4 26.545 4.432 22.489 0.00 67.56 ATOM 4 O GLU 4 26.915 4.250 21.328 0.00 68.09 ATOM 5 CB GLU 4 28.326 5.683 23.680 0.00 72.34 ATOM 6 CG GLU 4 27.447 6.910 23.973 0.00 75.98 ATOM 7 CD GLU 4 28.123 8.247 23.659 0.00 78.43 ATOM 8 OE1 GLU 4 29.375 8.299 23.604 0.00 79.32 ATOM 9 OE2 GLU 4 27.393 9.251 23.468 0.00 79.58 ATOM 10 N ARG 5 25.274 4.610 22.852 1.00 63.77 ATOM 11 CA ARG 5 24.179 4.807 21.907 1.00 59.83 ATOM 12 C ARG 5 23.411 3.698 21.219 1.00 56.20 ATOM 13 O ARG 5 23.987 2.808 20.596 1.00 57.33 ATOM 14 CB ARG 5 24.604 5.784 20.812 1.00 60.86 ATOM 15 CG ARG 5 23.926 7.127 20.866 1.00 61.89 ATOM 16 CD ARG 5 24.295 7.944 19.647 1.00 62.21 Is your structure correct? ? How unusual is the structure geometry? ? Does it contain rare conformations? ? Does it make chemical sense? http://pdb.rutgers.edu/validate/ Backbone geometry http://pdb.rutgers.edu/ Side chain geometry O N χ 1 χ 2 isoleucine χ angle here might www.fccc.edu/research/labs/dunbrack/confanalysis.html indicate error in structure http://pdb.rutgers.edu/validate/ PROCHECK Residue properties new-entry 20 40 60 80 100 a. Absolute deviation from mean Chi-1 value (excl. Pro) 4 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 sequence χ 1 absolute deviation from values determined for high-resolution X-ray structures Laskowski, R A, M W MacArthur, D S Moss, and J M Thornton. "PROCHECK: A Program to Check The Stereochemical Quality of Protein Structures." J. Appl. Cryst. 26 (1993): 283-291. Morris, A L, M W MacArthur, E G Hutchinson, and J M Thornton. "Stereochemical Quality of Protein Structure Coordinates." Proteins 12 (1992): 345-364. Summary of Structure Assessment problem diagnostic structure is incomplete PDB file header & coordinates, occupancies residues are B-factors disordered model doesn’t match data R value Free R value model has unusual Ramachandran plots, stereochemistry side chain analysis How are NMR structures solved? 1. Solution phase technique - protein at mM concentration in a buffer. Currently limited to proteins ≤ 30-50 kDa. 2. Measure resonant frequencies of 1 H, 13 C, 15 N atoms in a magnetic field. 3. Assign peaks observed in the spectrum to individual amino acids. 4. Measure distances between different residues < 6? apart to get restraints. Need many restraints per residue. 5. Build structures consistent with the experimental distance restraints and principles of sterochemistry. 6. Yields a set of structures consistent with the data. ? Please refer to http://public-1.cryst.bbk.ac.uk/ PPS2/projects/schirra/html/home.htm for an NMR Tutorial. Types of restraints available from NMR experiments 1. NOEs give rough distances between assigned atoms - given as upper and lower bounds. 2. COSY spectra and J-couplings give dihedral angle restraints Also have constraints from what you know about the protein: 1. Connectivity due to known aa geometry & sequence 2. Standard bond lengths and angles Building a structure from NMR data I: Distance Geometry Given: a set of labeled distance constraints k 1. Bounds smoothing using the triangle inequality given upper bounds u and lower bound l (e.g. from NOEs and bond lengths) if u ij > u ik + u kj then set u ij to u ik + u kj i j 2. Specific distances d ij that are compatible with the bounds and the triangle inequalities are chosen (metrization). 3. “Embedding” is used to compute a 3D model from the distances - often the distances are not all compatible with a 3D model but instead with a higher-dimensional one. In this case it is necessary to project into three dimensions (-> error). 4. Initial models contain many errors that must be iteratively corrected by refinement. Building a structure from NMR data II: Simulated Annealing U(R) = E empirical + E effective E effective = E NOE + E torsion derived from NMR experiment E empirical = E bond + E angle + E dihedral + E vdW + E elec as previously-discussed ’ | 2 E NOE = c?|r ij -r ij c = kTS/2?? 2 ’ where ? is an error estimate on the experimental constraint r ij S is chosen to balance the effective energy with the empirical energy Assessing NMR structure quality 1. Number of restraints used want ~10-20 per residue 2 . Numb er of restraint violations 3. RMS deviation from restraints 4. RMS differences between models want main chain atom rmsd < 0.4 ?, side chain < 1.0? 5. Stereochemical quality e.g. use the validation server at the PDB to check for bad backbone and side chain torsions Methods for Protein Structure Prediction Homology Modeling Threading Ab Initio Prediction Studying protein structure … without a structure Comparative modeling - inferring the structure of a protein from a homolog Fold recognition - an easier problem that fold prediction! Ab initio prediction - prediction of structure from sequence Translating structure between members of the same family - Homology Modeling ? Identify a protein with similar sequence for which a structure has been solved (the template) ? Align the target sequence with the template ? Use the alignment to build an approximate structure for the target ? Fill in any missing pieces ? Fine-tune the structure ?Evaluate success An excellent review: Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325. Identifying a good template ? By sequence similarity – Use FASTA, BLAST, PSI-BLAST or threading – Best performance from high sequence identity, but can try distant homologues and assess performance later ? The closer the evolutionary relationship, the better – Consider a phylogenetic tree ? Generally better to have many templates to use as models ? Consider the structure quality (R, resolution, average B) ? Consider particulars of the structure – Quaternary structure – Any ligands bound? –pH ? The probability of finding a template is ~20-50% You have cloned a new Pombe gene that is a putative protein kinase Blast against PDB, hit = 1DM2 Score = 250 bits(638), Expect= 6e-67 Identities = 136/302 (45%), Positives = 185/302 (61%), Gaps = 17/302(5%) ? Query: 71 IDDYEILEKIEEGSYGIVYRGLDKSTNTLVALKKIKFDPNGIGFPITSLREIESLSSIRH 130 ? +++++ +EKI EG+YG+VY+ +K T +VALKKI+ D G P T++REI L + H ? Sbjct: 1 MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH 60 ? Query: 131 DNIVELEKVVVGKDLKDVYLVMEFMEHDLKTLLD-----NMPEDFLQSEVKTLMLQLLAA 185 NIV+L V+ ++ +YLV EF+ DLK +D +P +K+ + QLL ? Sbjct: 61 PNIVKLLDVIHTEN--KLYLVFEFLHQDLKKFMDASALTGIPLPL----IKSYLFQLLQG 114 ? Query: 186 TAFMHHHWYLHRDLKPSNLLMNNTGEIKLADFGLARPVSEPKSSLTRLVVTLWYRAPELL 245 ? AF H H LHRDLKP NLL+N GIKLADFGLAR P + T VVTLWYRAPE+L ? Sbjct: 115 LAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEIL 174 ? Query: 246 LGAPSYGKEIDMWSIGCIFAEMITRTPLFSGKSELDQLYKIFNLLGYPTREEWPQYFLLP 305 LG Y +D+WS+GCIFAEM+TR LFG SE+DQL++IF LG P WP +P ? Sbjct: 175 LGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMP 234 ? Query: 306 YANKIKHPTVPTHSKIRTS--IPNLTGNAYDLLNRLLSLNPAKRISAKEALEHPYFYESP 363 P+ P S +P L + LL+++L +P KRISAK ALHP+F + ? Sbjct: 235 DYK----PSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVT 290 ? Query: 364 RP 365 +P ? Sbjct: 291 KP 292 Aligning the target to the template sequences ? A GOOD ALIGNMENT IS ABSOLUTELY ESSENTIAL ? For > 40% sequence identity the alignment is usually clear ? For < 40% sequence identity usually have to deal with gaps OBSERVATION: at 30% sequence only 20% of residues are correctly aligned! ? How could you try to improve the alignments over those provided by BLAST? Aligning the target to the template sequences ? A GOOD ALIGNMENT IS ABSOLUTELY ESSENTIAL ? For > 40% sequence identity the alignment is usually clear ? For < 40% sequence identity usually have to deal with gaps OBSERVATION: at 30% sequence only 20% of residues are correctly aligned! ? Try to use structural information OBSERVATION: most insertions/deletions occur in loops, not in secondary structure elements – Do a structure-based sequence alignment of all possible templates (e.g. with DALI) – Add the target sequence to the alignment, using its predicted secondary structure to choose gap placement – do the alignment over the known extent of a single protein domain in the template To improve the alignment: check secondary structure of 1DM2 (given in the pdb entry) 1 MENFQKVEKI GEGTYGVVYK ARNKLTGEVV ALKKIRLDTE TEGVPSTAIR EEE EE B SSSEEEE EEETTT EE EEEE HHHH 51 EISLLKELNH PNIVKLLDVIHTENKLYLVF EFLHQDLKKF MDASALTGIP HTTTTTT TTB B EEE EETTEEEEEE E SEEHHHH HHTTTTT 101 LPLIKSYLFQ LLQGLAFCHS HRVLHRDLKP QNLLINTEGA IKLADFGLAR HHHHHHHHHH HHHHHHHHHH TT S G GGEEE TTS EEE 151 AFGVPVRTYT HEVVTLWYRA PEILLGCKYY STAVDIWSLG CIFAEMVTRR TT HHHHTT SS THHHHHHHH HHHHHHHHSS 201 ALFPGDSEID QLFRIFRTLG TPDEVVWPGV TSMPDYKPSF PKWARQDFSK SS SSHHH HHHHHHHHH TTTSTTG GGTTTTTTTS GGG 251 VVPPLDEDGR SLLSQMLHYD PNKRISAKAA LAHPFFQDVT KPVPHLRL TTTT HHHH HHHHHHS SS TTTS HHHH TTTGGGTT Compare to the PREDICTED secondary structure of the target (from PHD, PREDATOR, JPRED, etc.) Build a model from the alignment - I ? Construct a backbone framework – If you have only one model, copy the backbone coordinates for the aligned part of the target – If you have multiple models, average the Cα positions, then fit a backbone trace to those positions by ? using the template with highest sequence identity at each site OR ? selecting a hexapeptide from a database that fits Build the model - II ? Add the side chains – For positions with identical sequence, copy the template structure – For positions with different sequence select the side chain placement from a list of commonly-observed conformers (known as “rotamers”) – Side chain positions may need to be iteratively refined so as to be consistent (more on this later!) Build the model - III ? Build in the loops – Often the target differs from the templates in the loop region – Local sequence doesn’t uniquely determine loop structure – Often loops contain important functional residues! – Loops can be built two ways ? using a database of loop structures found in the pdb – Match the “stem” of the loop with a known segment, then transfer the coordinates to the target structure (“knowledge based” appraoch) ? Do a conformational search using a molecular mechanics energy function (physics based approach) – These methods work reasonably for short loops (4-5 residues) and for specialized classes of loops (e.g. IgG hypervariable regions) Refine the model ? The model as built in steps I - III may have poor stereochemistry (e.g. clashes) ? Can improve severe local errors through molecular mechanics minimization OBSERVATION: EXTENSIVE MINIMIZATION GIVES WORSE MODELS ? At this point side chain conformations can be adjusted to be consistent with the entire model Optimization using constraints ? A. Sali’s MODELLER, G. Montelione’s HOMA ? Uses the template to generate constraints – Atom distances, dihedral angles ? Uses molecular mechanics to introduce other constraints – Bond lengths, angles, dihedrals, non-bond terms ? Combine constraints into an objective function ? Minimize in Cartesian space ? Advantages: combines model building & refinement, can incorporate many types of data (e.g. NMR constraints) Sali, A, and TL Blundell. "Comparative Protein Modelling by Satisfaction of Spatial Restraints." J Mol Biol. 234, no. 3 (5 December 1993): 779-815. There are many places to go wrong… ? Bad template - it doesn’t have the same structure as the target after all ? Bad alignment (a very common problem) ? Good alignment to good template still gives wrong local structure ? Bad loop construction ? Bad side chain positioning Pitfalls in comparative modeling Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325. Courtesy of Annual Reviews Nonprofit Publisher of the Annual Review of TM Series. Used with permission. How do you know if you can trust your model? Model Assessment ? The sequence identity between target and template ? Structural tests similar to those used for new crystal structures – backbone & side chain conformations, H-bonding ? Is the structure “protein-like”? – does it have good H/P patterning? ? Does it score better than alternate models according to some energy function? Z score = S - <S> σ Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325. Courtesy of Annual Reviews Nonprofit Publisher of the Annual Review of TM Series. Used with permission. these numbers from an entirely automated process - can do somewhat better with manual intervention Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325. Courtesy of Annual Reviews Nonprofit Publisher of the Annual Review of TM Series. Used with permission.