Solving structures using
X-ray crystallography
&
NMR spectroscopy
7.91 Amy Keating
How are X-ray crystal structures determined?
1. Grow crystals - structure determination by X-ray crystallography
relies on the repeating structure of a crystalline lattice.
2. Collect a diffraction pattern - periodically spaced atoms in the
crystal give specific “spots” where X-rays interfere constructively.
3. Carry out a Fourier transform to get from “reciprocal space” to a
real space description of the electron density.
4. THIS STEP REQUIRES KNOWLEDGE OF THE PHASES OF THE
INTERFERING WAVES, WHICH CAN’T BE DIRECTLY MEASURED
“THE PHASE PROBLEM”
4. Build a preliminary model of the protein into the envelope of
electron density that results from the experiment.
5. Refine the structure through an iterative process of changing the
model and comparing how it fits the data.
The Phase Problem: we don’t know what phases to use to
add up all of the contributing waves. BIG PROBLEM.
| F
hkl
| exp(iα
hkl
) =
observable
amplitude
atomic scattering factor - related
the phase of F is determined by the
to electron density around atom j
x, y and z coordinates of the atoms
What we observe is I
hkl
α |F
khl
|
2
we can’t measure the phases directly
Get phases from molecular replacement, or heavy atom methods
X-Ray Crystal Structure Refinement
The model:
Computed
The data: Actual
intensities of spots
intensities of spots
F
obs
(h,k,l) ? F
calc
(h,k,l)
]
2
U
X -ray expt
=
∑[
Summation
h ,k,l
Actual intensity of spot Intensity of spot calculated
runs over spots observed in expt from trial structure
U
hybrid
= U
Model Molec
+ sU
expt ray -X
? Simulated annealing on hybrid potential rapidly improves
correspondence between structure and X-ray observations while
maintaining reasonable chemistry (large radius of convergence)
? Previous method effectively used local minimization which became
trapped in local minima (small radius of convergence)
The Free R factor
current
model
90% of X-ray
amplitudes
R = Σ||F
obs calc
||/Σ|F
obs
|
model-derived amplitudes
change model
10% of X-ray
amplitudes
R
free
= Σ||F
obs calc
||/Σ|F
obs
|
assess model
| - |F
| - |F
What parameters do you refine?
? Atomic coordinates X, Y, Z
? The temperature factor of each atom, B
? Can also refine the occupancy
u
B = 8π
2
x u
2
2
= mean square atomic displacement
B results from atomic vibrations and disorder
units = ?
2
Example:
B = 20 --> 0.5? displacement
B = 80 --> 1? displacement
Atomic coordinates in the PDB file
X Y Z occ B
ATOM 1 N GLU 4 28.492 3.212 23.465 1.00 70.88
ATOM 2 CA GLU 4 27.552 4.354 23.629 1.00 69.99
ATOM 3 C GLU 4 26.545 4.432 22.489 0.00 67.56
ATOM 4 O GLU 4 26.915 4.250 21.328 0.00 68.09
ATOM 5 CB GLU 4 28.326 5.683 23.680 0.00 72.34
ATOM 6 CG GLU 4 27.447 6.910 23.973 0.00 75.98
ATOM 7 CD GLU 4 28.123 8.247 23.659 0.00 78.43
ATOM 8 OE1 GLU 4 29.375 8.299 23.604 0.00 79.32
ATOM 9 OE2 GLU 4 27.393 9.251 23.468 0.00 79.58
ATOM 10 N ARG 5 25.274 4.610 22.852 1.00 63.77
ATOM 11 CA ARG 5 24.179 4.807 21.907 1.00 59.83
ATOM 12 C ARG 5 23.411 3.698 21.219 1.00 56.20
ATOM 13 O ARG 5 23.987 2.808 20.596 1.00 57.33
ATOM 14 CB ARG 5 24.604 5.784 20.812 1.00 60.86
ATOM 15 CG ARG 5 23.926 7.127 20.866 1.00 61.89
ATOM 16 CD ARG 5 24.295 7.944 19.647 1.00 62.21
Is your structure correct?
? How unusual is the structure geometry?
? Does it contain rare conformations?
? Does it make chemical sense?
http://pdb.rutgers.edu/validate/
Backbone geometry
http://pdb.rutgers.edu/
Side chain
geometry
O
N
χ
1
χ
2
isoleucine
χ angle here might
www.fccc.edu/research/labs/dunbrack/confanalysis.html
indicate error in structure
http://pdb.rutgers.edu/validate/
PROCHECK
Residue properties
new-entry
20
40
60
80
100
a. Absolute deviation from mean Chi-1 value (excl. Pro)
4 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
sequence
χ
1
absolute deviation from values determined for high-resolution
X-ray structures
Laskowski, R A, M W MacArthur, D S Moss, and J M Thornton. "PROCHECK: A Program to Check The Stereochemical
Quality of Protein Structures." J. Appl. Cryst. 26 (1993): 283-291.
Morris, A L, M W MacArthur, E G Hutchinson, and J M Thornton. "Stereochemical Quality of Protein Structure
Coordinates." Proteins 12 (1992): 345-364.
Summary of Structure Assessment
problem diagnostic
structure is incomplete PDB file header &
coordinates,
occupancies
residues are B-factors
disordered
model doesn’t match
data
R value
Free R value
model has unusual Ramachandran plots,
stereochemistry side chain analysis
How are NMR structures solved?
1. Solution phase technique - protein at mM concentration in
a buffer. Currently limited to proteins ≤ 30-50 kDa.
2. Measure resonant frequencies of
1
H,
13
C,
15
N atoms in a
magnetic field.
3. Assign peaks observed in the spectrum to individual amino
acids.
4. Measure distances between different residues < 6? apart
to get restraints. Need many restraints per residue.
5. Build structures consistent with the experimental distance
restraints and principles of sterochemistry.
6. Yields a set of structures consistent with the data.
? Please refer to http://public-1.cryst.bbk.ac.uk/
PPS2/projects/schirra/html/home.htm for an
NMR Tutorial.
Types of restraints available from NMR experiments
1. NOEs give rough distances between assigned atoms - given
as upper and lower bounds.
2. COSY spectra and J-couplings give dihedral angle restraints
Also have constraints from what you know about the protein:
1. Connectivity due to known aa geometry & sequence
2. Standard bond lengths and angles
Building a structure from NMR data I: Distance Geometry
Given: a set of labeled distance constraints
k
1. Bounds smoothing using the triangle inequality
given upper bounds u and lower bound l (e.g. from NOEs
and bond lengths)
if u
ij
> u
ik
+ u
kj
then set u
ij
to u
ik
+ u
kj
i
j
2. Specific distances d
ij
that are compatible with the bounds
and the triangle inequalities are chosen (metrization).
3. “Embedding” is used to compute a 3D model from
the distances - often the distances are not all compatible
with a 3D model but instead with a higher-dimensional one.
In this case it is necessary to project into three dimensions (-> error).
4. Initial models contain many errors that must be iteratively
corrected by refinement.
Building a structure from NMR data II:
Simulated Annealing
U(R) = E
empirical
+ E
effective
E
effective
= E
NOE
+ E
torsion
derived from NMR experiment
E
empirical
= E
bond
+ E
angle
+ E
dihedral
+ E
vdW
+ E
elec
as previously-discussed
’
|
2
E
NOE
= c?|r
ij
-r
ij
c = kTS/2??
2
’
where ? is an error estimate on the experimental constraint r
ij
S is chosen to balance the effective energy with the empirical energy
Assessing NMR structure quality
1. Number of restraints used
want ~10-20 per residue
2 . Numb er of restraint violations
3. RMS deviation from restraints
4. RMS differences between models
want main chain atom rmsd < 0.4 ?, side chain < 1.0?
5. Stereochemical quality
e.g. use the validation server at the PDB to
check for bad backbone and side chain torsions
Methods for Protein Structure Prediction
Homology Modeling
Threading
Ab Initio Prediction
Studying protein structure
… without a structure
Comparative modeling - inferring the structure
of a protein from a homolog
Fold recognition - an easier problem that fold
prediction!
Ab initio prediction - prediction of structure from
sequence
Translating structure between members of the
same family - Homology Modeling
? Identify a protein with similar sequence for which a structure
has been solved (the template)
? Align the target sequence with the template
? Use the alignment to build an approximate structure for the
target
? Fill in any missing pieces
? Fine-tune the structure
?Evaluate success
An excellent review:
Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325.
Identifying a good template
? By sequence similarity
– Use FASTA, BLAST, PSI-BLAST or threading
– Best performance from high sequence identity, but can
try distant homologues and assess performance later
? The closer the evolutionary relationship, the better
– Consider a phylogenetic tree
? Generally better to have many templates to use as
models
? Consider the structure quality (R, resolution, average B)
? Consider particulars of the structure
– Quaternary structure
– Any ligands bound?
–pH
? The probability of finding a template is ~20-50%
You have cloned a new Pombe gene that is a putative protein kinase
Blast against PDB, hit = 1DM2
Score = 250 bits(638), Expect= 6e-67
Identities = 136/302 (45%), Positives = 185/302 (61%), Gaps = 17/302(5%)
? Query: 71 IDDYEILEKIEEGSYGIVYRGLDKSTNTLVALKKIKFDPNGIGFPITSLREIESLSSIRH 130
? +++++ +EKI EG+YG+VY+ +K T +VALKKI+ D G P T++REI L + H
? Sbjct: 1 MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH 60
? Query: 131 DNIVELEKVVVGKDLKDVYLVMEFMEHDLKTLLD-----NMPEDFLQSEVKTLMLQLLAA 185
NIV+L V+ ++ +YLV EF+ DLK +D +P +K+ + QLL
? Sbjct: 61 PNIVKLLDVIHTEN--KLYLVFEFLHQDLKKFMDASALTGIPLPL----IKSYLFQLLQG 114
? Query: 186 TAFMHHHWYLHRDLKPSNLLMNNTGEIKLADFGLARPVSEPKSSLTRLVVTLWYRAPELL 245
? AF H H LHRDLKP NLL+N GIKLADFGLAR P + T VVTLWYRAPE+L
? Sbjct: 115 LAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEIL 174
? Query: 246 LGAPSYGKEIDMWSIGCIFAEMITRTPLFSGKSELDQLYKIFNLLGYPTREEWPQYFLLP 305
LG Y +D+WS+GCIFAEM+TR LFG SE+DQL++IF LG P WP +P
? Sbjct: 175 LGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMP 234
? Query: 306 YANKIKHPTVPTHSKIRTS--IPNLTGNAYDLLNRLLSLNPAKRISAKEALEHPYFYESP 363
P+ P S +P L + LL+++L +P KRISAK ALHP+F +
? Sbjct: 235 DYK----PSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVT 290
? Query: 364 RP 365
+P
? Sbjct: 291 KP 292
Aligning the target to the template sequences
? A GOOD ALIGNMENT IS ABSOLUTELY ESSENTIAL
? For > 40% sequence identity the alignment is usually
clear
? For < 40% sequence identity usually have to deal with
gaps
OBSERVATION: at 30% sequence only 20% of residues
are correctly aligned!
? How could you try to improve the alignments over
those provided by BLAST?
Aligning the target to the template sequences
? A GOOD ALIGNMENT IS ABSOLUTELY ESSENTIAL
? For > 40% sequence identity the alignment is usually clear
? For < 40% sequence identity usually have to deal with gaps
OBSERVATION: at 30% sequence only 20% of residues are
correctly aligned!
? Try to use structural information
OBSERVATION: most insertions/deletions occur in loops, not
in secondary structure elements
– Do a structure-based sequence alignment of all possible
templates (e.g. with DALI)
– Add the target sequence to the alignment, using its
predicted secondary structure to choose gap placement
– do the alignment over the known extent of a single protein
domain in the template
To improve the alignment: check secondary
structure of 1DM2 (given in the pdb entry)
1 MENFQKVEKI GEGTYGVVYK ARNKLTGEVV ALKKIRLDTE TEGVPSTAIR
EEE EE B SSSEEEE EEETTT EE EEEE HHHH
51 EISLLKELNH PNIVKLLDVIHTENKLYLVF EFLHQDLKKF MDASALTGIP
HTTTTTT TTB B EEE EETTEEEEEE E SEEHHHH HHTTTTT
101 LPLIKSYLFQ LLQGLAFCHS HRVLHRDLKP QNLLINTEGA IKLADFGLAR
HHHHHHHHHH HHHHHHHHHH TT S G GGEEE TTS EEE
151 AFGVPVRTYT HEVVTLWYRA PEILLGCKYY STAVDIWSLG CIFAEMVTRR
TT HHHHTT SS THHHHHHHH HHHHHHHHSS
201 ALFPGDSEID QLFRIFRTLG TPDEVVWPGV TSMPDYKPSF PKWARQDFSK
SS SSHHH HHHHHHHHH TTTSTTG GGTTTTTTTS GGG
251 VVPPLDEDGR SLLSQMLHYD PNKRISAKAA LAHPFFQDVT KPVPHLRL
TTTT HHHH HHHHHHS SS TTTS HHHH TTTGGGTT
Compare to the PREDICTED secondary structure
of the target
(from PHD, PREDATOR, JPRED, etc.)
Build a model from the alignment - I
? Construct a backbone framework
– If you have only one model, copy the backbone
coordinates for the aligned part of the target
– If you have multiple models, average the Cα positions,
then fit a backbone trace to those positions by
? using the template with highest sequence identity at each
site
OR
? selecting a hexapeptide from a database that fits
Build the model - II
? Add the side chains
– For positions with identical sequence, copy the template
structure
– For positions with different sequence select the side chain
placement from a list of commonly-observed conformers
(known as “rotamers”)
– Side chain positions may need to be iteratively refined so
as to be consistent (more on this later!)
Build the model - III
? Build in the loops
– Often the target differs from the templates in the loop
region
– Local sequence doesn’t uniquely determine loop structure
– Often loops contain important functional residues!
– Loops can be built two ways
? using a database of loop structures found in the pdb
– Match the “stem” of the loop with a known
segment, then transfer the coordinates to the
target structure (“knowledge based” appraoch)
? Do a conformational search using a molecular
mechanics energy function (physics based approach)
– These methods work reasonably for short loops (4-5
residues) and for specialized classes of loops (e.g. IgG
hypervariable regions)
Refine the model
? The model as built in steps I - III may have poor
stereochemistry (e.g. clashes)
? Can improve severe local errors through molecular
mechanics minimization
OBSERVATION: EXTENSIVE MINIMIZATION GIVES
WORSE MODELS
? At this point side chain conformations can be adjusted
to be consistent with the entire model
Optimization using constraints
? A. Sali’s MODELLER, G. Montelione’s HOMA
? Uses the template to generate constraints
– Atom distances, dihedral angles
? Uses molecular mechanics to introduce other constraints
– Bond lengths, angles, dihedrals, non-bond terms
? Combine constraints into an objective function
? Minimize in Cartesian space
? Advantages: combines model building & refinement,
can incorporate many types of data (e.g. NMR
constraints)
Sali, A, and TL Blundell. "Comparative Protein Modelling by Satisfaction of Spatial Restraints." J Mol Biol.
234, no. 3 (5 December 1993): 779-815.
There are many places to go wrong…
? Bad template - it doesn’t have the same structure as
the target after all
? Bad alignment (a very common problem)
? Good alignment to good template still gives wrong local
structure
? Bad loop construction
? Bad side chain positioning
Pitfalls in comparative modeling
Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325.
Courtesy of Annual Reviews Nonprofit Publisher of the Annual Review of TM Series. Used with permission.
How do you know if you can trust your model?
Model Assessment
? The sequence identity between target and template
? Structural tests similar to those used for new crystal
structures
– backbone & side chain conformations, H-bonding
? Is the structure “protein-like”?
– does it have good H/P patterning?
? Does it score better than alternate models according to
some energy function?
Z score = S - <S>
σ
Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325.
Courtesy of Annual Reviews Nonprofit Publisher of the Annual Review of TM Series. Used with permission.
these numbers
from an entirely
automated
process - can do
somewhat better
with manual
intervention
Marti-Renom et al. Annu. Rev. Biophys. Biomol. Struct. 29 (2000): 291-325.
Courtesy of Annual Reviews Nonprofit Publisher of the Annual Review of TM Series. Used with permission.