How do we use computational methods
to analyze, predict, or design
protein sequences and structures?
Theme:
Methods based on physics vs. methods
based on our accumulated empirical
knowledge of protein properties
7.91 Amy Keating
For a molecular simulation or model
you need:
1. A representation of the protein
2. An energy function
3. A search algorithm or optimizer
Covalent Potential Energy Terms
U
Covalent
= U
bond
+ U
angle bond
+ U
dihedral improper
+ U
torsion
2
U
bond
=
∑
1
k
b
(b ? b
0
)
bonds
2
2
U
angle bond
=
∑
1
k
θ
(θ ? θ )
angles bond
2
0
U
dihedral improper
=
∑
1
dihedrals improper
2
k
Φ
( Φ ? Φ
0
)
U
torsion
=
∑
1
k
φ
[1 + cos(nφ ? δ )]
torsions
2
Brooks et al., J. Comput. Chem. 4: 187-217 (1983)
2
Non-Covalent Potential Energy Terms
U
Non -covalent
= U
vdW
+ U
elec
Lennard-Jones potential
12
ij
ij
r
B
6
ij
ij
r
C
?
?
B
ij
U
vdW
=
∑
?
12
?
C
ij
?
?
inull j
?
r
ij
r
ij
6
?
“accurate”
approximate
q
i
q
j
U
elec
=
∑
Coulomb’s law
i null j
εr
ij
The potential energy surface is a 3N-6 dimensional space.
For a protein, we assume a single native-structure minimum.
There are many local minima, and some may be close in energy
to the global minimum.
E
n
e
r
g
y
X
Sampling the Potential Energy Surface
? Energy minimization
– “downhill” search, generally to nearest local minimum
– can be used to relax structures
– might be useful to define local changes due to mutation
? Normal mode analysis
– defines “characteristic motions”, which are distortions about
a local minimum structure
– orders motions “easy” (low frequency) to “hard” (high)
? Molecular dynamics
– movie of motion at given temperature (300 K)
– equivalent to statistical mechanical ensemble
? Monte Carlo/Simulated Annealing
– Describe properties of the landscape and thermodyanmic
parameters without simulating how the molecular actually
moves
Energy Minimization
Conformational Space, R
Potential Energy, U(R)
X-ray structure
X
? Iterative procedures;
terminate when reach
tolerance, such as small
(
F
i
= ?? r U )
i
gradient
? Poor initial structure leads
r
i+ 1
= r
i
+δ F
i
to poor local minimum
? Multiple minimum
problem
ONLY FINDS LOCAL MINIMA!
Uses of simple minimization
1. The “minimum perturbation approach” to modeling a mutation
– Assume structure of single-site mutant is close to known
wild-type structure
? Find stable conformations for mutant side chain in
context of wild-type protein
? Use energy minimization to relax candidate structures
(all degrees of freedom)
Shih, Brady, and Karplus, Proc. Natl. Acad. Sci. USA 82: 1697–1700
(1985); hemagglutinin Gly to Asp mutation modeled accurately
2. Relieving strain before analyzing the energy of an
experimental or predicted structure
3. Structure building and refinement when solving structures
using X-ray crystallography or NMR
Normal Mode Analysis
? Characteristic Motions and their Relative Ease
? Thermodynamic Properties
Mathematical Approximation: Series of Independent Harmonic Oscillators
0
?
2
U
( ( ( ? ?R U ) = R U
0
) ? + R U
0
)( R R
0
) +
1
2
∑∑
? ?
( r r
i 0 ,
)(r
j
? r
j 0 ,
) + null
i
j > i i
r r
ji
Corresponds to Local Expansion of Potential Surface as Parabolic
U(R )
actual
harmonic approximation
R
Normal Modes Locate “Easy” Deformations
U(R)
R
? Low-frequency, energetically easy motions will
dominate the dynamical behavior of macromolecules
Normal modes of proteins sometimes correspond
to biologically relevant motions
? “shearing” or “hinging” conformational changes are common in
enzymes
? Can compare structure changes computed from NMA with
alternate conformations observed experimentally. Frequently a
low-frequency mode describes the change (but it may not be
the lowest energy mode)
? NMA is a way to get an idea about motion from a static
structure
Molecular Dynamics Simulations
? Simulate motions of molecules as a function of time
– Collect short “movie” of protein “swimming” in solution
? Can use the simulation to compute:
– Average structure (at given temperature, T)
– Atomic fluctuations (at T)
– Thermodynamic quantities (temperature dependent)
m
?
2
x
i
F
i
F i
= m
i
a
i
a
i
= =?
?
i
U
i
?t
2
m
i
null null
r (t +δt) = r(t) +δ
null
(tv t +δt /2)
null
v(t +δt /2) =
null
v (t ?δt /2) +δt
null
a(t)
Verlet “leap frog” algorithm
Running a Molecular Dynamics Simulation
? Minimize initial coordinates
? Assign random starting velocities from Maxwell-Boltzmann
distribution
1/2
2
?
p(v
i
)= ?
?
m
i
?
?
exp
?
?
?
?m
i
v
i
?
?
?
2πkT
?
?
2kT
?
?Use small time step to integrate Newton’s equations of motion
(1 fs =1 x 10
-15
sec is typical)
? Equilibration (running dynamics until system equilibrates)
? Production dynamics (useful dynamics at a desired temp)
Sampling conformational space using Monte Carlo
Monte Carlo randomly samples configurations, rather than simulating
how the molecule would actually move in time.
1. Begin with a random (energy-minimized) conformation
2. Evaluate the energy = E
0
3. Make a random conformational change
(e.g. rotation around a bond)
4. Evaluate the new energy = E’
5. DECISION:
if E’ < E
0
ACCEPT the move, set E
0
= E’, go to 3
if E’ > E
0
AND if exp[-(E’-E
0
)/kT] > random #
then ACCEPT the move, set E
0
= E’, go to 3
else reject the move, go to 3
6. Proceed for an arbitrary amount of time
The acceptance criterion is such that the conformations generated
sample the Boltzman distribution.
What are Monte Carlo and Molecular Dynamics Useful For?
1. Exploring the energy landscape
What is the global minimum?
What are other minima?
What are the barriers between minima?
2. Computing thermodynamic quantities
3. “Observing” pathways connecting different states
What are the differences between MD and MC?
Molecular dynamics and Monte Carlo can be used for many of
the same purposes.
If you want to know how a system evolves in time, you must use MD.
MC can be better at sampling broad conformational spaces because it
makes random moves that can get you out of local minima.
A Dynamic Model for the Allosteric Mechanism of GroEL
MD simulation used to probe
the mechanism of conformational
change in the chaperonin
GroEL that occurs upon
binding ATP and GroES.
Ma, J, PB Sigler, Z Xu, and M Karplus. "A Dynamic Model for The Allosteric Mechanism of GroEL."
Please see figure 1a of
J Mol Biol. 302, no. 2 (15 September 2000): 303-13.
Shown are conformational changes
that occur upon binding ATP.
The endpoints of the calculation were
determined from experimental X-ray
structures, but the path in between isn’t
accessible experimentally.
The conformation change is too slow to
simulate normally, but “targeted MD” can
identify such trajectories.
The yellow intermediate domain
moves downwards first, triggering the
upwards movement of the green apical
domain.
Simulation shows details of interactions
responsible for both positive and negative
cooperativity.
Please see figures 2b and 5b of
Ma, J, PB Sigler, Z Xu, and M Karplus. "A Dynamic Model for The Allosteric Mechanism of GroEL."
J Mol Biol. 302, no. 2 (15 September 2000): 303-13.
MD simulations are now run on ENORMOUS systems!
Simulation of the aquaporin channel: permeable to water but not protons. The waters and the
lipid bilayer are both present in the calculation - 101,000 atoms total.
Please see figure 1 of
de Groot, Bert L., and Helmut Grubmüller. "Water Permeation Across Biological Membranes: Mechanism
and Dynamics of Aquaporin-1 and GlpF." Science
294 (14 December 2001): 2353-2357.
MD simulation of water permeation through aquaporin
Please see figure 2 of
de Groot, Bert L., and Helmut Grubmüller. "Water Permeation Across Biological Membranes: Mechanism
and Dynamics of Aquaporin-1 and GlpF." Science
294 (14 December 2001): 2353-2357.
A 10 ns simulation was sufficient to see water move in “real time” because the permeation rate
is 3x109 s-1.
Can look at the mechanism of selectivity and what determines the rate.
Can you fold a protein using Molecular Dynamics?
We don’t really know yet!
Simulate the folding of villin headpiece:
36 residues, with explicit solvent.
1 us simulation with 2 fs timestep, took, 512 Cray processor months!
Results:
The NMR structure is stable in the simulation.
After 60 ns the structure showed 50% helicity, in good agreement
with experimental data for other proteins
There was an initial “burst” phase giving helix formation and native
contacts
A “metastable intermediate” was found with secondary and tertiary
contacts similar to the native state.
Duan & Kollman Science (1998) 282, 740-744
Villin headpiece
A. unfolded
B. 980 ns partly folded
C. native
E. structure from stable cluster
Time evolution of:
A. Native helical content
B. Native contacts
C. Radius of gyration
D. Solvation free energy
Please see figures 1 and 2 of
Duan, Yong, and Peter A. Kollman. "Pathways to a Protein Folding Intermediate Observed in a
1-Microsecond Simulation in Aqueous Solution." Science 282 (23 October 1998): 740-744.
Molecular Modeling - a few final words
? Potential functions used for modeling are approximate (though
quite good for some purposes); calculations are not
replacement for experiment
? Ability to “dissect” computational results leads to insights
unavailable by other approaches
? The need to specify the problem precisely in order to build a
model is itself valuable
Solving structures using
X-ray crystallography
&
NMR spectroscopy
How are X-ray crystal structures determined?
1. Grow crystals - structure determination by X-ray crystallography
relies on the repeating structure of a crystalline lattice.
2. Collect a diffraction pattern - periodically spaced atoms in the
crystal give specific “spots” where X-rays interfere constructively.
3. Carry out a Fourier transform to get from “reciprocal space” to a
real space description of the electron density.
4. THIS STEP REQUIRES KNOWLEDGE OF THE PHASES OF THE
INTERFERING WAVES, WHICH CAN’T BE DIRECTLY MEASURED
“THE PHASE PROBLEM”
4. Build a preliminary model of the protein into the envelope of
electron density that results from the experiment.
5. Refine the structure through an iterative process of changing the
model and comparing how it fits the data.
Growing Protein Crystals
This is often the rate-limiting step in solving a structure!
hanging-drop vapor diffusion
1. choose a sequence that you think
folds into a well-defined structure
2. prepare highly-purified protein
3. concentrate to > 10 mg/ml
buffer, salts,
precipitant
Collecting Diffraction Data
What do the spots in the diffraction pattern mean?
Every spot represents constructive interference between diffraction
from a set of atoms with spacing satisfying Bragg’s Law:
nλ = 2dsinθ
So the INTENSITY of a spot at a given q tells you something
about how much electron density lies in a set of periodic planes
with spacing d.
NOTE: λ = 0.5 to 1.5 ?
each spot results from a
differently-oriented set of
planes with the appropriate
spacing
The electron density in the
protein is a superposition of all
of these periodic functions…
The diffraction pattern represents reciprocal space
large θ -> small d; HIGH resolution
small θ -> large d
LOW resolution
Image of diffraction data removed for reasons of copyright.
Fourier Transform
f(x) = ∫ F(s) exp(-i 2πxs) ds
F(s) = ∫ f(x) exp(i 2πxs) dx
This is the “diffraction pattern”
This is the “electron density”
Courtesy of Kevin Cowtan: http://www.ysbl.york.ac.uk/~cowtan/sfapplet/sfintro.html
The Phase Problem: we don’t know what phases to use to
add up all of the contributing waves. BIG PROBLEM.
| F
hkl
| exp(iα
hkl
) =
observable
amplitude
atomic scattering factor - related
the phase of F is determined by the
to electron density around atom j
x, y and z coordinates of the atoms
What we observe is I
hkl
α |F
khl
|
2
we can’t measure the phases directly
F
hkl
is called a structure factor
Courtesy of Kevin Cowtan: http://www.ysbl.york.ac.uk/~cowtan/sfapplet/sfintro.html
Courtesy of Kevin Cowtan: http://www.ysbl.york.ac.uk/~cowtan/sfapplet/sfintro.html
FT
FT
Contribution of intensities vs. phases to the Fourier Transform
duck intensities
+
cat phases
Courtesy of Kevin Cowtan: http://www.ysbl.york.ac.uk/~cowtan/sfapplet/sfintro.html
So, how do you get the phases?
I. Molecular Replacement
“Borrow” phases from a structur e that you think is similar.
e.g. duck inte nsi t ie s plus goose phases would give a reasonable
estimate of t h e real struct ure.
Beware “model bias” - n eed to run controls to make sure you haven’t
forced the data to assume the structure of the model.
How it works:
Take the model structure and translate/rotate it in the unit cell of the
crystal. Calculate the expected structure factors.
Monitor the R value.
If you find a significantly non-random R value, this might be a good
model. Try to refine it and see if the R
free
value improves.
So, how do you get the phases?
II. Heavy atom methods
If you can make two or more crystals of the same geometry that have heavy atoms
incorporated, then F
protein?heavy
= F
protein
+ F
heavy
and taking the difference of a diffraction
pattern with and without the heavy atoms will give you the diffraction pattern for just the
heavy atoms.
This structure can be solved using a “Patterson map”:
1
2
V
P(u ,v, w) =
∑∑∑
| F
hkl
| exp(?2πi(hu + kv + lw))
h k l
The Patterson map gives you all of the vectors between atoms. This doesn’t
help for a whole protein, but for a sufficiently small structure, you can get the entire structure
from this data (except the chirality).
Once you have the structures of the heavy atom derivatives, you can use this to derive the
phases for the original structure. (Trust me on this - we don’t have time to do the details.)
Map tracing usually requires knowledge of the sequence
2.55? map
X-Ray Crystal Structure Refinement
The model:
Computed
The data: Actual
intensities of spots
intensities of spots
F
obs
(h,k,l) ? F
calc
(h,k,l)
]
2
U
X -ray expt
=
∑[
Summation
h ,k ,l
Actual intensity of spot Intensity of spot calculated
runs over spots observed in expt from trial structure
U
hybrid
= U
Model Molec
+ sU
expt ray -X
? Simulated annealing on hybrid potential rapidly improves
correspondence between structure and X-ray observations while
maintaining reasonable chemistry (large radius of convergence)
? Previous method effectively used local minimization which became
trapped in local minima (small radius of convergence)
Structure refinement and the R factor
current
model
observed X-ray
amplitudes
R = Σ
all measurements
||F
obs
| - |F
calc
||/Σ|F
obs
|
model-derived amplitudes
make changes guided by observed data
and by stereochemical principles
The Free R factor
current
model
90% of X-ray
amplitudes
R = Σ||F
obs calc
||/Σ|F
obs
|
model-derived amplitudes
change model
10% of X-ray
amplitudes
R
free
= Σ||F
obs calc
||/Σ|F
obs
|
assess model
| - |F
| - |F