The Protein Interactome
A critical framework underlying systems biology
1. Overview - the many levels of systems biology
2. Experimental methods for measuring protein-protein
interactions, and their limitations
3. Data sources for information about proteins and their
interactions
4. Computational methods for assessing and predicting
protein-protein interactions.
7.91 Amy Keating
Spectrum of Systems Biology
detailed models - describe rates, concentrations, structure
low-resolution models - describe information flow, logic, mechanism
circuitry logic/control - positive and negative regulation
connectivity/topology - who talks to who? interaction scaffold
parts list - protein and DNA sequences (& structures)
Recommended reading: Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262
Spectrum of Systems Biology
allow simulation &
detailed models - differential equations
comparison with data
low-resolution models - Boolean & Markov models
circuitry logic/control - Bayesian networks
connectivity/topology - graph theory
parts list - databases
Recommended reading: Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262
Spectrum of Systems Biology
detailed models - rates of individual reactions, protein
concentrations in the cell, extent of phosphorylation, diffusion
rates
low-resolution models - which elements are most crucial?
combinatorial dependencies.
circuitry logic/control - Expression profiling, post-translational
modifications in response to different stimuli. Identify pathways
and clusters; does an interaction activate or repress; are multiple
components required?
connectivity/topology - protein-protein, protein-DNA and
protein-small molecule interactions
parts list - genome sequencing projects, gene finding algorithms,
EST libraries
Recommended reading: Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262
Spectrum of Systems Biology
detailed models
YAFFE
low-resolution models - not covering this topic much this year
circuitry logic/control -
BURGE
connectivity/topology -
KEATING (today)
parts list BURGE
Recommended reading: Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262
Protein-protein and protein-DNA interactions at the genomic level
Saccharomyces cerevisiae as a model organism.
A very simple eukaryote - “yeast as a model for human”
Genome 12,053 kb sequenced in 1996.
~5800 protein-coding genes.
Easy to do genetics in yeast.
Many regulatory and metabolic pathways are at least partly
conserved between yeast and higher eukaryotes.
Many human disease genes have yeast orthologs.
Saccharomyces cerevisiae image from SGD?, provided by Peter Hollenhorst and Catherine Fox. Used with permission.
Small-scale interaction experiments
Protein-protein interactions
pull-down (GST, Ni affinity, co-immunoprecipitation)
cross-linking
more biophysical & quantitative: fluorescence, CD,
calorimetry, surface plasmon resonance
Protein-DNA interactions
mostly by gel shift assay
Many, many thousands of such experiments have been done and
reported in the literature, but how do you get the information out?
This is hard, and an important problem in modern biology.
PreBIND is a machine learning application that can extract information
about whether two proteins interact from the literature automatically.
http://www.blueprint.org/products/prebind/prebind.html
Small-scale experiment are generally the most reliable, though still rife
with false negatives and false positives.
Yeast 2-hybrid assay
mate
yeast
Vector with activation domain--ORF fusion
Vector with DNA-binding domain--B fusion
plate on -His media
Images: http://depts.washington.edu/sfields/yp_interactions/YPLM.html
Courtesy of Stanley Fields. Used with permission.
Yeast 2-hybrid assay
Pros Cons
easy/fast prone to false negatives
protein doesn’t fold
no purification required
protein doesn’t localize to nucleus
interference from endogenous protein
in vivo conditions
fusion protein doesn’t interact like native
protein fusion may be toxic to cell
can be adapted for
high-throughput screens
prone to false positives
auto-activation
indirect ineractions
can detect transient interactions
not quantitative
no control over post-translational modification
only test binary interactions
not quantitative
Yeast 2-hybrid assay for an entire genome
Uetz et al. Nature (2000) 403, 623-627
Two strategies:
1. “array” approach: ~6,000 activation domain hybrid transformants mated to
192 DNA binding domain fusion transformants
only 20% of interactions (281) reproducible (many auto-activate)
3.3 positives per interaction-competent protein
2. “high-throughput screen” approach: 5,345 ORFs cloned separately into
DNA-binding and activation domain plasmids (2 reporter genes); DBD fusions
pooled and mated to AD fusions; 12 clones per pool sequenced, gave 692
unique interactions (472 seen more than once)
1.8 positives per interaction-competent protein
Ito et al. PNAS (2001) 98, 4569-4574
For both DBD and AD, make 62 pools of ~96 proteins. Mate all pools against all.
Gave 4,549 interactions; 841 observed ≥ 3 times (= core data).
The potential number of interactions is huge, and the number of real interactions
is probably very large (>10,000); these studies only characterize a tiny
fraction (low coverage).
Stan Field’s web site
http://depts.washington.edu/sfields/images/RPC19.html
Courtesy of Stanley Fields. Used with permission.
Additional “cons” when you do a large scale 2-hybrid screen
PCR amplification gives mutations - generally don’t sequence
everything to confirm!
Cloning & transformation inefficiencies
If baits are pooled, slow-growing cells will lose to faster ones,
giving false negatives.
All vs. all assay contains many implausible interactions - proteins
that aren’t co-localized or expressed at the same time.
Can only sequence a small fraction of the positive clones.
High-throughput Y2H screens miss as many as 90% of Y2H
interactions observed in focused, small-scale studies!
Affinity Purification
What do you mean by an “interaction”?
Most proteins interact with several other proteins (estimate 2-10).
Many proteins in the cell are found in complexes. For some
purposes, knowing the identities of the members of the clusters is
as useful, or more useful, than knowing the directly interacting
partners.
Affinity purification is a method for characterizing the clusters
directly, rather than one interaction at a time.
Affinity Purification/Mass spectrometry
affinity tag
BAIT
a
b
c
d
e
BAIT
a
b
c
d
e
DNA encodes bait + tag
bait expressed in
cell forms part of a
complex
complex with affinity
the tag
separate
a,b,c,d,e,BAIT
extract bands,
digest with trypsin
PEPTIDES
mass spec +
identities
of
proteins
in the
complex
lyse cell, fish for
column that binds
by SDS PAGE gel
database search
Affinity purification/mass spectrometry for an entire genome
Gavin et al. Nature (2002) 415, 141-147; Cellzome
1,167 bait proteins
TAP tag inserted at 3’ end of gene; proteins under endogenous promoter
2 rounds of purification
232 distinct complexes with 2 to 83 proteins per complex
new cellular role proposed for 344 proteins
To assess confidence:
Repeat the experiment - only 70% reproducible using the same bait
Use different proteins in the complex as the bait, see if you recover the
same proteins in the complex.
Ho et al. Nature (2002) 415, 180-183; MDS Proteomics
725 bait proteins; 1,578 interacting proteins
FLAG tag, proteins transiently overexpressed
To assess confidence:
74% of interactions reproducible in small scale co-IP/blot
Affinity/ms assay
Pros
get the whole complex
proteins that purify together are
likely to share a function
very sensitive - can detect ~15
copies per cell
in vivo conditions
can be adapted for
high-throughput screens
Cons
doesn’t determine direct interactions
not reliable for small proteins (< 15 kD)
affinity tag may interfere with interactions or
with the function of essential proteins
prone to false positives, e.g. “sticky” proteins
prone to false negatives
won’t get every protein every time
complex must survive purification
not quantitative
Array Detection of Protein-Protein Interactions
print onto
aldehyde or
Ni surface
label with
fluorescent
1 2
3 4
1 2
3 4
1 2
3 4
purified
peptides or proteins
N x N
dye
1
m
m
Highly purified proteins were denatured using GdnHCl and printed onto aldehyde-
derivatized glass slides using a commercial split pin arrayer. GdnHCl was to prevent
homodimerization on the surface.
49 human proteins plus 3 duplicates plus 10 yeast proteins were printed in quadruplicate
62 times.
The 62 proteins were independently labeled with Cy-3 dye and denatured with GdnHCl.
Peptides were diluted from GdnHCl as they were added to the arrays. Following a brief
incubation, slides were washed, dried and scanned, yielding NxN measurements, in
quadruplicate of cc interactions.
The assay was repeated at concentration ranging from 160 pM to XXX nM.
Array Detection of Protein-Protein Interactions
MacBeath & Schreiber Science 2000
proof-of-principle for three types of interactions
protein-protein: protein G with IgG, FRAP with FKBP12, p50 with IκBα
protein-small molecule: biotin with steptavidin, Ab with DIG steroid ligand
enzyme-substrate: kinases PKA, Erk2
Zhu et al. Science 2001
assay of 5,800 yeast genes with calmodulin, phospholipids
Newman & Keating Science 2003
assay of ~48 x 48 human bZIP transcription factor coiled coils (plus 10 x 10 yeast)
Protein microarrays
Pros Cons
N x N interactions at once tedious purification required, or else
interactions may not be direct
direct interaction assay
surface may perturb folding or interactions
reagents can be well characterized
doesn’t mimic in vivo conditions
solution conditions are controlled
not yet a mature technology - possibly not a
can be quantitative good general approach
requires very little protein
can be adapted for
high-throughput screens
few false positives
Overlap of high-throughput interaction studies is LOW
Ito
Y2H
Uetz
Y2H
Gavin
TAP/ms
Ho
FLAG/ms
Ito 2-hybrid 4363 186 54 63
Uetz
2-hybrid
1403 54 56
Gavin affinity 3222 198
Ho affinity 3596
Small scale 442 415 528 391
data from Salwinski & Eisenberg, Current Opinion in Structural Biology (2003) 13, 377-382
Lesson:
Lots of protein-protein interaction data is now available for
yeast, but it is not very reliable and it is not nearly
comprehensive.
Nevertheless, these data have inspired the development of
many computational methods.
To facilitate computational analysis, need to disseminate the
data in a usable form!
This is often a rate limiting step in systems biology.
Databases that store interaction data
Database of Interacting Proteins (DIP)
Biomolecular Interaction Network Database (BIND)
Molecular Interactions Database (MINT)
INTERACT
MIPS contains interaction data (both direct and clusters) for yeast
partners of murine p53
Lab of David Eisenberg
http://dip.doe-mbi.ucla.edu
DIP interaction details
http://dip.doe-mbi.ucla.edu
DIP interaction statistics
as of May, 2004
http://dip.doe-mbi.ucla.edu
http://dip.doe-mbi.ucla.edu
BIND
Designed to hold direct interaction, cluster and pathway data
81,000 interactions
written in ASN.1 (Abstract Syntax Notation) for computational efficiency
Bader GD, Betel D, Hogue CW. (2003)
BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 31(1):248-50
Gene Ontology (GO) - an organizational framework for storing
interaction and function data
http://www.geneontology.org
What is the function of a protein?
Not an easy question to answer!
There are many aspects to function, and different people might (and
do!) describe the function of one protein many different ways.
In GO gene products (mostly proteins) are described using descriptors
in three categories:
Molecular function - the activity carried out by the gene product at
the molecular level: histidine kinase, alcohol dehydrogenase,
Biological process - a multi-step process, such as cell division, DNA
replication, signal transduction
Cellular component - part of a cell that is a part of a larger structure;
ribosome, spindle pole, kinetochore
The hierarchical structure of GO
Molecular function, biological process and the cellular compartment
where a gene product is found can be described at many levels of
detail.
GO uses hierarchical description where each protein gets a set of terms
at different levels. The ontology structure is that of an acyclic directed
graph.
less detailed description
more detailed description
p38 (fly)
MOLECULAR FUNCTION
p38 (fly)
BIOLOGICAL PROCESS
p38 (fly)
CELLULAR COMPONENT
Where do the GO terms come from?
A team of experts is responsible for assigning GO annotation.
Every term comes with an evidence code describing where the
annotation came from.
Sample evidence codes:
IDA - inferred from direct assay (enzyme assay, cell fractionation)
IPI - inferred from physical interaction (2-hybrid)
IGI - inferred from genetic interaction (suppressor, synthetic lethal)
IEP - inferred from expression pattern (microarray)
IMP - inferred from mutant phenotype
ISS - inferred from sequence or structure similarlity
TAS - traceable author statement
NAS - non-traceable author statement
Advantages of GO
Controlled vocabulary! Everyone can communicate using the same terms!
Designed to apply across species.
Linked to genomic databases like TIGR, FlyBase, SGD (yeast), WormBase, MGD
(mouse), RGD (rat), TAIR (Arabidopsis), ZFIN (zebrafish) and more…
Makes it possible to compute on protein function and localization. Can define
formal relationships between proteins based on their GO annotation.
Flexible. GO is always growing and changing. New terms can be added to the
hierarchy.
Free and open for the use of the community.
Computational methods for improving the quality of interaction data.
1. Assessment and validation (improve accuracy)
2. Prediction (improve coverage)
Assessing and filtering interaction data
1. Promiscuity criteria
In most high-throughput interaction studies, a few proteins are
observed to interact promiscuously. Generally these are
removed from the analysis. Problem: some interactions may be
real!
Examples:
Affinity purification/ms
Even with no bait, 17 proteins were found in pull-downs by Gavin et al.
49 other proteins found to have a similar frequency of interaction to
these false positives were thrown out.
Yeast 2-hybrid
Proteins observed to make many interactions in many screens usually
discarded as probably false positives
Assessing and filtering interaction data
2. Overlap criteria
A. with other interaction data - intersection is low!
In 2001, ~2,000 high-throughput measurements were
confirmed by small scale experiments.
B. with non-interaction data, e.g.
annotations in YPD = yeast protein
databank
YPD now proprietary at Incyte :-(
Please see figures 1 and 2 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios, and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Overlap with expression data:
Expression Profile Reliability (EPR)
Note: proteins involved in “true” protein-
protein interactions have more similar mRNA
expression profiles than random pairs. Use
this to assess how good an experimental set
of interactions is.
Please see figure 4 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios, and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Assessing and filtering interaction data
Expression Profile Reliability (EPR)
Assume the observed distribution observed results from the
true interactions and false positive interactions. The
observed distribution is expressed as a weighted sum of
these contributions.
Estimate the distribution for non interactions using all
protein-protein pairs (assume interactions are rare).
Estimate distribution for true interactions using small-scale
experiments.
Fit a parameter a
EPR
to estimate how many high-throughput
interactions are true positive vs. false positive.
F_exp(d
2
) = a
EPR
?F_int(d
2
) + (1-a
EPR
)?F_no_int(d
2
)
Best fit a
EPR
= 31% -> ~70% of high-throughput pairs are false positives!
But method doesn’t tell you which interactions these are.
Other methods have estimated that ~50% of yeast 2-hybrid pairs are true positives.
Please see figure 4 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios,
and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput
Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Assessing and filtering interaction data
Homology methods - Paralogous Verification (PVM)
Sequence A candidate interaction Sequence B
PSI-BLAST
w/in genome
list of paralogs
A B
A’ B’
A’’ B’’
B’’’
B’’’’
PVM score = 2
(# non A-B interactions)
Deane et al. Mol. & Cell. Proteomics (2002) 1.5, 349-356
PVM is very specific, but not very sensitive
three different high-confidence interaction datasets
WP indicates proetin paris with ≥ 1 paralog for A or B
Points on this plot come from using different PVM
score cutoffs to designate a true interaction. It is an
example of a receiver-operator characteristic (ROC)
curve, which is commonly used to illustrate the
tradeoff between sensitivity vs. specificity.
PVM is very selective - if a pair scores by PVM it is
almost certainly a true positive (x-axis -> low false
positive rate).
However, PVM does not achieve good coverage - it is
not sensitive (y-axis). At most, PVM can confirm
~50% of high-confidence examples. This is at least
partly because many examples of paralogous
complexes are sparse.
Please see figure 5 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios,
and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput
Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Assessing and filtering interaction data
DIP_CORE is a set of 3,003 interactions considered higher confidence.
DIP_CORE interactions either:
1. Have been observed in a small-scale experiment (2,246)
2. Have been observed in more than one experiment (1,179)
3. Have been confirmed by PVM (1,428)
verification field indicates that one (1) small-scale experiment supports this interaction
Deane et al. Mol. & Cell. Proteomics (2002) 1.5, 349-356
Assessing and filtering interaction data
3. Topology criteria use information about the observed vs. expected
interaction network.
We will discuss the paper:
Bader et al. “Gaining confidence in high-throughput protein interaction
networks”
Nature Biotechnology (2004) 22, 78-85
Predicting protein-protein interactions
1. Sequence methods
How can you predict that an interaction might occur between two
proteins based purely on sequence data?
Review : Valencia & Paz o s, Current Opin ion in Structural Biolog y (2002) 12, 368-373
Predicting protein-protein interactions
1. Sequence methods
phylogenetic profiles - based on the joint presence/absence
of a pair of proteins in a large number of genomes; recall the
first literature discussion class (Pellegrini et al.).
co-evolution - as assessed by similarity of phylogenetic trees.
“mirrortree” method compares the distance matrices for
generating trees; requires lots of sequences and a good
alignment!
gene fusions - genes encoding interacting proteins in one
organism are sometimes fused into a single gene in another.
Look for these occurrences.
gene neighborhood - for bacteria, the arrangement of genes
in operons means that interacting proteins are often encoded
in adjacent sites in the genome
Review: Valencia & Pazos, Current Opinion in Structural Biology (2002) 12, 368-373
Predicting protein-protein interactions
1. Sequence methods
correlated mutations - the idea is that interacting positions on different proteins should co
evolve so as to maintain the interface. Look for correlation between sequence
changes at one position and those at another position in a multiple sequence
alignment.
Recall Süel et al. “Evolutionarily conserved networks of residues mediate allosteric
communication in proteins”
??G
i,j
= ?G
j
- ?G
j|i
where ?G = kT?ln(P(x at j)/P
MSA
(x))
Pazos & Valencia “In silico two-hybrid systems for the selection of physically interacting
protein pairs.” PROTEINS (2002) 47, 219-227
Pearson coefficient r
ij
= Σ(S
i,k,l
-<S
i
>)(S
j,k,l
-<S
j
>)/normalization describes the correlation
between amino acid positions i and j in two proteins. Here S
i,k,l
is a measure of the
similarity of the aa at position i in sequences k and l, and <S> is the average of these
i
values. k and l are sequences taken from a MSA that has the same number of
sequences, from the same species, for sites i and j.
Problem: need lots of sequences, and the method is very sensitive to the alignment used.
Review: Valencia & Pazos, Current Opinion in Structural Biology (2002) 12, 368-373
Predicting protein-protein interactions
2. Structure-based methods
Docking is a large field in and of itself, which involves predicting how
two known structures will interact. It even has its own prediction
contest - CAPRI, like CASP.
The main issues in docking are, as always when modeling structure, (1)
sampling the conformational space and (2) selecting the correct solution
Docking approaches require structures of both interacting components.
Frequently, conformational changes accompany protein interactions.
Docking methods generally require a structure of the bound
conformation to predict interactions correctly. Modeling
conformational flexibility is hard.
We don’t have enough structures or good enough docking methods to make high-
throughput prediction of protein-protein interactions practical at this point.
Predicting protein-protein interactions
2. Structure-based methods
What do you do when you don’t have a structure?
Homology modeling methods (Aloy & Russell PNAS (2002) 99, 5896-5901)
For target proteins that have homologs that form a complex of known
structure:
(1) Identify pairs of positions that form interactions in the known structure
(2) align the target proteins to the template proteins and score the interacting
residue pairs identified in step (1) with a knowledge-based potential.
(3) Normalize using the scores for pairs of random sequences
(4) Z-scores above a certain cutoff indicate that a complex is likely.
~65% accuracy when assessing whether different fibroblast growth factors bind
to various receptors (4 structures available, 252 possible pairings evaluated).
Not practical to apply at the genome level due to lack of homologous
complexes with structures.
Predicting protein-protein interactions
2. Structure-based methods
What do you do when you don’t have a structure?
Threading methods (Lu et al., MULTIPROSCPECTOR)
Phase I: Thread each target sequence onto a library of folds using a
permissive cutoff
Phase II: Take pairs of fold assignments and thread the targets onto
complexes of these folds (complexes of known structure)
Evaluate an interfacial score to determine how complementary the fit is
S_interface = -log (N_obs_in_PDB(i,j)/N_expect_by_chance(i,j))
Used library of 768 complexes, predicted 7,321 interactions for yeast proteins.
Hard to assess performance. One way is to look at some property that you
believe should correlate with interactions, e.g. co-localization or function.
Lu et al., PROTEINS (2002) 49, 350-364, Genome Research (2003) 13, 1146-1154
2. Structure-based methods
Threading methods Lu, L, H Lu, and J Skolnick. "MULTIPROSPECTOR: An Algorithm for The
Prediction of Protein-protein Interactions by Multimeric Threading." Proteins 49, no. 3 (15 November
2002): 350-64.
Co-localization:
Are the proteins found
in the same part of the cell?
Please see figure 2 of
Lu, Long, Adrian K. Arakaki, Hui Lu, and Jeffrey Skolnick. "Multimeric Threading-Based Prediction of
Protein–Protein Interactions on a Genomic Scale: Application to the Saccharomyces Cerevisiae Proteome."
Genome Res.13 (June 2003): 1146-1154.
Predicting protein-protein interactions
3. Methods based on data
Jansen et al. - next class
Next class: literature about protein-protein interaction
assessment and prediction
Bader et al. “Gaining confidence in high-throughput protein interaction
networks” Nature Biotechnology (2004) 22, 78-85
Jansen et al. “A Bayesian networks approach for predicting protein-
protein interactions from genomic data” Science (2003) 320, 449
453.
Focus on:
1. What are they trying to do?
2. What do they use as a set of positive and negative examples?
3. What is their basis for deciding if an interaction is good or not?
4. How well do the methods work? How can you tell?
5. Do they learn anything new or exciting about interactions in the
proteome?