The Protein Interactome
A critical framework underlying systems biology 
1.	 Overview - the many levels of systems biology 
2.	 Experimental methods for measuring protein-protein 
interactions, and their limitations 
3.	 Data sources for information about proteins and their 
interactions 
4.	 Computational methods for assessing and predicting 
protein-protein interactions. 
7.91 Amy Keating
Spectrum of Systems Biology
detailed models - describe rates, concentrations, structure 
low-resolution models - describe information flow, logic, mechanism 
circuitry logic/control - positive and negative regulation 
connectivity/topology - who talks to who? interaction scaffold 
parts list - protein and DNA sequences (& structures) 
Recommended reading:  Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262 
Spectrum of Systems Biology
allow simulation &
detailed models - differential equations 
comparison with data 
low-resolution models - Boolean & Markov models 
circuitry logic/control - Bayesian networks 
connectivity/topology - graph theory 
parts list - databases 
Recommended reading:  Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262 
Spectrum of Systems Biology
detailed models - rates of individual reactions, protein 
concentrations in the cell, extent of phosphorylation, diffusion 
rates 
low-resolution models - which elements are most crucial? 
combinatorial dependencies. 
circuitry logic/control - Expression profiling, post-translational 
modifications in response to different stimuli. Identify pathways 
and clusters; does an interaction activate or repress; are multiple 
components required? 
connectivity/topology - protein-protein, protein-DNA and 
protein-small molecule interactions 
parts list - genome sequencing projects, gene finding algorithms, 
EST libraries 
Recommended reading:  Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262 
Spectrum of Systems Biology
detailed models 
YAFFE 
low-resolution models - not covering this topic much this year 
circuitry logic/control -
BURGE 
connectivity/topology -
KEATING (today) 
parts list BURGE 
Recommended reading:  Ideker & Lauffenburger, TRENDS in Biotechnology (2003) 21, 255-262 
Protein-protein and protein-DNA interactions at the genomic level 
Saccharomyces cerevisiae as a model organism. 
A very simple eukaryote - “yeast as a model for human”
Genome 12,053 kb sequenced in 1996.
~5800 protein-coding genes.
Easy to do genetics in yeast.
Many regulatory and metabolic pathways are at least partly 
conserved between yeast and higher eukaryotes.
Many human disease genes have yeast orthologs.
Saccharomyces cerevisiae image from SGD?, provided by Peter Hollenhorst and Catherine Fox. Used with permission. 
Small-scale interaction experiments 
Protein-protein interactions 
pull-down (GST, Ni affinity, co-immunoprecipitation)
cross-linking
more biophysical & quantitative: fluorescence, CD, 
calorimetry, surface plasmon resonance 
Protein-DNA interactions 
mostly by gel shift assay 
Many, many thousands of such experiments have been done and 
reported in the literature, but how do you get the information out?  
This is hard, and an important problem in modern biology. 
PreBIND is a machine learning application that can extract information 
about whether two proteins interact from the literature automatically. 
http://www.blueprint.org/products/prebind/prebind.html 
Small-scale experiment are generally the most reliable, though still rife 
with false negatives and false positives. 
Yeast 2-hybrid assay
mate 
yeast 
Vector with activation domain--ORF fusion 
Vector with DNA-binding domain--B fusion 
plate on -His media
Images: http://depts.washington.edu/sfields/yp_interactions/YPLM.html 
Courtesy of Stanley Fields. Used with permission.
Yeast 2-hybrid assay
Pros Cons 
easy/fast prone to false negatives 
protein doesn’t fold 
no purification required 
protein doesn’t localize to nucleus 
interference from endogenous protein 
in vivo conditions 
fusion protein doesn’t interact like native 
protein fusion may be toxic to cell 
can be adapted for 
high-throughput screens 
prone to false positives 
auto-activation 
indirect ineractions
can detect transient interactions 
not quantitative 
no control over post-translational modification 
only test binary interactions 
not quantitative 
Yeast 2-hybrid assay for an entire genome
Uetz et al. Nature (2000) 403, 623-627 
Two strategies: 
1.	 “array” approach: ~6,000 activation domain hybrid transformants mated to 
192 DNA binding domain fusion transformants 
only 20% of interactions (281) reproducible (many auto-activate) 
3.3 positives per interaction-competent protein
2.	 “high-throughput screen” approach:  5,345 ORFs cloned separately into 
DNA-binding and activation domain plasmids (2 reporter genes); DBD fusions 
pooled and mated to AD fusions; 12 clones per pool sequenced, gave 692 
unique interactions (472 seen more than once) 
1.8 positives per interaction-competent protein
Ito et al. PNAS (2001) 98, 4569-4574 
For both DBD and AD, make 62 pools of ~96 proteins.  Mate all pools against all. 
Gave 4,549 interactions; 841 observed ≥ 3 times (= core data). 
The potential number of interactions is huge, and the number of real interactions 
is probably very large (>10,000); these studies only characterize a tiny 
fraction (low coverage). 
Stan Field’s web site 
http://depts.washington.edu/sfields/images/RPC19.html 
Courtesy of Stanley Fields. Used with permission.
Additional “cons” when you do a large scale 2-hybrid screen
PCR amplification gives mutations - generally don’t sequence 
everything to confirm! 
Cloning & transformation inefficiencies
If baits are pooled, slow-growing cells will lose to faster ones, 
giving false negatives.
All vs. all assay contains many implausible interactions - proteins 
that aren’t co-localized or expressed at the same time.
Can only sequence a small fraction of the positive clones.
High-throughput Y2H screens miss as many as 90% of Y2H 
interactions observed in focused, small-scale studies! 
Affinity Purification
What do you mean by an “interaction”? 
Most proteins interact with several other proteins (estimate 2-10). 
Many proteins in the cell are found in complexes. For some 
purposes, knowing the identities of the members of the clusters is 
as useful, or more useful, than knowing the directly interacting 
partners. 
Affinity purification is a method for characterizing the clusters 
directly, rather than one interaction at a time. 
Affinity Purification/Mass spectrometry
affinity tag 
BAIT 
a 
b 
c 
d 
e 
BAIT 
a 
b 
c 
d 
e 
DNA encodes bait + tag 
bait expressed in 
cell forms part of a 
complex 
complex with affinity 
the tag 
separate 
a,b,c,d,e,BAIT 
extract bands, 
digest with trypsin 
PEPTIDES 
mass spec + 
identities 
of 
proteins 
in the 
complex 
lyse cell, fish for 
column that binds 
by SDS PAGE gel 
database search 
Affinity purification/mass spectrometry for an entire genome
Gavin et al. Nature (2002) 415, 141-147; Cellzome
1,167 bait proteins
TAP tag inserted at 3’ end of gene; proteins under endogenous promoter
2 rounds of purification
232 distinct complexes with 2 to 83 proteins per complex
new cellular role proposed for 344 proteins
To assess confidence:
Repeat the experiment - only 70% reproducible using the same bait
Use different proteins in the complex as the bait, see if you recover the 
same proteins in the complex.
Ho et al. Nature (2002) 415, 180-183; MDS Proteomics
725 bait proteins; 1,578 interacting proteins
FLAG tag, proteins transiently overexpressed
To assess confidence:
74% of interactions reproducible in small scale co-IP/blot
Affinity/ms assay
Pros 
get the whole complex 
proteins that purify together are 
likely to share a function 
very sensitive - can detect ~15 
copies per cell 
in vivo conditions 
can be adapted for 
high-throughput screens 
Cons 
doesn’t determine direct interactions 
not reliable for small proteins (< 15 kD) 
affinity tag may interfere with interactions or 
with the function of essential proteins 
prone to false positives, e.g. “sticky” proteins 
prone to false negatives 
won’t get every protein every time 
complex must survive purification 
not quantitative 
Array Detection of Protein-Protein Interactions
print onto 
aldehyde or 
Ni surface 
label with 
fluorescent 
1 2 
3 4 
1 2 
3 4 
1 2 
3 4 
purified 
peptides or proteins
N x N 
dye 
1 
m
m 
Highly purified proteins were denatured using GdnHCl and printed onto aldehyde-
derivatized glass slides using a commercial split pin arrayer.  GdnHCl was to prevent 
homodimerization on the surface.
49 human proteins plus 3 duplicates plus 10 yeast proteins were printed in quadruplicate 
62 times.
The 62 proteins were independently labeled with Cy-3 dye and denatured with GdnHCl.  
Peptides were diluted from GdnHCl as they were added to the arrays.  Following a brief 
incubation, slides were washed, dried and scanned, yielding NxN measurements, in 
quadruplicate of cc interactions.
The assay was repeated at concentration ranging from 160 pM to XXX nM.
Array Detection of Protein-Protein Interactions
MacBeath & Schreiber Science 2000 
proof-of-principle for three types of interactions 
protein-protein:  protein G with IgG, FRAP with FKBP12, p50 with IκBα 
protein-small molecule: biotin with steptavidin, Ab with DIG steroid ligand 
enzyme-substrate: kinases PKA, Erk2 
Zhu et al. Science 2001 
assay of 5,800 yeast genes with calmodulin, phospholipids 
Newman & Keating Science 2003 
assay of ~48 x 48 human bZIP transcription factor coiled coils (plus 10 x 10 yeast) 
Protein microarrays
Pros Cons 
N x N interactions at once tedious purification required, or else 
interactions may not be direct 
direct interaction assay 
surface may perturb folding or interactions 
reagents can be well characterized 
doesn’t mimic in vivo conditions 
solution conditions are controlled 
not yet a mature technology - possibly not a 
can be quantitative good general approach 
requires very little protein 
can be adapted for 
high-throughput screens 
few false positives 
Overlap of high-throughput interaction studies is LOW
Ito 
Y2H 
Uetz 
Y2H 
Gavin 
TAP/ms 
Ho 
FLAG/ms 
Ito 2-hybrid 4363 186 54 63 
Uetz 
2-hybrid 
1403 54 56 
Gavin affinity 3222 198 
Ho affinity 3596 
Small scale 442 415 528 391 
data from Salwinski & Eisenberg, Current Opinion in Structural Biology (2003) 13, 377-382 
Lesson: 
Lots of protein-protein interaction data is now available for 
yeast, but it is not very reliable and it is not nearly 
comprehensive. 
Nevertheless, these data have inspired the development of 
many computational methods.
To facilitate computational analysis, need to disseminate the 
data in a usable form!
This is often a rate limiting step in systems biology.
Databases that store interaction data
Database of Interacting Proteins (DIP) 
Biomolecular Interaction Network Database (BIND) 
Molecular Interactions Database (MINT) 
INTERACT 
MIPS contains interaction data (both direct and clusters) for yeast 
partners of murine p53
Lab of David Eisenberg 
http://dip.doe-mbi.ucla.edu 
DIP interaction details
http://dip.doe-mbi.ucla.edu
DIP interaction statistics
as of May, 2004 
http://dip.doe-mbi.ucla.edu 
http://dip.doe-mbi.ucla.edu
BIND
Designed to hold direct interaction, cluster and pathway data 
81,000 interactions 
written in ASN.1 (Abstract Syntax Notation) for computational efficiency 
Bader GD, Betel D, Hogue CW. (2003) 
BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 31(1):248-50
Gene Ontology (GO) - an organizational framework for storing
interaction and function data
http://www.geneontology.org 
What is the function of a protein? 
Not an easy question to answer!
There are many aspects to function, and different people might (and 
do!) describe the function of one protein many different ways.
In GO gene products (mostly proteins) are described using descriptors 
in three categories: 
Molecular function - the activity carried out by the gene product at 
the molecular level: histidine kinase, alcohol dehydrogenase, 
Biological process - a multi-step process, such as cell division, DNA 
replication, signal transduction 
Cellular component - part of a cell that is a part of a larger structure; 
ribosome, spindle pole, kinetochore 
The hierarchical structure of GO
Molecular function, biological process and the cellular compartment 
where a gene product is found can be described at many levels of 
detail. 
GO uses hierarchical description where each protein gets a set of terms 
at different levels. The ontology structure is that of an acyclic directed 
graph. 
less detailed description 
more detailed description 
p38 (fly)
MOLECULAR FUNCTION
p38 (fly)
BIOLOGICAL PROCESS
p38 (fly)
CELLULAR COMPONENT
Where do the GO terms come from?
A team of experts is responsible for assigning GO annotation.
Every term comes with an evidence code describing where the 
annotation came from.
Sample evidence codes:
IDA - inferred from direct assay (enzyme assay, cell fractionation)
IPI - inferred from physical interaction (2-hybrid)
IGI - inferred from genetic interaction (suppressor, synthetic lethal)
IEP - inferred from expression pattern (microarray)
IMP - inferred from mutant phenotype
ISS - inferred from sequence or structure similarlity
TAS - traceable author statement
NAS - non-traceable author statement
Advantages of GO
Controlled vocabulary!  Everyone can communicate using the same terms!
Designed to apply across species.
Linked to genomic databases like TIGR, FlyBase, SGD (yeast), WormBase, MGD 
(mouse), RGD (rat), TAIR (Arabidopsis), ZFIN (zebrafish) and more…
Makes it possible to compute on protein function and localization. Can define 
formal relationships between proteins based on their GO annotation.
Flexible. GO is always growing and changing.  New terms can be added to the 
hierarchy.
Free and open for the use of the community.
Computational methods for improving the quality of interaction data. 
1. Assessment and validation (improve accuracy) 
2. Prediction (improve coverage) 
Assessing and filtering interaction data
1. Promiscuity criteria 
In most high-throughput interaction studies, a few proteins are 
observed to interact promiscuously.  Generally these are 
removed from the analysis. Problem: some interactions may be 
real! 
Examples:
Affinity purification/ms
Even with no bait, 17 proteins were found in pull-downs by Gavin et al. 
49 other proteins found to have a similar frequency of interaction to 
these false positives were thrown out.
Yeast 2-hybrid
Proteins observed to make many interactions in many screens usually 
discarded as probably false positives
Assessing and filtering interaction data
2. 	Overlap criteria
A. with other interaction data - intersection is low! 
In 2001, ~2,000 high-throughput measurements were 
confirmed by small scale experiments. 
B.	 with non-interaction data, e.g. 
annotations in YPD = yeast protein 
databank
YPD now proprietary at Incyte :-(
Please see figures 1 and 2 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios, and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Overlap with expression data:
Expression Profile Reliability (EPR)
Note: proteins involved in “true” protein-
protein interactions have more similar mRNA 
expression profiles than random pairs.  Use 
this to assess how good an experimental set 
of interactions is. 
Please see figure 4 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios, and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Assessing and filtering interaction data
Expression Profile Reliability (EPR)
Assume the observed distribution observed results from the 
true interactions and false positive interactions.  The 
observed distribution is expressed as a weighted sum of
these contributions. 
Estimate the distribution for non interactions using all 
protein-protein pairs (assume interactions are rare).
 
Estimate distribution for true interactions using small-scale 
experiments. 
Fit a parameter a
EPR 
to estimate how many high-throughput 
interactions are true positive vs. false positive. 
F_exp(d
2
) = a
EPR
?F_int(d
2
) + (1-a
EPR
)?F_no_int(d
2
) 
Best fit a
EPR
= 31% -> ~70% of high-throughput pairs are false positives!
But method doesn’t tell you which interactions these are.
Other methods have estimated that ~50% of yeast 2-hybrid pairs are true positives.
Please see figure 4 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios,
and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput 
Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Assessing and filtering interaction data
Homology methods - Paralogous Verification (PVM)
Sequence A candidate interaction Sequence B 
PSI-BLAST 
w/in genome 
list of paralogs 
A B 
A’ B’ 
A’’ B’’
B’’’
B’’’’
PVM score = 2 
(# non A-B interactions) 
Deane et al. Mol. & Cell. Proteomics (2002) 1.5, 349-356 
PVM is very specific, but not very sensitive
three different high-confidence interaction datasets 
WP indicates proetin paris with ≥ 1 paralog for A or B 
Points on this plot come from using different PVM 
score cutoffs to designate a true interaction.  It is an 
example of a receiver-operator characteristic (ROC) 
curve, which is commonly used to illustrate the 
tradeoff between sensitivity vs. specificity. 
PVM is very selective - if a pair scores by PVM it is 
almost certainly a true positive (x-axis -> low false 
positive rate). 
However, PVM does not achieve good coverage - it is 
not sensitive (y-axis). At most, PVM can confirm 
~50% of high-confidence examples.  This is at least 
partly because many examples of paralogous 
complexes are sparse. 
Please see figure 5 of
Deane, Charlotte M., Lukasz Salwinski, Ioannis Xenarios,
and David Eisenberg. "Protein Interactions: Two Methods
for Assessment of the Reliability of High Throughput 
Observations." Mol. Cell. Proteomics 1 (May 2002): 349-356.
Assessing and filtering interaction data
DIP_CORE is a set of 3,003 interactions considered higher confidence. 
DIP_CORE interactions either: 
1. Have been observed in a small-scale experiment (2,246) 
2. Have been observed in more than one experiment (1,179) 
3. Have been confirmed by PVM (1,428) 
verification field indicates that one (1) small-scale experiment supports this interaction 
Deane et al. Mol. & Cell. Proteomics (2002) 1.5, 349-356 
Assessing and filtering interaction data
3. Topology criteria use information about the observed vs. expected 
interaction network. 
We will discuss the paper:
Bader et al. “Gaining confidence in high-throughput protein interaction 
networks”
Nature Biotechnology (2004) 22, 78-85
Predicting protein-protein interactions
1. Sequence methods 
How can you predict that an interaction might occur between two 
proteins based purely on sequence data? 
Review :  Valencia & Paz o s, Current Opin ion in Structural Biolog y (2002) 12, 368-373 
Predicting protein-protein interactions
1. 	Sequence methods 
phylogenetic profiles - based on the joint presence/absence 
of a pair of proteins in a large number of genomes; recall the 
first literature discussion class (Pellegrini et al.). 
co-evolution - as assessed by similarity of phylogenetic trees. 
“mirrortree” method compares the distance matrices for 
generating trees; requires lots of sequences and a good 
alignment! 
gene fusions - genes encoding interacting proteins in one 
organism are sometimes fused into a single gene in another. 
Look for these occurrences. 
gene neighborhood - for bacteria, the arrangement of genes 
in operons means that interacting proteins are often encoded 
in adjacent sites in the genome 
Review:  Valencia & Pazos, Current Opinion in Structural Biology (2002) 12, 368-373 
Predicting protein-protein interactions
1. Sequence methods 
correlated mutations - the idea is that interacting positions on different proteins should co 
evolve so as to maintain the interface.  Look for correlation between sequence 
changes at one position and those at another position in a multiple sequence 
alignment. 
Recall Süel et al. “Evolutionarily conserved networks of residues mediate allosteric 
communication in proteins” 
??G
i,j 
= ?G
j 
- ?G
j|i 
where ?G = kT?ln(P(x at j)/P
MSA
(x)) 
Pazos & Valencia “In silico two-hybrid systems for the selection of physically interacting 
protein pairs.” PROTEINS (2002) 47, 219-227 
Pearson coefficient r
ij 
= Σ(S
i,k,l 
-<S
i
>)(S
j,k,l 
-<S
j
>)/normalization describes the correlation 
between amino acid positions i and j in two proteins.  Here S
i,k,l 
is a measure of the 
similarity of the aa at position i in sequences k and l, and <S> is the average of these 
i
values. k and l are sequences taken from a MSA that has the same number of 
sequences, from the same species, for sites i and j. 
Problem: need lots of sequences, and the method is very sensitive to the alignment used. 
Review:  Valencia & Pazos, Current Opinion in Structural Biology (2002) 12, 368-373 
Predicting protein-protein interactions
2. Structure-based methods 
Docking is a large field in and of itself, which involves predicting how 
two known structures will interact. It even has its own prediction 
contest - CAPRI, like CASP. 
The main issues in docking are, as always when modeling structure, (1) 
sampling the conformational space and (2) selecting the correct solution 
Docking approaches require structures of both interacting components. 
Frequently, conformational changes accompany protein interactions. 
Docking methods generally require a structure of the bound
conformation to predict interactions correctly. Modeling 
conformational flexibility is hard.
We don’t have enough structures or good enough docking methods to make high-
throughput prediction of protein-protein interactions practical at this point. 
Predicting protein-protein interactions
2.	 Structure-based methods 
What do you do when you don’t have a structure? 
Homology modeling methods (Aloy & Russell PNAS (2002) 99, 5896-5901) 
For target proteins that have homologs that form a complex of known 
structure: 
(1) Identify pairs of positions that form interactions in the known structure 
(2)	 align the target proteins to the template proteins and score the interacting 
residue pairs identified in step (1) with a knowledge-based potential. 
(3) Normalize using the scores for pairs of random sequences 
(4) Z-scores above a certain cutoff indicate that a complex is likely. 
~65% accuracy when assessing whether different fibroblast growth factors bind 
to various receptors (4 structures available, 252 possible pairings evaluated).  
Not practical to apply at the genome level due to lack of homologous 
complexes with structures. 
Predicting protein-protein interactions
2. Structure-based methods 
What do you do when you don’t have a structure? 
Threading methods (Lu et al., MULTIPROSCPECTOR) 
Phase I: Thread each target sequence onto a library of folds using a 
permissive cutoff 
Phase II: Take pairs of fold assignments and thread the targets onto 
complexes of these folds (complexes of known structure) 
Evaluate an interfacial score to determine how complementary the fit is 
S_interface = -log (N_obs_in_PDB(i,j)/N_expect_by_chance(i,j)) 
Used library of 768 complexes, predicted 7,321 interactions for yeast proteins. 
Hard to assess performance. One way is to look at some property that you 
believe should correlate with interactions, e.g. co-localization or function. 
Lu et al., PROTEINS (2002) 49, 350-364, Genome Research (2003) 13, 1146-1154 
2. Structure-based methods
Threading methods Lu, L, H Lu, and J Skolnick. "MULTIPROSPECTOR: An Algorithm for The
Prediction of Protein-protein Interactions by Multimeric Threading." Proteins 49, no. 3 (15 November
 2002): 350-64.
Co-localization:
Are the proteins found
in the same part of the cell?
Please see figure 2 of
Lu, Long, Adrian K. Arakaki, Hui Lu, and Jeffrey Skolnick. "Multimeric Threading-Based Prediction of
Protein–Protein Interactions on a Genomic Scale: Application to the Saccharomyces Cerevisiae Proteome."
Genome Res.13 (June 2003): 1146-1154.
Predicting protein-protein interactions
3. 	Methods based on data 
Jansen et al. - next class 
Next class: literature about protein-protein interaction 
assessment and prediction
Bader et al. “Gaining confidence in high-throughput protein interaction 
networks” Nature Biotechnology (2004) 22, 78-85 
Jansen et al. “A Bayesian networks approach for predicting protein-
protein interactions from genomic data” Science (2003) 320, 449 
453. 
Focus on: 
1.	 What are they trying to do? 
2.	 What do they use as a set of positive and negative examples? 
3.	 What is their basis for deciding if an interaction is good or not? 
4.	 How well do the methods work?  How can you tell? 
5.	 Do they learn anything new or exciting about interactions in the 
proteome? 

