7.91 – Lecture #6 Protein Secondary Structure Prediction Michael Yaffe 7.93 – Lecture #9 Protein Secondary Structure Prediciton -and- Motif Searching with Scansite Outline ? Brief review of protein structure ? Chou-Fasman predictions ? Garnier, Osguthorpe and Robson ? Helical wheels and hydrophobic moments ? Neural networks ? Nearest neighbor methods ? Consensus prediction approaches Hierarchy of protein structure implies planarity Reasonance of peptide bond Dihedral angles define secondary structure Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. Structure of α-helices Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. α-helix dipole moment Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. Anti-parallel β-sheets Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. The “pleat”- a function of the tetrahedral Cα carbon Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. The parallel β-sheet Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. All α-helical All β-sheet Protein Classes – defined by secondary structural elements α/β-protein Chou-Fasman Biochemistry, 13: 222-245, 1974 ?Statistical Method ? Based on 15 proteins of known conformation, 2473 total amino acids ? Determined “protein conformational parameters” Pα, Pβ, based on f i s /(Σf j s /20) → 0.5-1.5 Helical residues Pα Glu Ala Leu His Met Gln Val Phe Trp 1.53 Strong 1.45 Ηα 1.34 helix former 1.24 1.20 1.17 hα Helix former 1.14 1.14 1.12 Lys Ile Asp Thr Ser Arg Cys Asn Tyr Pro Gly 1.07 1.00 0.98 0.82 0.79 0.79 0.77 0.73 0.61 0.59 0.53 Iα Weak helix former iα Helix indifferent bα Helix breaker Strong Bα helix breaker β-Sheet residues Met Val Ile Cys Tyr Phe Gln Leu Thr Trp Pβ 1.67 1.65 1.60 Ala Arg 0.90 Gly 0.81 Asp 0.80 Lys 0.74 Ser 0.73 His 0.71 Asn 0.65 Pro 0.62 Glu 0.26 0.97 Strong Ηβ sheet former 1.30 1.29 1.28 Sheet former hβ 1.23 1.22 1.20 1.19 Weak sheet former Iβ iα Sheet indifferent bβ Sheet breaker Strong Bβ sheet breaker α-helical β-sheet Glu Met Ala Val Leu Ile His Cys Met Tyr Gln Phe Trp Gln Val Leu Phe Thr Lys Trp Ile Ala Asp Arg Thr Gly Ser Asp Arg Lys Cys Ser Asn His Tyr Asn Pro Pro Gly Glu Chou-Fasman Empirical rule set for secondary structure nucleation using <Pα>, <Pβ> ? Search for helical nuclei: locate clusters of 4 (Hα or hα) out of 6 residues. Unfavorable if > 1/3 (bα or Bα). ? Extend helical segments in both directions until tetrminated by tetrapeptides with <Pα><1.0. Helix breakers include b4, b3i, etc. Some of the tetrapeptide residues can be in the helical ends (except Pro). ? Refine boundaries: Pro, Asp, Glu prefer N-terminal end, His Lys, Arg prefer C-terminal end. ? Rule #1 – Any segment > 6 residues with <Pα>>1.03 and <Pα>><Pβ>, satisfying above conditions is predicted as helical. Chou-Fasman Empirical rule set for secondary structure nucleation using <Pα>, <Pβ> ? Search for β-sheet nuclei: locate clusters of 3 β residues (Hβ or hβ) out of 5 residues. Unfavorable if > 1/3 β breakers (bβ or Bβ). ? Extend β-sheet segments in both directions until tetrminated by tetrapeptides with <Pβ><1.0. β-sheet breakers include b4, b3i, etc. ? Refine boundaries: Glu occurs rarely in β-region and Pro equally uncommon within inner β-sheets. Charged residues rare at either end. Trp most frequently at N- terminal end ? Rule #2 – Any segment > 5 residues with <Pβ>>1.05 and <Pβ>><Pα>, satisfying above conditions is predicted as β-sheet. Chou-Fasman Results ? ~50-60% accurate in reality, though paper claimed much higher results (limited data set) ? Seemed to be particularly less accurate for β- sheets. Chou-Fasman β-Turn potentails ? Typical β-turn is 4 amino acids f i f i+1 f i+2 f i+3 Arg 0.051 0.127 0.025 0.101 Asn 0.101 0.086 0.216 0.065 Asp 0.137 0.088 0.069 0.059 Pro 0.074 0.272 0.012 0.062 Trp 0.045 0.000 0.045 0.205 <f j > = Σj/N = 65/2343= 0.07 Chou-Fasman β-Turn potentails ? Typical β-turn is 4 amino acids f i f i+1 f i+2 f i+3 Arg 0.051 0.127 0.025 0.101 Asn 0.101 0.086 0.216 0.065 Asp 0.137 0.088 0.069 0.059 Pro 0.074 0.272 0.012 0.062 Trp 0.045 0.000 0.045 0.205 <f j > = Σj/N = 165/2343= 0.07 P(t)=f i f i+1 f i+2 f i+3 P(t)>7.5x10 -5 →turn Garnier, Osguthorpe, Robson ? Alternative approach to Chou-Fasman. ? Original version called “GOR”. Now up to GOR 3. Uses a scanning window of 17 amino acids centered on residue being examined. ? Based on assumption that each amino acid individually influences the propensity of the central residue to adopt a particular secondary structure. ? Each flanking position evaluated independently …like a PSSM! GOR Scoring Tables (original) 3 states – α-helix, β-sheet, turn -8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 WR Q I C T V NAFLCEHSYK α-helix β-sheet turn AA Pos AA Pos AA Pos Note – each table is INDEPENDENT of the central amino acid! GOR Scoring Tables ? Add the scores – assign secondary structure based on highest score. ? Problems: Limited data set for scoring table. 17 amino acids – 20 17 possibilities = 1.3 x 10 22 possible sequences, yet based on only 200-300 proteins! ? What do the scoring numbers mean? We are treating them as log-odds ratios, representing units of structural information. GOR Scoring Tables ? Based on information theory approach of Robson and Pain. ? Step 1- Consider the joint probability of amino acid R being in conformation S. The information function is I(S,R)=Log(P(S,R)/P(R)P(S)) - this is Chou-Fasman ? Step 2 – For Garnier, in each conformation, calculate the difference of information functions, I(?S,R)=Log(P(S,R)/P(S’,R))+Log(P(S’)/P(S)) where S’= all other conformations except S. These terms are the values in the lookup tables. ? Probablility terms calculated based on observed frequencies in the database of known structures as on 1978. Can actually use the net probability sum to calculate absolute probability ratios – so can estimate likelihoods. GOR Results ? ~ 65% accurate. ? Can use information from experiments (circular dichroism) to improve accuracy of predictions. ? Later versions allowed pairwise combinations of amino acids in flanking regions + central amino acid (GOR-2), or combinations of two amino acids in the flanking region (GOR-3) influence the final conformation of the central amino acid. Fred Cohen’s Approach-1989 ? Both Garnier and Chou-Fasman work well for globular proteins ? Cohen: Turns demarcate elements of secondary structure ? Therefore, start by predicting turns first. ? Fill in helices, strands after that. ? Use pattern recognition algorithms (forerunner of neural networks). ?In α/β proteins - ~ 85% accurate. But how do you know you have an α/β protein to begin with? Helical wheels and hydrophobic moments hydrophobic Amphipathic helices Alternating hydrophobic and hydrophilic positions in β-sheets Please refer to Branden, Carl, and John Tooze. Introduction to Protein Structure. 2nd ed. Garland Publishing, Inc., 1999. ISBN: 0815323042. Eisenberg-Hydrophobic moments ? Standard approach – Kyte and Doolittle – calculate hydrophobicity using a running window and typical scale of hydrophobicity based on oil- water partition coefficients of free amino acid side chains. ? Eisenberg’s idea – Plot hydrophobicity as function of sequence # - look for periodic repeats by fourier transform: Period = 2 amino acids – β sheet Period = 3-4 amino acids – α helix Neural network approach ? Look for amino acid patterns that patterns in a protein sequence that coincide with known secondary structures. ? Use machine learning approaches and a test set of proteins to decipher the best pattern recognition algorithm. ? Simulate the operation of the brain, where complex synaptic connections underlie function. Some neurons collect data, some process data, some deliver output. Neural network approach ? Use sliding window of 13-17 amino acids. ? 3 processing layers in feed-forward multilayer network: input layer →hidden layer →output layer ? Each input modified by a weighting factor and many inputs are fed into the hidden layer. The hidden layer integrates the inputs and outputs a number close to 0 or 1 by feeding inputs into a sigmoid trigger function that mimics neuronal firing. ? Signals from hidden units sent to the each of three output units (one for helix, sheet or other), weighed again, and all the inputs integrated again. Final output from each output unit is a 1 (predicts that particular secondary structure) or a 0 (not predicted). Neural network approach Input layer Hidden layer Output layer Input seq. L S F G Y C V K D R P S F 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 H j O j α β coil S j 1 0 0 H j S j S j x W ij O j Predicted structure S out =1/(1+e -kSin ) Neural network approach ? Train network on training set to optimize the wrighting factors Wij using feedback. ? Usually done by Jack-knife testing. ? Can use multiple different network architectures and select final secondary structure by jury decision. ? Increases predictive accuracy to ~ 70-72%. ? Best example: PHD (Profile network from HeiDelberg). ? Gives reliability indices for each predicted portion of the protein based on differences between output signals from the network. Nearest-neighbors Methods ? Also machine learning-based ? Identify sequences similar to the query in known structures. The known structures in the training set are divided into ~16 amino acid sequence fragments and secondary structure of central amino acid is recorded. ? Take similar window in the query sequence, match to best ~50 sequences in the training set. Use frequency of secondary structure of central amino acid in training data to infer structure in the query. ? Feed these structural predictions as input into a neural network to obtain the final prediction. ? Very accurate algorithms >72% correct prediction Nearest-neighbors Methods ? PREDATOR – another NN method that also considers amino acid patterns that can form H- bonds between adjacent β-strands and between n and n+4 in α-helices. ? Also considers substitutions found in sequence alignments, and gaps as likely to be “coils” ? Accuracy is ~75% - most accurate prediction algorithm to date. Best overall strategy ? JPRED http://jura.ebi.ac.uk:8888/ ? Developed by Geoffrey Barton ? A consensus approach to predicting secondary structure. Utilizes 6 different methods for prediction – PHD, linear discrimination (DSC), NNSSP, PREDATOR, ZPRED (conservative number weighted prediction), MULPRED (consensus single sequence method combination). ? ****Looks in pdb for homologues*** ? Available over the web, Q3=72.9%