www.fgks.org   »   [go: up one dir, main page]

Academia.eduAcademia.edu
PROTEINS: Structure, Function, and Bioinformatics 68:48–56 (2007) Benchmarking of TASSER in the Ab Initio Limit Jose M. Borreguero and Jeffrey Skolnick* Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318 ABSTRACT A significant number of protein sequences in a given proteome have no obvious evolutionarily related protein in the database of solved protein structures, the PDB. Under these conditions, ab initio or template-free modeling methods are the sole means of predicting protein structure. To assess its expected performance on proteomes, the TASSER structure prediction algorithm is benchmarked in the ab initio limit on a representative set of 1129 nonhomologous sequences ranging from 40 to 200 residues that cover the PDB at 30% sequence identity and which adopt a, a 1 b, and b secondary structures. For sequences in the 40–100 (100–200) residue range, as assessed by their root mean square deviation from native, RMSD, the best of the top five ranked models of TASSER has a global fold that is significantly close to the native structure for 25% (16%) of the sequences, and with a correct identification of the structure of the protein core for 59% (36%). In the absence of a native structure, the structural similarity among the top five ranked models is a moderately reliable predictor of folding accuracy. If we classify the sequences according to their secondary structure content, then 64% (36%) of a, 43% (24%) of a 1 b, and 20% (12%) of b sequences in the 40–100 (100–200) residue range have a significant TM-score (TM-score ‡0.4). TASSER performs best on helical proteins because there are less secondary structural elements to arrange in a helical protein than in a beta protein of equal length, since the average length of a helix is longer than that of a strand. In addition, helical proteins have shorter loops and dangling tails. If we exclude these flexible fragments, then TASSER has similar accuracy for sequences containing the same number of secondary structural elements, irrespective of whether they are helices and/or strands. Thus, it is the effective configurational entropy of the protein that dictates the average likelihood of correctly arranging the secondary structure elements. Proteins 2007;68:48–56. VC 2007 Wiley-Liss, Inc. take many years until there is at least one deposited structure for every protein family,2 ab initio or templatefree modeling methods are the only tool available for the structure prediction of these hard cases. Among the various realizations of ab initio methods are those that employ either physics based3–5 or knowledge-based potentials derived from a statistical analysis of protein structural databases.6,7 While these approaches are in principle applicable to any sequence, in practice because no global template information is used, as evidenced by their recent performance in CASP6, their accuracy has been rather limited.8 In the past, ab initio methods were validated on a relatively small number of proteins from which it is difficult to extract general trends, including the expected success rate. One trend which did emerge is that the ab initio folding of helical proteins was more successful than for proteins containing b sheets.4–6,9 Often, this reflected a problem with the hydrogen bond term that did not work well for b sheet structures. Alternatively, for a given chain length, since the mean length of a helix is longer than that of a beta strand, the number of secondary structural elements is smaller in helical that in beta proteins.10 This effect might have contributed to the success rate, but to establish this, a large, representative benchmark set is required. Recently, we developed the TASSER (Threading/ASSEmbly/Refinement) algorithm, which is designed to span the comparative modeling to ab initio folding regimes. We reported the results for the large scale benchmarking in the limit of weakly homologous, single and multiple domain proteins,11,12 where reasonable structural templates can be identified that may or may not be evolutionary related to the sequence of interest. We also explored its accuracy in the comparative modeling regime where there is a clear evolutionary relationship between the target and template structures.13 As expected, the quality of the prediction deteriorates when the templates identified from threading are unrelated to the target structure of interest. We also applied TASSER to the comprehensive structure prediction of all human GPCRs below 500 residues,14 as well as benchmarked Key words: ab initio folding; protein folding; protein structure prediction INTRODUCTION Grant sponsor: Division of General Medical Sciences, National Institutes of Health; Grant number: GM-37408. *Correspondence to: Jeffrey Skolnick, Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA 30318. E-mail: skolnick@gatech.edu For roughly 25% of the sequences in a given proteome, threading fails to identify a structural related template that can be used in subsequent modeling.1 Since it will Received 22 August 2006; Revised 10 November 2006; Accepted 2 January 2007 Published online 19 April 2007 in Wiley InterScience (www. interscience.wiley.com). DOI: 10.1002/prot.21392 C 2007 WILEY-LISS, INC. V 49 AB INITIO PROTEIN STRUCTURE PREDICTION TASSER on all families of membrane proteins with solved crystal structures.15 In all cases, the ab initio component of TASSER was applied to model the loops and tails regions lacking a template alignment. However, there has been no systematic examination of the performance of TASSER in the template free limit. Here, we address this issue for single domain proteins ranging from 40–200 residues in length. ence between the two conformations. Within each replica simulation, Monte Carlo moves, comprising random selection plus coordinate change of a protein fragment ranging from two to six amino acids in size, are performed. Changes in the protein conformation are accepted or rejected based on an evaluation of the energy difference before and after the conformational change.20 Structural Similarity Measures METHODS AND MATERIALS Construction of the Benchmark Sets To generate the set of sequences below 100 residues, S100, we retrieve a representative set of a, b, and a þ b protein sequences from the PDB that are under 100 residues whose pairwise sequence identities are no higher than 30%.1 The resulting set contains 131 a, 60 b, and 102 a þ b proteins (293 total), according to SCOP.16 For the set of sequences between 100 and 200 residues, S200, we similarly retrieve a, b, and a þ b sequences from the PDB with identical sequence identity cut-off (30%). The resulting set contains 230 a, 337 b, and 269 a þ b proteins (836 in total). Predicted models and native structures are available on our website at http://cssb.biology. gatech.edu/skolnick/files/abinitio/ Overview of TASSER in the ab initio limit The protein is described by a reduced protein model, where each residue is comprised of the Ca and the sidechain center of mass coordinates. Side-chain center of mass coordinates are determined with the Ca coordinates and a two-rotamer approximation. Initial Ca coordinates are generated by first projecting template coordinates of the Ca atoms onto a high coordinated cubic lattice, then connecting consecutive template fragments with an on-lattice random walk of Ca-Ca bond vectors. Most of the energy potential terms in TASSER have been previously described.17,18 Here, we outline its essential ingredients. The potential energy includes: (i) generic hydrogen bonding (ii) side chain contact energies between residues, (iii) short-range backbone correlations reflecting the propensity to adopt a particular secondary structure. Energy terms containing parameters that take into account the target protein’s sequence are: (i) amino acid burial propensity; (ii) short-range backbone correlations and a bias in the hydrogen bond to adopt the PSIPRED19 predicted secondary structure; and (iii) a contact potential derived from the alignment of pairs of small (11 residues) sequence fragments.11,18 All templates whose global pairwise sequence identity is higher than 30% are a priori excluded from the calculations. Protein conformational space is searched with a variant of the replica-exchange Monte Carlo algorithm. For each target protein, 40 different simulations with a total of 8107 Monte Carlo moves are attempted. Simulations are performed concurrently and in a broad range of temperatures (replicas). Protein conformations for replicas with similar temperatures are swapped at regular time intervals with a probability to accept the swap that is dependent on the energy differ- We use the root mean square deviation (RMSD),21 the Z-score of the relative root mean square deviation (ZrRMSD),22 and the TM-score23 as three metrics to assess the structural similarity of the models to the native structure. While RMSD is a more intuitive measure, the same RMSD value represents models of different quality for sequences of varying lengths. Z-rRMSD is independent of target sequence length, and from a practical point of view, we consider a protein as folded if the Z-rRMSD of the model is lower than 4.45 (P-value ¼ 10 5). In cases when only a fraction, albeit significant, of the residues fold close to native, the low RMSD signal from these residues is lost due to the high RMSD values of the other residues. The resulting RMSD and Z-rRMSD values don’t differentiate these cases from a random structure to native. In contrast, the TM-score can report the subset of residues with coordinates close to native, and its value distinguishes these cases from a random global alignment. In addition, the TM-score is sequencelength independent. Again, for practical purposes we consider a protein as folded if it has a TM-score of 0.4 or higher. This value usually indicates that more than half of the residues have coordinates close to native. The average TM-score of a pair of randomly related structures is 0.1724 and that of the best structural alignment of a pair of randomly related structures is 0.30, with a standard deviation of 0.01.24 Clustering Algorithm We employ the SPICKER25 algorithm to cluster the structures generated by TASSER, and obtain an average structure (model) for each of the top five clusters ranked by cluster density. The density of a cluster is the number of cluster members divided by the average RMSD of the members to the average structure. We report results for the average structures having the best RMSD, ZrRMSD, and TM-score to native, termed the best model, and the average structure of the densest cluster, termed the first model. RESULTS S100 Set The probability that the best model has an alignment to native better than some particular RMSD value [Fig. 1(a)] shows that models for a proteins are consistently more accurate than for a þ b proteins, that in turn are more accurate than for b proteins. Other ab initio methods6,9 also report that b proteins are the most difficult to fold. PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot 50 J. M. BORREGUERO AND J. SKOLNICK Fig. 1. (a) Probability of folding to native below a particular RMSD value for a (circle), a þ b (triangle), and b (square) classes in the S100 set for the best model in the top five models. (b) Same as in (a), but using the Z-rRMSD measure. Dashed line indicates the Z-rRMSD ¼ 4.25 threshold. (c) Probability of folding to native above a particular TM-score value. In contrast, the accuracy of TASSER when global templates can be successfully identified (which is not the situation described here) is independent of secondary structure class.11 The analogous probability distribution with the Z-rRMSD score [Fig. 1(b)] shows that TASSER predicts 33% of the best a models with a significant global alignment to native (Z-rRMSD  4.25, P-value  10 5), PROTEINS: Structure, Function, and Bioinformatics and the average and standard deviation of the RMSD to native of these models is 4.1  0.5 Å. Corresponding success rates for a þ b and b proteins are 27% and 16%, respectively. The rank distribution of the best model in these predictions (not shown) is not significantly different from a flat distribution (v2  0.5 for the three secondary structure classes). However, we find that best and first models coincide much more often than randomly expected 32% for a, 48% for b, and 35% for a þ b proteins. The average rank of the best model is independent of secondary structure class (a: 2.9  0.7, a þ b: 2.6  0.7, b: 2.6  0.7). The reason why the best and first models do not coincide even more often is that the five top models are structurally similar to each other, and the current ab initio TASSER potential may not discriminate in some cases from among a set of similar structures the one which is closest to the native state. The average and standard deviation of the TM-score among the top five models is 0.66  0.16 for a proteins, 0.63  0.16 for a þ b proteins, and 0.50  0.14 for b proteins. Since there may be target sequences for which TASSER correctly predicts the coordinates of a significant fraction of the residues, we calculate the TM-score of the models to the native structure to detect such cases. The probability that the best model has an alignment to native better than some particular TM-score [Fig. 1(c)] shows that 78%, 42%, 28% of a, a þ b, b sequences respectively have significant predictions (TMscore 0.4). These percentages are higher than those we obtain using the Z-rRMSD cut-off because the Z-rRMSD measure can detect only folds with overall global similarity to native. The average and standard deviation of the TM-score among the five models of the same target sequence is 0.70  0.15 (a), 0.63  0.16 (a þ b), and 0.57  0.15 (b), respectively. The structural similarity among the top five models as assessed by their average TMscore has a 0.5 correlation coefficient to the TM-score of the best model to native [Fig. 2(a)], and the correlation coefficient is independent of the secondary structure class. This structural similarity among the models arises when different initial conformations are driven via TASSER simulations towards conformations that are structurally close. This can happen if the parameterization of the potential energy reproduces some of the features of real proteins. Then, the different initial conformations converge to conformations that are structurally similar to the global minimum (native state), and therefore, are similar to each other. S200 Set Figure 3(a), for 100–200 residue proteins shows the probability of folding a target sequence below a certain RMSD threshold. Again, a proteins are the easiest secondary structure class to fold. The percentage of sequences with the best model having a significant global fold (Z-rRMSD  4.25, P  10 5) is 26%, 17%, 12% for a, a þ b, b proteins [Fig. 3(b)], and the average and standard deviation of the RMSD to native of these models is 6.4  DOI 10.1002/prot AB INITIO PROTEIN STRUCTURE PREDICTION 51 Fig. 2. (a) Scatter plot of the average TM-score between all possible pairs of the five top models versus TM-score-to-native of the best model for sequences in the S100 set. (b) Same as in (a), but for sequences in the S200 set. 0.8 Å. An analogous analysis with the TM-score shows that 55%, 38%, 21% of a, a þ b, b sequences have their best model with a significant portion of the structure acceptably predicted, viz. with a TM-score 0.4 [Figure 3(c)]. The fraction of amino acids with coordinates close to native, the coverage, shows a strong linear correlation with TM score (coverage  0.01 þ 1.50TM, r ¼ 0.92). For instance, a model with a TM-score ¼ 0.4 has 48% to 61% of its residues with a RMSD to native typically between 3.0 and 4.5 Å. The RMSD of this region shows also a linear correlation with TM score (RMSD  5.2 Å– 3.8 ÅTM, r ¼ 0.66). Thus, while only a few percent of targets have a global RMSD below 5 Å [see Fig. 3(a)], there are many more targets with at least half of their residues below this RMSD value, usually located in the protein’s core. Using the TM-score measure, the rank distribution of the best models is not significantly different from a flat distribution (v2  0.3 for all three secondary structure classes), but as with the previous S100 set, the best and first models coincide more often than the randomly expected 20% (a: 40%, a þ b: 28%, b: 46%). The average rank for the best model is 2.5  0.7 (a), 2.6  0.7 (a þ b), and 2.3  0.7 (b) respectively. Finally, we find a correlation coefficient of 0.7 between the TM-score of the best model to native and the average TM-score among the top five models [Figure 2(b)], so that the average TM-score among the models is a moderately reliable indicator of a successful prediction in the 100–200 residue range. We show in Figure 4 some representative target examples, with lengths in between 64 and 141 residues, where Fig. 3. (a) Probability of folding to native below a particular RMSD value for a (circle), a þ b (triangle), and b (square) classes in the S200 set. (b) Same as in (a), but using Z-rRMSD. The dashed line indicates the Z-rRMSD ¼ 4.25 threshold. (c) Probability of folding to native above a particular TM-score value. TASSER provides a significant prediction. Figure 4(a) shows the best of the top five models (rank ¼ 2) for Granulysin from human cytolytic T lymphocytes (PDB code 1l9lA, 74 residues), which adopts the Saposin-like fold (an orthogonal bundle of four helices). For this target, TASSER correctly predicts the positions of all Ca atoms, with a global RMSD of only 1.64 Å. Figure 4(b) shows the best of the top five models (rank ¼ 1) for one monomer of PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot 52 J. M. BORREGUERO AND J. SKOLNICK Fig. 4. Two illustrative examples of successful global superposition for each of the a (a,b), b (c,d), and a þ b (e,f) classes. We superimpose the model (colored backbone, from red in the N-terminal to blue in the C-terminal) onto the native structure (thin red backbone). Every case shows its PDB code, TM-score, and global RMSD to native. the homo-tetrameric hemoglobin of Urechis caupo (1ithA, 141 residues). TASSER predicts a structure with an RMSD of 5.6 Å and a TM-score of 0.59, with the major errors due to one long loop and the C-terminus. Two reasons, both extrinsic to the protein chain, converge in this target to account for the misoriented residues. First, the absence of an explicit representation of the Heme group in our model forces the C-terminal to occupy part of the volume left by the absent Heme group in the protein core. Second, extensive interactions with the other monomers of the biological unit produce a tight geometry in the long loop that otherwise may not be the most stable in the monomeric state. Figure 4(c,d) show two representatives of the Immunoglobulin-like sandwich fold of b proteins. Figure 4(c) shows the best of the top five models (rank ¼ 2) for the human soluble tissue factor (1danT, 75 PROTEINS: Structure, Function, and Bioinformatics residues). TASSER yields a structure with a global RMSD of 3.34 Å and a TM-score of 0.86. Figure 4(d) shows the best of the top five models (rank ¼ 1) for a mutant T cell receptor (TCR) V alpha domain (1ac6A, 110 residues) with a RMSD of 4.17 Å. This protein contains 12 strands arranged in two sheets. Figures 4(e,f) show two representative results for a þ b proteins. Figure 4(e) shows the best of the top five models model (rank ¼ 1) for the cyanobacterial copper metallochaperone, ScAtx1, (1sb6A, 64 residues) with a global RMSD to native of only 1.99 Å. Finally, Figure 4(f) shows the best of the top five models model (rank ¼ 1) for one of the protein chains of the Grb10 Src homology 2 domain, a natural dimer (1nrvA, 100 residues). The global RMSD is 2.86 Å. In addition to these previous examples, we show in Figure 5 two targets for which TASSER predicts a reasonably good substructure but with a high global RMSD. Figure 5(a) shows the predicted the best of top five models (rank ¼ 2) for the human interferon b (PDB code 1au1A, 166 residues), which adopts the 4-helical cytokine fold. For this target, TASSER correctly predicts 66% of the structure with a RMSD ¼ 3.09 Å, corresponding to three helices of the four-helix bundle plus the extra helix characteristic of the cytokine fold. The remaining residues are located in the extra helix at the N-terminal and two long, connecting loops. The model has a global RMSD of 15.2 Å, and a TM-score of 0.51. From these examples, we observe that unaligned residues tend to be located in the termini and long loops, resulting from incorrect assignment of secondary structure and/or the inherent disorder of the tails. The presence of other protein chains, prosthetic groups, metals and binding molecules/peptides in the native state may also force the protein chain to adopt some local geometry that our monomer potential ignores. Figure 5(b) shows the predicted best of the top five models (rank ¼ 1) of the mannose 6-phosphate receptor (1c39A, 152 residues). TASSER correctly predicts 64% of the structure with a RMSD ¼ 3.08 Å, corresponding to seven of the nine strands. An incorrect assignment of secondary structure in the first 51 residues by the PSIPRED program results in TASSER generating a helix in place of the first strand, forcing the misalignment of the N-terminal and a global RMSD to native of 14.7 Å. A more dramatic example of incorrect secondary structure assignment occurred for target protein 1a30A. PSIPRED assigned helices to four of the nine native strands, resulting in a model with different fold than native (TM ¼ 0.27, RMSD ¼ 11.7 Å). On the other hand, the JPRED secondary structure predictor server28 correctly assigned eight of the nine strands. Thus, the use of a secondary structure meta-predictor could aid in improving the accuracy of secondary structure assignments. One example of correct secondary structure assignment but incorrect assembly into the native fold is target 1m4oA (TM ¼ 0.23, RMSD ¼ 10.1 Å), composed of three helixes and eight strands. Both the native structure and the best predicted model have a very similar radius of gyration, but the native structure contains almost double number of long range DOI 10.1002/prot AB INITIO PROTEIN STRUCTURE PREDICTION 53 Fig. 5. Two examples of significant substructure predictions with a high global RMSD. We superimpose the model (colored backbone, from red in the N-terminal to blue in the C-terminal) onto the native structure (thin red backbone). [Color figure can be viewed in the online issue, which is available at www.interscience. wiley.com.] contacts than the best model. Thus structures of lower contact order are predicted. The ratio (q) of model longrange contacts to native long-range contacts decreases from a q ¼ 0.9 ratio for a contact between two residues separated by 30 residues to a q ¼ 0.6 ratio for a contact separated by 160 residues. This scenario may be typical of a target protein with a pair contact potential that is not specific enough to the target. Finally, other failed predictions include target proteins with very open structures (1mhlA), or two-domain proteins for which TASSER fails to reproduce the correct domain orientation (1bcpB). We will address these more complex cases after we fine-tune ab initio TASSER to produce low RMSD models for globular, single-domain proteins. Dangling Termini The prediction of protein termini is of special difficulty, as it is often observed that termini do not adopt a particular secondary structure or lack interactions with the rest of the protein. There are several scenarios for which the termini may not be correctly predicted: (i) incorrect packing against the protein core; (ii) interactions with another protein; and (iii) the termini point away from the protein core in the native structure for no apparent reason. One can argue that there is information missing in the last two scenarios that may prevent TASSER from predicting the correct orientation of the termini. We examine the disorder present in the termini of targets of the S200 set by counting the number of dangling residues. We define residue at position i as dangling if it has no contacts with other residues, excluding the [i 4, i þ 4] range of local neighbors. In addition, we define two residues in contact if the center of mass of their respective chains is below some cut-off distance, taken from an analysis of PDB structures. By adding all the dangling residues found in the native structures of the 836 targets, we find 2682 dangling residues, which means that on average, there are four dangling residues per target. Figure 6(a) shows the probability that a target in the S200 set has less than some particular number of dangling residues, either in the N, C or both termini. Only a relatively few number of targets have dangling tails of considerable length. Thus, if we trim the dangling residues off all the targets and recalculate the percentage of targets in S200 having the best model with significant TM-score, then the percentage of acceptable predictions shows a gain of 2.1% (TM-score >0.4) with respect to the calculation including the dangling residues. This percentage gain is higher if we select the targets having dangling tails of considerable length, instead of all targets in S200. Figure 6(b) shows the percentage gain in TM-score if we select targets having a number of dangling residues above certain cut-off value, trim these residues off, and then recalculate the percentage of targets with a significant TM-score (>0.4). We find that the gain is exponential with the cut-off value, as shown in the fit of Figure 6(b). This indicates the relevance of dangling tails in the prediction of the native structure of some proteins. Since TASSER has difficulty predicting the native coordinates of dangling tails (if indeed there are any), we examine the ability of TASSER to predict whether a residue will be dangling in the native state. Models generated by TASSER correctly identify 33% of these residues as dangling. The remaining 77% of the dangling PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot 54 J. M. BORREGUERO AND J. SKOLNICK tently show that the prediction accuracy decreases with increasing number of strands in the protein. Thus, a þ b proteins are harder to predict than a proteins, and b proteins are harder to predict than a þ b proteins. Can the observed lower folding probability of b sequences be attributable to the relative higher number of secondary structure elements (NSS), when compared to a sequences of same length? As NSS increases, we expect that the potential energy loses its ability to discriminate the unique native structure among the different ways in which the elements of the secondary structure can arrange and produce different topologies. We estimate of the order of eNln(z) arrangements, where N is the number of independent elements here taken to be the secondary structure elements, and z is the partition function of the internal degrees of freedom of a typical secondary structure element. If no energy function is used, then all arrangements are equally likely and the probability of finding the unique native structure among the set of arrangements would be at best PF 1=eNSSlnðzÞ ¼ e Fig. 6. (a) Probability that the number of dangling residues, nd, is bigger than some value L for N terminal (squares), C terminal (triangles) and both termini (circles). (b) For a subset of targets in the S200 set having L dangling residues in both termini, we show the percentage gain in the fraction of these targets having best centroid with TM-score (0.4 after we trim the dangling residues. We show only half of the circles for clarity of presentation. The curve shows and exponential fit with r ¼ 0.99. residues make some contact(s) in the TASSER generated models. Conversely, these results change only slightly if we focus in the N terminus (31%) or in the C terminal (34%). TASSER generated models also predict dangling residues which are not dangling in the native state. About 33% of the TASSER dangling residues are correctly identified. TASSER predicts a larger excess of dangling residues in the N-terminus (18% of them are correctly identified) than in the C terminus (45% identified), due to the fact that there are less dangling residues in the N terminus of native states than in the C terminus. These results suggest that inclusion of a bias for predicted intrinsic disordered regions26 in the TASSER potential energy may increase the ability of TASSER to discriminate a dangling tail from a terminal that interacts with the rest of the protein. Number of Secondary Structure Elements Independent of the structural similarity measure that we use (either Z-rRMSD or TM-score), the results consisPROTEINS: Structure, Function, and Bioinformatics NSSlnðzÞ or lnðPF ÞNSS: In addition to the number of secondary structure elements, structural similarity measure between the model and native structures will be adversely affected by the presence of residues that are not part of the secondary structure elements, that is, loops and dangling tails. These coil-residues may be flexible and therefore their alignment to native is of increased difficulty to predict. To eliminate the effect of the coil-residues from our resulting TM-score values, we will only take into account those residues predicted to adopt a helix or strand conformation when calculating the TM-score. The resulting scores will assess the significance of the model topology to the native topology. Figure 7(a) shows the logarithm of the percentile probability that a sequence of given length and secondary structure class will have a significant TMscore, log(PF(TM >0.4jL,class)). The probabilities are relatively high for sequences below 150 residues, a direct consequence of removing coil-residues from the calculation of the TM-score. PF has a monotonic decrease with increasing sequence length, which becomes more acute for sequences above 150 residues. We observe that the probabilities for a proteins are consistently higher than those of b proteins, with an average difference of 18% over the whole range of sequence lengths. Figure 7(b) shows the logarithm of the percentile probability that a sequence of given number of secondary structure elements will have a significant TM-score, log(PF(TM >0.4jNSS,class)). We observe again a monotonic decrease of the probability with increasing NSS, except for a flattening in the curve corresponding to a þ b proteins in the small NSS range (NSS <5). The reason for this exception may be an insufficient number of target proteins, since one additional pseudo count for each NSS in this range, having significant TM-score would give the monotonic decrease in this NSS range. a Proteins have a slightly higher probability than b proteins in the DOI 10.1002/prot 55 AB INITIO PROTEIN STRUCTURE PREDICTION CONCLUSIONS Fig. 7. (a) Logarithm of the percentile probability of obtaining a model with significant TM-score for a (circle), a þ b (triangle), and b (square) secondary structure classes versus sequence length and (b) versus number of secondary structure elements. The dashed line represents the pure exponential decay102 0.045 NSS. 1<NSS<8 range, with an average difference of 5%. For higher NSS (7 < NSS <10), b proteins have a 7% higher probability than a proteins. The overall difference between a- and b proteins over the whole NSS range is 1.3% in favor of a proteins, much smaller than the 18% when we plot the probabilities against sequence length. We show the exponential fit (dashed line) of all three secondary classes in the NSS<9 range, with a 0.91 correlation coefficient. The fit suggest that the specificity of the TASSER potential decreases exponentially with the number of secondary structures, due to the increasing number of arrangements with a potential energy similar to that of the native structure. The trend is independent of the type of secondary structure so that in the ab initio limit, TASSER has a similar accuracy for sequences of different secondary structure classes and the same number of secondary structures. Above NSS ¼ 8, TASSER more frequently assigns an energy to the native structure that is significantly higher than the structure having the minimum energy. Thus, the probability of predicting the native structure in this scenario is lower than the case when no energy function is used and all structures are equally accessible. Hence, the sharp decrease of the folding probability in the high NSS range. We have assessed the ability of TASSER in the template free limit to predict the global fold of a comprehensive set of nonhomologous a, a þ b, and b sequences below 200 residues. For representative sequences below 100 residues, in the top five ranked models TASSER predicts structures whose global fold bears a statistically significant similarity to the native structure (Z-rRMSD  4.25) in 43% of a, 33% of a þ b, and 19% of b proteins. For sequences in the 100–200 residue range, the corresponding success rates are 33% for a, 24% for a þ b, and 15% for b proteins. Even when the entire fold is not correctly predicted, TASSER can in some cases predict the correct structure of the protein core. For sequences below 100 residues, it can generate models in the top five ranked models with significant TM-score (TM-score 0.4) in 64% of a, 43% of a þ b, and 20% of b sequences. For sequences 100 to 200 residues in length, these percentages are 36% for a, 24% for a þ b, and 12% for b proteins with the poorly predicted regions located in loops and at the N- and C-termini. Furthermore, structural similarity among the top five clusters is a moderately reliable predictor of folding success. Finally, for all sequences below 200 residues, the ability of TASSER to predict the structure of the protein core, represented by its secondary structure elements, is very similar for a, a þ b, and b sequences with the same number of elements. Thus, the success of TASSER is strongly dictated by the size of the conformational space which must be searched, which is a function of the number of secondary structural elements. While the results of this comprehensive benchmark are encouraging, clearly improvements in the TASSER force field are required. One possibility is to reparameterize the TASSER force field specifically for the template free limit (at present, the relative weights of the terms in the potential are the same regardless of whether template information is used or not18). Alternatively, additional terms might need to be added to the potential to enhance the sequence-structure specificity. These might include a bias towards intrinsic-disordered residues,26 distance-dependent pair potentials,7 as well as three-body terms.27 Finally, the use of a secondary structure meta-predictor could improve on the current secondary structure assignments. These issues as well as others will be examined in detail in the near future. ACKNOWLEDGMENTS The authors gratefully acknowledge Dr. A. Arakaki for his careful reading of the manuscript and the production of Figures 1–3. REFERENCES 1. Skolnick J, Kihara ,D, Zhang ,Y. Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 2004;563:502–518. 2. Holm L, Sander C. Mapping the protein universe. Science 1996;273:595–603. PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot 56 J. M. BORREGUERO AND J. SKOLNICK 3. Snow CD, Sorin EJ, Rhee YM, Pande VS. How well can simulation predict protein folding kinetics and thermodynamics? Annu Rev Biophys Biomol Struct 2005;34:43–69. 4. Simmerling C, Strockbine B, Roitberg AE. All-atom structure prediction and folding simulations of a stable protein. J Am Chem Soc 2002;124:11258–11259. 5. Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Amautova YA, Jagielska A, Makowski M, Schafroth HD, Kazmierkiewicz R, Ripoll DR, Pillardy J, Saunders JA, Kang YK, Gibson KD, Scheraga HA. Physics-based proteinstructure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proc Natl Acad Sci USA 2005;102:7547–7552. 6. Kussell E, Shimada J, Shakhnovich EI. A structure-based method for derivation of all-atom potentials for protein folding. Proc Natl Acad Sci USA 2002;99:5343–5348. 7. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002;11:2714–2726. 8. Vincent JJ, Tai CH, Sathyanarayana BK, Lee B. Assessment of CASP6 predictions for new and nearly new fold targets. Proteins 2005;61 (Suppl 7):67–83. 9. Simons KT, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997;268:209–225. 10. Zhu ZY, Blundell TL. The use of amino acid patterns of classified helices and strands in secondary structure prediction. J Mol Biol 1996;260:261–276. 11. Zhang Y, Skolnick J. Automated structure prediction of weakly homologous proteins on a genomic scale. Proc Natl Acad Sci USA 2004;101:7594–7599. 12. Zhang Y, Skolnick J. Tertiary structure predictions on a comprehensive benchmark of medium to large size proteins. Biophys J 2004;87:2647–2655. 13. Pandit S, Skolnick J. TASSER-Lite: an automated tool for protein comparative modeling. Biophysical J, in press. 14. Zhang Y, Devries ME, Skolnick J. Structure modeling of all identified G protein-coupled receptors in the human genome. PLoS Comput Biol 2006;2:e13. PROTEINS: Structure, Function, and Bioinformatics 15. Zhang Y, Skolnick J. Tertiary structure predictions on a comprehensive benchmark of medium and large size proteins. Biophysical J, in press. 16. Hubbard TJ, Ailey B, Brenner SE, Murzin AG, Chothla C. SCOP: a Structural Classification of Proteins database. Nucleic Acids Res 1999;27:254–256. 17. Zhang Y, Arakaki A, Skolnick J. TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins, in press. 18. Zhang Y, Kolinski A, Skolnick J. TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophys J 2003;85:1145–1164. 19. McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics 2000;16:404–405. 20. Zhang Y, Kihara D, Skolnick J. Local energy landscape flattening: parallel hyperbolic Monte Carlo sampling of protein folding. Proteins 2002;48:192–201. 21. Kabsch W. A discussion of the solution for the best rotation to relate two sets of vecotrs. Acta Cryst A 1978;34:827–828. 22. Betancourt MR, Skolnick J. Universal similarity measure for comparing protein structures. Biopolymers 2001;59:305–309. 23. Zhang Y, Skolnick J. A scoring function for the automated assessment of protein structure template quality. Proteins 2004;57:702–710. 24. Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J. On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci USA 2006;103:2605– 2610. 25. Zhang Y., Skolnick J., SPICKER: a clustering approach to identify near-native protein folds. J Comput Chem 2004, 25(6):865– 71. 26. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 2006;7:208. 27. Li X, Liang J. Geometric cooperativity and anticooperativity of three-body interactions in native proteins. Proteins 2005;60:46– 65. 28. Cuff JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ. JPred: a consensus secondary structure prediction server. Bioinformatics 1998;14:892–893. DOI 10.1002/prot