Protein 3D Structure Computed from Evolutionary Sequence Variation Debora S. Marks1*., Lucy J. Colwell2., Robert Sheridan3, Thomas A. Hopf1, Andrea Pagnani4, Riccardo Zecchina4,5, Chris Sander3 1 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America, 2 MRC Laboratory of Molecular Biology, Hills Road, Cambridge, United Kingdom, 3 Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America, 4 Human Genetics Foundation, Torino, Italy, 5 Politecnico di Torino, Torino, Italy

2011. PLoS ONE 6(12): e28766 Abstract

The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 A˚ Ca-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.

EVfold

Spencer Bliven Bourne Journal Club. Jan 10, 2012

Citation: Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 Editor: Andrej Sali, University of California San Francisco, United States of America Received November 10, 2011; Accepted November 14, 2011; Published December 7, 2011 Copyright: ! 2011 Marks et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: CS and RS have support from the Dana Farber Cancer Institute-Memorial Sloan-Kettering Cancer Center Physical Sciences Oncology Center (NIH U54CA143798). LC is supported by an Engineering and Physical Sciences Research Council fellowship (EP/H028064/1). TH has support from the German National Academic Foundation. RZ has support from European Community grant 267915. No other financial support was received for the research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] . These authors contributed equally to this work.

How do we use our abundant sequence information to get protein structures?

Fig. 1

Why is this needed? ! 

! 

“In spite of significant progress in the field of structural genomics over the last decade [20], only about half of all well-characterized protein families (PFAM-A, 12,000 families), have a 3D structure for any of their members [1].” “As we are about to reach a truly explosive phase of massively parallel sequencing, we anticipate increased coverage of sequence space for protein families by several orders of magnitude, well above the level of 1000–10000 non-redundant sequences for protein family and with rich evolutionary information about protein structure directly from sequence.”

RASH_HUMAN

R 3

5

DI

141 164 5

Simulated Annealing

1.  Multiple Alignment 2.  Predicted Contact Map 3.  Predicted Structure

Fig. 1, S2c, 3

164

3

How do we accurately predict contact maps from multiple alignments?

Fig. 3

How accurate are the predicted contacts?

Fig. 6a

How accurate are the structures produced?

Fig. 2

How accurate are the structures produced?

Fig. 4

How accurate can we get?

Mostly “significant” TM-scores (>.5), 2-5Å RMSD Table 1

How much sequence data do we need?

Need at least 1000 sequences with high variability. 10-20K is better. Table 1

What about errors in contact prediction?

Fig. 5

What about errors in contact prediction?

Fig. 5

Is this amount of sequence data reasonable?

Fig. 7b

How many constraints are needed?

Fig. 7b

Aren’t active sites conserved? Can Evfold still work for active sites? ! 

This may reflect strong evolutionary constraints near functional sites and may imply that the configuration of resides around an active site can be predicted more accurately than other detailed aspects of the 3D structure. The ability to predict active site constellations at this level of accuracy would be particularly interesting for the design of drugs on predicted structural templates.

Limitations ! 

Need 1000+ sequence MSA ! 

! 

Also needs sufficient independence (<70% identity)

“Code available upon request”

Benefits ! 

Fast! ! 

<1 cpu-hour per family

What can we do next with EVFold? !  !  !  ! 

Pre-compute structures for all Pfam families Use to predict protein-protein complexes Predict structures for use phasing Xray data Search for novel folds

How exactly does it work? Calculate DI for all pairs of positions in the MSA Rank DI pairs. Filter based on secondary structure consistency, sequence proximity, disulfide bonding, and conservation. Keep the top N. Convert to distance constraints (CA, CB, side chain centers) Use NMR structure solvers (simulated annealing + MD)

1.  2. 

3.  4.  1. 

Remove incorrect chirality and knotted topologies

Score predictions

5.  1. 

Number of constraints satisfied, MD energy, etc

twork states in neural populations [40], Global better than local model for residue couplings odeling from transcript profiles [41], to Mutual information does not sufficiently correlate with eractions from nucleotide sequences residue proximity. We first attempted the prediction of on of protein signaling networks from residue-residue proximity relationships using the straightforward 44]. The maximum entropy principle, local mutual information (MI) measure. MI(i,j) for each residue even probabilities subject to optimal pair i, j is a difference entropy which compares the experimentally generated and empirical observables, observed co-occurrence frequencies fij(Ai,Aj) of amino-acid pairs Ai, device for approaching the problem of Aj in positions i, j of the alignment to the distribution fi(Ai)fj(Aj) that couplings from multiple sequence has no residue pair couplings (details in Text S1): homologous proteins. use of a !  fij(AThe ,A ) = co-occurance frequency of amino acids Ai and Aj at i j ch to derive essential residue correla! " ! positions i and j. Visual comparison X 3D structures for Lapedes three representative proteins. ofq 3 of the uced in 1999 by et al. [14,15] ! 15 test " proteins fij Ai(others ,Aj in Figure S3) ! Center: " agreement the predicted top ranked 3Dto structure (left) and the experimentally (right). Ca-RMSD fobserved ð1Þ ln MI ij ~ ij Ai ,Aj structure rithmicallyof using belief propagation ð Þf A A f ˚ i i j j was chosen to !  Mutual information: error calculation, e.g., 2.9 A Ca-RMSD (67). The ribbon representation s, number of residues used for Ca-RMSD Ai ,Aj ~1 ctions in protein-protein ographical progression of theinterfaces polypeptideby chain, rather than atomic details such as hydrogen bonding (colored blue to red in ge chain, Monte Carlo optimization to study !  Doesn’t work, since doesn’t distinguish from indirect N-term to C-term; helical ribbons are a-helices, straight ribbons are b-strands, arrowdirect in the direction of the chain; eachcorrelations Contact maps constructed from residue pairs assigned high MI ack view, related degree rotation). The predicted proteins can be viewed in full atomic detail in deposited graphics dies by Mora et by al.180 [45]. An alternative values, and thus A).interpreted predicted contacts, program (Web A4) or P(A from their coordinates (Web Appendix !  [46], Want )=probability that aspolypeptide A1,Adiffer Nimwegen et Appendix al. similar in intent 1,A2,…,A L 2,…,AL e.0028766.g002 substantially from the correct contact maps deduced from native approach, uses abelongs Bayesian network to the family, such that structures, consistent with the work of Fodor et al. [9] (Figure S1). direct from indirect statistical depenVisual inspection of positions MI-predictedin contactsmodel as lines matches connecting the ! reports Thea covariance between two ositions also dramatic (1) consistency with observed data the (pair and single residue are of theand form [11,45,49]: residue pairs superimposed on the observed crystal structure cy of contact prediction from multiple frequencies) and (2) maximum entropy of the global probability observed covariance confirms that predicted fromInMIpractice, are often incorrect ( ) over the setthe of contacts all possible sequences. once these teinsX [13]. X ! " and/or unevenly distributed (Figure 3, left, blue lines). Presumably parameters entropy are determined matrix inversion has maximal (egbysequences are(Equations equallyM4,likely unless eij Ai ,Aj ! z Thehmodel ð2Þ i ðA i Þ this arises due to the local nature of MI, which is independently M5), one can directly compute the effective pair probabilities 1ƒiƒjƒL 1ƒiƒL conformational complexity the data indicates otherwise) Dir calculated each residue pair i,j. Plausibly, the key confounding Pij (Ai,Afor j) (Equation M6), and from these the effective residue e asked if there is sufficient contact is the probabilities transitivity of pair correlations, where theterm simplest case rticular amino !  acidsCalculate at sequence positions i factor couplings (‘direct information’, inDir analogy to )the ‘mutual effective pair P (A ,A resulting from a correlations from the evolutionary ij i j e normalization constant. The Lagrange involves residueDItriplets; for example, residue amino B co-varies with over allif possible acid pairs information’) ij by summing rotein a correctthe three-dimensional interaction the amino acids nd hi(Ainto agreement of the both positions i,j: Btwo Abetween Aj at and C, because is spatially close to both A(requires and C, then A inverting a i) constraindirect i,A builds on an efficient algorithm to ith pair and single residue occurrences, and C may co-vary even without physical proximity (A–C is a 2 matrix) (20L) in a maximum entropy model, called to transitive pair correlation). ! "! lobal statistical model is analogous Any local" measure q Dir of correlation, not X ! P A ,A i j ij nalysis [47] for and translates the resulting Dir xpressions the probability of the just mutual information, ! " effect. ð3Þ Ai ,Aby DIij ~ isPlimited this transitivity j ln !  Direct Information: ij f ðA Þfj Aj distance constraints effective ultiple particle system,forsuch as in use the in Ising Ai ,Aj ~1 Effective residue couplings from i a i global maximum tion of 3D structures and i corresponds in their s analogy, a sequence position entropy model. To disentangle such direct and indirect s a spin, andand canmolecular be in one dynamics of 21 states correlation nimization The crucial difference this expression direct a effects, we use abetween global statistical model to for compute Hamiltonian (the expression in curlyofbrackets) l data requirement for success this 3) and equation for all mutual ij (Equation setinformation of direct DI residue couplings thatthebest explains pair rticle-particle coupling energies eij(A information MIij (Equation 1) is to replace pair probabilities i,Aj)isand rich evolutionary sequence data that correlations observed in the multiple sequence alignment (see gl co-evolution energies to external fields estimated based on local frequency counts fij(Ai,Aj), by the doubly i(Ai). patterns in hamino acid Methods and Text S1) [15,47]. More precisely, we seek a general uence problem, the e (A ,A ) in equation 2 are constrained pair probabilities PijDir(Ai,Aj), which are globally i j protein. The uctural elements ofij the model, P(A A ), for the probability of a particular amino acid

How is direct information calculated?

ings that are used in the prediction of folding

1… L

2012-01-10 EVfold.pdf

1. Multiple Alignment. 2. Predicted Contact Map. 3. Predicted Structure. Fig. 1, S2c, 3. Page 4 of 19. 2012-01-10 EVfold.pdf. 2012-01-10 EVfold.pdf. Open. Extract.

2MB Sizes 1 Downloads 232 Views

Recommend Documents

No documents