Protein folding by motion planning

Viewer
Transcript

INSTITUTE OF PHYSICS PUBLISHING

PHYSICAL BIOLOGY

doi:10.1088/1478-3975/2/4/S09

Phys. Biol. 2 (2005) S148–S155

Protein folding by motion planning ∗ Shawna Thomas1, Guang Song2 and Nancy M Amato1,3 1 2

Department of Computer Science, Texas A&M University, College Station, TX 77843-3112, USA Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA 50011, USA

Received 24 June 2005 Accepted for publication 2 September 2005 Published 9 November 2005 Online at stacks.iop.org/PhysBio/2/S148 Abstract We investigate a novel approach for studying protein folding that has evolved from robotics motion planning techniques called probabilistic roadmap methods (PRMs). Our focus is to study issues related to the folding process, such as the formation of secondary and tertiary structures, assuming we know the native fold. A feature of our PRM-based framework is that the large sets of folding pathways in the roadmaps it produces, in just a few hours on a desktop PC, provide global information about the protein’s energy landscape. This is an advantage over other simulation methods such as molecular dynamics or Monte Carlo methods which require more computation and produce only a single trajectory in each run. In our initial studies, we obtained encouraging results for several small proteins. In this paper, we investigate more sophisticated techniques for analyzing the folding pathways in our roadmaps. In addition to more formally revalidating our previous results, we present a case study showing that our technique captures known folding differences between the structurally similar proteins G and L.

1. Introduction There are large and ongoing research efforts whose goal is to determine the native structure of a protein from its amino acid sequence [1, 2]. A protein’s 3D structure is important because it affects the protein’s function. In this work, we assume the native structure is known, and our focus is on the study of protein folding mechanisms. That is, instead of performing fold prediction, we aim to study issues related to the folding process, such as the formation of secondary and tertiary structures, and the dependence of the folding pathway on the initial denatured conformation. Such questions have taken on increased practical significance with the realization that mis-folded or only partially folded proteins are associated with many devastating diseases [3]. Moreover, increased knowledge of folding mechanisms may provide insight for protein structure prediction. Despite intensive efforts by experimentalists and theorists, there are major gaps in our ∗

This research was supported in part by NSF CAREER Award CCR9624315, NSF Grants ACI-9872126, EIA-9975018, EIA-0103742, EIA9805823, ACR-0113971, CCR-0113974, EIA-9810937, EIA-0079874 and the Texas Higher Education Coordinating Board grant ATP-000512-02612001. ST was supported in part by an NSF Graduate Research Fellowship. GS was supported in part by an IBM PhD Fellowship. 3 Author to whom any correspondence should be addressed.

1478-3975/05/040148+08$30.00

understanding of the behavior and mechanism of the folding process. In previous work [4], we proposed a technique for computing protein folding pathways that is based on the successful probabilistic roadmap (PRM) [5] method for robotics motion planning. We were inspired to apply this technique to protein folding based on our success in applying it to folding problems such as carton folding and paper crafts [6]. We obtained promising results for small proteins (∼60 amino acids) and validated our pathways by comparing the secondary structure formation order with known experimental results [7]. A major feature of our PRM-based framework is that in a few hours on a desktop PC it produces roadmaps containing large sets of unrelated folding pathways that provide global information about the protein’s energy landscape. In this paper, we investigate more sophisticated techniques for analyzing the pathways in our roadmaps. In addition to more formally revalidating our previous results, we present a case study showing that our technique captures known folding differences between the structurally similar proteins G and L. 1.1. Comparison to related work Table 1 provides a summary comparison of various models for protein folding. Both Monte Carlo simulation and

© 2005 IOP Publishing Ltd Printed in the UK

S148

Protein folding by motion planning

Table 1. A comparison of protein folding models. Approach

Folding landscape

# Paths produced

Path quality

Computation time

Native state needed

Molecular dynamics Monte Carlo Statistical model PRM-based Lattice model

No No Yes Yes Not used on real proteins

1 1 0 Many

Good Good N/A Approx

Long Long Fast Fast

No No Yes Yes

(a)

(b)

(c)

Figure 1. A PRM roadmap for protein folding shown imposed on a visualization of the potential energy landscape: (a) after node generation (note sampling is denser around N, the known native structure), (b) after the connection phase and (c) using it to extract folding paths to the known native structure.

molecular dynamics provide a single, usually high quality, folding trajectory. Each run is computationally intensive because they attempt to simulate complex kinetics and thermodynamics. Statistical mechanical models, while computationally efficient, assume extremely simplified molecular interactions and are limited to studying global averages of folding kinetics. They also cannot detect multiple kinetics behavior such as the two-state and three-state kinetics exhibited by hen egg-white lysozyme [8, 9]. Lattice models [10] have been well studied and possess great theoretical value but cannot be applied to real proteins. Our PRM approach, by constructing a roadmap that approximates the folding landscape, computes multiple folding pathways in a single run and provides a natural way to study protein folding kinetics at the pathway level. What we sacrifice is path quality, which can be improved through bigger roadmaps, oversampling or other techniques.

2. A probabilistic roadmap method for protein folding Our approach to protein folding is based on the probabilistic roadmap (PRM) approach for motion planning [5]. A detailed description of how the PRM framework can be applied to protein folding is presented in our previous work [4, 11]. The basic idea is illustrated in figure 1. We first sample some points in the protein’s conformation space (figure 1(a)); generally, our sampling is biased to increase density near the known native state. Then, these points are connected to form a graph, or a roadmap (figure 1(b)). Weights are assigned to directed edges to reflect the energetic feasibility

Figure 2. A three-amino-acid segment of a protein. The phi (φ) and psi (ψ) angles in the middle amino acid are labeled.

of transition between the conformations corresponding to the two end points. Finally, folding pathways are extracted from the roadmap using standard graph search techniques (figure 1(c)). 2.1. Modeling proteins (C-space) The amino acid sequence is modeled as a tree-like linkage. Using a standard modeling assumption for proteins [12], the only degrees of freedom (dof) in our model of the protein are the backbone’s phi and psi torsional angles, which we model as revolute (rotational) joints taking values in [0, 2π), see figure 2. Moreover, side chains are modeled as spheres and have zero dof. Since we are not concerned with the absolute position and orientation of the protein, a conformation of an (n + 1)-aminoacid protein can be specified by a vector of 2n phi and psi angles, each in the range [0, 2π), with the angle 2π equated S149

S Thomas et al

(a)

(b)

(c)

(d)

Figure 3. An illustration of our iterative perturbation sampling strategy shown imposed on a visualization of the potential energy landscape. Protein A (60 residues, all alpha) 4000

2.5

x 10

Protein G (56 residues, mixed)

4

2.5

x 10

Protein L (62 residues, mixed)

4

2.5

x 10

4

Protein CTXIII (60 residues, all beta)

3500 2

2

2

2000 1500 1000

1.5

1

0.5

Potential [kcal/mol]

2500

Potential [kcal/mol]

Potential [kcal/mol]

Potential [kcal/mol]

3000 1.5

1

0.5

1.5

1

0.5

500 0

0

0

0 −500 0

5

10

15

20

25

30

−0.5 0

5

10

15

20

25

30

35

−0.5 0

5

10

15

20

25

30

35

−0.5 0

5

10

15

20

RMSD [A]

RMSD [A]

RMSD [A]

RMSD [A]

(a)

(b)

(c)

(d)

25

30

35

Figure 4. The potential versus RMSD distribution of roadmap nodes for proteins (a) A (all-alpha helix), (b) G (mixed alpha/beta), (c) L (mixed alpha/beta) and (d ) CTX III (all-beta).

to 0, which is naturally associated with a unit circle in the plane, denoted by S 1 . That is, the conformation space (C-space) of interest for a protein with n + 1 amino acids can be expressed as C = {q | q ∈ S 2n }.

(1)

Note that C simply denotes the set of all possible conformations. The feasibility of a point in C will be determined by potential energy computations. 2.2. Node generation Recall that we begin with the known native structure and our goal is to map the protein folding landscape, focusing on the native fold. The objective of the node generation phase is to generate a representative sample of conformations of the protein. Due to the high dimensionality of the conformation space, simple uniform sampling would take too long to provide sufficiently dense coverage of the region surrounding the native structure. The results presented in this paper use a biased sampling strategy that focuses sampling around the native state by iteratively applying small perturbations to existing conformations [4]. The process is illustrated in figure 3. Our sampling is based on the contact number of a conformation. A native contact is a pair of Cα atoms from residues whose ˚ apart in the native state. The contact centers are less than 7 A number of a conformation is defined as the number of native contacts present. We partition all conformations into sets, or bins, according to the conformation’s contact number. Two conformations with different sets of native contacts present will appear in the same bin if they contain the same number of native contacts. We continue generating nodes until all bins have enough conformations. We use ten equal-sized bins. S150

A node q is accepted and added to the roadmap based on its potential energy E(q) (see section 3) with the following probability: P (accept q)  1,    E − E(q) max = ,  Emax − Emin   0,

if

E(q) < Emin ,

if

Emin E(q) Emax ,

if

E(q) > Emax .

(2)

We set Emin = 11 905 kcal mol−1 and Emax = 16 667 kcal mol−1 which favors configurations with well-separated side chain spheres. This acceptance test, which retains more nodes in low energy regions, was also used in PRM-based methods for ligand binding [14, 15] and in our previous work on protein folding [16, 11]. 2.2.1. Node distribution. An interesting effect starts to emerge after node generation. For example, consider the differences between the potential versus RMSD distributions for the all-alpha and all-beta proteins shown in figure 4. This distinctive behavior, which was observed for all proteins we have studied, cannot be attributed to our sampling strategy or our potential E(q) since we observe the same qualitative distributions with both the coarse potential energy and the all-atom potential energy (see section 3 for details on energy calculations). It can, however, be related to folding differences. In particular, all-alpha proteins tend to form the helices first, and then the helices pack together to form the final tertiary structure. In the figure, this packing of helices is seen as the narrow ‘tail’ in the distribution where the potential changes very little as the RMSD approaches zero. In contrast, the distribution for the all-beta protein is much smoother,

Protein folding by motion planning

indicating that the secondary and the tertiary structure may be formed simultaneously. For the mixed alpha and beta proteins, the plots share some features of the plots for the all-alpha and all-beta proteins. One explanation for the observed behavior is that proteins tend to maximize the formation of favorable interactions while minimizing conformational entropy loss, as observed by other researchers (e.g. [17]). Here, we capture this behavior in the very early stages of our approach, i.e., after the initial sampling phase. One reason could be that since the formation of helices causes little entropy loss, the corresponding conformation space remains large, while for beta sheets, the conformation space is quickly constrained by the larger entropy losses. Therefore, beta sheets appear later, close to the native state (when the surrounding conformation space is already small and entropy loss is not as significant), while alpha helices form much earlier (since this does not affect the conformation space as much). Interestingly, this is captured and reflected by our sampling. 2.3. Connecting the roadmap Connection is the second phase of roadmap construction. The objective is to obtain a roadmap encoding representative, low energy paths. For each roadmap node, we first find its k nearest neighbors, for some small constant k, and then try to connect it to them using a local planning method. This yields a connectivity roadmap that can be viewed as a net laid down on the energy landscape (see figure 1(b)). When two nodes q1 and q2 are connected, the directed edge (q1 , q2 ) is added to the roadmap. Each edge (q1 , q2 ) is assigned a weight that depends on the sequence of conformations {q1 = c0 , c1 , c2 , . . . , cn−1 , cn = q2 } on the straight line in C connecting q1 and q2 . For each pair of consecutive conformations ci and ci+1 , the probability Pi of moving from ci to ci+1 depends on the difference between their potential energies Ei = E(ci+1 ) − E(ci ). −Ei e kT , if Ei > 0, Pi = (3) 1, if Ei 0. This keeps the detailed balance between two adjacent states and enables the weight of an edge to be computed by summing the logarithms of the probabilities for consecutive pairs of conformations in the sequence w(q1 , q2 ) =

n−1

− log(Pi ).

(4)

i=0

Edge weights are not transition rates, but the logarithm of transition rates. This enables edge weights to follow the summation rule (instead of the multiplication rule for transition rates) and facilitates the use of graph algorithms to extract shortest paths. (Negative logs are used since each 0 Pi 1.) In this way, we encode the energetic feasibility of transiting from one conformation to another in the edge connecting them. A similar weight function, with different probabilities, was used in [15].

2.4. Extracting folding pathways The roadmap is a map of the protein folding landscape of the protein. One way to study this landscape is to inspect and analyze the pathways it contains. An important feature of our approach is that the roadmap contains many folding pathways, which together represent the folding landscape. We can extract many such paths by computing the single-source shortest-path (SSSP) tree from the native structure (see figure 1(c)).

3. Potential energy calculations The way in which a protein folds depends critically on the potential energy. Our PRM framework incorporates this bias by accepting conformations based on their potential energy (section 2.2) and by weighting roadmap edges according to their energetic feasibility (section 2.3). Our framework is flexible enough to use any method for computing potential energies. In this work, we use two different potentials: a very simplistic potential based on [18] and a more detailed, all-atom potential, effective energy function 1 (EEF1) [19]. For our coarse potential based on [18], we use a step function approximation of the van der Waals potential component. Our approximation considers only the contribution from the side chains. Additionally, in our model of each amino acid, we treat the side chain as a single large ‘atom’ R located at the Cβ atom. For a given conformation, we calculate the coordinates of the R ‘atoms’ (our spherical approximation of the side chains) for all residues. If any two R ˚ during node generation ‘atoms’ are too close (less than 2.4 A ˚ and 1.0 A during roadmap connection), a high enough potential is returned to force the node/edge to be rejected. If all the ˚ then we distances between all R ‘atoms’ are larger than 2.4 A, proceed to calculate the potential as follows: 1/2

Kd (di − d0 )2 + dc2 − dc + Ehp , (5) Utot = restraints

˚ as in [18]. where Kd is 100 kcal mol−1 and d0 = dc = 2 A The first term represents constraints which favor the known secondary structure through main-chain hydrogen bonds and disulphide bonds and the second term is the hydrophobic effect. The hydrophobic effect is computed as follows: if two ˚ of each other, then the hydrophobic residues are within 6 A potential is decreased by 20 kcal mol−1 . Finally, we note that in our case, the minimum potential is not necessarily achieved by the native structure, and thus our energy model does not yield true funnel landscapes as are shown in the figures. Also, this coarse potential becomes unrealistic as conformations become increasingly unstructured. However, it is accurate enough near the native state to capture the main properties of the folding process.

4. Timed contact analysis Contact analysis provides us with a formal method of validation and allows for detailed analysis of the folding S151

S Thomas et al

(I: α)

0

3 78 70 81 21 27 1 89 87 93 1 92 13 91 1 76 16 95 67

(V: βturn1 89 91 103 79 94 71 82

10 (II)x20 xx

40

(V)

x

xx xx xx xx xx xx xx xx xx xx xx xx x

(VI: βturn2 ) 96 76 1 56 79 81 33 99

xxx xxx xxx xxx xxx xxx xx x

100 101 100 97 102 86 93 96 96 76 97 79 81

(I: α)

10

xxx xxxx xx

50

Figure 5. Timed contact map for protein G. The full contact matrix (right) and blow-ups (left) showing the time steps when contacts appear on a path. The blow-ups: (I) alpha helix contacts, (II) beta 1–2 contacts, (III) beta 3–4 contacts, (IV) beta 1–4 contacts, (V) turn 1 (beta 1–2) contacts and (VI) turn 2 (beta 3–4) contacts. The full contact matrix (right) shows contact placement with an ‘x’ and the blow-ups (left) indicate the time step for each contact.

pathways. We first identify the native contacts by finding all ˚ apart. pairs of Cα atoms in the native state that are at most 7 A If desired, attention can be restricted to hydrophobic contacts between hydrophobic residues. To analyze a particular pathway, we examine each conformation on the path and determine the time step on the path at which each native contact appears. Although these time steps cannot be associated with any real time, they do give a temporal ordering and produce a timed contact map for the given pathway, see figures 5 and 6. The full contact matrix (on the right) shows contact placement and the blow-ups (left) show contact time steps. The timed contact map provides a formal basis for determining the secondary structure formation order along a pathway. Here, the structure formation order is based on the formation order of the native contacts [17]. We have looked at several metrics to determine when a secondary structure appears: average appearance time of native contacts within the structure, average appearance of the first x% of the contacts, average appearance ignoring outliers, etc. We can also focus

30

50

x

x x xx x

(VI: βturn2 ) 25 22 11 27 26 23 1 25 22 22 4 13

60 0 (IV)

xx xx xxxx xxx xxx xxx xxx xxx

x

23 29 32 38 1 36 1 1 1

xx xx xx xx xx xx xx xx x

30 xx x

x x x

(III)xxxxxx (VI) xxxxxxxxx (IV: β1β4)

(III: β3β4) 29 35 33 27 32 41 25 29 24 41

10

20

(I) x

(II: β1β2) 39 38 38 38 38 38 36 38 29 32 36

40

x

(V: βturn1 )

xxx xxx

(III)xxxx 40 (VI) xxxxxxxxx

20xx (II)xxxxxxxxxxx

xxx xxx xxx xxx xxx xxx

(V)

9 12 1 22 9 24 3 1 1 1 1 32 8 32 7 28 15

30

105105 102103 103102102 104103104 104103104 104 103

10

8

20

(IV: β1β4) (III: β3β4)

0

50(IV)0

(I)

(II: β1β2) 101102101 101103104 102101 97 104 94 97 102 94 91 104 89 91 93 79 94

30

xxxx xxx xxxx xxx xxx xxx xxx xx

37 38 37 38 37 39 39 38 38 40 38 39 39 39 39

xxxx xxx xxxxx xxxx xx

40

50

60

Figure 6. Timed contact map for protein L. The full contact matrix (right) and blow-ups (left) showing the time steps when contacts appear on a path. The blow-ups: (I) alpha helix contacts, (II) beta 1–2 contacts, (III) beta 3–4 contacts, (IV) beta 1–4 contacts, (V) turn 1 (beta 1–2) contacts and (VI) turn 2 (beta 3–4) contacts. The full contact matrix (right) shows contact placement with an ‘x’ and the blow-ups (left) indicate the time step for each contact.

our analysis on smaller pieces of secondary structure such as β-turns (instead of the entire β-sheet). This is especially helpful when looking for fine details in a folding pathway. Because the roadmap contains multiple pathways, we can estimate the probability of a particular secondary structure formation order occurring. If the roadmap approximates the potential energy landscape well, then the percentage of pathways in the roadmap that contain a particular formation order reflects the probability of that order occurring.

5. Experimental validation and discussion In this section, we present results obtained using our PRMbased approach. For each protein studied, we construct a roadmap as outlined in sections 2.2 and 2.3, extract the folding pathways as described in section 2.4 and analyze the pathways as discussed in section 4. Our course potential described in section 3 is used unless otherwise specified.

Table 2. Proteins studied and roadmap statistics. Shown are number of residues (length), number of α helices and β strands (SS), roadmap size (# nodes) and construction time.

S152

pdb

Description

Length

SS

# Nodes

Time (h)

1gb1 2crt 1bdd 1shg 2ptl 1coa 1srl 1nyf 2ait 1ubq 1pks 1pba

Protein G domain B1 Cardiotoxin III Staphylococcus protein A SH3 domain α-spectrin Protein L, B1 domain CI2 SH3 domain src SH3 domain fyn Tendamistat Ubiquitin SH3 domain PI3 kinase Procarboxypeptidase A2

56 60 60 62 62 64 64 67 74 76 79 81

1α + 4β 5β 3α 5β 1α + 4β 1α + 4β 5β 5β 7β 1α + 5β 1α + 5β 3α + 3β

8 000 8 000 10 000 10 000 4 000 10 000 8 000 10 000 10 000 8 000 10 000 8 000

6.400 6.430 10.400 8.344 3.104 9.984 5.990 8.418 13.327 10.381 14.446 10.845

Protein folding by motion planning

Table 3. The secondary structure formation order on dominant pathways in our roadmaps and some validations. The brackets indicate that there was no clear order. The last column compares our results with those from hydrogen-exchange experiments. pdb

Out exchange [7]

Pulse labeling [7]

Our SS formation order

Comp.

1gb1 2crt 1bdd 1shg 2ptl 1coa 1srl 1nyf 2ait 1ubq 1pks 1pba

[α, β1, β3, β4], β2 [β3, β4, β5], [β1, β2] [α2, α3], α1 N/A [α, β1, β2, β4], β3 [α, β2, β3], [β1, β4] N/A N/A [β1, β2], [β3, β4, β5, β6, β7] [α, β1, β2], [β3, β5], β4 N/A N/A

[α, β4], [β1, β2, β3] β5, β3, β4, [β1, β2] [α1, α2, α3] N/A [α, β1], [β2, β3, β4] N/A N/A N/A N/A N/A N/A N/A

α, β3–β4, β1–β2, β1–β4 β1–β2, β3–β4, β3–β5 [α2, α3], α1, α2–α3, α1–α3 β3–β4, β2–β3, β1–β5, β1–β2 α, β1–β2, β3–β4, β1–β4 α, β3–β4, β2–β3, β1–β4 β3–β4, β2–β3, β1–β5, β1–β2 β3–β4, β2–β3, β1–β2, β1–β5 β1–β2, β3–β4, [β2–β5, β3–β6], β3–β5 α, β3–β4, β1–β2, β3–β5, β1–β5 β3–β4, β1–β5, [β1–β2, β2–β3] [α1, α3], [β1–β2, β1–β3]

Agreed Not sure Agreed N/A Agreed Agreed N/A N/A Agreed Agreed N/A N/A

Table 4. Comparison of analysis techniques, helix and hairpins, for proteins G and L. For each combination of contact type (all or hydrophobic) and number of contacts (first x% to form), we show the percentage of pathways with a particular secondary structure formation order. Recall that β-hairpin 2 (β3–β4) forms first in protein G and β-hairpin 1 (β1–β2) forms first in protein L. Only the first three secondary structure formation orders are shown. Analyze first x% contacts

Name

Contacts considered

Energy function

Secondary structure formation order

Protein G

All

Our

α, β3–β4, β1–β2, β1–β4 α, β1–β2, β3–β4, β1–β4

76 23

66 34

77 23

55 45

58 42

All-atom

α, β3–β4, β1–β2, β1–β4 α, β1–β2, β3–β4, β1–β4

75 25

46 54

54 46

54 46

83 17

Our

α, β3–β4, β1–β2, β1–β4 α, β3–β4, β1–β4, β1–β2 α, β1–β2, β3–β4, β1–β4

85 11 4

78 11 10

77 9 14

62 8 29

67 8 24

All-atom

α, β3–β4, β1–β2, β1–β4 α, β1–β2, β3–β4, β1–β4

78 22

79 21

92 8

79 21

92 8

Our

α, β1–β2, β3–β4, β1–β4 α, β1–β2, β1–β4, β3–β4 α, β3–β4, β1–β2, β1–β4

67 15 19

76 4 20

78 4 18

78 4 18

92 4 4

All-atom

α, β1–β2, β3–β4, β1–β4

100

100

100

100

100

Our

α, β1–β2, β3–β4, β1–β4 α, β1–β2, β1–β4, β3–β4 α, β3–β4, β1–β2, β1–β4

54 9 36

65 3 32

74 3 23

73 2 26

86 2 13

All-atom

α, β1–β2, β3–β4, β1–β4 α, β3–β4, β1–β2, β1–β4

99 1

100 0

99 1

99 1

99 1

Hydrophobic

Protein L

All

Hydrophobic

5.1. Proteins studied We study several small proteins in detail, see table 2. The structures for all the proteins were obtained from the Protein Data Bank [20]. We selected proteins that were studied in Munoz and Eaton’s [21] work studying the folding kinetics of small proteins. Our work is greatly motivated by theirs, as well as the work of Baker’s group [22]. For each protein, we build roadmaps ranging from 2000 nodes to 10 000 nodes. For each protein, we show the smallest roadmap that yields a stable secondary structure formation order. As discussed in section 1.1, our PRM-based method sacrifices accuracy in favor of rapid coverage. As can be

20

40

60

80

100

seen from the running time and roadmap statistics shown in table 2, our roadmaps containing thousands of folding paths are computed in just a few hours on a desktop PC. In contrast, traditional methods such as molecular dynamics, compute a single trajectory, have tremendous computational requirements and are subject to local minima. 5.2. Secondary structure formation order and validation Contact analysis was performed on the pathways for all proteins studied. Timed contact maps for proteins G and L are shown in figures 5 and 6, respectively. The dominant formation order found for each protein is shown in table 3. It S153

S Thomas et al

Table 5. β-turn formation: comparison of analysis techniques for proteins G and L using the same roadmaps as in table 4. Recall that turn 2 forms first in protein G and turn 1 forms first in protein L. Only the first three secondary structure formation orders are shown.

Name

Energy function

Secondary structure formation order

20

40

60

80

100

Protein G

All

Our

α, turn 2, turn 1 turn 2, α, turn 1 α, turn 1, turn 2

53 15 25

52 9 33

52 17 26

50 22 23

50 22 24

All-atom

α, turn 2, turn 1 turn 2, α, turn 1 α, turn 1, turn 2 turn 1, α, turn 2

36 3 50 12

37 0 63 0

55 0 45 0

55 0 45 0

57 0 43 0

Our

α, turn 2, turn 1 α, turn 1, turn 2

96 4

96 4

85 12

96 2

87 11

All-atom

α, turn 2, turn 1 α, turn 1, turn 2

76 24

78 22

78 22

92 8

69 31

Our

α, turn 1, turn 2 turn 1, α, turn 2 α, turn 2, turn 1

24 3 73

30 4 63

37 4 60

38 4 48

41 6 39

All-atom

α, turn 1, turn 2 α, turn 2, turn 1

25 75

25 75

48 52

43 57

41 59

Our

α, turn 1, turn 2 turn 1, α, turn 2 α, turn 2, turn 1

72 5 23

68 9 22

72 5 22

70 7 23

69 15 15

All-atom

α, turn 1, turn 2 turn 1, α, turn 2 α, turn 2, turn 1

66 3 31

76 0 24

78 0 22

95 0 5

97 0 3

Hydrophobic

Protein L

All

Hydrophobic

is clearly seen that our results are in good agreement with the hydrogen-exchange experimental results described by Li and Woodward [7]4 . All proteins seem to form local contacts first, and then those with increasing sequence contact order, like a zipper process [23, 17]. 5.3. Protein G and protein L: a detailed study Proteins G and L present a good test case for our technique because they are known to fold differently although they are structurally similar. In particular, although they have only 15% sequence identity [24], they are both composed of an α-helix and a four-stranded β-sheet. β strands 1 and 2 form the N-terminal hairpin (hairpin 1) and β strands 3 and 4 form the C-terminal hairpin (hairpin 2). Experimental results show that β-hairpin 1 forms first in protein L [25] and β-hairpin 2 forms first in protein G [24]. In native state out-exchange experiments for proteins G and L, numerous NHs out exchange very slowly, which makes it difficult to unambiguously identify the slowest out-exchange residues. It is found that the slow exchanging NHs are in β1 , β3 , β4 and the helix for G, and the α helix, β1 , β2 and β4 for L. On the other hand, pulse-labeling experiments identify that the first NHs to gain protection during folding are in α and β4 for G and α and β1 for L (see [7] and references therein). In summary, out-exchange and pulse-labeling experiments 4

We note that our results for CTX III may be affected by the four disulphide bonds which we model as hydrogen bonds.

S154

Analyze first x% contacts

Contacts considered

strongly suggest that the α and β4 form first for G and that the α and β1 form first for L. Furthermore, this is consistent with -value analysis on G [24] and L [25] which indicates that, in the folding transition state, β-hairpin 2 is more formed than the rest of the structure for G and β-hairpin 1 is similarly more formed for L. For both proteins G and L, we use the same definition of the beta strands as is contained in the Protein Data Bank. These definitions include all the observed residues that are found in the slowest exchange core in the native state out-exchange experiments and that are among the first gaining protection in the pulse-labeling experiments, see figures 5 and 6. This enables us to have a fair comparison of our results with those from these experiments. Table 4 shows our results. For each protein, one roadmap was constructed and then its (thousands of) pathways were studied using the different analysis methods described in section 4. When only the specified contacts were considered, the percentage of paths that had the given secondary structure formation order is shown. For example, for all contacts, and limiting our consideration to only the first 60% of the contacts to form, in 77% of the pathways for protein G β-hairpin 2 (β3 – β4 contacts) formed before β-hairpin 1 (β1 –β2 contacts) with our coarse potential, while in 82% of the pathways for protein L β-hairpin 1 formed before β-hairpin 2. Thus, the helix and β-hairpin 2 form first by a significant percentage for protein G, while for protein L, the helix and β-hairpin 1 consistently form first by a significant percentage. Both results agree very

Protein folding by motion planning

well with experimental observations. We also performed the study considering only the hydrophobic contacts and obtained similar results, further confirming our findings. We also study the formation order of β-turns (see figures 5 and 6 for our definition). Remarkably, the results (see table 5) are in good agreement with those obtained using the beta strands. For protein G, the second β-turn forms consistently earlier than the first β-turn, which confirms our results that the second hairpin forms first. For protein L, our results show that the second β-turn forms first when considering all contacts. However, when only hydrophobic contacts are considered, then the first β-turn forms first by a significant percentage. This indicates that some hydrophobic contacts form earlier in the first turn than in the second. We performed the same study using the EEF1 all-atom potential [19]. We obtained similar results to our coarse potential while suffering a 25–50-fold increase in computation time. The significant difference between the two potentials is that the results with the all-atom potential are more pronounced. This is likely due to the increased accuracy of the all-atom potential.

6. Conclusion We have shown that our PRM-based approach for studying protein folding pathways is able to correctly reproduce known folding differences between the structurally similar proteins G and L. This result gives confidence in our approach and implies that it could be valuable for analyzing proteins whose structure is known but for which we lack experimental data on the folding pathway.

Acknowledgments We would like to thank Ken Dill for many useful suggestions and guidance for our work, Jean-Claude Latombe for pointing out the connection between box folding and protein folding, and Michael Levitt and Vijay Pande for useful suggestions.

References [1] Levitt M, Gerstein M, Huang E, Subbiah S and Tsai J 1997 Protein folding: the endgame Annu. Rev. Biochem. 66 549–79 [2] Reeke G N Jr 1988 Protein folding: computational approaches to an exponential-time problem Annu. Rev. Comput. Sci. 3 59–84 [3] Lansbury P T 1999 Evolution of amyloid: what normal protein folding may tell us about fibrillogenesis and disease Proc. Natl Acad. Sci. USA 96 3342–4 [4] Amato N M and Song G 2002 Using motion planning to study protein folding pathways J. Comput. Biol. 9 149–68 (Special issue of Int. Conf. Comput. Mol. Biol. (RECOMB) 2001)

[5] Kavraki L E, Svestka P, Latombe J C and Overmars M H 1996 Probabilistic roadmaps for path planning in highdimensional configuration spaces IEEE Trans. Robot. Autom. 12 566–80 [6] Song G and Amato N M 2001 A motion planning approach to folding: from paper craft to protein folding Proc. IEEE Int. Conf. Robot. Autom. (ICRA) pp 948–53 [7] Li R and Woodward C 1999 The hydrogen exchange core and protein folding Protein Sci. 8 1571–91 [8] Dobson C M, Sali A and Karplus M 1998 Protein folding: a perspective from theory and experiment Angew. Chem. Int. Ed. 37 868–93 [9] Radford S E and Dobson C M 1995 Insights into protein folding using physical techniques: studies of lysozyme and α-lactalbumin Phil. Trans. R. Soc. Lond. B 348 17 [10] Bryngelson J D, Onuchic J N, Socci N D and Wolynes P G 1995 Funnels, pathways, and the energy landscape of protein folding: A synthesis Protein Struct. Funct. Genet. 21 167–95 [11] Song G, Thomas S L, Dill K A, Scholtz J M and Amato N M 2003 A path planning-based study of protein folding with a case study of hairpin formation in proteins G and L Proc. Pacific Symposium of Biocomputing (PSB) pp 240–51 [12] Sternberg M J 1996 Protein Structure Prediction (Oxford: OIRL Press at Oxford University Press) [13] The IMB Jena Image Library of Biological Macromolecules http://www.imb-jena.de [14] Bayazit O B, Song G and Amato N M 2001 Ligand binding with OBPRM and haptic user input: enhancing automatic motion planning with virtual touch Proc. IEEE Int. Conf. Robot. Autom. (ICRA) pp 954–9 (This work was also presented as a poster at RECOMB 2001) [15] Singh A P, Latombe J C and Brutlag D L 1999 A motion planning approach to flexible ligand binding 7th Int. Conf. on Intelligent Systems for Molecular Biology (ISMB) pp 252–61 [16] Song G and Amato N M 2001 Using motion planning to study protein folding pathways Proc. Int. Conf. Comput. Mol. Biol. (RECOMB) pp 287–96 [17] Fiebig K M and Dill K A 1993 Protein core assembly processes J. Chem. Phys. 98 3475–87 [18] Levitt M 1983 Protein folding by restrained energy minimization and molecular dynamics J. Mol. Biol. 170 723–64 [19] Lazaridis T and Karplus M 1999 Effective energy function for proteins in solution Proteins 35 133–52 (http://mingus.sci.ccny.cuny.edu/server/) [20] The Protein Data Bank http://www.rcsb.org/pdb/ [21] Munoz V and Eaton W A 1999 A simple model for calculating the kinetics of protein folding from three dimensional structures Proc. Natl Acad. Sci. USA 96 11311–6 [22] Alm E and Baker D 1999 Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures Proc. Natl Acad. Sci. USA 96 11305–10 [23] Dill K A, Fiebig K M and Chan H S 1993 Cooperativity in protein-folding kinetics Proc. Natl Acad. Sci. USA 90 1942–6 [24] McCallister E L, Alm E and Baker D 2000 Critical role of β-hairpin formation in protein G folding Nat. Struct. Biol. 7 669–73 [25] Kim D E, Fisher C and Baker D 2000 A breakdown of symmetry in the folding transition state of protein l J. Mol. Biol. 298 971–84

S155

protein folding pdf

Protein Folding Paper - Ghosh.pdf

ARE THERE PATHWAYS FOR PROTEIN FOLDING ?

The Protein Folding Problem

Protein Folding and Ligand-Enzyme Binding from Bias ...

Network random walk model of two-state protein folding

Contact Planning for Acyclic Motion with Task ...

Simultaneous Local Motion Planning and Control for ...

Motion planning under bounded uncertainty using ...

A fast heuristic Cartesian space motion planning ...

Modeling and Motion Planning for Mechanisms on a ...

Thesis Problem Definition Proposal: Motion Planning ...

3D Motion Planning Algorithms for Steerable Needles ...

Motion Planning for Human-Robot Collaborative ... - Semantic Scholar

Extracting Protein-Protein Interactions from ... - Semantic Scholar

Motion Planning for Human-Robot Collaborative ... - Semantic Scholar

Motion planning for formations of mobile robots - Semantic Scholar

Optimal Mobile Sensor Motion Planning Under ...

Motion planning and bimanual coordination in ...

Contact Planning for Acyclic Motion with Tasks ...

3D Motion Planning Algorithms for Steerable Needles ... - Ken Goldberg

Screw-Based Motion Planning for Bevel-Tip Flexible ...