1 Bock ,

Claudio

2 Garutti ,

Concettina

2,3 Guerra

1: Dept. of Statistics, Purdue University,USA 2: Dept. of Information Engineering, University of Padova, Italy 3: College of Computing, Georgia Institute of Technology, USA [email protected],[email protected],[email protected]

Abstract E present a method for detecting and comparing cavities on protein surfaces, which is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. The method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein, in order to find an indication that the two regions likely bind to the same ligand.

W

• The horizontal profile h(s) of a spin-image s is a vector whose ith element h(s)(i) is the number of contiguous zero-elements in row i of s starting at column 0 and ending at the first non-zero cell along row i.

. . ET Nb = # blocked points, Nu = # unblocked points. In general:

L

• in a protein, Nb << Nu • in a cavity, Nb >> Nu • in a binding site, Nb >> Nu

• The data set is composed by 464 binding sites on 244 proteins from [3], where 112 are enzymes (45.9%), 129 are nonenzymes (52.9%), and three are ”hypothetical” (1.2%) proteins. Figure 2: Determination of the sphere using spin-image horizontal profile.

2.1 Cavity detection

T

2. Methods

3.1 Cavity Detection

• The sphere S(s) of a spin-image s is the biggest semicircle tangent to the left lower corner of pixel (0, 0), with radius R and with center C on the normal n, s.t. the sphere contains only empty pixels.

1. Surface Characterization HE molecular surface is a collection of spinimages, each of them associated to a Connolly’s surface point P with its normal n. Let (P, n) be the coordinate system with origin in the surface point P and with axis its normal n. In this system, every surface point Q is represented by two coordinates: the perpendicular distance α of Q to n, and the signed perpendicular distance β of Q to the plane T through P perpendicular to n.The spin-image s(P ) of a point P is a two-dimensional histogram of the quantized coordinates (α, β) of the surface points w.r.t (P, n). If βr = βmax − βmin then the number of rows is h = dβr /εe and the number of columns is k = dαmax/εe, where is the pixel size. A surface point is blocked if its spin-image contains a non-zero pixel with positive β, otherwise it is unblocked.

3. Data and results

Figure 1: Statistics of blocked points of proteins and binding sites for a non redundant data set of 244 proteins defined in [3] .

B UILD S PHERE (s) R ← |h(s)|/2 for j = 2, . . . , |h(s)| begin i ← h(s)(j) if(i ≥ j) then R ← min{R, (i2 + j 2)/2j} else R ← min{R, (i2 + (j − 1)2)/2(j − 1)} C ← (0, (R + 1)) end

• The rank of a cavity is its size, i.e. the biggest cavity has rank 1, the second biggest cavity has rank 2, and so on.

3.2 Finding similar binding sites on two proteins

1. For a given protein surface, determine the set of blocked points B 2. Compute h(s(b)) for ∀b ∈ B 3. Determine B 0 = {b ∈ B : |h(s(b))| < 10} 4. For ∀b ∈ B 0, B UILD S PHERE (s) 5. Determine B 00 = {b ∈ B 0 : R(S(s(b))) < 1} 6. Build the undirected graph G = (V, E), where v ∈ V ⇔ b ∈ B 00, and e = (vi, vj ) ∈ E ⇔ . dist(Ci, Cj ) < Ri + Rj , where Ci = C(S(s(bi))), . Ri = R(S(s(bi))) 7. Find the connected components G1, · · · , Gn of G using Breadth First Search.

2.2 Finding similar binding sites on two proteins 1. Build the spin-image representation of the surface points of the two proteins. 2. For each protein, find the surface cavities and select the largest one(s). 3. Compare pairs of cavities, one per protein, by identifying and grouping sets of corresponding points based on the correlation of their associated spin-images, using MolLoc [1] [2]. 4. Return the regions on the two cavities that are most similar.

Given a protein P , a binding site B on P and the set of atoms S identified T on P in the comparison,T we . |S B| . |S B| define coverage= |B| and accuracy= |S| . Using just the cavities instead of the whole surfaces: • Execution times are reduced from 1–2 hours down to few minutes or even seconds • Coverage and accuracy improve up to 21 %

Figure 3: The figure plots the number of atoms of the binding sites versus the number of atoms of the ligands for all 244 proteins of the dataset. The dotted line is the least square line.

• The bigger the number of atoms of the binding site, the better the rank of the corresponding cavity. • 76% of the binding sites lie in one of the four biggest cavities.

References [1] M.E. Bock , G. Cortelazzo, C. Ferrari and C. Guerra (2005). Identifying similar surface patches on proteins using a spin-image surface representation. Proc. Combinatorial Pattern Matching CPM 2005, 417–428. [2] M.E. Bock, C. Garutti and C. Guerra (2007). Discovery of Similar Regions on Protein Surfaces. J. Comp. Biol., 14(3):285–299. [3] F. Glaser, R.J. Morris, R.J. Najmanovich, R.A. Laskowski and J.M. Thornton (2006). A Method for Localizing Ligand Binding Pockets in Protein Structures. PROTEINS: Structure, Function, and Bioinformatics, 62:479–488.

6th Annual International Conference on Computational Systems Bioinformatics CSB2007, 13-17 August 2007, University of California, San Diego

Figure 4: Distribution of binding sites by cavity rank and # atoms of binding site.

Pdb ID # residues Coverage Coverage Accuracy in binding MolLoc Cavity Cavity site [2] comparison comparison 1atp 23 78% 91% 80% 1phk 26 69% 90% 76% 1atp 23 70% 78% 75% 1csn 26 62% 80% 91% 1atp 23 26% 34% 100 % 1mjh:B 25 24% 32% 88% 1atp 23 39% 56% 92 % 1hck 24 42 % 58 % 87 % 1atp 23 43% 60% 93% 1nsf 23 35% 43% 76% Table 1: Comparison with results obtained with MolLoc [2].