Classification of Hub and Non-hub proteins based on Chaos Game Representation Vrinda V. Nair, Dept. Of Electr. & Commn. Engg., Govt. Engg. College, Thrissur. Centre for Bioinformatics, University of Kerala, Thiruvananthapuram. [email protected]

Lissy Anto P., St. Joseph’s College, Irinjalakuda. Centre for Bioinformatics, University of Kerala, Thiruvananthapuram. [email protected]

Abstract— Connectivity of proteins in protein interaction networks have played a very crucial role in determining the function of the proteins and hence that of the network. Proteins with more interaction partners are termed as hubs and those with lesser interaction partners as non-hubs. Past studies reveal that hub proteins posses regions of high intrinsic disorder while non-hubs do not. There have been attempts to differentiate hubs from non-hubs based on predicting intrinsic disorders. In this paper, we have tried to do a hub non-hub classification based on the amino acid sequence, since the features of the sequence virtually encompass the structural peculiarity resulting in protein disorder. We have transformed the symbol sequence into an image using the Chaos Game Representation algorithm (CGR). A quantitative measure from the CGR is then used for classification. Using this method, we obtained an accuracy of above 70% for both prokaryotes and eukaryotes. Keywords- Chaos Game Representation; connectivity; intrinsic disorder; protein interaction network

I.

INTRODUCTION

Protein-protein interaction play a pivotal role in almost every level of cell function such as in the structure of subcellular organelles, the transport machinery across the various biological membranes, packaging of chromatin, the network of sub-membrane filaments, muscle contraction, signal transduction and regulation of gene expression [1]. Information about protein-protein interactions improves our understanding of neurological disorders such as Creutzfeld-Jacob and Alzheimer's disease and can provide the basis for new therapeutic approaches for such diseases. Due to their importance in development and disease, these systems have been the object of intense research for many years [1]. Protein-protein interaction networks are organized into scale–free networks (SFNs) where a small number of proteins can interact with many other proteins, while most proteins interact with a small number of partners conforming to a power law distribution [2], [3]. The scalefree nature of protein–protein interaction networks gives them the advantages of high connectivity and robustness [4]. Those proteins which interact with many other partners are designated as ‘hubs’ and those with relatively

Achuthsankar S. Nair, Centre for Bioinformatics, University of Kerala, Thiruvananthapuram. [email protected]

less interactions are termed as ‘non-hubs’. The hub nature of proteins is attributed to the intrinsic disorder in one or both partners in interaction [4], [5]. These intrinsically disordered (ID) proteins and regions are known to carry out numerous biological functions including cell signaling, molecular recognition, and various other interactions with proteins and nucleic acids [4]. The study of disorder content of organism-specific protein interaction networks have been carried out in [5]. The prediction of disorder in the interaction networks from four eukaryotic organisms carried out using PONDR VL-XT is also reported. The comparison of proteins from these networks shows that while the disorder content varies between organisms, hub proteins are consistently found to be more disordered than non-hub proteins in all organisms [5]. There are several studies available in literature which predicts protein disorder. These predictors are based on the assumption that the absence of rigid structure is encoded in specific features of the amino acid sequence [5]. In fact, statistical analysis shows that amino acid sequences encoding for ID proteins or regions are significantly different from those of ordered proteins on the basis of local amino acid composition, flexibility, hydropathy, charge, coordination number and several other factors [5]. A hub protein classifier was developed based on the available interaction data and Gene Ontology (GO) annotations for proteins in the Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens genomes in [6]. In this paper, we are attempting to classify hub and non-hub proteins based on the amino acid sequence, since the features of the sequence virtually encompass the structural peculiarity resulting in protein disorder [5]. We are obtaining an accuracy of 71.96% for prokaryotes and 71.45% for Eukaryotes. Proteins with connectivity 3 and below were assumed as non-hubs and those with 8 and above as hubs [15]. II. MATERIALS AND METHODS In this paper, we are using a quantitative measure obtained by mapping the sequences to an image using Chaos Game Representation algorithm. Chaos Game Representation of Genome sequences is a scale independent method of plotting the genome sequences, introduced by H. Joel Jeffrey in 1990 [7].

To derive Chaos Game Representation of a genome sequence, a square is first drawn to any desired scale and corners marked A, T, G and C. The first point is plotted halfway between the center of the square and the corner corresponding to the first nucleotide of the sequence, and successive points are plotted halfway between the previous point, and the corner corresponding to the base of each successive nucleotide. Mathematically, coordinates of the successive points in the Chaos Game Representation of a DNA sequence is described by an iterated function system defined in (1) Xi = 0.5( Xi − 1 + gix ) Figure 1. CGR for Human- Chromosome 16

Yi = 0.5(Yi − 1 + giy )

(1)

gix and giy are the X and Y co-ordinates respectively of the corners corresponding to the nucleotide at position i in the sequence [8]. Fig. 1 shows the CGR plotted for Human Chromosome 16. The “double scoop” in the top right quadrant is indicative of the relative sparseness of Guanine following Cytosine in the gene sequence. Other features in CGRs drawn for various organisms display marked diagonals, varying vertical intensities, absence of diagonals etc. signifying corresponding sequence characteristics indirectly captured by the signature images [7]. In a CGR, the frequency of occurrence of any oligomer can be obtained by dividing the image into a 2n x 2n grid and counting the number of points in each subsquare. This count is an important quantitative measure of the n-mers in the sequence. This representation is known as Frequency Chaos Game Representation (FCGR) [9], [10]. Numerous applications have been reported based on FCGR measures. This was used for developing an algorithm for aligning and comparing whole genomes [8]. Phylogenetic trees were generated using various distance measures derived from FCGR and it was concluded that FCGRs contained major phylogenetic information [11]. CGRs of amino acid sequences present a very different challenge, as CGRs can be constructed using a 20 sided regular polygon or alternatively using smaller polygons, by assigning groups of amino acids to the corners in a variety of ways. For instance, the 20 amino acids can be divided into four classes: non-polar, negative polar, uncharged polar and positive polar and the corresponding residues assigned to the corners of a square. Literature on protein CGRs are relatively limited, probably due to the cumbersome task of exploring the massive number of combinations possible with n sided CGRs. It was demonstrated that different protein families exhibit distinct patterns in their CGRs with characteristic grid counts [12]. The grid counts were used as diagnostic features of such protein families for identification of new members of the families. It was shown that CGR could be applied for revealing information relating the primary and 3D structures of proteins [13]. Multifractal and correlation analyses of the measures based on the CGR of amino acid sequences from complete genomes have been reported and attempts have been made to construct a more precise phylogenetic tree of bacteria [14].

In this paper, classification of hub and non-hub proteins is attempted, using hexagonal CGRs. In order to plot the CGR in a hexagon, the amino acids need to be grouped into 6. They can be grouped based on their physico-chemical properties. For example, the amino acids can be grouped based on hydrophobicity. There are many hydrophobicity scales available in literature. It is well known that there are significant differences between these scales [16], hence, we are resorting to an alternative method of grouping, using k-means clustering. Each amino acid is replaced by a numerical value corresponding to amino acid indices. We have considered 14 amino acid indices as listed in Table I. The amino acid index values for each property corresponding to the 20 amino acids, are clustered into 6 groups using k-means clustering. k-means uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible [17]. TABLE I.

LIST OF AMINO ACID INDICES USED

Sl.#

AA index

1.

Hydrophobicity

2.

Beta-sheet propensity

3.

Alpha-helix propensity

4.

Normalised frequency of extended structure

5.

Polarizability parameter

6.

NMR chemical shift of Alpha Carbon

7.

Entropy of formation

8.

Alpha Helix index

9.

EIIP

10.

Alpha NH

11.

Average flexibility index

12.

Amino Acid composition

13.

Absolute entropy

14.

Relative frequency of occurrence

The 6 clusters are then assigned to the corners of a hexagon and CGR of the amino acid sequences are plotted in a manner similar to that of square CGRs mentioned earlier, selecting corresponding corners and marking points midway between the corners and preceding points. Fig. 2 gives the CGR for a hub protein from a eukaryotic organism - Saccharomyces cerevisiae (Baker's yeast), whose JSN1 protein is plotted and Fig. 3 gives the CGR for a non-hub protein from a eukaryotic organism –Human whose Transcription initiation factor TFIID subunit 1 is plotted. Here, initially, the amino acids are clustered into 6 groups using k-means clustering, based on amino acid composition index. The resulting clusters are given by : Cluster I – A,G, Cluster II – L,S, Cluster III – R,D,P, Cluster IV – V,E,T,K, Cluster V- N,F,Q,Y,I, and Cluster VI – M,C,W,H. These clusters are assigned to the corners of a hexagon, in an anticlockwise direction starting from the vertex at the bottom left corner. The CGR is then divided into a 22 x 22 grid in order to compute the FCGR matrix. The number of points in each cell is counted so that we get a 16 element FCGR matrix for each protein. For the same protein sequence, this is repeated for the selected 14 AA indices. We thus get a 16x14 = 224 element vector representing each protein. The classification was applied on prokaryotes and eukaryotes separately. Training sequences from prokaryotes as well as eukaryotes were taken and a 224 element vector which is the average of all training hub sequences was taken as profile hub vector and another 224 element vector which is the average of all training non-hub sequences was taken as profile non-hub vector. Test sequences from both prokaryotes and eukaryotes were chosen. Corresponding to each test protein, a 224 element vector was computed as explained above. Each test vector was compared individually with the profile sequences for classification, using the cos angle distance measure. A. Classification using cos angle distance measure The angle between two nonzero vectors x and y is given as shown in (2)

cos θ =

x. y x y

The vectors will be closest when angle θ approaches zero which means cosθ tends to 1. The distance between each of the test vectors and profile vectors were obtained using (2). If the cosine value between the test vector and profile hub vector was greater than the cosine value between the same test vector and the profile non-hub vector, the test vector was classified as hub and otherwise, as non-hub. III.

RESULTS AND DISCUSSION

A total of 1281 sequences belonging to the eukaryotes and 3451 belonging to the prokaryotes were taken from the APID database. All sequences with connectivity less than 4 were considered as non-hub and those equal and above 8 were taken as hubs [15]. Roughly 50% of each set was taken for training and the rest for testing. Table II and III gives the Sensitivity, Specificity and Accuracy obtained for test sequences of eukaryotes and prokaryotes respectively. The classification statistics is calculated as follows Sensitivity = TP/(TP + FN) Specificity = TN/(TN + FP) Accuracy = (TP + TN)/(TP + TN + FP + FN) where TP = True Positive, FP = False Positive, TN = True Negative, and FN = False Negative. IV.

CONCLUSION

It may be inferred that the amino acid sequences do in fact conceal information regarding the hubness of proteins to a considerable extent. The question regarding the connectivity threshold for assigning a protein as hub or non-hub is still questionable, since there are several views on the same, found in literature. The accuracy may improve if the hubness - connectivity relationship is well defined. Further, it is also possible to experiment with various other polygonal CGRs with still other means of symbol assignment to the corners, which can yet be explored for improved accuracy. Nevertheless, the simplicity of the method is that the classification is based solely on the structure of the symbol sequence.

(2)

Figure 2. CGR-Saccharomyces cerevisiae (Baker's yeast), JSN1 protein (Hub)

Figure 3. CGR - Human , Transcription initiation factor , TFIID subunit 1 (Non-hub)

TABLE II. Hub

Nonhub

TP

EUKARYOTES

FN

TN

FP

Sensitivity

Hub

277

108

277

108

-

-

71.95

Nonhub

75

181

-

-

181

75

-

TABLE III. Hub

Nonhub

TP

Specificity

70.7

Accuracy

71.45

PROKARYOTES

FN

TN

FP

Sensitivity

Hub

974

237

974

237

-

-

80.43

Nonhub

247

268

-

-

268

247

-

Specificity

52.04

Accuracy

71.96

REFERENCES [1] [2]

[3] [4]

[5]

[6]

[7] [8]

[9]

Catherine Royer. (2004, May 14). Biophysics Textbook Online. [Online]. Available: www.biophysics.org/education/croyer.pdf Miho Higurashi, Takashi Ishida, and Kengo Kinoshita, “Identification of transient hub proteins and the possible structural basis for their multiple interactions,” Protein Sci., vol. 17, pp. 7278, Feb. 2008. Rafael Rangel-Aldao, “Developing countries and systems biology,” Nature biotech., vol. 21, pp.491-492, May 2003. A. Keith Dunker, S. Marc Cortese, Pedro Romero, M. Lilia Iakouchev and N. Vladimir Uversky, “Flexible nets The roles of intrinsic disorder in protein interaction networks,” FEBS Journal, vol. 272, pp. 5129–5148, Aug. 2005. Chad Haynes et al., “ Intrinsic Disorder Is a Common Feature of Hub Proteins from Four Eukaryotic Interactomes,” PLOS comp. biol., vol. 2, no. 8, pp. 890-901, Aug. 2006. Michael Hsing, Kendall Grant Byler and Artem Cherkasov, “The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks,” BMC Systems Biology, vol. 2:80, Sept. 2008. H. J. Jeffrey, “ Chaos game representation of gene structure,” Nucleic Acids Res., vol. 18, pp. 2163–2170, 1990. Jijoy Joseph and Roschen Sasikumar, “Chaos Game Representation for comparison of whole genomes,” BMC Bioinformatics, no.7, pp. 243, 2006. J.S. Almeida, A. Joao Carrico, Antonio Maretzek, A. Peter Noble and Madilyn Fletcher, “Analysis of genomic sequences by Chaos

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Game Represen-tation,” Bioinformatics, vol. 17, no. 5, pp. 429– 437, Jan. 2001. P. J. Deschavanne, Alain Giron, Joseph Vilain, Guillaume Fagot and Bernard Fertil, “Genomic signature: characterization and classification of species assessed by chaos game representation of sequences,” Mol. Biol. Evol., vol. 16, pp. 1391–1399, 1999. Yingwei Wang , Kathleen Hill, Shiva Singh and Lila Kari, “ The spectrum of genomic signatures: from dinucleotides to chaos game representation,” Gene, vol. 346, pp.173–185, Jan. 2005. Soumalee Basu, Archana Pan, Chitra Dutta and Jyotirmoy Das, “Chaos game representation of proteins,” J. Mol. Graph. Model., vol. 15, pp. 279–289, 1997. Andras Fiser, Gabor E. Tsunady and Istvan Simon, “Chaos Game Representation of Protein Structures,” J. Mol. Graphics, vol. 12, pp. 302–304, Dec. 1994 Zu-Guo Yu, Vo Anh and Ka-Sing Lau, “Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses,”. J. Theor Biol., vol. 226(3), pp. 341–8, 2004. Diana Ekman, Sara Light, Åsa K Björklund and Arne Elofsson, “What properties characterize the hub proteins of the proteinprotein interaction network of Saccharomyces cerevisiae?”, Genome Biology, vol. 7:R45, Apr. 2006. Cornette,J. et al. , “Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins,” J. Mol. Biol., vol. 195, pp. 659–685, 1987. MATLAB – Statistics toolbox- Multivariate Statistcs, The Math Works Inc., June 2004.

Paper Title (use style: paper title)

zero which means cosθ tends to 1. The distance between each of the test vectors and profile vectors were obtained using (2). If the cosine value between the test vector and profile hub vector was greater than the cosine value between the same test vector and the profile non-hub vector, the test vector was classified as hub ...

256KB Sizes 5 Downloads 350 Views

Recommend Documents

Paper Title (use style: paper title) - Sites
Android application which is having higher graphics or rendering requirements. Graphics intensive applications such as games, internet browser and video ...

Paper Title (use style: paper title) - GitHub
points in a clustered data set which are least similar to other data points. ... data mining, clustering analysis in data flow environments .... large than the value of k.

Paper Title (use style: paper title)
College of Computer Science. Kookmin ... of the distinct words for clustering online news comments. In ... This work was supported by the Basic Science Research Program through .... is performed on class-wise reviews as depicted in Fig. 1(b).

Paper Title (use style: paper title)
School of Electrical Engineering, KAIST .... [Online]. Available: http://yann.lecun.com/exdb/mnist/. [5] Design Compiler User Guide, Synopsys, Mountain View, CA, ...

Paper Title (use style: paper title)
on the substrate, substrate pre-deposition process, and Pd deposition .... concentration is below the ignition threshold, which is often important for such a sensor.

Paper Title (use style: paper title)
Turin, Italy [email protected]. Hui Wang. School of Information Engineering. Nanchang Institute of Technology. Nanchang 330099, China [email protected]. Abstract—Frequency Modulation (FM) sound synthesis provides a neat synthesis

Paper Title (use style: paper title)
mobile wireless networking, it is becoming possible to monitor elderly people in so-called ... sensor network that might be used in order to recognize tasks described in Table 1. ..... its advantages, and their relative merits and demerits are still.

Paper Title (use style: paper title)
communication channel between the sensors and the fusion center: a Binary ..... location estimation in sensor networks using binary data," IEEE Trans. Comput., vol. ... [9] K. Sha, W. Shi, and O. Watkins, "Using wireless sensor networks for fire.

Paper Title (use style: paper title)
search and compact storage space. Although search ... neighbor search methods in the binary space. ... Given a query ∈ { } , we list the online search algorithm.

Paper Title (use style: paper title)
Research Program Fellowships, the University of Central Florida – Florida. Solar Energy Center (FSEC), and a NASA STTR Phase I contract. NNK04OA28C. ...... Effluents Given Off by Wiring Insulation," Review of Progress in. QNDE, vol. 23B ...

Paper Title (use style: paper title)
In Long term Evolution. (LTE), HARQ is implemented by MAC level module called .... the receiver is decoding already received transport blocks. This allows the ...

use style: paper title
helps learners acquire scientific inquiry skills. One of ... tutoring systems; LSA; natural language processing ..... We collected data from 21 college students who.

Paper Title (use style: paper title)
Reducing Power Spectral Density of Eye Blink Artifact through Improved Genetic ... which could be applied to applications like BCI design. MATERIALS AND ...

Paper Title (use style: paper title)
general, SAW technology has advantages over other potentially competitive ... SAW devices can also be small, rugged, passive, wireless, and radiation hard,.

Paper Title (use style: paper title)
provide onboard device sensor integration, or can provide integration with an .... Figure 2 Schematic diagram of a 7 chip OFC RFID tag, and. OFC measured and ..... [3] C. S. Hartmann, "A global SAW ID tag with large data capacity," in Proc.

Paper Title (use style: paper title) - Research at Google
decades[2][3], but OCR systems have not followed. There are several possible reasons for this dichotomy of methods: •. With roots in the 1980s, software OCR ...

Paper Title (use style: paper title) - Research
grams for two decades[1]. Yet the most common question addressed to the author over more than two decades in OCR is: “Why don't you use a dictionary?

Paper Title (use style: paper title)
determine the phase error at unity-gain frequency. In this paper, while comparing some topologies we ... degrees at the integrator unity gain frequency result in significant filter degradation. Deviations from the .... due to gm/Cgd occur at a much h

Paper Title (use style: paper title)
Abstract— The Open Network and Host Based Intrusion Detection. Testbed .... It is unique in that it is web-based. .... sensor is also the application web server.

Paper Title (use style: paper title)
Orlando, FL 32816-2450 (email: [email protected]). Brian H. Fisher, Student .... presentation provides a foundation for the current efforts. III. PALLADIUM ...

Paper Title (use style: paper title)
A VLSI architecture for the proposed method is implemented on the Altera DE2 FPGA board. Experimental results show that the proposed design can perform Chroma-key effect with pleasing quality in real-time. Index Terms—Chroma-key effect, K-means clu

Paper Title (use style: paper title)
the big amount of texture data comparing to a bunch of ... level and a set of tile data stored in the system memory from ... Figure 1: Architecture of our algorithm.

Paper Title (use style: paper title)
printed texts. Up to now, there are no ... free format file TIFF. ... applied on block texts without any use of pre- processing ... counting [12, 13] and the reticular cell counting [1]. The main ..... Computer Vision and Image Understanding, vol. 63

Paper Title (use style: paper title)
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798. Abstract— ... For 60GHz wireless communication systems, the ... the benefit of isolated DC noise from the tuning element. The load on ...