Classification of Hub and Non-hub proteins based on Chaos Game Representation Vrinda V. Nair, Dept. Of Electr. & Commn. Engg., Govt. Engg. College, Thrissur. Centre for Bioinformatics, University of Kerala, Thiruvananthapuram.
[email protected]
Lissy Anto P., St. Joseph’s College, Irinjalakuda. Centre for Bioinformatics, University of Kerala, Thiruvananthapuram.
[email protected]
Abstract— Connectivity of proteins in protein interaction networks have played a very crucial role in determining the function of the proteins and hence that of the network. Proteins with more interaction partners are termed as hubs and those with lesser interaction partners as non-hubs. Past studies reveal that hub proteins posses regions of high intrinsic disorder while non-hubs do not. There have been attempts to differentiate hubs from non-hubs based on predicting intrinsic disorders. In this paper, we have tried to do a hub non-hub classification based on the amino acid sequence, since the features of the sequence virtually encompass the structural peculiarity resulting in protein disorder. We have transformed the symbol sequence into an image using the Chaos Game Representation algorithm (CGR). A quantitative measure from the CGR is then used for classification. Using this method, we obtained an accuracy of above 70% for both prokaryotes and eukaryotes. Keywords- Chaos Game Representation; connectivity; intrinsic disorder; protein interaction network
I.
INTRODUCTION
Protein-protein interaction play a pivotal role in almost every level of cell function such as in the structure of subcellular organelles, the transport machinery across the various biological membranes, packaging of chromatin, the network of sub-membrane filaments, muscle contraction, signal transduction and regulation of gene expression [1]. Information about protein-protein interactions improves our understanding of neurological disorders such as Creutzfeld-Jacob and Alzheimer's disease and can provide the basis for new therapeutic approaches for such diseases. Due to their importance in development and disease, these systems have been the object of intense research for many years [1]. Protein-protein interaction networks are organized into scale–free networks (SFNs) where a small number of proteins can interact with many other proteins, while most proteins interact with a small number of partners conforming to a power law distribution [2], [3]. The scalefree nature of protein–protein interaction networks gives them the advantages of high connectivity and robustness [4]. Those proteins which interact with many other partners are designated as ‘hubs’ and those with relatively
Achuthsankar S. Nair, Centre for Bioinformatics, University of Kerala, Thiruvananthapuram.
[email protected]
less interactions are termed as ‘non-hubs’. The hub nature of proteins is attributed to the intrinsic disorder in one or both partners in interaction [4], [5]. These intrinsically disordered (ID) proteins and regions are known to carry out numerous biological functions including cell signaling, molecular recognition, and various other interactions with proteins and nucleic acids [4]. The study of disorder content of organism-specific protein interaction networks have been carried out in [5]. The prediction of disorder in the interaction networks from four eukaryotic organisms carried out using PONDR VL-XT is also reported. The comparison of proteins from these networks shows that while the disorder content varies between organisms, hub proteins are consistently found to be more disordered than non-hub proteins in all organisms [5]. There are several studies available in literature which predicts protein disorder. These predictors are based on the assumption that the absence of rigid structure is encoded in specific features of the amino acid sequence [5]. In fact, statistical analysis shows that amino acid sequences encoding for ID proteins or regions are significantly different from those of ordered proteins on the basis of local amino acid composition, flexibility, hydropathy, charge, coordination number and several other factors [5]. A hub protein classifier was developed based on the available interaction data and Gene Ontology (GO) annotations for proteins in the Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens genomes in [6]. In this paper, we are attempting to classify hub and non-hub proteins based on the amino acid sequence, since the features of the sequence virtually encompass the structural peculiarity resulting in protein disorder [5]. We are obtaining an accuracy of 71.96% for prokaryotes and 71.45% for Eukaryotes. Proteins with connectivity 3 and below were assumed as non-hubs and those with 8 and above as hubs [15]. II. MATERIALS AND METHODS In this paper, we are using a quantitative measure obtained by mapping the sequences to an image using Chaos Game Representation algorithm. Chaos Game Representation of Genome sequences is a scale independent method of plotting the genome sequences, introduced by H. Joel Jeffrey in 1990 [7].
To derive Chaos Game Representation of a genome sequence, a square is first drawn to any desired scale and corners marked A, T, G and C. The first point is plotted halfway between the center of the square and the corner corresponding to the first nucleotide of the sequence, and successive points are plotted halfway between the previous point, and the corner corresponding to the base of each successive nucleotide. Mathematically, coordinates of the successive points in the Chaos Game Representation of a DNA sequence is described by an iterated function system defined in (1) Xi = 0.5( Xi − 1 + gix ) Figure 1. CGR for Human- Chromosome 16
Yi = 0.5(Yi − 1 + giy )
(1)
gix and giy are the X and Y co-ordinates respectively of the corners corresponding to the nucleotide at position i in the sequence [8]. Fig. 1 shows the CGR plotted for Human Chromosome 16. The “double scoop” in the top right quadrant is indicative of the relative sparseness of Guanine following Cytosine in the gene sequence. Other features in CGRs drawn for various organisms display marked diagonals, varying vertical intensities, absence of diagonals etc. signifying corresponding sequence characteristics indirectly captured by the signature images [7]. In a CGR, the frequency of occurrence of any oligomer can be obtained by dividing the image into a 2n x 2n grid and counting the number of points in each subsquare. This count is an important quantitative measure of the n-mers in the sequence. This representation is known as Frequency Chaos Game Representation (FCGR) [9], [10]. Numerous applications have been reported based on FCGR measures. This was used for developing an algorithm for aligning and comparing whole genomes [8]. Phylogenetic trees were generated using various distance measures derived from FCGR and it was concluded that FCGRs contained major phylogenetic information [11]. CGRs of amino acid sequences present a very different challenge, as CGRs can be constructed using a 20 sided regular polygon or alternatively using smaller polygons, by assigning groups of amino acids to the corners in a variety of ways. For instance, the 20 amino acids can be divided into four classes: non-polar, negative polar, uncharged polar and positive polar and the corresponding residues assigned to the corners of a square. Literature on protein CGRs are relatively limited, probably due to the cumbersome task of exploring the massive number of combinations possible with n sided CGRs. It was demonstrated that different protein families exhibit distinct patterns in their CGRs with characteristic grid counts [12]. The grid counts were used as diagnostic features of such protein families for identification of new members of the families. It was shown that CGR could be applied for revealing information relating the primary and 3D structures of proteins [13]. Multifractal and correlation analyses of the measures based on the CGR of amino acid sequences from complete genomes have been reported and attempts have been made to construct a more precise phylogenetic tree of bacteria [14].
In this paper, classification of hub and non-hub proteins is attempted, using hexagonal CGRs. In order to plot the CGR in a hexagon, the amino acids need to be grouped into 6. They can be grouped based on their physico-chemical properties. For example, the amino acids can be grouped based on hydrophobicity. There are many hydrophobicity scales available in literature. It is well known that there are significant differences between these scales [16], hence, we are resorting to an alternative method of grouping, using k-means clustering. Each amino acid is replaced by a numerical value corresponding to amino acid indices. We have considered 14 amino acid indices as listed in Table I. The amino acid index values for each property corresponding to the 20 amino acids, are clustered into 6 groups using k-means clustering. k-means uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible [17]. TABLE I.
LIST OF AMINO ACID INDICES USED
Sl.#
AA index
1.
Hydrophobicity
2.
Beta-sheet propensity
3.
Alpha-helix propensity
4.
Normalised frequency of extended structure
5.
Polarizability parameter
6.
NMR chemical shift of Alpha Carbon
7.
Entropy of formation
8.
Alpha Helix index
9.
EIIP
10.
Alpha NH
11.
Average flexibility index
12.
Amino Acid composition
13.
Absolute entropy
14.
Relative frequency of occurrence
The 6 clusters are then assigned to the corners of a hexagon and CGR of the amino acid sequences are plotted in a manner similar to that of square CGRs mentioned earlier, selecting corresponding corners and marking points midway between the corners and preceding points. Fig. 2 gives the CGR for a hub protein from a eukaryotic organism - Saccharomyces cerevisiae (Baker's yeast), whose JSN1 protein is plotted and Fig. 3 gives the CGR for a non-hub protein from a eukaryotic organism –Human whose Transcription initiation factor TFIID subunit 1 is plotted. Here, initially, the amino acids are clustered into 6 groups using k-means clustering, based on amino acid composition index. The resulting clusters are given by : Cluster I – A,G, Cluster II – L,S, Cluster III – R,D,P, Cluster IV – V,E,T,K, Cluster V- N,F,Q,Y,I, and Cluster VI – M,C,W,H. These clusters are assigned to the corners of a hexagon, in an anticlockwise direction starting from the vertex at the bottom left corner. The CGR is then divided into a 22 x 22 grid in order to compute the FCGR matrix. The number of points in each cell is counted so that we get a 16 element FCGR matrix for each protein. For the same protein sequence, this is repeated for the selected 14 AA indices. We thus get a 16x14 = 224 element vector representing each protein. The classification was applied on prokaryotes and eukaryotes separately. Training sequences from prokaryotes as well as eukaryotes were taken and a 224 element vector which is the average of all training hub sequences was taken as profile hub vector and another 224 element vector which is the average of all training non-hub sequences was taken as profile non-hub vector. Test sequences from both prokaryotes and eukaryotes were chosen. Corresponding to each test protein, a 224 element vector was computed as explained above. Each test vector was compared individually with the profile sequences for classification, using the cos angle distance measure. A. Classification using cos angle distance measure The angle between two nonzero vectors x and y is given as shown in (2)
cos θ =
x. y x y
The vectors will be closest when angle θ approaches zero which means cosθ tends to 1. The distance between each of the test vectors and profile vectors were obtained using (2). If the cosine value between the test vector and profile hub vector was greater than the cosine value between the same test vector and the profile non-hub vector, the test vector was classified as hub and otherwise, as non-hub. III.
RESULTS AND DISCUSSION
A total of 1281 sequences belonging to the eukaryotes and 3451 belonging to the prokaryotes were taken from the APID database. All sequences with connectivity less than 4 were considered as non-hub and those equal and above 8 were taken as hubs [15]. Roughly 50% of each set was taken for training and the rest for testing. Table II and III gives the Sensitivity, Specificity and Accuracy obtained for test sequences of eukaryotes and prokaryotes respectively. The classification statistics is calculated as follows Sensitivity = TP/(TP + FN) Specificity = TN/(TN + FP) Accuracy = (TP + TN)/(TP + TN + FP + FN) where TP = True Positive, FP = False Positive, TN = True Negative, and FN = False Negative. IV.
CONCLUSION
It may be inferred that the amino acid sequences do in fact conceal information regarding the hubness of proteins to a considerable extent. The question regarding the connectivity threshold for assigning a protein as hub or non-hub is still questionable, since there are several views on the same, found in literature. The accuracy may improve if the hubness - connectivity relationship is well defined. Further, it is also possible to experiment with various other polygonal CGRs with still other means of symbol assignment to the corners, which can yet be explored for improved accuracy. Nevertheless, the simplicity of the method is that the classification is based solely on the structure of the symbol sequence.
(2)
Figure 2. CGR-Saccharomyces cerevisiae (Baker's yeast), JSN1 protein (Hub)
Figure 3. CGR - Human , Transcription initiation factor , TFIID subunit 1 (Non-hub)
TABLE II. Hub
Nonhub
TP
EUKARYOTES
FN
TN
FP
Sensitivity
Hub
277
108
277
108
-
-
71.95
Nonhub
75
181
-
-
181
75
-
TABLE III. Hub
Nonhub
TP
Specificity
70.7
Accuracy
71.45
PROKARYOTES
FN
TN
FP
Sensitivity
Hub
974
237
974
237
-
-
80.43
Nonhub
247
268
-
-
268
247
-
Specificity
52.04
Accuracy
71.96
REFERENCES [1] [2]
[3] [4]
[5]
[6]
[7] [8]
[9]
Catherine Royer. (2004, May 14). Biophysics Textbook Online. [Online]. Available: www.biophysics.org/education/croyer.pdf Miho Higurashi, Takashi Ishida, and Kengo Kinoshita, “Identification of transient hub proteins and the possible structural basis for their multiple interactions,” Protein Sci., vol. 17, pp. 7278, Feb. 2008. Rafael Rangel-Aldao, “Developing countries and systems biology,” Nature biotech., vol. 21, pp.491-492, May 2003. A. Keith Dunker, S. Marc Cortese, Pedro Romero, M. Lilia Iakouchev and N. Vladimir Uversky, “Flexible nets The roles of intrinsic disorder in protein interaction networks,” FEBS Journal, vol. 272, pp. 5129–5148, Aug. 2005. Chad Haynes et al., “ Intrinsic Disorder Is a Common Feature of Hub Proteins from Four Eukaryotic Interactomes,” PLOS comp. biol., vol. 2, no. 8, pp. 890-901, Aug. 2006. Michael Hsing, Kendall Grant Byler and Artem Cherkasov, “The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks,” BMC Systems Biology, vol. 2:80, Sept. 2008. H. J. Jeffrey, “ Chaos game representation of gene structure,” Nucleic Acids Res., vol. 18, pp. 2163–2170, 1990. Jijoy Joseph and Roschen Sasikumar, “Chaos Game Representation for comparison of whole genomes,” BMC Bioinformatics, no.7, pp. 243, 2006. J.S. Almeida, A. Joao Carrico, Antonio Maretzek, A. Peter Noble and Madilyn Fletcher, “Analysis of genomic sequences by Chaos
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
Game Represen-tation,” Bioinformatics, vol. 17, no. 5, pp. 429– 437, Jan. 2001. P. J. Deschavanne, Alain Giron, Joseph Vilain, Guillaume Fagot and Bernard Fertil, “Genomic signature: characterization and classification of species assessed by chaos game representation of sequences,” Mol. Biol. Evol., vol. 16, pp. 1391–1399, 1999. Yingwei Wang , Kathleen Hill, Shiva Singh and Lila Kari, “ The spectrum of genomic signatures: from dinucleotides to chaos game representation,” Gene, vol. 346, pp.173–185, Jan. 2005. Soumalee Basu, Archana Pan, Chitra Dutta and Jyotirmoy Das, “Chaos game representation of proteins,” J. Mol. Graph. Model., vol. 15, pp. 279–289, 1997. Andras Fiser, Gabor E. Tsunady and Istvan Simon, “Chaos Game Representation of Protein Structures,” J. Mol. Graphics, vol. 12, pp. 302–304, Dec. 1994 Zu-Guo Yu, Vo Anh and Ka-Sing Lau, “Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses,”. J. Theor Biol., vol. 226(3), pp. 341–8, 2004. Diana Ekman, Sara Light, Åsa K Björklund and Arne Elofsson, “What properties characterize the hub proteins of the proteinprotein interaction network of Saccharomyces cerevisiae?”, Genome Biology, vol. 7:R45, Apr. 2006. Cornette,J. et al. , “Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins,” J. Mol. Biol., vol. 195, pp. 659–685, 1987. MATLAB – Statistics toolbox- Multivariate Statistcs, The Math Works Inc., June 2004.