Biochemical and Biophysical Research Communications 391 (2010) 1670–1674

Contents lists available at ScienceDirect

Biochemical and Biophysical Research Communications journal homepage: www.elsevier.com/locate/ybbrc

Protein location prediction using atomic composition and global features of the amino acid sequence Betsy Sheena Cherian *, Achuthsankar S. Nair Centre for Bioinformatics, University of Kerala, Kariyavattom Campus, Thiruvananthapuram, Kerala, India

a r t i c l e

i n f o

Article history: Received 14 December 2009 Available online 28 December 2009 Keywords: Subcellular localization Amino acid composition Atomic composition Physiochemical properties Sequence similarity

a b s t r a c t Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectively used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy. Ó 2009 Elsevier Inc. All rights reserved.

Introduction Cell is the basic unit of life and proteins are the work horses in the cell. For a protein to perform its function, it should be located in its targeted cellular location. Information about a protein’s location in the cell gives insight into the function of the protein and is useful in, screening candidates for drug discovery and vaccine design, annotating of gene products and, selecting relevant proteins for further studies. Computational subcellular localization prediction methods deal with predicting the location of the protein from its amino acid sequences. The success of computational subcellular localization prediction relies on two important components. First is the extraction of biological features which are relevant in the subcellular localization and the second is the computational technique employed for making prediction [1]. The biological features used for prediction include detection of protein sorting signal, amino acid composition, physiochemical properties, and homology search [2–8]. Most of the proteins, which are synthesized in the ribosomes, are translocated to its destination by an inherent signal, known

* Corresponding author. E-mail address: [email protected] (B.S. Cherian). 0006-291X/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2009.12.118

as protein sorting signal, usually present in the N-terminal of amino acid sequence. Sorting signals of proteins for various locations like mitochondria, chloroplast, nucleus, peroxisome etc. had been identified [9–22]. Many computational subcellular prediction methods use the presence of sorting signals for prediction [2,6,23–25]. The amino acid composition for localization prediction deals with the entire protein sequence and has its own advantages. Pseudo-amino acid composition proposed by Chou [26–31] incorporated parameters that reflect sequence order effect with the amino acid composition [32–36]. Several new methods consider the amino acid sequence as three parts, N-terminal, middle region and C-terminal to enhance various biological feature extraction [1,33]. The physiochemical properties of the amino acids are of great relevance in subcellular localization prediction [1,4,37,38]. The most widely considered physiochemical parameters are hydrophobicity, accessibility, flexibility, distribution ratio etc. Sequence similarity is another potential biological feature for inferring subcellular location information. Needleman–Wunsch and Smith–Waterman algorithm for sequence alignment has been used for subcellular localization prediction [39,40]. In this work, priority is given to the global features of the amino acid sequence rather than features of a single part like N-terminal. The biological features for prediction include atomic composition of the full sequence, multiple physiochemical properties for the full

B.S. Cherian, A.S. Nair / Biochemical and Biophysical Research Communications 391 (2010) 1670–1674

sequence, amino acid composition of the full sequence, 3 part amino acid composition of the full sequence, and sequence similarity for the entire sequence. Materials and methods For this study, we used the dataset compiled by Chou [8,27,41]. The training set contains amino acid sequences of 145 chloroplast proteins, 571 cytoplasmic proteins, 34 cytoskeleton proteins, 49 endoplasmic reticulum proteins, 224 extracellular proteins, 25 Golgi apparatus proteins, 37 lysosome proteins, 84 mitochondria proteins, 272 nucleus proteins, 27 peroxisome proteins, 699 plasma membrane proteins, and 24 vacuole proteins summing up to 2191 protein sequences altogether. The independent dataset has a total of 2494 proteins with, 112 chloroplast proteins, 761 cytoplasmic proteins, 19 cytoskeleton proteins, 106 endoplasmic reticulum proteins, 95 extracellular proteins, 4 Golgi apparatus proteins, 31 lysosome proteins, 163 mitochondria proteins, 418 nucleus proteins, 23 peroxisome proteins, 762 plasma membrane proteins. None of the protein in the independent dataset occurs in the training dataset. Support Vector Machine (SVM) was proposed by Vapnik [42] as a very effective method for general purpose supervised pattern recognition. Many of the subcellular localization prediction tools make use of SVM [1,3–5,8,43]. The SVM are of better-quality in practical applications and is well founded theoretically. SVMs are popular because of their high performance, adaptability and their ability to deal with data in high dimensional feature space. The SVM can classify nonlinear data using kernel transformation. The data is translated into a high dimensional feature space, and then the optimal separating hyperplane is determined. Since this work deals with proteins of 11 locations, this is a multi-class problem. There are two approaches for SVM to handle a multi-class problem, ‘‘one-against-one” and ‘‘one-against-all”. We employed ‘‘oneagainst-all” approach for making prediction. We used LIBSVM 2.9 [44] for making SVM modules. The Radial Basis Function (RBF) is used for all modules. The kernel parameters c and regularization parameter C were optimized with the training set. The input features used for prediction include atomic composition, multiple physiochemical properties, amino acid composition, 3 part amino acid composition and sequence similarity. Atomic composition is the number of constituent atoms in an amino acid sequence. As the side chain atoms of the amino acids decide the property of the amino acid and amino acid composition itself is a powerful parameter for localization prediction, we hypothesize that the atomic composition will serve as a feature for the localization prediction. Amino acids are made up of carbon, hydrogen, nitrogen, oxygen and sulphur atoms. Atomic composition gives the total number of each type of atoms in an amino acid sequence. To reveal the distribution of the atomic composition to a greater detail, we logically divided N-terminal, middle region and C-terminal into 3 subregions. Thus the SVM module for atomic composition has a feature vector of size 45 for each protein. The length of each subregion is equal and is calculated based on the length of the amino acid sequence. Let L be the length of P. Then each segment is of length L/9. Let P be a protein sequence, P = x1, x2, x3, x4, . . ., xN, where xi 2 A, i = 1, 2, 3, . . ., N and A is the set of 20 amino acids, A = {a1, a2, a3, . . ., a20}. Let T = {C, H, N, O, S} be the set of atoms in the amino acids. The atomic composition of sequence segment Pi is calculated as ATC (Pi) = {t1, t2, t3, t4, t5} where t 2 T, is the count of each type of atom in the sequence segment Pi. The kernel parameters used are c = 1, C = 2. In physiochemical SVM module, multiple physiochemical parameters from AAIndex database [45,46] are used for feature extraction. This is based on the observation that it is not a single physiochemical property, but a group of them describes the local-

1671

ization address signal. Each amino acid in the sequence is replaced with corresponding physiochemical value to get a global representation of the physiochemical values. A full list of these physiochemical parameters is given in Supplementary data. This SVM module has feature vector of length 96 as we consider 96 physiochemical parameters. Nearly half of these parameters are secondary structure related, like, average relative probability of betasheet, normalized frequency of alpha-helix. Other widely considered parameters like hydrophobicity, charge etc. are also included in the list. Let P be a protein sequence, P = x1, x2, x3, x4, . . ., xN, where xi 2 A, i = 1, 2, 3, . . ., N, N is the length of the protein sequence and A is the set of 20 amino acids, A = {a1, a2, a3, . . ., a20}. Let H be the amino acid index of a physiochemical parameter C. H = {h1, h2, . . ., h20} where hj is the amino acid index value of the amino acid aj.

C i ¼ log

N X

! hi

ð1Þ

i¼1

where Ci 2 {c1, c2, c3, . . ., c96}. The kernel parameters used were c = 2, C = 64. The fraction of each amino acid in the protein is used for prediction in amino acid composition SVM module. Let P be a protein sequence, P = x1, x2, x3, x4, . . ., xN, where xi 2 A, i = 1, 2, 3, . . ., N and A is the set of 20 amino acids, A = {a1, a2, a3, . . ., a20}. We calculated the amino acid composition AAC as the fraction of each amino acid a in the sequence P.

AAC ðai Þ ¼ ðtotal number of amino acid ai Þ=N;

ð2Þ

where, N is the total number of amino acids in the sequence. The feature vector for this SVM module had a length of 20 for each protein and the kernel parameters used were c = 4, C = 512. The three part amino acid composition SVM module consider the sequence as three parts, N-terminal, middle part, and C-terminal. This will bring out the compositional difference in each part of the sequence and will reveal the distribution of the amino acid composition to a greater detail. The protein sequence P is divided into three segments PN, PM, PC, where PN is the N-terminal segment, PM is the middle segment and, PC is the C-terminal segment. The length of each segment was equal and was calculated based on the length of the amino acid sequence. Let L be the length of P. Then each segment is of length L/3. Amino acid composition of each segment was calculated separately. P = x1, x2, x3, x4, . . ., xN, where xi 2 A, i = 1, 2, 3, . . ., N and A is the set of 20 amino acids, A = {a1, a2, a3, . . ., a20}. We calculated the amino acid composition AAC as the fraction of each amino acid a in the sequence P as in Eq. (2). The input feature vector for this SVM module had 60 elements for each protein and the kernel parameters were c = 4, C = 512. In sequence similarity module, the whole query sequence was aligned against the sequences in the training set using Smith– Waterman algorithm [47] to make the prediction. Smith–Waterman algorithm is a dynamic programming method to find the optimal alignment between the sequences. This algorithm finds out the local alignment between the protein sequences, bringing out common patterns and domains within the sequences. Since the address signals for each location share common characteristics, this algorithm is competent to detect the signals present within. The scoring matrix BLOSUM50 was used for alignment. Location of sequence with highest similarity with the query protein was predicted as the location of the query protein. Four SVM modules were designed for atomic composition, multiple physiochemical properties, amino acid composition and three part amino acid compositions. These SVM modules were named ATC-SVM, Phys-SVM, AAC-SVM, 3-AAC-SVM, respectively. A voting system has been employed to make the final prediction from these individual SVM modules. If a conflict occurs, for instance, the three modules predict different locations, sequence similarity module is

1672

B.S. Cherian, A.S. Nair / Biochemical and Biophysical Research Communications 391 (2010) 1670–1674

Query sequence

ATC-SVM

Phys-SVM

AAC-SVM

3-AAC-SVM

Sequence Alignment

Voting system

Prediction

Fig. 1. Schematic diagram of the subcellular prediction system.

used for making prediction. The layout of the system is depicted in Fig. 1. Self-consistency test, jackknife test and independent data test are performed to evaluate the system. The self-consistency test measures the self-consistency of the developed method. The same dataset, from which the rules of classification are derived, is used for making prediction. This will give high accuracy, because same dataset is used for training and testing. If the self-consistency of a method is poor, it is not a good classification method. In jackknife test, each protein in the training test is singled out to make prediction using the rules derived from the rest of the training test. Jackknife is considered as more objective and rigorous than other tests. In independent test, the training dataset is used for training the SVM to derive the support vectors and testing dataset is used for measuring the performance. The prediction accuracy of the each protein subcellular location is calculated as

Accuracy ðLÞ ¼ ðC L =T L Þ  100

ð3Þ

where CL is the number of true predictions and TL is the total number of proteins for location L. Total prediction accuracy is calculated as,

Accuracy ðSysÞ ¼ 1=N  ðtotal correct predictionÞ

ð4Þ

where N is the total number of proteins:

ð5Þ

Result and discussion

that the full length sequence information performs better than N-terminal sequence information. The result of 96 physiochemical values applied to both N-terminal and the full length sequence is listed in Table 1. The result of sequence similarity applied to both N-terminal and entire sequence is listed in Table 2. These result show that usage of full length sequence gives better accuracy than using N-terminal alone. This can be because of the dispersion of the address signals within the entire sequence rather than in N-terminal alone. Smith–Waterman sequence alignment performs better We used Smith–Waterman sequence alignment on the whole sequence for subcellular localization prediction. Our experiments proved that, Smith–Waterman sequence alignment for the whole sequence performs better than Needleman–Wunsch alignment [48] for N-terminal alone, Needleman–Wunsch alignment for whole sequence and Smith–Waterman for the N-terminal. The comparisons are given in Table 2. The higher accuracy exhibited by Smith–Waterman algorithm can be because of its capability for finding out the local alignment within the protein sequences, bringing out common patterns and domains within the sequences. The address signals for each location share common features and this algorithm is competent to detect the signals present within. Performance

Global features perform better than N-terminal features alone We had tested each module with both N-terminal sequence information and full sequence information. Our results showed Table 1 Comparison of prediction accuracies of physiochemical module applied both at Nterminal and to entire sequence.

We did self-consistency test, jackknife test and independent data test on the data. The result of the each module and the entire

Table 2 Comparison of Needleman–Wunsch, Smith–Waterman sequence alignment applied both at N-terminal and to entire sequence.

Location

N-terminal accuracy

Full sequence accuracy

Location

NW N-terminal

SW N-terminal

NW full

SW full

Chloroplast Cytoplasm Cytoskeleton ER Extracellular Golgi apparatus Lysosome Mitochondria Nucleus Peroxisome Plasma membrane Total

55.36 83.05 52.63 56.60 69.47 50.00 32.26 41.10 70.81 30.43 86.22 74.94

73.21 90.67 94.74 66.98 77.89 25.00 64.52 30.67 79.43 30.43 95.14 83.00

Chloroplast Cytoplasm Cytoskeleton ER Extracellular Golgi apparatus Lysosome Mitochondria Nucleus Peroxisome Plasma membrane Total

86.61 84.23 78.95 90.57 94.74 75.00 90.32 71.17 65.55 73.91 94.88 84.20

91.96 85.81 100.00 99.06 95.79 75.00 100.00 91.41 77.03 95.65 97.11 89.74

97.32 85.15 100.00 99.06 94.74 75.00 100.00 96.32 68.42 91.30 99.34 89.25

97.32 83.44 100.00 100.00 97.89 75.00 100.00 100.00 86.60 91.30 99.74 92.30

B.S. Cherian, A.S. Nair / Biochemical and Biophysical Research Communications 391 (2010) 1670–1674 Table 3 Accuracy of each module for self-consistency test, jackknife and independent test. Method

Self-consistency

Jackknife

Independent

Phys-SVM 3-AAC-SVM AAC-SVM SW ATC-SVM Hybrid Prediction system

95.44 100.00 100.00 100.00 97.63 100.00 100.00

72.34 80.92 77.50 78.27 71.79 81.29 82.47

82.72 84.72 81.07 92.26 77.39 84.60 88.81

Table 4 Comparison with other methods. Method

Self-consistency Jackknife Independent

Pseudo-amino acid composition, 85.8 covariant-discriminant method [27] Functional domain 87.3 composition [49] Stochastic signal processing 81.5 approach [35] Cellular automata images [36] 86.4 Complexity measure factor [50] — Hydrophobic patterns and average 86.0 power-spectral density [51] Lyapunov index, bessel function, 82.3 and chebyshev filter [32] Multi-scale energy [52] — Atomic composition (this paper) 97.6 Hybrid: (this paper) 100.0 Prediction system (this paper) 100.0

73.0

80.9

66.7

81.7

67.7

73.9

72.6 73.6 72.8

74.8 79.8 79.9

69.9



80.3 71.8 81.3 82.5

87.0 77.4 84.6 88.8

system is given in Table 3. The newly introduced feature, atomic composition, alone has significant prediction accuracy for independent test. Also the self-consistency test of atomic composition module is higher than that of physiochemical module. Considering the full sequence as three parts and calculating the amino acid composition gives better accuracy than considering the whole sequence together. This may be because the former expose the distribution of amino acid composition to a finer detail. A hybrid module based on Phys-SVM, 3-AAC-SVM, AAC-SVM, and ATC-SVM is developed to demonstrate the strength of the method when the sequence alignment is excluded. The prediction accuracies of this module also is reported. In the hybrid approach the 3-AAC-SVM is given weight for voting. We have conducted 5-fold cross validation for the individual SVM modules. The cross validation accuracies are 71.56 for Phys-SVM, 80.00 for 3-AAC-SVM, 76.54 for AAC-SVM, and 70.19 for ATC-SVM. Comparison of our method with other methods is given in Table 4. Conclusions We have introduced a new parameter, atomic composition for subcellular localization prediction and effectively integrated it with other parameters like amino acid composition, physiochemical parameters and sequence similarity. Our results demonstrated that the global information of the sequence contributed more to the prediction accuracy. This is found true in the case of physiochemical properties and sequence alignment modules. Another observation is that considering the full sequence as a group of three parts, N-terminal, middle region and C-terminal, will bring out underlying property distribution to a greater detail to enhance the prediction accuracy. For sequence alignment module, the Smith–Waterman algorithm for whole sequence performs better than Needleman–Wunsch algorithm for whole sequence. Our work strongly demonstrates that atomic composition can be effectively

1673

used along with other global features of the sequence to enhance the accuracy of subcellular localization prediction. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.bbrc.2009.12.118. References [1] E. Tantoso, K.B. Li, AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices, Amino Acids 13 (2008) 345–353. [2] H. Bannai, Y. Tamada, O. Maruyama, K. Nakai, S. Miyano, Extensive feature detection of n-terminal protein sorting signals, Bioinformatics 18 (2002) 298– 305. [3] M. Bhasin, A. Garg, G.P.S. Raghava, Pslpred: prediction of subcellular localization of bacterial proteins, Bioinformatics 21 (2005) 2522–2524. [4] M. Bhasin, G.P.S. Raghava, ESLpred: svm-based method for subcellular localization of eukaryotic proteins using dipeptide composition and psi-blast, Nucleic Acids Res. 32 (2004) W414–W419. [5] T. Blum, S. Briesemeister, O. Kohlbacher, MultiLoc2: integrating phylogeny and gene ontology terms improves subcellular protein localization prediction, BMC Bioinf. 10 (2009), doi:10.1186/1471-2105-10-274. [6] J.L. Gardy, M.R. Laird, F. Chen, S. Rey, C.J. Walsh, M. Ester, F.S.L. Brinkman, Psortb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis, Bioinformatics 21 (2005) 617–623. [7] A. Garg, M. Bhasin, G.P.S. Raghava, Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search, J. Biol. Chem. 280 (2005) 14427–14432. [8] C.S. Yu, C.J. Lin, J.K. Hwang, Predicting subcellular localization of proteins for gram-negative bacteria by support vector machines based on n-peptide compositions, Protein Sci. 13 (2004) 1402–1406. [9] B.D. Bruce, The paradox of plastid transit peptides: conservation of function despite divergence in primary structure, Biochim. Biophys. Acta 1541 (2001) 2–21. [10] D. Christophe, C.C. Hobertus, B. Pichon, Nuclear targeting of proteins: how many different signals?, Cell Signal 12 (2000) 337–341. [11] M. Cokol, R. Nair, B. Rost, Finding nuclear localization signals, EMBO Rep. 1 (2000) 411–415. [12] R. Dono, D. James, R. Zeller, A GR-motif functions in nuclear accumulation of the large fgf-2 isoforms and interferes with mitogenic signalling, Oncogene 16 (1998) 2151–2158. [13] O. Emanuelsson, Predicting protein subcellular localisation from amino acid sequence information, Brief. Bioinform. 3 (2002) 361–376. [14] S.J. Gould, G.A. Keller, N. Hosken, J. Wilkinson, S. Subramani, A conserved tripeptide sorts proteins to peroxisomes, J. Cell Biol. 108 (1989) 1657–1664. [15] D. Kalderon, B.L. Roberts, W.D. Richardson, A.E. Smith, A short amino acid sequence able to specify nuclear location, Cell 39 (1984) 499–509. [16] W. Neupert, Protein import into mitochondria, Annu. Rev. Biochem. 66 (1997) 863–917. [17] N. Pfanner, A. Geissler, Versatility of the mitochondrial protein import machinery, Nat. Rev. Mol. Cell Biol. 2 (2001) 339–349. [18] V.W. Pollard, W.M. Michael, S. Nakielny, M.C. Siomi, F. Wang, G. Dreyfuss, A novel receptor-mediated nuclear protein import pathway, Cell 86 (1996) 985– 994. [19] T.A. Rapoport, Transport of proteins across the endoplasmic reticulum membrane, Science 258 (1992) 931–936. [20] J. Robbins, S.M. Dilwortht, R.A. Laskey, C. Dingwall, Two interdependent basic domains in nucleoplasmin nuclear targeting sequence: identification of a class of bipartite nuclear targeting sequence, Cell 64 (1991) 615–623. [21] G. von Heijne, Patterns of amino acids near signal-sequence cleavage sites, Eur. J. Biochem. 133 (1983) 17–21. [22] G. von Heijne, J. Steppuhn, R.G. Herrmann, Versatility of the mitochondrial protein import machinery, Eur. J. Biochem. 180 (2001) 535–545. [23] O. Emanuelsson, H. Nielsen, S. Brunak, G. von Heijne, Predicting subcellular localization of proteins based on their n-terminal amino acid sequence, J. Mol. Biol. 300 (2000) 1005–1016. [24] K. Nakai, M. Kanehisa, Expert system for predicting protein localization sites in gram-negative bacteria, Proteins 11 (1991) 95–110. [25] K. Nakai, M. Kanehisa, A knowledge base for predicting protein localization sites in eukaryotic cells, Genomics 14 (1992) 897–911. [26] K.C. Chou, Prediction of protein subcellular locations by incorporating quasisequence-order effect, Biochem. Biophys. Res. Commun. 278 (2000) 477–483. [27] K.C. Chou, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins: Struct. Funct. Genet. 43 (2001) 246–255. [28] K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10–19. [29] K.C. Chou, Y.D. Cai, Prediction of membrane protein types by incorporating amphipathic effects, J. Chem. Inf. 45 (2005) 407–413. [30] H.B. Shen, K.C. Chou, Ensemble classifier for protein fold pattern recognition, Bioinformatics 22 (2006) 1717–1722.

1674

B.S. Cherian, A.S. Nair / Biochemical and Biophysical Research Communications 391 (2010) 1670–1674

[31] H.B. Shen, K.C. Chou, PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem. 373 (2008) 386–388. [32] Y. Gao, S. Shao, X. Xiao, Y. Ding, Y. Huang, Z. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein subcellular location: approached with lyapunov index, bessel function, and chebyshev filter, Amino Acids 28 (2005) 373–376. [33] S. Matsuda, J.P. Vert, H. Saigo, N. Ueda, H. Toh, T. Akutsu, A novel representation of protein sequences for prediction of subcellular location using support vector machines, Protein Sci. 14 (2005) 2804–2813. [34] Y.X. Pan, D.W. Li, Y. Duan, Z.Z. Zhang, M.Q. Xu, G.Y. Feng, Lin He, Predicting protein subcellular location using digital signal processing, Acta Biochim. Biophys. Sin. 37 (2005) 88–96. [35] Y.X. Pan, Z.Z. Zhang, Z.M. Guo, G.Y. Feng, Z.D. Huang, L. He, Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach, J. Protein Chem. 22 (2003) 395–402. [36] X. Xiao, S. Shao, Y. Ding, Z. Huang, K.C. Chou, Using cellular automata images and pseudo amino acid composition to predict protein subcellular location, Amino Acids 30 (2006) 49–54. [37] W.L. Huang, C.W. Tung, H.L. Huang, S.F. Hwang, S.Y. Ho, ProLoc: prediction of protein subnuclear localization using svm with automatic selection from physicochemical composition features, Biosystems 90 (2007) 573–581. [38] K. Imai, N. Asakawa, T. Tsuj, F. Akazawa, A. Ino, M. Sonoyama, S. Mitaku, SOSUIGramN: high performance prediction for sub-cellular localization of proteins in gram-negative bacteria, Bioinformation 2 (2008) 417–421. [39] J.K. Kim, S.Y. Bang, S. Choi, Sequence-driven features for prediction of subcellular localization of proteins, Pattern Recogn. 39 (2006) 2301–2311. [40] J.K. Kim, G.P.S. Raghava, S.Y. Bang, S. Choi, Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine, Pattern Recogn. Lett. 27 (2006) 996–1001.

[41] K.C. Chou, D.W. Elrod, Protein subcellular location prediction, Protein Eng. 12 (1999) 107–118. [42] V.N. Vapnik, The Nature of Statistical Learning Theory, Wiley-Interscience, New York, 1998. [43] R. Nair, B. Rost, Mimicking cellular sorting improves prediction of subcellular localization, J. Mol. Biol. 348 (2005) 85–100. [44] C.C. Chang, C. Lin, LIBSVM: a library for support vector machines, 2001. www.csie.ntu.edu.tw/~cjlin/libsvm. [45] S. Kawashima, H. Ogata, M. Kanehisa, AAindex: amino acid index database, Nucleic Acids Res. 27 (1999) 368–369. [46] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, M. Kanehisa, AAindex: amino acid index database progress report 2008, Nucleic Acids Res. 36 (2008) D202–D205. [47] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, J. Mol. Biol. 147 (1981) 195–197. [48] S.B. Needleman, C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970) 443–453. [49] K.C. Chou, Y.D. Cai, Using functional domain composition and support vector machines for prediction of protein subcellular location, J. Biol. Chem. 277 (2002) 45765–45769. [50] X. Xiao, S. Shao, Y. Ding, Z. Huang, Y. Huang, K.C. Chou, Using complexity measure factor to predict protein subcellular location, Amino Acids 28 (2005) 57–61. [51] T. Zhang, Y. Ding, K.C. Chou, Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence, Comput. Biol. Chem. 30 (2006) 367–371. [52] J.Y. Shi, S.W. Zhang, Q. Pan, Y.M. Cheng, J. Xie, Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition, Amino Acids 33 (2007) 69–74.

Protein location prediction using atomic composition ...

Dec 28, 2009 - subcellular localization and the second is the computational tech- nique employed for making prediction [1]. The biological features used for prediction include detection of protein sorting signal, ami- no acid composition, physiochemical properties, and homology search [2–8]. Most of the proteins, which are ...

190KB Sizes 0 Downloads 233 Views

Recommend Documents

Geolocation Prediction in Twitter Using Location ...
location-based recommendation (Ye et al., 2010), crisis detection and management (Sakaki et al., ... Section 2 describes our proposed approach, including data ..... Using friendship (bi-directional) and following (uni-directional) links to infer the 

Improving Location Prediction using a Social Historical ...
of social media where users are able to check-in to loca- tions they ... these predicted locations. Permission to ... [10] pro- posed the Order-k Markov model, which considers the fre- ... more recently visited places), and demonstrating how it can.

Improving Location Prediction using a Social Historical Model with ...
Location-based Social Networks (LBSN) are a popular form of social media where users are able to check-in to loca- tions they have ..... [5] H. Gao, J. Tang, and H. Liu. gSCorr: modeling geo-social correlations for new check-ins on location-based soc

Extracting Protein-Protein interactions using simple ...
datasets and the limited information available about their methods. 2 Data. A gene-interaction .... from text–is text mining ready to deliver? PLoS Biol, 3(2).

Extracting Protein-Protein interactions using simple ...
using 10-fold cross-validation. Performance will be measured using Recall, Precision and F1. 3 Experiments. Each possible combination of proteins and iWords.

Extracting Protein-Protein interactions using simple ...
References. C. Blaschke and A. Valencia. 2002. The frame-based module of the suiseki information extraction system. IEEE Intelli- gent Systems, (17):14–20.

Experimental Results Prediction Using Video Prediction ...
RoI Euclidean Distance. Video Information. Trajectory History. Video Combined ... Training. Feature Vector. Logistic. Regression. Label. Query Feature Vector.

Learning Protein Protein Interaction Extraction using ...
Performance is usually assessed using 10 fold CV. • Robustness typically not ... 12. Distant supervision. • Manual annotation is labor intensive and tedious.

Advances in the prediction of protein targeting signals
Enlarged sets of reference data and special machine learning approaches have improved the accuracy of the ... parably easily accessible by drug molecules, due to their localization in .... of additional targeting signal prediction tools, see, e.g.,.

Quantitative impedance measurement using atomic ...
Sep 15, 2004 - Building 530, Room 226, Stanford, California 94305-3030 ... example, obtaining quantitative kinetic data from the recently developed atomic force microscopy ... faces. The technique has been applied to visualize electronic.

A Dynamic Bayesian Network Approach to Location Prediction in ...
A Dynamic Bayesian Network Approach to Location. Prediction in Ubiquitous ... SKK Business School and Department of Interaction Science. Sungkyunkwan ...

Protein Word Detection using Text Segmentation Techniques
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.

Anesthesia Prediction Using Fuzzy Logic - IJRIT
Thus a system proposed based on fuzzy controller to administer a proper dose of ... guide in developing new anesthesia control systems for patients based on ..... International conference on “control, automation, communication and energy ...

Ionic and electronic impedance imaging using atomic ...
Jun 15, 2004 - Building 530, Room 226, Stanford, California 94305-303. Minhwan ... impedance data from polycrystalline ceramics with a lateral resolution of ...

A Grid-Based Location Estimation Scheme using Hop ...
Report DCS-TR-435, Rutgers University, April 2001. [13] J. G. Lim, K. L. Chee, H. B. Leow, Y. K. Chong, P. K. Sivaprasad and. SV Rao, “Implementing a ...

Cover Estimation and Payload Location using Markov ...
Payload location accuracy is robust to various w2. 4.2 Simple LSB Replacement Steganography. For each cover image in test set B, we embed a fixed payload of 0.5 bpp using LSB replacement with the same key. We then estimate the cover images, or the mo

Extracting Protein-Protein Interactions from ... - Semantic Scholar
statistical methods for mining knowledge from texts and biomedical data mining. ..... the Internet with the keyword “protein-protein interaction”. Corpuses I and II ...

Protein Functional Recognition Using a Spin-Image ...
Keywords: protein function, molecular recognition, spin-images, molecular ... Molecular recognition [4] and binding site identification [3] are of interest for the ...

Protein Word Detection using Text Segmentation ... - Research
Aug 4, 2017 - to the task of extracting ”biological words” from protein ... knowledge about protein words, we propose to ..... A tutorial introduction to the.

Design of a novel globular protein fold with atomic level ...
Nov 21, 2003 - folding of about 10 cal deg 1 mol 1, a typical value for well-folded ..... Daniel T. S. Pak*† and Morgan Sheng†. Synaptic plasticity involves the ...

Program Behavior Prediction Using a Statistical Metric ... - Canturk Isci
Jun 14, 2010 - Adaptive computing systems rely on predictions of program ... eling workload behavior as a language modeling problem. .... r. LastValue. Table-1024. SMM-Global. Figure 2: Prediction accuracy of our predictor, last-value and ...