Improvement in Fold Recognition Accuracy of a Reduced-StateSpace Hidden Markov Model by using Secondary Structure Information in Scoring Christos Lampros, Costas Papaloukas, Kostas Exarchos, Student Member, IEEE and Dimitrios I. Fotiadis, Senior Member, IEEE Abstract: Fold recognition is a challenging field strongly related with function determination which is of high interest for the biologists and the pharmaceutical industry. Hidden Markov Models (HMMs) have been largely applied for this purpose. In this work, the fold recognition accuracy of a recently introduced Hidden Markov Model with a reduced state-space topology is improved. This model employs an efficient architecture and a low complexity training algorithm based on likelihood maximization. Currently we further improve the fold recognition accuracy of the proposed model in two steps. In the first step we adopt a smaller model architecture based on {E,H,L} alphabet instead of DSSP secondary structure alphabet. In the second step we additionally use the predicted and the correct secondary structure information in scoring of the test set sequences. The dataset, used for the evaluation of the proposed methodology, comes from the SCOP and PDB databases. The results show that the fold recognition performance increases significantly in both steps. I.

INTRODUCTION

There is an increasing number of identified protein sequences in the last years due to the extensive research in the field. However the majority of these sequences is not accompanied by any information concerning their structure or biochemical function. A newly identified protein can be related with proteins in annotated databases whose threedimensional structure (fold) is known. Determining how amino acid sequences are related to those of proteins with known structure, helps in making predictions for their structural, functional and evolutionary attributes. Manuscript received March 30, 2007. Christos Lampros is with the Department of Medical Physics, Medical School, University of Ioannina, Ioannina, Greece, GR 45110 (e-mail: [email protected]). Costas Papaloukas is with the Department of Biological Applications and Technology, University of Ioannina, Ioannina, Greece, GR 45110 (email: [email protected]). Kostas Exarchos is with the Unit of Medical Technology and Intelligent Information Systems, Dept. of Computer Science, University of Ioannina, Ioannina, Greece, GR 45110 (e-mail: [email protected]). Dimitrios I. Fotiadis is with the Unit of Medical Technology and Intelligent Information Systems, Dept. of Computer Science, University of Ioannina, Ioannina, Greece, GR 45110 (0030-26510-98803; fax: 003026510-970; e-mail: [email protected]). .

Various methods have been developed to identify the fold category where a protein of unknown structure belongs (fold recognition). These methods are divided into two methodological approaches: (a) the informatics-based methods that involve the sequence-based methods [1]-[5] and the structure-based methods [6]-[8] and (b) the biophysics-based methods [9]-[10]. Sequence-based methods use primary or predicted secondary sequence information to perform sequence comparison and detect whether two proteins share a fold or not. Structure based or threading methods create an energy function which evaluates how well the amino acid sequence of a protein fits into one of the known folds. On the other hand, methods based on biophysics perform ab initio structure prediction. They detect a native conformation or ensemble of conformations of the protein that are at or near the global free-energy minimum [10]. A review of sequenced-based methods reveals that hidden Markov models (HMMs) are commonly used in fold recognition and also demontsrate high performance [2]-[5]. In the current work we introduce changes which ameliorate significantly the performance of a recently proposed reduced state-space HMM in two steps. More specifically, in the first step we decrease the number of states by adopting the simple {E,H.L} alphabet. This change leads to an even smaller number of parameters to be calculated in the training phase and better results simultaneously. In the second step, we additionally use the predicted and the correct secondary structure sequences in scoring of the test set proteins. This enables us to avoid the use of the complex forward algorithm [11] for scoring and also allows us to exploit the secondary sequence information of the proteins whose structure is considered unknown. The predicted secondary sequences are calculated employing PSIPRED [12]. The results show significant improvement in fold recognition accuracy. When we use the correct secondary sequences the performance is even better, as in that case there are no errors in the secondary structure. II. MATERIALS AND METHODS HMMs are widely used in modelling families of biological sequences. They model a series of observations based upon a hypothesized (hidden) process. Each HMM consists of a set of states S and a set of possible transitions T between them. Every state emits a signal based upon a set of

emission probabilities and then stochastically transmits it to some other state with a probability depending on the previous state. The procedure continues until the total of each sequence is emitted. There is a begin state where the process starts and transition probabilities also exist from the beginning to each possible state. That set of probabilities sums to unity and so does the set of emissions of possible signals in each state and the set of transitions from each state. The observer does not know which is the exact state that produces each signal, because that state is hidden from him. This is the first main characteristic of a HMM, which differentiates it from other stochastic models. The second is the Markov property, which means that given the value of the previous state St-1 the current state St and all future states are independent of all the states prior to t-1 [11]. A HMM is trained using a set of sequences called training set and then can be used for discrimination. The aim of the learning procedure is to maximize the likelihood of the model given the training data. The main disadvantage of HMMs, which is the employment of large model architectures that demand large datasets and high computational effort for training, is adressed in our recent work. In that work [5] we introduced a HMM with a reduced state-space topology, which can be trained and then be used for the classification of proteins in fold categories. That model incorporates the secondary structure information in such a way that the states of the model will depict the possible different secondary states. That fact enables us not only to use the secondary structure information which is necessary for the more accurate classification of proteins among different structural groups, but also to reduce drastically the number of states of the model and thus the number of parameters that have to be trained. Moreover, the training of the model becomes much more simple because the state sequence while learning is known and therefore we can employ a likelihood maximization algorithm with very low complexity [11] and avoid complicated iterative procedures.The reduced statespace HMM has been proven to be effective in fold recognition [5]. In the current work, we introduce specific changes that lead to significant improvement of the fold recognition accuracy of the model. These changes are presented in two steps. At first we change the topology of the model and then we change the way we score the test set proteins against the improved model. In the first step, we change the secondary alphabet that we use to encode the different possible secondary structure formations. In the previous work we had used the DSSP alphabet [13] which consists of the 7 letters {H,B,E,G,I,T,S}. This time, we adopt the basic {Ε,H,L} alphabet, where each letter corresponds to more than one of the letters of the DSSP alphabet, so the total number of letters in use is reduced. More specifically, H,G and I correspond to H, E and B correspond to E and T,S

correspond to L. Some aminoacids that where considered of unknown secondary structure in DSSP are know considered as loop (L). Thus, the topology of the model changes and it currently includes 3 hidden states instead of 7, corresponding to the underlying secondary structure. The topology of the model is shown in Fig. 1.

H

E

Begin

L Fig. 1. Topology of the HMM with 3 states.

In the training set, there is one to one correspondence between the amino acid and the secondary structure residues. The states of the model are fully connected, which means that all possible transitions between them are allowed. In each state a distribution over all possible amino acid residues is found. There are 21 possible residues which are the variables in each distribution, the 20 different amino acids and one more residue indicating amino acids of unknown origin. Thus, the total number of the model parameters is 3x21 for the possible emissions, 3x3 for the possible transitions between states and 1x3 for the transitions from the beginning. The sum of all this is 75 parameters, which is less than the 203 parameters that were used in the previous model. The emission and transition parameters of the model are calculated in a single step with the use of maximum likelihood estimators. If akl is the transition probability from state k to another state l and ek(b) the emission probability of the residue b in the state k, then the estimators are given by the following equations:

akl =

Akl , ∑ Akl'

(1)

l'

and

ek (b) =

Ek (b)

∑ E (b ' ) b

'

k

.

(2)

Where

l

'

represents all the states where the procedure can '

go after state k and b represents all the symbols that can be emitted from state k. Akl is the number of times that the transition from k to l is used and Ek(b) is the number of times the emission of b from k is used in the training test of sequences. We should take also precautions to avoid overfittng of maximum likelihood estimators when there is insufficient data by adding pseudocounts [5]. We use the posterior probability scores for scoring the model. These scores are logarithmic forms of the probability of the sequence, given the model. According to the Bayes theorem, a test sequence is classified to that group whose model gives the maximum probability score. The posterior probabilities are calculated with the use of the forward algorithm [11]. The forward algorithm gives the probability that a sequence has been produced by a HMM by adding the probability of all possible paths of the sequence through the model. In the second step, we maintain the current model for training, but we change the way that the test proteins are scored against the model. Our aim is to exploit the secondary structure information of the test proteins, which is not possible when we use the forward algorithm. So instead of adding the probability of all the possible paths of the primary sequence through the model, we calculate the probability across the path of the secondary sequence. The use of the forward algorithm for scoring presupposes that the secondary structure is unknown. If the secondary structure is known, we know the path of the amino acid sequence through the model so the scoring becomes computationally less expensive [11]. Three new experiments take place in this work. In the first, we train the model with 3 states for 34 different fold categories belonging to the 4 major structural classes. Then, the test set proteins are scored against all models of candidate folds and are assigned to that fold whose model gave the maximum posterior probability score. In that case we score only the primary test sequence against the model with the use of the forward algorithm. In the second experiment the training procedure is exactly the same but we score the test set protein against the model by taking into account its predicted secondary sequence. Thus, we score the protein sequence through the path that corresponds to its predicted secondary sequence. The predicted secondary sequences for each protein are given by PSIPRED [12]. The assignment to fold category takes place in the same way as before. In the third experiment everything is the same with the second experiment and the only difference is that instead of the predicted secondary sequences of the test set proteins we use the correct ones given by the Protein Data Bank (PDB) [14]. III. DATASET AND RESULTS The dataset used in the three experiments is shown in Table I. This group of protein sequences, both primary and

secondary, is taken from PDB and it is separated in training and test sets. The members of this group correspond to specific folds of the SCOP database [15]. The data from both databases is combined, so that SCOP will provide the correct categorization and PDB will provide the relevant sequences, both primary and secondary. The sequence identifiers come from the ASTRAL SCOP 1.69 dataset, where no proteins with more than 40% similarity are included. The most populated SCOP folds of classes A, B, C and D, most specifically those who have at least 30 members, are used. TABLE I THE DATASET USED (34 SCOP FOLDS FROM 4 SCOP CLASSES) Fold index a1 a3 a4 a24 a39 a60 a118 b1 b2 b18 b29 b34 b40 b47 b55 b82 b121

train

test

21 20 103 28 31 25 32 132 20 21 24 44 61 25 24 28 27

11 10 52 15 15 12 16 66 10 10 12 22 31 12 12 14 14

Fold index c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94 d15 d17 d58 d144

train

test

143 91 22 58 35 91 39 31 20 40 31 34 23 44 20 102 23

71 46 11 29 17 46 20 15 10 20 15 17 12 22 10 51 12

We compare the results of the three experiments with the outcome of an experiment that took place in our previous work, where we trained with a model of 7 states and 203 parameters. We can see that the model with 3 states outperforms the model with 7 states in the same dataset as the performance increases from 17.9% to 20.5%. When we add predicted secondary structure in the second experiment, fold recognition performance is improved more, from 20.5% to 30.6%, and it reaches its maximum (36.8%) when we use the correct secondary structure in the third experiment. The results are shown in Table II. If we perform the same experiment with SAM [4] and without the use of predicted secondary sequence, the overall performance is 23.5% [5]. It has been shown that when predicted secondary sequence is added in SAM, then the fold recognition performance is expected to improve by 10% [16]. Thus, in that case we still outperform SAM.

TABLE II OF FOLD RECOGNITION ACCURACY OF THE MODEL IN DIFFERENT IMPLEMENTATIONS COMPARISON Fold index

7-HMM

3-HMM

a1 a3 a4 a24 a39 a60 a118 b1 b2 b18 b29 b34 b40 b47 b55 b82 b121 c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94 d15 d17 d58 d144 total

5/11 3/10 5/52 1/15 6/15 5/12 6/16 27/66 0/10 2/10 2/12 4/22 4/31 4/12 5/12 1/14 12/14 1/71 10/46 2/11 0/29 1/17 1/46 2/20 4/15 0/10 0/20 1/15 7/17 5/12 4/22 1/10 0/51 5/12 17.9%

5/11 6/10 7/52 2/15 6/15 5/12 6/16 29/66 1/10 2/10 2/12 4/22 5/31 3/12 4/12 2/14 11/14 1/71 12/46 3/11 0/29 1/17 0/46 2/20 3/15 1/10 0/20 8/15 5/17 7/12 5/22 2/10 0/51 5/12 20.5%

3-HMM predicted secondary 8/11 8/10 11/52 2/15 8/15 7/12 9/16 38/66 1/10 3/10 3/12 5/22 6/31 4/12 6/12 3/14 11/14 3/71 20/46 2/11 1/29 3/17 15/46 6/20 2/15 0/10 3/20 7/15 14/17 6/12 6/22 0/10 3/51 8/12 30.6%

3-HMM correct secondary 10/11 8/10 16/52 2/15 10/15 4/12 9/16 41/66 2/10 4/10 5/12 7/22 8/31 9/12 6/12 3/14 13/14 16/71 17/46 6/11 8/29 2/17 8/46 6/20 3/15 2/10 5/20 9/15 10/17 6/12 11/22 1/10 5/51 7/12 36.8%

IV. DISCUSSION In the current work we improved the fold recognition performance of a reduced state-space HMM. First we reduced the number of parameters to be trained by adopting a different secondary structure alphabet for training. Despite the reduction of the size of the model the fold recognition accuracy is improved. Then, we utilized the secondary structure information for scoring of the test sequences against the improved model. In this way we further improved the performance as we exploited the secondary structure information of test set proteins and we avoided the use of the computationally expensive forward algorithm. We tested both predicted and correct secondary sequence of test proteins. In the case of the the correct secondary sequence the performance was the best of all experiments, as there are no errors in secondary structure. It was also shown that SAM was outperformed, even when predicted secondary structure is used.

These improvements led to a significant rise of the fold recognition performance of the reduced state-space HMM with less computational cost in scoring. On the other hand, it was necessary to use PSIPRED for predicting the secondary sequence of the target proteins. Furthermore, it would be interesting to assess the performance of the 7-HMM using the predicted secondary structure in scoring, however there is a lack of reliable secondary structure predictors that utilise a 7-letter encoding. Additional structural features could also be incorporated in the improved model in the future, like residue solvent accessibility, for example. These features would help us ameliorate the performance of the model without significant increase of its complexity.

[1] [2] [3]

[4] [5]

[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

REFERENCES J.Y. Shi, Q. Pan, S.W. Zhang, et al., “Protein fold recognition with support vector machines fusion network”, Prog. Biochem. Biophys., vol. 33, no. 2, pp. 155-162, 2006. J. Hargbo and A. Elofsson, “Hidden Markov Models That Use Predicted Secondary Structures For Fold Recognition”, Proteins, vol. 36, pp. 68-87, 1999. R. Karchin, M. Cline, Y. Mandel-Gutfreund and K. Karplus, “Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry”, Proteins, vol. 51, pp. 504-514, 2003. K. Karplus, R. Karchin, G. Shackelford and R. Hughey, “Calibrating E-values for hidden Markov models using reverse-sequence null models”, Bioinf., vol. 21, pp. 4107-4115, 2005. C. Lampros, C. Papaloukas, T.P. Exarchos, Y. Goletsis, and D.I. Fotiadis, “Sequence-based protein structure prediction using a reduced state-space hidden Markov model”, Comp. Biol. Med., 2006, article in press. J.U. Bowie, R. LuÈ thy and D. Eisenberg, “A method to identify protein sequence that fold into a known three-dimensional structure”, Science, vol. 253, pp. 164-170, 1991. J. Xu, “Fold Recognition by Predicted Alignment Accuracy”, IEEE/ACM Trans. on Comp. Biol. and Bioinf., vol 2, no. 2, pp. 157165, 2005. O. Sander, I. Sommer, T. Lengauer, “Local protein structure prediction using discriminative models”, BMC Bioinf., vol. 7, no. 14, pp. 1-13, 2006. A.G. Murzin, “Structure classification based assessment of CASP3 predictions for the fold recognition targets”, Proteins, vol. 37, pp. 88– 103, 1999. R. Bonneau and D. Baker, “Ab initio protein structure prediction: Progress and prospects”, Annu. Rev. Bioph. Biom. vol. 30, pp. 173189, 2001. R. Durbin, S. Eddy, A. Krogh and G. Mitchison, “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids”, Cambridge University Press, New York, 1998. D.T. Jones, “Protein secondary structure prediction based on positionspecific scoring matrices”, J. Mol. Biol., vol. 292, pp. 195-202, 1999. W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features”, Biopolymers, vol. 22, pp. 2577-637, 1983. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and Bourne P. E., “The Protein Data Bank’”, Nucleic Acids Res., vol. 28, pp. 235-42, 2000. A.G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, “SCOP: a structural classification of proteins database for the investigation of sequences and structures”, J. Mol. Biol., vol. 247, pp. 536-40, 1995. A. Tramontano and V. Morea, “Assessment of Homology-Based Predictions in CASP5”, Proteins, vol. 53, pp. 352-368, 2003.

Improvement in Fold Recognition Accuracy of a ...

Christos Lampros is with the Department of Medical Physics, Medical. School, University ..... support vector machines fusion network”, Prog. Biochem. Biophys.,.

158KB Sizes 1 Downloads 244 Views

Recommend Documents

An Improvement of Accuracy in Product Quality ...
We constructed a prediction model with the data set pro- vided by Software Engineering Center, Information-technology. Promotion Agency, Japan(IPA/SEC) by applying the naive. Bayesian classifier. The result showed that accuracy of pre- dicting succes

Improvement of measurement accuracy in micro PIV by ...
Mar 4, 2010 - auto-correlation to the auto-correlation at the focal plane is greater than e (chosen to be 1% in Olsen ..... to calculate the velocity for all measurements tech- niques. Vector averaging ..... mechanics, Lisbon. 10. Kiger KT, Pan C ...

Improving the Accuracy of Erroneous-Plan Recognition ...
Email: [biswas,apwaung,atolstikov]@i2r.a-star.edu.sg. Weimin Huang. Computer ... research area, due to the projected increasing number of people ..... using Support Vector Machine, and Dynamic Bayesian. Network ... swered a phone call. 9.

A Robust High Accuracy Speech Recognition System ...
speech via a new multi-channel CDCN technique, reducing computation via silence ... phone of left context and one phone of right context only. ... mean is initialized with a value estimated off line on a representative collection of training data.

Improving the Accuracy of Erroneous-Plan Recognition ...
Email: [biswas,apwaung,atolstikov]@i2r.a-star.edu.sg ... Email: philip [email protected] ... the automated recognition of ADLs; the ADLs are in terms of.

COMPUTATIONAL ACCURACY IN DSP IMPLEMENTATION ...
... more logic gates are required to implement floating-point. operations. Page 3 of 13. COMPUTATIONAL ACCURACY IN DSP IMPLEMENTATION NOTES1.pdf.

Improvement in Performance Parameters of Image ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, Issue 6, ... Department of Computer Science and Engineering, Technocrats Institute of Technology ... Hierarchical Trees (SPIHT), Wavelet Difference Reduction (WDR), and ...

A Study of Automatic Speech Recognition in Noisy ...
each class session, teachers wore a Samson AirLine 77 'True Diversity' UHF wireless headset unidirectional microphone that recorded their speech, with the headset .... Google because it has an easier to use application programming interface (API –

A Change in Orientation: Recognition of Rotated ... - Semantic Scholar
Workers were labeled with numbered plastic disks glued to the thorax or .... raises the question of whether the failure to find a significant discrimination in the S + ...

A Change in Orientation: Recognition of Rotated ... - Semantic Scholar
... [email protected]. Fax: (613) 562- ..... be an intractable task to rule on this issue because reaction times for free foraging bees would ... of the First International Conference on Computer Vision, IEEE Computer Society Press,. London. pp.

Recognition of Handwritten Numerical Fields in a Large ...
pattern recognition systems is to use synthetic training data. [2, 7, 9]. In this paper, we investigate the utility of artifi- cial data in building a segmentation-based ...

A System for Recognition of Biological Patterns in ...
A System for Recognition of Biological Patterns in Toxins Using. Computational Intelligence. Bernardo Penna Resende de Carvalho, Thais Melo Mendes, Ricardo de Souza Ribeiro,. Ricardo Fortuna, José Marcos Veneroso and Maurıcio de Alvarenga Mudado. A

Quad Fold
I joined IAWP mainly to access ongoing education and training required for re-certification as a Global Career ... When states support efforts such as this, they ...

Demonstration of Fold and Cusp Catastrophes in an ...
Mar 26, 2014 - From the formation of galaxy clusters in the early. Universe to the .... 3(a) is the fact that the density peak lines separate between regions in ...

accuracy of lacunarity algorithms in texture ...
This paper proposes a comparison between the Gliding-Box and ... surface, and provide wide coverage, frequent updates and relatively low costs. ... coastline in a map depends on the ruler size used to measure it. In other words, if the ...

The accuracy of
Harvard Business Review: September-October 1970. Exhibit 11'. Accuracy of companies' planned. 1 969 revenues, 1964-1:968, adjusted for inflation via Industrial Price Index. Mean ratio of. I d .' l planned/actual 1969 revenue n “ma. Year plan Price

Market Power, Survival and Accuracy of Predictions in ...
arbitrary probability measure Q on (S∞,Γ), we define dQ0 ≡ 1 and dQt to be .... has an arbitrage opportunity, define Bi(q) as the set of sequences (c, θ) that.

Estimation of accuracy and bias in genetic evaluations ...
Feb 13, 2008 - The online version of this article, along with updated information and services, is located on ... Key words: accuracy, bias, data quality, genetic groups, multiple ...... tion results using data mining techniques: A progress report.

Learning Speed-Accuracy Tradeoffs in ...
All this means that the u and v values are defined by a recurrent system of ... Report CSL-80-12 and in the Proceedings of the Nobel Symposium on Text ...

Activity Recognition Using a Combination of ... - ee.washington.edu
Aug 29, 2008 - work was supported in part by the Army Research Office under PECASE Grant. W911NF-05-1-0491 and MURI Grant W 911 NF 0710287. This paper was ... Z. Zhang is with Microsoft Research, Microsoft Corporation, Redmond, WA. 98052 USA (e-mail:

Diagnostic accuracy of immunological methods in patients ... - SciELO
fusion, which is difficult and at present the main tool in TPE diagnostic is pleural effusion ... order to compare it with microbiological tests. A total of 60 .... improve the cost-benefit performance of the diagnosis. ..... LDH (UI L–1). 197.3 ±

On the Effect of Bias Estimation on Coverage Accuracy in ...
Jan 18, 2017 - The pivotal work was done by Hall (1992b), and has been relied upon since. ... error optimal bandwidths and a fully data-driven direct plug-in.

On the Effect of Bias Estimation on Coverage Accuracy in ...
Jan 18, 2017 - degree local polynomial regression, we show that, as with point estimation, coverage error adapts .... collected in a lengthy online supplement.