Much like the shape of a tool suggests its intended purpose, knowledge of a protein's structure can provide substantial insight into its function. Therefore, computational prediction of protein structure based solely on protein sequence data is a challenge of fundamental importance to biomedical research. An effective solution promises significant advances in computational drug discovery and an increased understanding of complex disease processes such as cancer. We have recently developed a novel approach to determining the secondary structure of proteins from protein sequence data which makes use of Parallel Cascade Identification (PCI), a powerful method of nonlinear system identification. PCI is used to create two layers of dynamic nonlinear systems that map divergent evolutionary profile input data into secondary structure assignment output data. PCI prediction accuracy compares well with eleven top contemporary methods over a dataset of new protein structures. Furthermore, PCI is a highly effective means to combine multiple experts achieving the highest observed accuracy over two test datasets and also the lowest rate of occurrence of a particularly detrimental class of errors. One limitation of the PCI classifiers is that approximately 13% of all amino acids cannot readily be assigned predictions due to settling times introduced by the dynamic linear component in each cascade model. In this paper we describe a number of methods designed to overcome this limitation. While zero-padding of the input sequence data proved to be the most effective solution in terms of prediction accuracy, an analysis of causal, anti-causal, and mixed cascades provides interesting insights into the biological mechanism of protein folding.

Additional Metadata
Keywords Nonlinear system identification, Parallel cascade identification, bioinformatics, Protein secondary structure prediction
Persistent URL
Conference 2006 Canadian Conference on Electrical and Computer Engineering, CCECE'06
Green, J, & Korenberg, M.J. (Michael J.). (2007). Nonlinear system identification provides insight into protein folding. Presented at the 2006 Canadian Conference on Electrical and Computer Engineering, CCECE'06. doi:10.1109/CCECE.2006.277670