In this paper, we propose pool-based active learning with support vector machine (SVM) classifiers for the prediction of asparagine/aspartate (N/D) hydroxylation sites on proteins. The verification of hydroxylation sites on human proteins in wet-lab experiments is very costly and sometimes time-consuming to achieve. The active learning procedure could therefore be used to choose which putative hydroxylation sites should be selected for future wet-lab experimental validation and verification in order to gain maximal information. Using a dataset of N/D sites with known hydroxylation state, we here demonstrate through simulations that active learning query strategies can achieve higher classification performance with fewer labelled training instances for hydroxylation site prediction, compared to traditional passive learning. The active learning query strategies (uncertainty, density-uncertainty, certainty) are shown to identify the most informative unlabelled instances for oracle annotation at each learning cycle. Furthermore, our experimental results also show that active learning strategies are highly robust in the presence of class imbalance in the available unlabeled data.

Additional Metadata
Keywords Active learning, Class imbalance, Hydroxylation site prediction, Support vector machines
Persistent URL dx.doi.org/10.2316/P.2011.753-034
Conference 6th IASTED International Conference on Computational Intelligence and Bioinformatics, CIB 2011
Citation
Iyuke, F.O. (Festus O.), Green, J, & Willmore, W. (2011). Active learning for the prediction of asparagine/aspartate hydroxylation sites on proteins. Presented at the 6th IASTED International Conference on Computational Intelligence and Bioinformatics, CIB 2011. doi:10.2316/P.2011.753-034