This paper deals with the problems of language detection and tracking in multilingual online short word-of-mouth (WoM) discussions. This problem is particularly unusual and difficult from a pattern recognition perspective because, in these discussions, the participants and content involve the opinions of users from all over the world. The nature of these discussions, consisting of multiple topics in different languages, presents us with a problem of finding training and classification strategies when the class-conditional distributions are nonstationary. The difficulties in solving the problem are many-fold. First of all, the analyst has no knowledge of when one language stops and when the next starts. Further, the features which one uses for any one language (for example, the n-grams) will not be valid to recognize another. Finally, and most importantly, in most real-life applications, such as in WoM, the fragments of text available before the switching, are so small that it renders any meaningful classification using traditional estimation methods almost futile. Earlier, the authors [B. J. Oommen and L. Rueda, Patt. Recogn. 39(1) (2006) 328-341.] had recommended that for a variety of problems, the use of strong estimators (i.e. estimators that converge with probability 1) is sub-optimal. In this vein, we propose to solve the current problem using novel estimators that are pertinent for nonstationary environments. The classification results obtained for various data sets which involve as many as eight languages demonstrates that our proposed methodology is both powerful and efficient.

, , ,
International Journal of Pattern Recognition and Artificial Intelligence
School of Computer Science

Stensby, A. (Aleksander), Oommen, J, & Granmo, O.-C. (Ole-Christoffer). (2013). The use of weak estimators to achieve language detection and tracking in multilingual documents. International Journal of Pattern Recognition and Artificial Intelligence, 27(4). doi:10.1142/S0218001413500110