Sentiment analysis is a promising branch in natural language processing, but it becomes challenging when dealing with data from Twitter due to the big volume, rapidly changing language style and a lack of training data. As a result, it is difficult to utilize the traditional lexicon-based approach and supervised learning method for the problems mentioned above. In this paper, we propose the label propagation algorithm in order to solve the last two problems based on graph structure and apply GraphX, an API in Spark framework for graph parallel computing, to address the first problem. The results show that the label propagation algorithm is robust and scalable in our parallel implementation. Meanwhile, our approach which utilizes the lexicon and noisy label like emoticons outperform the baseline significantly. For the future works, we plan to test more algorithms in clusters and optimize the way of taking advantage of the social network by adding a community detection procedure before the classification to improve the accuracy.

Additional Metadata
Keywords Big Data, GraphX, Label Propagation, Parallel Computing, Sentiment analysis, Spark
Persistent URL dx.doi.org/10.1109/TrustCom/BigDataSE.2018.00270
Conference 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE 2018
Citation
Yang, Y. (Yibing), & Shafiq, M.O. (2018). Large Scale and Parallel Sentiment Analysis Based on Label Propagation in Twitter Data. In Proceedings - 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE 2018 (pp. 1791–1798). doi:10.1109/TrustCom/BigDataSE.2018.00270