The big data software stack based on Apache Spark and Hadoop has become mission critical in many enterprises. Performance of Spark and Hadoop jobs depends on a large number of configuration settings. The manual tuning procedure is expensive and brittle. There have been efforts to develop online and off-line automatic tuning approaches to make the big data stack more autonomic, but many researchers noted that it is important to tune only when truly necessary because many parameter searches can reduce rather than enhance performance. Autonomic systems need to be able to accurately detect important changes in workload characteristics, predict future workload characteristics, and use this information to pro-actively optimise resource allocation and frequency of parameter searches. This paper presents the first study focusing on workload change detection, change classification and workload forecasting in big data workloads. We demonstrate 99% accuracy for workload change detection, 90% accuracy for workload and workload transition classification, and up to 96% accuracy for future workload type prediction on Spark and Hadoop job flows simulated using popular big data benchmarks. Our method does not rely on past workload history for workload type prediction.

Additional Metadata
Keywords Big data autonomic computing, big data performance optimisation, Hadoop, on-line automatic tuning, Spark, workload change detection, workload forecasting, YARN
Persistent URL dx.doi.org/10.1109/BigData47090.2019.9006149
Conference 2019 IEEE International Conference on Big Data, Big Data 2019
Citation
Genkin, M. (Mikhail), & Dehne, F. (2019). Autonomic Workload Change Classification and Prediction for Big Data Workloads. In Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019 (pp. 2835–2844). doi:10.1109/BigData47090.2019.9006149