Big data Hadoop and Spark applications are deployed on infrastructure managed by resource managers such as Apache YARN, Mesos, and Kubernetes, and run in constructs called containers. These applications often require extensive manual tuning to achieve acceptable levels of performance. While there have been several promising attempts to develop automatic tuning systems, none are currently robust enough to handle realistic workload conditions. Big data workload analysis research performed to date has focused mostly on system-level parameters, such as CPU and memory utilization, rather than higher-level container metrics. In this paper we present the first detailed experimental analysis of container performance metrics in Hadoop and Spark workloads. We demonstrate that big data workloads show unique patterns of container creation, completion, response-time and relative standard deviation of response-time. Based on these observations, we built a machine-learning-based workload classifier with a workload classification accuracy of 83% and a workload change detection accuracy of 74%. Our observed experimental results are an important step towards developing automatically tuned, fully autonomous cloud infrastructure for big data analytics.

Additional Metadata
Keywords Big data cloud performance, Hadoop, On-line automatic tuning, Spark, YARN
Persistent URL dx.doi.org/10.1007/978-3-030-32813-9_11
Series Lecture Notes in Computer Science
Citation
Genkin, M. (Mikhail), Dehne, F, Navarro, P. (Pablo), & Zhou, S. (Siyu). (2019). Machine-Learning Based Spark and Hadoop Workload Classification Using Container Performance Patterns. In Lecture Notes in Computer Science. doi:10.1007/978-3-030-32813-9_11