Big data has become essential for businesses as it enables companies and organizations to gather insights from their data and use it to determine marketing opportunities, assist decision-making or even to find new business opportunities. Companies spend a great deal of effort collecting large amounts of data, which in some cases must be processed in real-time in order to capitalize on business opportunities. Predicting the expected input load at a given point in time can be very difficult and sometimes impossible. As a result, a great deal of effort is put into creating techniques to address varying input loads. A widely used approach is dynamic resource provisioning, but resource provisioners may not react in time to address the resource shortage which can result in increased processing latencies. This paper presents a priority scheduling technique that can be used in conjunction with dynamic and static resource provisioning. This approach allows users to assign a priority to input data items. The scheduler ensures that higher priority data items are given precedence over lower priority data items. This means that when resources become constrained the higher priority data items receive a greater share of resources and experience lower queueing delays in comparison to low priority items. A prototype for the data driven priority scheduler is implemented on the Spark Streaming system.

Additional Metadata
Keywords Priority Scheduling, Spark, Spark Streaming
Persistent URL dx.doi.org/10.1109/CCGRID.2019.00072
Conference 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019
Citation
Ajila, T. (Tobi), & Majumdar, S. (2019). Data driven priority scheduling on a spark streaming system. In Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019 (pp. 561–568). doi:10.1109/CCGRID.2019.00072