In high-performance computing (HPC), workflow-based workloads are usually data intensive for exploratory analysis of a scientific computation problem that may involve a large parameter space. To achieve the best performance, storage resource constraint is always a pragmatic concern in reality as the potential problem space scale, especially in big data science, as well as its required dataset are ever growing to outpace any increasing rate of storage capacity. Therefore, the workflow computation in a HPC environment with finite storage resources is still a practical topic that is worthwhile studying. To this end, we propose a novel scheduling framework that enhances the scheduling policies of Versioned Name Space and Overwrite-Safe Concurrency, introduced in our earlier work, with abilities to handle the deadlock problem in workflow computation with finite storage constraints. We achieve this goal by leveraging the data dependency information of the workflow to integrate a collection of deadlock resolution algorithms into the workflow scheduler. With such integration, after extensive simulation-based studies we conclude that the enhanced scheduling policies can solve the deadlock problem introduced by the storage constraints caused by big data overflow. More interestingly, we demonstrate that our enhanced scheduling policies perform better than the cases where only pure deadlock algorithms are applied when storage is highly constrained in terms of makespan performance.

Additional Metadata
Keywords concurrency and computation, dataflow, deadlock resolution, workflow scheduling
Persistent URL dx.doi.org/10.1093/comjnl/bxu109
Journal Computer Journal
Citation
Wang, Y. (Yang), & Shi, W. (2015). Dataflow-Based Scheduling for Scientific Workflows in HPC with Storage Constraints. Computer Journal, 58(7), 1628–1644. doi:10.1093/comjnl/bxu109