Abstract
High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.
Keywords
Get full access to this article
View all access options for this article.
