With a large number of heterogeneous processors are deployed on service-oriented cloud computing systems, the issue of processor random hardware failure is becoming increasingly prominent. Replication-based fault-tolerance task assignment is a common approach to satisfy application’s reliability requirement. However, the state-of-the-art algorithms have either high redundancy or low time efficiency. In this work, we propose a fast task assignment for minimizing redundancy (FTAMR) algorithm to satisfy reliability requirement for a directed acyclic graph-based parallel application on heterogeneous service-oriented cloud computing systems. Firstly, the FTAMR algorithm fast identifies tasks which need to be replicated. Secondly, the FTAMR algorithm fast maps selected tasks to their respective most suitable processors. Then, the FTAMR algorithm repeats above steps until application’s reliability satisfies established reliability requirement. Experimental results on real and synthetic generated parallel applications at different scales, parallelism, and heterogeneity show that the FTAMR algorithm can generate minimum redundancy and maximum time efficiency compared with the state-of-the-art fault-tolerance algorithms.
This work was supported in part by the Natural Science Foundation of Hunan Province, China, under Grant 2020JJ6063 and Grant 2019JJ50592, in part by the National Key Research and Development Program of China under Grant 2018YFB1003702, in part by the National Natural Science Foundation of China under Grant 61902336 and Grant 61703157, in part by the Hunan Province Science and Technology Project Funds under Grant 2018TP1036, and in part by the CERNET Innovation Project under Grant NGII20160310.