Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds

Abstract

Cloud computing has become a popular technology for executing scientific workflows. However, with a large number of hosts and virtual machines (VMs) being deployed, the cloud resource failures, such as the permanent failure of hosts (HPF), the transient failure of hosts (HTF), and the transient failure of VMs (VMTF), bring the service reliability problem. Therefore, fault tolerance for time-consuming scientific workflows is highly essential in the cloud. However, existing fault-tolerant (FT) approaches consider only one or two above failure types and easily neglect the others, especially for the HTF. This paper proposes a Real-time and dynamic Fault-tolerant Scheduling (ReadyFS) algorithm for scientific workflow execution in a cloud, which guarantees deadline constraints and improves resource utilization even in the presence of any resource failure. Specifically, we first introduce two FT mechanisms, i.e., the replication with delay execution (RDE) and the checkpointing with delay execution (CDE), to cope with HPF and VMTF, simultaneously. Additionally, the rescheduling (ReSC) is devised to tackle the HTF that affects the resource availability of the entire cloud datacenter. Then, the resource adjustment (RA) strategy, including the resource scaling-up (RS-Up) and the resource scaling-down (RS-Down), is used to adjust resource demands and improve resource utilization dynamically. Finally, the ReadyFS algorithm is presented to schedule real-time scientific workflows by combining all the above FT mechanisms with RA strategy. We conduct the performance evaluation with real-world scientific workflows and compare ReadyFS with five vertical comparison algorithms and three horizontal comparison algorithms. Simulation results confirm that ReadyFS is indeed able to guarantee the fault tolerance of scientific workflow execution and improve cloud resource utilization.

Publication DOI: https://doi.org/10.1016/j.ins.2021.03.003
Divisions: College of Business and Social Sciences > Aston Business School
College of Business and Social Sciences > Aston Business School > Operations & Information Management
Funding Information: This work was supported by the National Natural Science Foundation of China (No. 61802095, 61802167, 61572162), the Zhejiang Provincial Key Science and Technology Project Foundation (No. 2018C01012), the Key Program of Research and Development of China (2
Additional Information: © 2021, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ Funding Information: This work was supported by the National Natural Science Foundation of China (No. 61802095, 61802167, 61572162), the Zhejiang Provincial Key Science and Technology Project Foundation (No. 2018C01012), the Key Program of Research and Development of China (2016YFC0800803), and the VC Research (No. VCR 0000057).
Uncontrolled Keywords: Checkpointing,Cloud computing,Delay execution,Fault-tolerant workflow scheduling,Replication,Rescheduling,Software,Control and Systems Engineering,Theoretical Computer Science,Computer Science Applications,Information Systems and Management,Artificial Intelligence
Publication ISSN: 1872-6291
Last Modified: 03 Sep 2024 07:14
Date Deposited: 09 Jun 2022 11:02
Full Text Link:
Related URLs: http://www.scop ... tnerID=8YFLogxK (Scopus URL)
https://www.sci ... 2401?via%3Dihub (Publisher URL)
PURE Output Type: Article
Published Date: 2021-08-01
Published Online Date: 2021-03-09
Accepted Date: 2021-03-02
Authors: Li, Zhongjin
Chang, Victor (ORCID Profile 0000-0002-8012-5852)
Hu, Haiyang
Hu, Hua
Li, Chuanyi
Ge, Jidong

Export / Share Citation


Statistics

Additional statistics for this record