Various scientific research organizations generate several petabytes of data per year through computational science simulations. These data are often shared by geographically distributed data centers for data analysis. One of the major challenges in distributed environments is failure; hardware, network, and software might fail at any instant. Thus, high-speed and fault tolerant data transfer frameworks are vital for transferring such large data efficiently between the data centers. In this study, we proposed a bloom filter-based data aware probabilistic fault tolerance (DAFT) mechanism that can handle such failures. We also proposed a data and layout aware mechanism for fault tolerance (DLFT) to effectively handle the false positive matches of DAFT. We evaluated the data transfer and recovery time overheads of the proposed fault tolerance mechanisms on the overall data transfer performance. The experimental results demonstrated that the DAFT and DLFT mechanisms exhibit a maximum of 10% and a minimum of 2% recovery time overhead at 80% and 20% fault points respectively. However, we observed minimum to negligible overhead with respect to the overall data transfer rate.
This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) Grant funded by the Korean Government (MSIT) under Grant 2020-0-01592, and in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant 2019R1F1A1058548.