An Implementation of Job Running Backup Function in User-PC Computing System

Hein Htet, Nobuo Funabiki, Ariel Kamoyedji, Xudong Zhou, Xu Xiang, Shinji Sugawara, Wen Chung Kao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As a low-cost and high-performance distributed computing platform, we have studied the User-PC Computing (UPC) system based on the master-worker model. Docker container technology is adopted to run various application programs or jobs on heterogeneous PC environments for workers. Some jobs, such as physics simulations and neural networks, require long CPU time, which increases the probability of failure of running workers. The automatic backup of the job running state and migration to other worker will be essential to reduce the job completion delay. In this paper, we implement the job running backup function in the UPC system. Checkpoint-Restore in Userspace (CRIU) is periodically applied to capture the job running state of the running job at a worker. When the master detects the failure, it automatically migrates the job to another worker. To evaluate the function, we conducted experiments using the testbed UPC system with 14 jobs and six workers of different specifications, and confirmed that the proposal successfully resumes the job running from the interrupted point at another worker.

Original languageEnglish
Title of host publication2022 4th International Conference on Computer Communication and the Internet, ICCCI 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages156-161
Number of pages6
ISBN (Electronic)9781665469920
DOIs
Publication statusPublished - 2022
Event4th International Conference on Computer Communication and the Internet, ICCCI 2022 - Chiba, Japan
Duration: Jul 1 2022Jul 3 2022

Publication series

Name2022 4th International Conference on Computer Communication and the Internet, ICCCI 2022

Conference

Conference4th International Conference on Computer Communication and the Internet, ICCCI 2022
Country/TerritoryJapan
CityChiba
Period7/1/227/3/22

Keywords

  • CRIU
  • Docker
  • periodic checkpoint
  • Podman
  • UPC system

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems and Management
  • Atomic and Molecular Physics, and Optics

Fingerprint

Dive into the research topics of 'An Implementation of Job Running Backup Function in User-PC Computing System'. Together they form a unique fingerprint.

Cite this