I/O Performance Evaluation of Large-Scale Deep Learning on an HPC System

Journal: 2019 International Conference on High Performance Computing and Simulation, HPCS 2019

Citation: 2019 International Conference on High Performance Computing and Simulation, HPCS 2019, pp.436-439

Keyword: component distributed deep learning HPC Intel-Caffe large mini-batch large-scale cluster

Mesh Keyword: Computing resource Diverse fields Large-scale clusters Learning frameworks Science and Technology Training process Training time

All Science Classification Codes (ASJC): Computer Science Applications Hardware and Architecture Modeling and Simulation Computer Networks and Communications

Abstract: Recently, deep learning has become important in diverse fields. Because the process requires a huge amount of computing resources, many researchers have proposed methods to utilize large-scale clusters to reduce the training time. Despite many proposals concerning the training process for large-scale clusters, there remain areas to be developed. In this study, we benchmark the performance of Intel-Caffe, which is a generalpurpose distributed deep learning framework on the Nurion supercomputer of the Korea Institute of Science and Technology Information. We particularly focus on identifying the file I/O factors that affect the performance of Intel-Caffe, as well as a performance evaluation in a container-based environment. Finally, to the best of our knowledge, we present the first benchmark results for distributed deep learning in the container-based environment for a large-scale cluster.

URI: https://aurora.ajou.ac.kr/handle/2018.oak/36432
https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85092018797&origin=inward

Journal URL: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=9183768

qrcode