Gradient sparsification is widely adopted in distributed training; however, it suffers from a trade-off between computation and communication. The prevalent Top-k sparsifier has a hard constraint on computational overhead while achieving the desired gradient compression ratio. Conversely, the hard-threshold sparsifier eliminates computational constraints but fail to achieve the targeted compression ratio. Motivated by this tradeoff, we designed a novel threshold-based sparsifier called SAGE, which achieves a compression ratio close to that of the Top-k sparsifier with negligible computational overhead. SAGE scales the compression ratio by deriving an adjustable threshold based on each iteration’s heuristics. Experimental results show that SAGE achieves a compression ratio closer to the desired ratio than a hard-threshold sparsifier without exacerbating the accuracy of model training. In terms of computation time for gradient selection, SAGE achieves a speedup of up to 23.62 × over the Top-k sparsifier.
This work was jointly supported by the BK21 FOUR program (NRF5199991014091), the Basic Science Research Program (2022R1F1A1062779) of National Research Foundation (NRF) of Korea, the Korea Institute of Science and Technology Information (KISTI) (TS-2022-RE-0019), and (KSC-2022-CRE-0406).