What is the state of memory saving for model training?

Under Submission to Sixth Conference on Machine Learning and Systems (MLSys 2023).

Download paper: PDF


Machine Learning, Systems for ML, Efficient Training


  • Xiaoxuan Liu ( UC Berkeley ) <>
  • Chuyan Zhu ( UC Berkeley ) <>
  • Jialun Lyu ( N/A ) <>
  • Zhuohan Li ( UC Berkeley ) <>
  • Xiaoyong Liu ( Alibaba Group US Inc. ) <>
  • Daniel Kang ( Stanford University ) <>
  • Alvin Cheung ( University of California, Berkeley ) <>


Large neural networks can improve the accuracy and generalization on tasks across many domains. However, this trend cannot continue indefinitely due to limited hardware memory. As a result, researchers have devised a number of memory saving methods (MOMs) to alleviate the memory bottleneck, such as gradient checkpointing, quantization, and swapping. In this work, we study memory saving methods and show that, although these strategies indeed lower peak memory usage, they can actually decrease training throughput by up to 9.3×. To provide practical guidelines for practitioners, we propose a simple but effective performance model PAPAYA to quantitatively explain the memory and training time trade-off. PAPAYA can be used to determine when to apply the various memory optimization methods in training different models. We outline the circumstances in which memory saving techniques are more advantageous based on derived implications from PAPAYA. We assess the accuracy of PAPAYA and the derived implications on a variety of machine models, showing that it achieves over 0.97 R score on predicting the peak memory/throughput, and accurately predicts the effectiveness of MOMs across five evaluated models on vision and NLP tasks.