NeuroCache: Budget-Constrained Activation Offloading for Memory-Efficient LLM Training
Aayush Kumar
Abstract
NeuroCache proposes a budget-controlled activation offloading scheme for large language model training. By introducing a single tunable parameter k that governs how many transformer layers retain activations on-GPU versus those offloaded to pinned CPU memory via PyTorch saved_tensors_hooks, the work delivers ~15% GPU memory reduction with negligible throughput impact. Experiments on RTX 2050 reveal an optimal tradeoff at k ≈ 5.
GPU Memory Reduction
~15%
Optimal k
≈ 5
Hardware
RTX 2050 / CUDA
Throughput Impact
Negligible
