Paper

Memory-Efficient Batching for Time Series Transformer Training: A Systematic Evaluation

Abstract

Transformer-based time series models are being increasingly employed for time series data analysis. However, their training remains memory intensive, especially with high-dimensional data and extended look-back windows, while model-level memory optimizations are well studied, the batch formation process remains an underexplored factor to performance inefficiency. This paper introduces a memory-efficient batching framework based on view-based sliding windows operating directly on GPU-resident tensors. This approach eliminates redundant data materialization caused by tensor stacking and reduces data transfer volumes without modifying model architectures. We present two variants of our solution: (1) per-batch optimization for datasets exceeding GPU memory, and (2) dataset-wise optimization for in-memory workloads. We evaluate our proposed batching framework systematically using peak GPU memory consumption and epoch runtime as efficiency metrics across varying batch sizes, sequence lengths, feature dimensions, and model architectures. Results show consistent memory savings, averaging 90% and runtime improvements of up to 33% across multiple transformer-based models (Informer, Autoformer, Transformer, and PatchTST) and a linear baseline (DLinear) without compromising model accuracy. We extensively validate our method using synthetic and standard real-world benchmarks, demonstrating accuracy preservation and practical scalability in distributed GPU environments. The proposed method highlights batch formation process as a critical component for improving training efficiency.