Sage Journals: Discover world-class research

Abstract

In recent years, Transformer-based models have dominated the field of long-term time series forecasting. However, the quadratic complexity of attention mechanisms makes both training and inference computationally expensive. The SOFTS model has emerged as an efficient alternative, replacing attention mechanisms with the STAR module to preserve linear complexity while achieving performance comparable to or better than competing approaches. The SOFTS model builds on the iTransformer architecture, which marked a significant advancement in long-term time series forecasting. Although neither iTransformer nor SOFTS incorporates positional embeddings, our analysis revealed a clear opportunity to improve forecasting accuracy by introducing them. However, the straightforward inclusion of positional embeddings leads to convergence and generalization issues. To address this, we propose a simple yet effective technique: during training, positional embeddings are randomly omitted in certain forward passes, which reduces instability and helps the model generalize better. We refer to this novel form of using positional embeddings as Learnable Stochastic Positional Embedding. Additionally, we incorporate multiple dropout layers to mitigate overfitting and improve accuracy. These modifications result in SOFTS++, a fast and accurate model that achieves the best performance on at least 10 out of 12 standard benchmark datasets. By maintaining linear complexity and requiring minimal computational resources, SOFTS++ stands out as a capable and resource-efficient method for multivariate long-term forecasting tasks.

Keywords

Time series forecasting linear complexity positional embeddings dropout layers regularization techniques

Get full access to this article

View all access options for this article.

References

Katsarou

. Sequential machine learning for textual and time-series data. 2025.

Miller

Aldosari

Saeed

, et al. A survey of deep learning and foundation models for time series forecasting. arXiv preprint arXiv:2401.13912, 2024.

Wang

Dong

, et al. Deep time series models: A comprehensive survey and benchmark. arXiv preprint arXiv:2407.13278, 2024.

Lim

Zohren

. Time-series forecasting with deep learning: a survey. Philosop Trans R Soc A 2021; 379: 20200209.

Liu

Zhang

, et al. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.

Nie

Nguyen

Sinthong

, et al. A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022.

Côté

Liu

. Saits: Self-attention-based imputation for time series. Expert Syst Appl 2023; 219: 119619.

Tuli

Casale

Jennings

. Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv preprint arXiv:2201.07284, 2022.

Wen

Zhou

Zhang

, et al. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125, 2022.

10.

Lan

Alyakin

Oermann

. Gateformer: Advancing multivariate time series forecasting through temporal and variate-wise attention with gated representations. arXiv preprint arXiv:2505.00307, 2025.

11.

Wang

, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 2021; 34: 22419–22430.

12.

Zhou

Zhang

Peng

, et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 35, 2021, pp.11106–11115.

13.

Zeng

Chen

Zhang

, et al. Are transformers effective for time series forecasting? In: Proceedings of the AAAI conference on artificial intelligence, Vol. 37, 2023, pp.11121–11128.

14.

Han

Chen

, et al. Softs: Efficient multivariate time series forecasting with series-core fusion. arXiv preprint arXiv:2404.14197, 2024a.

15.

Zeiler

Fergus

. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.

16.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9: 1735–1780.

17.

Cho

Van Merriënboer

Gulcehre

, et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

18.

Pagliaro

. Artificial intelligence v’s. efficient markets: A critical reassessment of predictive models in the big data era. Electronics (2079-9292) 2025; 14.

19.

Spiliotis

. Time series forecasting with statistical, machine learning, and deep learning methods: Past, present, and future. In: Forecasting with artificial intelligence: Theory and applications, 2023, pp.49–75. Springer.

20.

Waqas

Humphries

. A critical review of rnn and lstm variants in hydrological time series predictions. MethodsX 2024; 13: 102946.

21.

Sun

. The evolution of transformer models from unidirectional to bidirectional in natural language processing. Appl Comput Eng 2024; 42: 281–289.

22.

Wibawa

Kurniawan

, et al. Advancements in natural language processing: Implications, challenges, and future directions. Telemat Informat Rep 2024; 16: 100173.

23.

Zhang

Yan

. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: The eleventh international conference on learning representations, 2023.

24.

Chen

Yoder

, et al. Tsmixer: An all-mlp architecture for time series forecasting. arXiv preprint arXiv:2303.06053, 2023.

25.

Zhou

Wen

, et al. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In: International conference on machine learning, 2022, pp.27268–27286. PMLR.

26.

Liu

Wang

Vaidya

, et al. Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756, 2024.

27.

Han

Zhang

, et al. Kan4tsf: Are kan and kan-based models effective for time series forecasting? arXiv preprint arXiv:2408.11306, 2024b.

28.

Srivastava

Hinton

Krizhevsky

, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014; 15: 1929–1958.

29.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 5998–6008.

30.

Ahmed

, et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024; 568: 127063.

31.

Gehring

Auli

Grangier

, et al. Convolutional sequence to sequence learning. In: International conference on machine learning, 2017, pp.1243–1252. PMLR.

32.

Hendrycks

Gimpel

. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.

33.

Luo

Xie

, et al. Softs: Efficient multivariate time series forecasting with series-core fusion. arXiv preprint arXiv:2404.14197, 2024.

34.

Huang

Sun

Liu

, et al. Deep networks with stochastic depth. In: European conference on computer vision, 2016, pp.646–661. Springer.

35.

Das

Kong

Leach

, et al. Long-term forecasting with tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424, 2023.

36.

Liu

, et al. Timesnet: Temporal 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186, 2022.

37.

Liu

Zeng

Chen

, et al. Scinet: Time series modeling and forecasting with sample convolution and interaction. Adv Neural Inf Process Syst 2022; 35: 5816–5828.

38.

Lundberg

Lee

. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017; 30: 4768–4777.

SOFTS++: Fast and accurate linear model for multivariate long-term time series forecasting

Abstract

Keywords

Get full access to this article

References