TeleEval-OS: Performance evaluations of large language models for telecommunications operations scheduling

Abstract

The rapid advancement of large language models (LLMs) has enabled their application across complex professional domains. In the telecommunications industry, operations scheduling integrates network monitoring, ticket management, risk assessment, and workforce coordination, requiring intelligent support due to its language-knowledge-decision coupling. However, the lack of standardized benchmarks limits LLM development in this field. To address these issues, we introduce TeleEval-OS, the first dedicated evaluation benchmark for telecommunications operations scheduling. TeleEval-OS comprehensively covers the four essential stages of dispatch workflows: intelligent ticket creation, resolution, ticket closure, and operational assessment. The benchmark includes 15 high-quality, manually annotated datasets with a total of 10.4K samples, spanning 13 representative real-world sub-tasks, such as similar-ticket recommendation, service intent classification, network fault ticket report generation, and risk indicator interpretation. To capture the spectrum of task complexity, we define a four-level evaluation hierarchy: basic natural language processing (NLP), domain-specific question answering (Q&A), structured report generation, and operational report analysis. We conduct a systematic evaluation of 14 representative LLMs under zero-shot and few-shot settings, such as GPT-4o, DeepSeek-V3, and Qwen-2.5-72B-Instruct. The results show that DeepSeek-V3 achieves the best performance on basic NLP and structured report generation tasks, while GPT-4o demonstrates superior capabilities in operational report analysis. These findings highlight the complementary strengths of LLMs across different task types. They also underscore the practical value of TeleEval-OS as the first dedicated benchmark for telecommunications operations scheduling, providing a unified framework for both future research and real-world deployment. Code is available at https://github.com/zjsllab/TeleEval-OS.

Keywords

Benchmark large language model telecommunications operation scheduling domain

Get full access to this article

View all access options for this article.

References

Achiam

Adler

Agarwal

, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774 2023.

Liu

Feng

Xue

, et al. DeepSeek-v3 technical report. arXiv preprint arXiv:2412.19437 2024.

Yang

Zhang

, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 2024.

Touvron

Lavril

Izacard

, et al. Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 2023. DOI: 10.48550/arXiv.2302.13971.

Santorinaios

Kourtis

Santorinaiou

, et al. MedGuard: Securing medical IoT with compiler polymorphism. In: Proceedings of the IEEE International workshop on information forensics and security (WIFS), 2024, pp.1–3. IEEE.

Wang

, et al. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical AI. Adv Neural Inf Process Syst 2025; 37: 94327–94427.

Zhu

Wen

, et al. Benchmarking large language models on CFLUE—a Chinese financial language understanding evaluation dataset. arXiv preprint arXiv:2405.10542 2024.

Krumdick

Koncel-Kedziorski

Lai

, et al. BizBench: A quantitative reasoning benchmark for business and finance. In: Proceedings of the 62nd Annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2024, pp.8309–8332.

Chen

Yan

, et al. Dr. academy: A benchmark for evaluating questioning capability in education for large language models. arXiv preprint arXiv:2408.10947 2024.

10.

Liu

Feng

Yang

, et al. COMET: “cone of experience” enhanced large multimodal model for mathematical problem generation. Sci China Inform Sci 2024; 67: 1–2.

11.

Wang

Chen

Feng

, et al. Automatic extraction of operator industry specification terms based on multi-combination unsupervised methods. In: Proceedings of the 3rd International conference on innovations and development of information technologies and robotics (IDITR), 2024, pp.167–172. IEEE.

12.

Zhu

Zhao

Chen

, et al. Promptbench: A unified library for evaluation of large language models. J Mach Learn Res 2024; 25: 1–22.

13.

Mou

Zhang

. SG-bench: Evaluating LLM safety generalization across diverse tasks and prompt types. Adv Neural Inf Process Syst 2025; 37: 123032.

14.

Zhang

Koto

, et al. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212 2023.

15.

Kim

Abdulle

, et al. MedExQA: Medical question answering benchmark with multiple explanations. In: Proceedings of the 23rd Workshop on biomedical natural language processing, 2024, pp.167–181.

16.

Yue

Wang

Zhu

, et al. TCMBench: A comprehensive benchmark for evaluating large language models in traditional Chinese medicine. CoRR 2024.

17.

Xie

Han

Chen

, et al. FinBen: A holistic financial benchmark for large language models. Adv Neural Inf Process Syst 2025; 37: 95716–95743.

18.

Dai

Feng

Huang

, et al. LAiW: A Chinese legal large language models benchmark. In: Proceedings of the 31st International conference on computational linguistics, 2025, pp.10738–10766.

19.

Lee

Arya

Cho

, et al. TelBench: A benchmark for evaluating telco-specific large language models. In: Proceedings of the 2024 conference on empirical methods in natural language processing: Industry Track, 2024, pp.609–626.

20.

Wang

, et al. Performance evaluations of large language models for customer service. Int J Mach Learn Cybernet 2025; 16: 2997–3017.

21.

Maatouk

Piovesan

Ayed

, et al. Large language models for telecom: Forthcoming impact on the industry. IEEE Commun Magaz 2025; 63: 62–68.

22.

GLM

Zeng

, et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. arXiv preprint arXiv:2406.12793 2024.

23.

Guo

Jin

Liu

, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736 2023.

24.

Chang

Wang

, et al. A survey on evaluation of large language models. ACM Trans Intell Syst Technol 2024; 15: 1–45.

25.

Wang

Singh

Michael

, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 2018.

26.

Zhang

, et al. CLUE: A Chinese language understanding evaluation benchmark. In: Proceedings of the 28th international conference on computational linguistics, 2020, pp.4762–4772.

27.

Zhu

, et al. SuperCLUE: A comprehensive Chinese large language model benchmark. arXiv preprint arXiv:2307.15020 2023.

28.

Lian

Zhao

Liu

, et al. What is the best model? Application-driven evaluation for large language models. In: CCF International conference on natural language processing and chinese computing, 2024, pp.67–79.

29.

Zhong

Cui

Guo

, et al. AGIEval: A human-centric benchmark for evaluating foundation models. In: Findings of the association for computational linguistics: NAACL 2024, 2024, pp.2299–2314.

30.

Wang

Zhang

, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In: The Thirty-eighth conference on neural information processing systems datasets and benchmarks track, 2024.

31.

Huang

Bai

Zhu

, et al. C-eval: A multi-level multi-discipline Chinese evaluation suite for foundation models. Adv Neural Inf Process Syst 2023; 36: 62991–63010.

32.

Feng

Ding

Wang

, et al. SciKnowEval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098 2024.

33.

Qiu

Zhang

, et al. Towards building multilingual language model for medicine. Nat Commun 2024; 15: 8384.

34.

Zhao

Liu

Long

, et al. KnowledgeFMATH: Knowledge-intensive math reasoning in finance domains. In: Proceedings of the 62nd annual meeting of the association for computational linguistics, 2024, pp.12841–12858.

35.

Guo

, et al. FLAME: Financial large-language model assessment and metrics evaluation. arXiv preprint arXiv:2501.06211 2025.

36.

Yue

Chen

Wang

, et al. DISC-LawLLM: Fine-tuning large language models for intelligent legal services. CoRR 2023.

37.

Wang

Zhang

Sun

, et al. Smart customer service in unmanned retail store enhanced by large language model. Sci Rep 2024; 14: 19838.

38.

Yilma

Ayala-Romero

Garcia-Saavedra

, et al. TelecomRAG: Taming telecom standards with retrieval augmented generation and LLMs. ACM SIGCOMM Comput Commun Rev 2025; 54: 18–23.

39.

Wang

Liu

, et al. TeleChat technical report. arXiv preprint arXiv:2401.03804 2024.

40.

Zhou

Yuan

, et al. Large language model (LLM) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities. IEEE Commun Surveys Tutorials 2024; 27: 1955–2005.

41.

Yang

Liu

, et al. AIR-Bench: Benchmarking large audio-language models via generative comprehension. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2024, pp.1979–1998.

42.

Zheng

Chiang

Sheng

, et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In: Proceedings of the 37th international conference on neural information processing systems (NeurIPS 2023), 2023, pp.1–29. Red Hook, NY, USA: Curran Associates Inc. Article no. 2020.