An Empirical Study of Issues in Large Language Model Training Systems

Yanjie Gao; Ruiming Lu; Haoxiang Lin; Yueguo Chen

An Empirical Study of Issues in Large Language Model Training Systems

Yanjie Gao ,
Ruiming Lu ,
Haoxiang Lin ,
Yueguo Chen

FSE 2025 | June 2025

Published by ACM

The ACM International Conference on the Foundations of Software Engineering, Industry Track

Download BibTex

Large language models (LLMs) have gained significant traction in recent years, driving advancements in various applications. The training and evaluation of these models depend heavily on specialized LLM training systems, which are deployed across numerous GPUs, partition LLMs, and process large datasets. However, issues in LLM training systems can lead to program crashes or unexpected behavior, reducing development productivity and wasting valuable resources such as GPUs and storage.

This paper presents the first comprehensive empirical study of issues in LLM training systems. We conducted a manual analysis of 300 high-quality issue reports and corresponding fix commits from the GitHub repositories of three prominent LLM training systems: Microsoft DeepSpeed, NVIDIA Megatron-LM, and Hugging Face Transformers. Our analysis identified common symptoms, root causes, typical fixes, and debugging and testing practices in LLM training systems. Our major findings include: (1) LLM training systems exhibit issues and trends that are uncommon in traditional deep learning, such as Concurrency Error and Tensor Management Error occurring in parallel training, which are particularly difficult to diagnose and resolve. (2) The primary root causes of these issues are API Misuse (19.67%), Configuration Error (18.33%), and General Code Error (16.33%), respectively. Such issues often arise from the rapid evolution of the systems, the integration of complex external dependencies, and a configuration-driven development paradigm. (3) Current testing and debugging practices are often insufficient for identifying issues related to parallel training and large-scale numerical computations. Based on our findings, we propose several research topics and tooling improvements that can facilitate the future development of LLMs.