dl²: Detecting Communication Deadlocks in Deep Learning Jobs

FSE 2025 |

Published by ACM

The ACM International Conference on the Foundations of Software Engineering, Industry Track

DOI

In recent years, deep learning has seen widespread adoption across various domains, giving rise to large-scale models such as large language models. Training these models, particularly in distributed environments, presents substantial computational and communication challenges. A critical issue is the communication deadlock—a state in which processes become indefinitely stalled while awaiting network messages from others, which leads to resource wastage and reduced productivity. Current approaches to deadlock handling are either unsuitable for deep learning due to its unique hybrid programming paradigm or limit optimization opportunities. This paper presents dl2, a novel dynamic analysis tool designed to detect communication deadlocks in deep learning jobs. dl2 models the runtime trace of a job as an execution graph, detects unmatched communications, and constructs a wait-for graph to identify deadlock cycles. dl2 can also handle nondeterministic communication behaviors, providing replay and diagnostic support for root cause analysis. We evaluate dl2 using PyTorch with a combination of synthetic test cases and real-world deep learning workloads. The experimental results show that dl2 successfully detects all communication deadlocks, achieving 100% precision and recall, which highlights its effectiveness.