dl²: Detecting Communication Deadlocks in Deep Learning Jobs

Yanjie Gao; Jiyu Luo; Haoxiang Lin; Hongyu Zhang; Ming Wu; Mao Yang

dl²: Detecting Communication Deadlocks in Deep Learning Jobs

Yanjie Gao ,
Jiyu Luo ,
Haoxiang Lin ,
Hongyu Zhang ,
Ming Wu ,
Mao Yang

FSE 2025 | June 2025

Published by ACM

The ACM International Conference on the Foundations of Software Engineering, Industry Track

DOI

Download BibTex

In recent years, deep learning has seen widespread adoption across various domains, giving rise to large-scale models such as large language models. Training these models, particularly in distributed environments, presents substantial computational and communication challenges. A critical issue is the communication deadlock—a state in which processes become indefinitely stalled while awaiting network messages from others, which leads to resource wastage and reduced productivity. Current approaches to deadlock handling are either unsuitable for deep learning due to its unique hybrid programming paradigm or limit optimization opportunities. This paper presents dl², a novel dynamic analysis tool designed to detect communication deadlocks in deep learning jobs. dl² models the runtime trace of a job as an execution graph, detects unmatched communications, and constructs a wait-for graph to identify deadlock cycles. dl² can also handle nondeterministic communication behaviors, providing replay and diagnostic support for root cause analysis. We evaluate dl² using PyTorch with a combination of synthetic test cases and real-world deep learning workloads. The experimental results show that dl² successfully detects all communication deadlocks, achieving 100% precision and recall, which highlights its effectiveness.