DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
- Linghao Zhang ,
- Junhao Wang ,
- Shilin He ,
- Chaoyun Zhang ,
- Yu Kang ,
- Bowen Li ,
- Jiaheng Wen ,
- Chengxing Xie ,
- Maoquan Wang ,
- Yufan Huang ,
- Elsie Nallipogu ,
- Qingwei Lin 林庆维 ,
- Yingnong Dang ,
- Saravan Rajmohan ,
- Dongmei Zhang ,
- Qi Zhang
ArXiv | , Vol abs/2501.13699
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.