Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learning Service

IEEE INFOCOM 2025 - IEEE Conference on Computer Communications | , pp. 1-10

In-network aggregation (INA) offloads gradient aggregation onto switches, and thus effectively reduces the aggregation latency and the volume of traffic. However, INA resources are limited due to the high cost of on-chip memory, which imposes distinct challenges to the effective scheduling of these resources in multi-job MLaaS scenarios. In this paper, we explore the scheduling of INA resources in spatial and temporal dimensions, specifically focusing on its impact on the average job completion time (JCT) and the efficiency of INA resources. We propose MINA, an innovative co-design of algorithm and system that intelligently assigns INA resources to each job and effectively schedules these resources among multiple jobs. Our experiments show that MINA attains an INA efficiency score of 0.9998, implying that almost all jobs run nearly as efficiently as they would with exclusive INA acceleration.