Minimum latency training strategies for streaming sequence-to-sequence ASR

ICASSP |

Recently, a few novel streaming attention-based sequence-tosequence
(S2S) models have been proposed to perform online
speech recognition with linear-time decoding complexity. However,
in these models, the decisions to generate tokens are delayed
compared to the actual acoustic boundaries since their unidirectional
encoders lack future information. This leads to an inevitable
latency during inference. To alleviate this issue and reduce latency,
we propose several strategies during training by leveraging external
hard alignments extracted from the hybrid model. We investigate
to utilize the alignments in both the encoder and the decoder. On
the encoder side, (1) multi-task learning and (2) pre-training with
framewise classification task are studied. On the decoder side, we
(3) remove inappropriate alignment paths beyond an acceptable
latency during the alignment marginalization, and (4) directly minimize
the differentiable expected latency loss. Experiments on the
Cortana voice search task demonstrate that our proposed methods
can significantly reduce the latency, and even improve the recognition
accuracy in certain cases on the decoder side. We also present
some analysis to understand the behaviors of streaming S2S models.