Challenges and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft
in New Era for Robust Speech Recognitino: Exploiting Deep Learning
Published by Springer | 2017
Deep Learning (DL) Network Acoustic Modeling has been widely deployed to real-world speech recognition products and services that benefit millions of users. In addition to the general modeling research that academic works on, there are special constraints and challenges that the industry has to face, e.g., the runtime constraint of system deployment, the robustness to the variations such as acoustic environment, accents, lack of manual transcription, etc. For large scale ASR applications, this chapter briefly describes selected developments and investigations at Microsoft to make deep learning networks more effective under production environment, including:
reducing run-time cost with SVD (singular value decomposition)-based training,
improving the accuracy of small-size DNN with teacher-student training,
use of small amount of parameters for speaker adaptation of acoustic models,
improving the robustness to acoustic environment with variable component DNN modeling,
improving the robustness to accent/dialect with model adaptation and accent dependent modeling,
introducing time and frequency invariance with time-frequency long short-term memory recurrent neural networks,
exploring the generalization capability to unseen data with maximum margin sequence training,
use of unsupervised data to improve SR accuracy,
increasing language capability by reusing speech training material across languages.
The outcome has enabled the deployment of DL acoustic models across Microsoft server and client product line including Windows 10 desktop/laptop/phone, XBOX, skype speech-to-speech translation.