Determination of Room Impulse Response for synthetic data acquisition and ASR testing

The Journal of the Acoustical Society of America | , Vol 136(4): pp. 2265

Publication | Publication

Automatic Speech Recognition (ASR) works best when the speech signal best matches the ones used for training. Training, however, may require thousands of hours of speech, and it is impractical to directly acquire them in a realistic scenario. Instead, we estimate Room Impulse Responses (RIRs), and convolve speech and noise signals with the estimated RIRs. This produces realistic signals, which can then be processed by the audio pipeline, and used for ASR training. In our research, a limited corpus of speech data as well as noise sources is recorded and the RIR at  27 positions is determined using a variety of methods (chirp, MLS, impulse, and noise). The convolved RIR with the “clean speech” is compared to the actual measurements.