RNNt
RNNT, short for Recurrent Neural Network Transducer, is an end-to-end architecture for automatic speech recognition designed to produce transcriptions in streaming fashion. It maps sequences of acoustic features to text tokens while generating outputs incrementally as audio is received, enabling online transcription without requiring full-sequence wait times.
The architecture comprises three main components. The transcription (encoder) network processes the input acoustic features to
During inference, RNNT can operate in a streaming mode, emitting tokens as sufficient evidence accumulates. Decoding
History and context: the RNN Transducer concept originated in early work on neural sequence transduction, with