SpeechtoText

Speechtotext, also called speech-to-text or STT, is the technology that converts spoken language into written text. It is a core component of automatic speech recognition (ASR) systems and is used in transcription, voice assistants, captioning, accessibility tools, and real-time communication.

Most STT systems follow a pipeline that includes feature extraction from audio, an acoustic model that maps

Performance is commonly measured by word error rate (WER), the proportion of words incorrectly transcribed. WER

Historically, STT emerged from work with hidden Markov models and Gaussian mixtures in the 1980s through 2000s,

a

a

a

sequence-to-sequence

considerations,