If you’re looking for an alternative to the Whisper stack, one option worth considering is Moshi. I’ve originally found it mentioned on X:
Summary:
- Moshi is a speech-text foundation model that uses a full-duplex spoken dialogue framework.
- It utilizes Mimi, a state-of-the-art streaming neural audio codec, which processes 24 kHz audio at a bandwidth of 1.1 kbps with low latency.
Key Features:
- Processes two audio streams: one from the user and one from Moshi itself.
- Predicts text tokens corresponding to its own speech (inner monologue) for improved generation quality.
- Utilizes Depth Transformer and Temporal Transformer models for inter and temporal dependencies.
- Achieves a theoretical latency of 160ms with practical overall latency as low as 200ms.
Mimi Codec:
- Builds on previous neural audio codecs such as SoundStream and EnCodec.
- Adds a Transformer in both the encoder and decoder.
- Adapts strides to match an overall frame rate of 12.5 Hz, allowing for closer matching with text tokens’ average frame rate (~3-4 Hz).
- Uses distillation loss to model semantic and acoustic information with a single model.
Training and Evaluation:
- Uses only adversarial training loss along with feature matching.
- Shows strong improvements in subjective quality despite its low bitrate.