Moshi Foundation Model for Speech-Text Processing

If you’re looking for an alternative to the Whisper stack, one option worth considering is Moshi. I’ve originally found it mentioned on X:

The Moshi speech-to-speech model by @kyutai_labs, hosted on @modal_labs, absolutely cranks!

Near instantaneous voice chat AI, thanks to bidirectional streaming.

Whole startups could be built on this stack. pic.twitter.com/wORwirpQUx
— Erik Dunteman 🍌 (@erikdunteman) September 23, 2024

Summary:

Moshi is a speech-text foundation model that uses a full-duplex spoken dialogue framework.
It utilizes Mimi, a state-of-the-art streaming neural audio codec, which processes 24 kHz audio at a bandwidth of 1.1 kbps with low latency.

Key Features:

Processes two audio streams: one from the user and one from Moshi itself.
Predicts text tokens corresponding to its own speech (inner monologue) for improved generation quality.
Utilizes Depth Transformer and Temporal Transformer models for inter and temporal dependencies.
Achieves a theoretical latency of 160ms with practical overall latency as low as 200ms.

Mimi Codec:

Builds on previous neural audio codecs such as SoundStream and EnCodec.
Adds a Transformer in both the encoder and decoder.
Adapts strides to match an overall frame rate of 12.5 Hz, allowing for closer matching with text tokens’ average frame rate (~3-4 Hz).
Uses distillation loss to model semantic and acoustic information with a single model.

Training and Evaluation:

Uses only adversarial training loss along with feature matching.
Shows strong improvements in subjective quality despite its low bitrate.

Leave a Reply Cancel reply