Zonos-v0.1 is open-weight text-to-speech model, is trending on HF. It was trained on over 200k hours of diverse multilingual speech, resulting in expressiveness and quality comparable to — or even surpassing — top TTS providers.
- Speech Generation Capability: Generates natural speech from text prompts with speaker embedding or audio prefix.
- Speech Cloning Capability: Accurately clones speech from a short reference clip.
- Speech Control: Allows fine control over speaking rate, pitch, audio quality, and emotions.
This version is supposed to be run on Ubuntu with nVidia GPUs, but there is thread opened for Apple Silicon here. In the sample program there is also a problem with mamba module on Apple.