π£ Text to audio (TTS)
API Compatibility
The LocalAI TTS API is compatible with the OpenAI TTS API and the Elevenlabs API.
LocalAI API
The /tts endpoint can also be used to generate speech from text.
Usage
Input: input, model
For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:
Returns an audio/wav file.
Backends
πΈ Coqui
Required: Don’t use LocalAI images ending with the -core tag,. Python dependencies are required in order to use this backend.
Coqui works without any configuration, to test it, you can run the following curl command:
You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.
You can also use config files to configure tts models (see section below on how to use config files).
Bark
Bark allows to generate audio from text prompts.
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
Usage
Use the tts endpoint by specifying the bark backend:
To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:
Piper
To install the piper audio models manually:
- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
- Extract the
.tar.tgzfiles (.onnx,.json) insidemodels - Run the following command to test the model is working
To use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:
Note:
aplayis a Linux command. You can use other tools to play the audio file.- The model name is the filename with the extension.
- The model name is case sensitive.
- LocalAI must be compiled with the
GO_TAGS=ttsflag.
Transformers-musicgen
LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
VibeVoice
VibeVoice-Realtime is a real-time text-to-speech model that generates natural-sounding speech with voice cloning capabilities.
Setup
Install the vibevoice model in the Model gallery or run local-ai run models install vibevoice.
Usage
Use the tts endpoint by specifying the vibevoice backend:
Voice cloning
VibeVoice supports voice cloning through voice preset files. You can configure a model with a specific voice:
Then you can use the model:
Pocket TTS
Pocket TTS is a lightweight text-to-speech model designed to run efficiently on CPUs. It supports voice cloning through HuggingFace voice URLs or local audio files.
Setup
Install the pocket-tts model in the Model gallery or run local-ai run models install pocket-tts.
Usage
Use the tts endpoint by specifying the pocket-tts backend:
Voice cloning
Pocket TTS supports voice cloning through built-in voice names, HuggingFace URLs, or local audio files. You can configure a model with a specific voice:
You can also pre-load a default voice for faster first generation:
Then you can use the model:
Qwen3-TTS
Qwen3-TTS is a high-quality text-to-speech model that supports three modes: custom voice (predefined speakers), voice design (natural language instructions), and voice cloning (from reference audio).
Setup
Install the qwen-tts model in the Model gallery or run local-ai run models install qwen-tts.
Usage
Use the tts endpoint by specifying the qwen-tts backend:
Custom Voice Mode
Qwen3-TTS supports predefined speakers. You can specify a speaker using the voice parameter:
Available speakers:
- Chinese: Vivian, Serena, Uncle_Fu, Dylan, Eric
- English: Ryan, Aiden
- Japanese: Ono_Anna
- Korean: Sohee
Voice Design Mode
Voice Design allows you to create custom voices using natural language instructions. Configure the model with an instruct option:
Then use the model:
Voice Clone Mode
Voice Clone allows you to clone a voice from reference audio. Configure the model with an AudioPath and optional ref_text:
You can also use URLs or base64 strings for the reference audio. The backend automatically detects the mode based on available parameters (AudioPath β VoiceClone, instruct option β VoiceDesign, voice parameter β CustomVoice).
Then use the model:
Using config files
You can also use a config-file to specify TTS models and their parameters.
In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.
With this config, you can now use the following curl command to generate a text-to-speech audio file:
Response format
To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.
Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)
Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.
If a response_format is added in the query (other than wav) and ffmpeg is not available, the call will fail.