xai/grok-text-to-speech
Convert text to natural-sounding speech with xAI's Grok TTS. 5 voices, 20 languages, expressive speech tags, and high-fidelity MP3 / WAV / telephony audio output.
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
text * | string | Text to synthesize into speech. Maximum 15000 characters. Supports inline speech tags like '[pause]', '[laugh]', and wrapping tags like '<whisper>...</whisper>' for expressive delivery. | — | — |
bit_rate | integer | MP3 bit rate in bits per second. Only used when 'output_format' is 'mp3'. Higher bit rates produce better quality at the cost of file size. | 128000 | 32000 64000 96000 128000 192000 |
language | string | BCP-47 language code for the input text. Set to 'auto' to let the model auto-detect the language. | "auto" | auto en ar-EG ar-SA ar-AE bn zh fr de hi id it ja ko pt-BR pt-PT ru es-MX es-ES tr vi |
output_format | string | Audio codec. 'mp3' is best for general use, 'wav' for lossless audio, 'pcm' for raw audio pipelines, 'mulaw'/'alaw' for telephony. | "mp3" | mp3 wav pcm mulaw alaw |
sample_rate | integer | Audio sample rate in Hz. Higher rates produce better quality at the cost of file size. | 24000 | 8000 16000 22050 24000 44100 48000 |
text_normalization | boolean | Normalize written-form text (numbers, abbreviations, symbols) into spoken-form before generating audio. | false | — |
voice | string | Voice to use for synthesis. 'eve' is energetic and upbeat (default), 'ara' is warm and friendly, 'rex' is confident and clear, 'sal' is smooth and balanced, 'leo' is authoritative and strong. | "eve" | eve ara rex sal leo |
text required string Text to synthesize into speech. Maximum 15000 characters. Supports inline speech tags like '[pause]', '[laugh]', and wrapping tags like '<whisper>...</whisper>' for expressive delivery.
bit_rate integer MP3 bit rate in bits per second. Only used when 'output_format' is 'mp3'. Higher bit rates produce better quality at the cost of file size.
128000 language string BCP-47 language code for the input text. Set to 'auto' to let the model auto-detect the language.
"auto" output_format string Audio codec. 'mp3' is best for general use, 'wav' for lossless audio, 'pcm' for raw audio pipelines, 'mulaw'/'alaw' for telephony.
"mp3" sample_rate integer Audio sample rate in Hz. Higher rates produce better quality at the cost of file size.
24000 text_normalization boolean Normalize written-form text (numbers, abbreviations, symbols) into spoken-form before generating audio.
false voice string Voice to use for synthesis. 'eve' is energetic and upbeat (default), 'ara' is warm and friendly, 'rex' is confident and clear, 'sal' is smooth and balanced, 'leo' is authoritative and strong.
"eve" 443e01eb52b1 Updated: 6/26/2026 115.6K runs
cinemasetfree.com