← Back to all generators

xai/grok-text-to-speech

Convert text to natural-sounding speech with xAI's Grok TTS. 5 voices, 20 languages, expressive speech tags, and high-fidelity MP3 / WAV / telephony audio output.

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

text required string

Text to synthesize into speech. Maximum 15000 characters. Supports inline speech tags like '[pause]', '[laugh]', and wrapping tags like '<whisper>...</whisper>' for expressive delivery.

bit_rate integer

MP3 bit rate in bits per second. Only used when 'output_format' is 'mp3'. Higher bit rates produce better quality at the cost of file size.

Default: 128000
32000 64000 96000 128000 192000
language string

BCP-47 language code for the input text. Set to 'auto' to let the model auto-detect the language.

Default: "auto"
auto en ar-EG ar-SA ar-AE bn zh fr de hi id it ja ko pt-BR pt-PT ru es-MX es-ES tr vi
output_format string

Audio codec. 'mp3' is best for general use, 'wav' for lossless audio, 'pcm' for raw audio pipelines, 'mulaw'/'alaw' for telephony.

Default: "mp3"
mp3 wav pcm mulaw alaw
sample_rate integer

Audio sample rate in Hz. Higher rates produce better quality at the cost of file size.

Default: 24000
8000 16000 22050 24000 44100 48000
text_normalization boolean

Normalize written-form text (numbers, abbreviations, symbols) into spoken-form before generating audio.

Default: false
voice string

Voice to use for synthesis. 'eve' is energetic and upbeat (default), 'ara' is warm and friendly, 'rex' is confident and clear, 'sal' is smooth and balanced, 'leo' is authoritative and strong.

Default: "eve"
eve ara rex sal leo
Version: 443e01eb52b1 Updated: 6/26/2026 115.6K runs