xai/grok-text-to-speech

OfficialView on Replicate →

Convert text to natural-sounding speech with xAI's Grok TTS. 5 voices, 20 languages, expressive speech tags, and high-fidelity MP3 / WAV / telephony audio output.

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`text`*	string	Text to synthesize into speech. Maximum 15000 characters. Supports inline speech tags like '[pause]', '[laugh]', and wrapping tags like '<whisper>...</whisper>' for expressive delivery.	`—`	—
`bit_rate`	integer	MP3 bit rate in bits per second. Only used when 'output_format' is 'mp3'. Higher bit rates produce better quality at the cost of file size.	`128000`	320006400096000128000192000
`language`	string	BCP-47 language code for the input text. Set to 'auto' to let the model auto-detect the language.	`"auto"`	autoenar-EGar-SAar-AEbnzhfrdehiiditjakopt-BRpt-PTrues-MXes-EStrvi
`output_format`	string	Audio codec. 'mp3' is best for general use, 'wav' for lossless audio, 'pcm' for raw audio pipelines, 'mulaw'/'alaw' for telephony.	`"mp3"`	mp3wavpcmmulawalaw
`sample_rate`	integer	Audio sample rate in Hz. Higher rates produce better quality at the cost of file size.	`24000`	80001600022050240004410048000
`text_normalization`	boolean	Normalize written-form text (numbers, abbreviations, symbols) into spoken-form before generating audio.	`false`	—
`voice`	string	Voice to use for synthesis. 'eve' is energetic and upbeat (default), 'ara' is warm and friendly, 'rex' is confident and clear, 'sal' is smooth and balanced, 'leo' is authoritative and strong.	`"eve"`	eveararexsalleo

textrequiredstring

Text to synthesize into speech. Maximum 15000 characters. Supports inline speech tags like '[pause]', '[laugh]', and wrapping tags like '<whisper>...</whisper>' for expressive delivery.

bit_rateinteger

MP3 bit rate in bits per second. Only used when 'output_format' is 'mp3'. Higher bit rates produce better quality at the cost of file size.

Default: 128000

320006400096000128000192000

languagestring

BCP-47 language code for the input text. Set to 'auto' to let the model auto-detect the language.

Default: "auto"

autoenar-EGar-SAar-AEbnzhfrdehiiditjakopt-BRpt-PTrues-MXes-EStrvi

output_formatstring

Audio codec. 'mp3' is best for general use, 'wav' for lossless audio, 'pcm' for raw audio pipelines, 'mulaw'/'alaw' for telephony.

Default: "mp3"

mp3wavpcmmulawalaw

sample_rateinteger

Audio sample rate in Hz. Higher rates produce better quality at the cost of file size.

Default: 24000

80001600022050240004410048000

text_normalizationboolean

Normalize written-form text (numbers, abbreviations, symbols) into spoken-form before generating audio.

Default: false

voicestring

Voice to use for synthesis. 'eve' is energetic and upbeat (default), 'ara' is warm and friendly, 'rex' is confident and clear, 'sal' is smooth and balanced, 'leo' is authoritative and strong.

Default: "eve"

eveararexsalleo

Version: 443e01eb52b1Updated: 7/25/2026115.6K runs