← Back to all generators

minimax/speech-02-turbo

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Designed for real-time applications with low latency

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

text required string

Text to narrate (max 10,000 characters). Use markers like <#0.5#> to insert pauses in seconds.

audio_format string

File format for the generated audio. Choose mp3 for general use, wav/flac for lossless, or pcm for raw bytes.

Default: "mp3"
mp3 wav flac pcm
bitrate integer

MP3 bitrate in bits per second. Only used when audio_format is mp3.

Default: 128000
32000 64000 128000 256000
channel string

mono for 1 channel (default), stereo for 2 channels.

Default: "mono"
mono stereo
emotion string

Desired delivery style. Use auto to let MiniMax choose, or pick a specific emotion.

Default: "auto"
auto happy sad angry fearful disgusted surprised calm fluent neutral
english_normalization boolean

Improve number/date reading for English text (adds a small amount of latency).

Default: false
language_boost string

Optional language hint. Choose Automatic to let MiniMax detect the language, or pick a specific locale.

Default: "None"
None Automatic Chinese Chinese,Yue Cantonese English Arabic Russian Spanish French Portuguese German Turkish Dutch Ukrainian Vietnamese Indonesian Japanese Italian Korean Thai Polish Romanian Greek Czech Finnish Hindi Bulgarian Danish Hebrew Malay Persian Slovak Swedish Croatian Filipino Hungarian Norwegian Slovenian Catalan Nynorsk Tamil Afrikaans
pitch integer

Semitone offset applied to the voice (−12 to +12).

Default: 0 min: -12, max: 12
sample_rate integer

Audio sample rate in Hz.

Default: 32000
8000 16000 22050 24000 32000 44100
speed number

Speech speed multiplier (0.5–2.0). Lower is slower, higher is faster.

Default: 1 min: 0.5, max: 2
subtitle_enable boolean

Return MiniMax subtitle metadata with sentence timestamps (non-streaming only).

Default: false
voice_id string

Voice to synthesize. Pick any MiniMax system voice (e.g. English_Wiselady, English_Deep-VoicedGentleman) or a voice_id returned by https://replicate.com/minimax/voice-cloning. See the full list of voices in the README.

Default: "English_Wiselady"
volume number

Relative loudness. 1.0 is default MiniMax gain. Range 0–10.

Default: 1 min: 0, max: 10
Version: f39649380c14 Updated: 6/8/2026 12.5M runs