minimax/speech-02-turbo
Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Designed for real-time applications with low latency
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
text * | string | Text to narrate (max 10,000 characters). Use markers like <#0.5#> to insert pauses in seconds. | — | — |
audio_format | string | File format for the generated audio. Choose mp3 for general use, wav/flac for lossless, or pcm for raw bytes. | "mp3" | mp3 wav flac pcm |
bitrate | integer | MP3 bitrate in bits per second. Only used when audio_format is mp3. | 128000 | 32000 64000 128000 256000 |
channel | string | mono for 1 channel (default), stereo for 2 channels. | "mono" | mono stereo |
emotion | string | Desired delivery style. Use auto to let MiniMax choose, or pick a specific emotion. | "auto" | auto happy sad angry fearful disgusted surprised calm fluent neutral |
english_normalization | boolean | Improve number/date reading for English text (adds a small amount of latency). | false | — |
language_boost | string | Optional language hint. Choose Automatic to let MiniMax detect the language, or pick a specific locale. | "None" | None Automatic Chinese Chinese,Yue Cantonese English Arabic Russian Spanish French Portuguese German Turkish Dutch Ukrainian Vietnamese Indonesian Japanese Italian Korean Thai Polish Romanian Greek Czech Finnish Hindi Bulgarian Danish Hebrew Malay Persian Slovak Swedish Croatian Filipino Hungarian Norwegian Slovenian Catalan Nynorsk Tamil Afrikaans |
pitch | integer | Semitone offset applied to the voice (−12 to +12). | 0 | min: -12, max: 12 |
sample_rate | integer | Audio sample rate in Hz. | 32000 | 8000 16000 22050 24000 32000 44100 |
speed | number | Speech speed multiplier (0.5–2.0). Lower is slower, higher is faster. | 1 | min: 0.5, max: 2 |
subtitle_enable | boolean | Return MiniMax subtitle metadata with sentence timestamps (non-streaming only). | false | — |
voice_id | string | Voice to synthesize. Pick any MiniMax system voice (e.g. English_Wiselady, English_Deep-VoicedGentleman) or a voice_id returned by https://replicate.com/minimax/voice-cloning. See the full list of voices in the README. | "English_Wiselady" | — |
volume | number | Relative loudness. 1.0 is default MiniMax gain. Range 0–10. | 1 | min: 0, max: 10 |
text required string Text to narrate (max 10,000 characters). Use markers like <#0.5#> to insert pauses in seconds.
audio_format string File format for the generated audio. Choose mp3 for general use, wav/flac for lossless, or pcm for raw bytes.
"mp3" bitrate integer MP3 bitrate in bits per second. Only used when audio_format is mp3.
128000 channel string mono for 1 channel (default), stereo for 2 channels.
"mono" emotion string Desired delivery style. Use auto to let MiniMax choose, or pick a specific emotion.
"auto" english_normalization boolean Improve number/date reading for English text (adds a small amount of latency).
false language_boost string Optional language hint. Choose Automatic to let MiniMax detect the language, or pick a specific locale.
"None" pitch integer Semitone offset applied to the voice (−12 to +12).
0 min: -12, max: 12 sample_rate integer Audio sample rate in Hz.
32000 speed number Speech speed multiplier (0.5–2.0). Lower is slower, higher is faster.
1 min: 0.5, max: 2 subtitle_enable boolean Return MiniMax subtitle metadata with sentence timestamps (non-streaming only).
false voice_id string Voice to synthesize. Pick any MiniMax system voice (e.g. English_Wiselady, English_Deep-VoicedGentleman) or a voice_id returned by https://replicate.com/minimax/voice-cloning. See the full list of voices in the README.
"English_Wiselady" volume number Relative loudness. 1.0 is default MiniMax gain. Range 0–10.
1 min: 0, max: 10 f39649380c14 Updated: 6/8/2026 12.5M runs
cinemasetfree.com