inworld/realtime-tts-2
Most expressive text-to-speech model from Inworld, with natural-language steering, real-time latency, and multilingual support across 100+ languages.
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
text * | string | The text to convert to speech. Maximum 2,000 characters. Supports natural-language steering with bracketed instructions placed before the text they apply to (e.g. `[say excitedly]`, `[whisper in a hushed style]`, `[say sadly with deliberate pauses in a low voice]`). Inline non-verbal tags are also supported (e.g. `[laugh]`, `[sigh]`, `[breathe]`, `[clear throat]`, `[cough]`, `[yawn]`). SSML break tags work for pauses (e.g. `<break time="1s" />`). Capitalize words for emphasis (e.g. `I told you NOT to do that`). | — | — |
audio_format | string | Output audio format. | "mp3" | mp3 wav ogg_opus flac |
language | string | Language of the input text. Use 'auto' to let the model detect the language. Supported production languages: English (en), Chinese (zh), Japanese (ja), Korean (ko), Russian (ru), Italian (it), Spanish (es), Portuguese (pt), French (fr), German (de), Polish (pl), Dutch (nl), Hindi (hi), Hebrew (he), Arabic (ar). | "auto" | auto en zh ja ko ru it es pt fr de pl nl hi he ar |
sample_rate | integer | Audio sample rate in Hz. | 48000 | 8000 16000 22050 24000 32000 44100 48000 |
speaking_rate | number | Speaking speed multiplier. Set to 0 for normal speed (1.0). | 0 | min: 0, max: 1.5 |
temperature | number | Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1). | 0 | min: 0, max: 2 |
text_normalization | string | Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is. | "auto" | auto on off |
voice_id | string | The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex', 'Darlene') or a custom cloned voice ID. | "Ashley" | — |
text required string The text to convert to speech. Maximum 2,000 characters. Supports natural-language steering with bracketed instructions placed before the text they apply to (e.g. `[say excitedly]`, `[whisper in a hushed style]`, `[say sadly with deliberate pauses in a low voice]`). Inline non-verbal tags are also supported (e.g. `[laugh]`, `[sigh]`, `[breathe]`, `[clear throat]`, `[cough]`, `[yawn]`). SSML break tags work for pauses (e.g. `<break time="1s" />`). Capitalize words for emphasis (e.g. `I told you NOT to do that`).
audio_format string Output audio format.
"mp3" language string Language of the input text. Use 'auto' to let the model detect the language. Supported production languages: English (en), Chinese (zh), Japanese (ja), Korean (ko), Russian (ru), Italian (it), Spanish (es), Portuguese (pt), French (fr), German (de), Polish (pl), Dutch (nl), Hindi (hi), Hebrew (he), Arabic (ar).
"auto" sample_rate integer Audio sample rate in Hz.
48000 speaking_rate number Speaking speed multiplier. Set to 0 for normal speed (1.0).
0 min: 0, max: 1.5 temperature number Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).
0 min: 0, max: 2 text_normalization string Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.
"auto" voice_id string The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex', 'Darlene') or a custom cloned voice ID.
"Ashley" ff2e08e7e058 Updated: 6/26/2026 5.7K runs
cinemasetfree.com