inworld/realtime-tts-1.5-max
Highest-quality realtime text-to-speech with <200ms latency, emotion control, and 15-language support
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
text * | string | The text to convert to speech. Maximum 2,000 characters. Supports SSML break tags for pauses (e.g. `<break time="1s" />`), emotion markups (e.g. `[happy]`, `[sad]`), and non-verbal vocalizations (e.g. `[laugh]`, `[sigh]`). | — | — |
audio_format | string | Output audio format. | "mp3" | mp3 wav ogg_opus flac |
sample_rate | integer | Audio sample rate in Hz. | 48000 | 8000 16000 22050 24000 32000 44100 48000 |
speaking_rate | number | Speaking speed multiplier. Set to 0 for normal speed (1.0). | 0 | min: 0, max: 1.5 |
temperature | number | Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1). | 0 | min: 0, max: 2 |
text_normalization | string | Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is. | "auto" | auto on off |
voice_id | string | The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex') or a custom cloned voice ID. | "Ashley" | — |
text required string The text to convert to speech. Maximum 2,000 characters. Supports SSML break tags for pauses (e.g. `<break time="1s" />`), emotion markups (e.g. `[happy]`, `[sad]`), and non-verbal vocalizations (e.g. `[laugh]`, `[sigh]`).
audio_format string Output audio format.
"mp3" sample_rate integer Audio sample rate in Hz.
48000 speaking_rate number Speaking speed multiplier. Set to 0 for normal speed (1.0).
0 min: 0, max: 1.5 temperature number Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).
0 min: 0, max: 2 text_normalization string Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.
"auto" voice_id string The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex') or a custom cloned voice ID.
"Ashley" 4a2e51066a48 Updated: 6/26/2026 140.6K runs
cinemasetfree.com