← Back to all generators

inworld/realtime-tts-1.5-max

Highest-quality realtime text-to-speech with <200ms latency, emotion control, and 15-language support

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

text required string

The text to convert to speech. Maximum 2,000 characters. Supports SSML break tags for pauses (e.g. `<break time="1s" />`), emotion markups (e.g. `[happy]`, `[sad]`), and non-verbal vocalizations (e.g. `[laugh]`, `[sigh]`).

audio_format string

Output audio format.

Default: "mp3"
mp3 wav ogg_opus flac
sample_rate integer

Audio sample rate in Hz.

Default: 48000
8000 16000 22050 24000 32000 44100 48000
speaking_rate number

Speaking speed multiplier. Set to 0 for normal speed (1.0).

Default: 0 min: 0, max: 1.5
temperature number

Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).

Default: 0 min: 0, max: 2
text_normalization string

Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.

Default: "auto"
auto on off
voice_id string

The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex') or a custom cloned voice ID.

Default: "Ashley"
Version: 4a2e51066a48 Updated: 6/26/2026 140.6K runs