inworld/realtime-tts-1.5-max

Highest-quality realtime text-to-speech with <200ms latency, emotion control, and 15-language support

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`text` *	string	The text to convert to speech. Maximum 2,000 characters. Supports SSML break tags for pauses (e.g. `<break time="1s" />`), emotion markups (e.g. `[happy]`, `[sad]`), and non-verbal vocalizations (e.g. `[laugh]`, `[sigh]`).	`—`	—
`audio_format`	string	Output audio format.	`"mp3"`	mp3 wav ogg_opus flac
`sample_rate`	integer	Audio sample rate in Hz.	`48000`	8000 16000 22050 24000 32000 44100 48000
`speaking_rate`	number	Speaking speed multiplier. Set to 0 for normal speed (1.0).	`0`	min: 0, max: 1.5
`temperature`	number	Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).	`0`	min: 0, max: 2
`text_normalization`	string	Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.	`"auto"`	auto on off
`voice_id`	string	The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex') or a custom cloned voice ID.	`"Ashley"`	—

text required string

The text to convert to speech. Maximum 2,000 characters. Supports SSML break tags for pauses (e.g. `<break time="1s" />`), emotion markups (e.g. `[happy]`, `[sad]`), and non-verbal vocalizations (e.g. `[laugh]`, `[sigh]`).

audio_format string

Output audio format.

Default: "mp3"

mp3 wav ogg_opus flac

sample_rate integer

Audio sample rate in Hz.

Default: 48000

8000 16000 22050 24000 32000 44100 48000

speaking_rate number

Speaking speed multiplier. Set to 0 for normal speed (1.0).

Default: 0 min: 0, max: 1.5

temperature number

Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).

Default: 0 min: 0, max: 2

text_normalization string

Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.

Default: "auto"

auto on off

voice_id string

The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex') or a custom cloned voice ID.

Default: "Ashley"

Version: 4a2e51066a48 Updated: 6/26/2026 140.6K runs