inworld/realtime-tts-2

OfficialView on Replicate →

Most expressive text-to-speech model from Inworld, with natural-language steering, real-time latency, and multilingual support across 100+ languages.

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`text`*	string	The text to convert to speech. Maximum 2,000 characters. Supports natural-language steering with bracketed instructions placed before the text they apply to (e.g. `[say excitedly]`, `[whisper in a hushed style]`, `[say sadly with deliberate pauses in a low voice]`). Inline non-verbal tags are also supported (e.g. `[laugh]`, `[sigh]`, `[breathe]`, `[clear throat]`, `[cough]`, `[yawn]`). SSML break tags work for pauses (e.g. `<break time="1s" />`). Capitalize words for emphasis (e.g. `I told you NOT to do that`).	`—`	—
`audio_format`	string	Output audio format.	`"mp3"`	mp3wavogg_opusflac
`language`	string	Language of the input text. Use 'auto' to let the model detect the language. Supported production languages: English (en), Chinese (zh), Japanese (ja), Korean (ko), Russian (ru), Italian (it), Spanish (es), Portuguese (pt), French (fr), German (de), Polish (pl), Dutch (nl), Hindi (hi), Hebrew (he), Arabic (ar).	`"auto"`	autoenzhjakoruitesptfrdeplnlhihear
`sample_rate`	integer	Audio sample rate in Hz.	`48000`	8000160002205024000320004410048000
`speaking_rate`	number	Speaking speed multiplier. Set to 0 for normal speed (1.0).	`0`	min: 0, max: 1.5
`temperature`	number	Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).	`0`	min: 0, max: 2
`text_normalization`	string	Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.	`"auto"`	autoonoff
`voice_id`	string	The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex', 'Darlene') or a custom cloned voice ID.	`"Ashley"`	—

textrequiredstring

The text to convert to speech. Maximum 2,000 characters. Supports natural-language steering with bracketed instructions placed before the text they apply to (e.g. `[say excitedly]`, `[whisper in a hushed style]`, `[say sadly with deliberate pauses in a low voice]`). Inline non-verbal tags are also supported (e.g. `[laugh]`, `[sigh]`, `[breathe]`, `[clear throat]`, `[cough]`, `[yawn]`). SSML break tags work for pauses (e.g. `<break time="1s" />`). Capitalize words for emphasis (e.g. `I told you NOT to do that`).

audio_formatstring

Output audio format.

Default: "mp3"

mp3wavogg_opusflac

languagestring

Language of the input text. Use 'auto' to let the model detect the language. Supported production languages: English (en), Chinese (zh), Japanese (ja), Korean (ko), Russian (ru), Italian (it), Spanish (es), Portuguese (pt), French (fr), German (de), Polish (pl), Dutch (nl), Hindi (hi), Hebrew (he), Arabic (ar).

Default: "auto"

autoenzhjakoruitesptfrdeplnlhihear

sample_rateinteger

Audio sample rate in Hz.

Default: 48000

8000160002205024000320004410048000

speaking_ratenumber

Speaking speed multiplier. Set to 0 for normal speed (1.0).

Default: 0min: 0, max: 1.5

temperaturenumber

Controls randomness when generating audio. Higher values produce more expressive results, lower values are more deterministic. Set to 0 to use the model default (1.1).

Default: 0min: 0, max: 2

text_normalizationstring

Controls whether numbers, dates, and abbreviations are expanded before synthesis. 'auto' lets the model decide, 'on' always normalizes, 'off' reads text as-is.

Default: "auto"

autoonoff

voice_idstring

The voice to use. Use a preset voice name (e.g. 'Ashley', 'Dennis', 'Alex', 'Darlene') or a custom cloned voice ID.

Default: "Ashley"

Version: ff2e08e7e058Updated: 7/25/20265.7K runs