xai/grok-speech-to-text

OfficialView on Replicate →

Transcribe audio to text with xAI's Grok. Handles 25 languages, word-level timestamps, speaker diarization, multichannel audio, and files up to 500 MB.

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`audio`*	string(uri)	Audio file to transcribe. Supports WAV, MP3, WebM, OGG, M4A, FLAC, AAC, MP4, Opus. Max 500MB.	`—`	—
`diarize`	boolean	Enable speaker diarization. Each word in the output will include a 'speaker' index identifying who spoke it.	`false`	—
`format_text`	boolean	Enable inverse text normalization, which converts spoken-form numbers, currencies, and units to their written form (e.g. 'one hundred dollars' becomes '$100'). Requires 'language' to be set.	`false`	—
`language`	string	Language code for the audio (e.g. 'en', 'fr', 'de'). Used to enable inverse text normalization when 'format_text' is true. Set to 'auto' to let the model auto-detect the language.	`"auto"`	autoarcsdadeenesfafilfrhiiditjakomkmsnlplptrorusvthtrvi
`multichannel`	boolean	Transcribe each audio channel independently. The output will include a per-word 'channel' index.	`false`	—
`timestamps`	boolean	Include word-level start and end timestamps in the output.	`false`	—

audiorequiredstring

Audio file to transcribe. Supports WAV, MP3, WebM, OGG, M4A, FLAC, AAC, MP4, Opus. Max 500MB.

diarizeboolean

Enable speaker diarization. Each word in the output will include a 'speaker' index identifying who spoke it.

Default: false

format_textboolean

Enable inverse text normalization, which converts spoken-form numbers, currencies, and units to their written form (e.g. 'one hundred dollars' becomes '$100'). Requires 'language' to be set.

Default: false

languagestring

Language code for the audio (e.g. 'en', 'fr', 'de'). Used to enable inverse text normalization when 'format_text' is true. Set to 'auto' to let the model auto-detect the language.

Default: "auto"

autoarcsdadeenesfafilfrhiiditjakomkmsnlplptrorusvthtrvi

multichannelboolean

Transcribe each audio channel independently. The output will include a per-word 'channel' index.

Default: false

timestampsboolean

Include word-level start and end timestamps in the output.

Default: false

Version: cfeb8cf422e3Updated: 7/25/20262.3K runs