xai/grok-speech-to-text
Transcribe audio to text with xAI's Grok. Handles 25 languages, word-level timestamps, speaker diarization, multichannel audio, and files up to 500 MB.
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
audio * | string (uri) | Audio file to transcribe. Supports WAV, MP3, WebM, OGG, M4A, FLAC, AAC, MP4, Opus. Max 500MB. | — | — |
diarize | boolean | Enable speaker diarization. Each word in the output will include a 'speaker' index identifying who spoke it. | false | — |
format_text | boolean | Enable inverse text normalization, which converts spoken-form numbers, currencies, and units to their written form (e.g. 'one hundred dollars' becomes '$100'). Requires 'language' to be set. | false | — |
language | string | Language code for the audio (e.g. 'en', 'fr', 'de'). Used to enable inverse text normalization when 'format_text' is true. Set to 'auto' to let the model auto-detect the language. | "auto" | auto ar cs da de en es fa fil fr hi id it ja ko mk ms nl pl pt ro ru sv th tr vi |
multichannel | boolean | Transcribe each audio channel independently. The output will include a per-word 'channel' index. | false | — |
timestamps | boolean | Include word-level start and end timestamps in the output. | false | — |
audio required string Audio file to transcribe. Supports WAV, MP3, WebM, OGG, M4A, FLAC, AAC, MP4, Opus. Max 500MB.
diarize boolean Enable speaker diarization. Each word in the output will include a 'speaker' index identifying who spoke it.
false format_text boolean Enable inverse text normalization, which converts spoken-form numbers, currencies, and units to their written form (e.g. 'one hundred dollars' becomes '$100'). Requires 'language' to be set.
false language string Language code for the audio (e.g. 'en', 'fr', 'de'). Used to enable inverse text normalization when 'format_text' is true. Set to 'auto' to let the model auto-detect the language.
"auto" multichannel boolean Transcribe each audio channel independently. The output will include a per-word 'channel' index.
false timestamps boolean Include word-level start and end timestamps in the output.
false cfeb8cf422e3 Updated: 6/26/2026 2.3K runs
cinemasetfree.com