← Back to all generators

xai/grok-speech-to-text

Transcribe audio to text with xAI's Grok. Handles 25 languages, word-level timestamps, speaker diarization, multichannel audio, and files up to 500 MB.

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

audio required string

Audio file to transcribe. Supports WAV, MP3, WebM, OGG, M4A, FLAC, AAC, MP4, Opus. Max 500MB.

diarize boolean

Enable speaker diarization. Each word in the output will include a 'speaker' index identifying who spoke it.

Default: false
format_text boolean

Enable inverse text normalization, which converts spoken-form numbers, currencies, and units to their written form (e.g. 'one hundred dollars' becomes '$100'). Requires 'language' to be set.

Default: false
language string

Language code for the audio (e.g. 'en', 'fr', 'de'). Used to enable inverse text normalization when 'format_text' is true. Set to 'auto' to let the model auto-detect the language.

Default: "auto"
auto ar cs da de en es fa fil fr hi id it ja ko mk ms nl pl pt ro ru sv th tr vi
multichannel boolean

Transcribe each audio channel independently. The output will include a per-word 'channel' index.

Default: false
timestamps boolean

Include word-level start and end timestamps in the output.

Default: false
Version: cfeb8cf422e3 Updated: 6/26/2026 2.3K runs