elevenlabs/scribe-v2
Transcribe speech with ElevenLabs Scribe v2. 90+ languages, word-level timestamps, speaker diarization for up to 32 speakers, audio event tagging, and keyterm biasing. Files up to 3 GB and 10 hours.
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
audio * | string (uri) | Audio or video file to transcribe. Supports MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and more. Max 3 GB, up to 10 hours. | — | — |
diarize | boolean | Identify and label different speakers in the audio. When enabled, each word in the output includes a 'speaker_id'. Supports up to 32 speakers. | false | — |
keyterms | string | Comma-separated list of words or phrases to bias transcription towards. Useful for product names, technical terms, or proper nouns. Up to 1000 terms, max 50 characters each. | "" | — |
language_code | string | Language of the audio as an ISO-639-1 (e.g. 'en') or ISO-639-3 (e.g. 'eng') code. Set to 'auto' to detect the language automatically. Setting a specific language can improve accuracy for noisy or unusual audio. | "auto" | — |
no_verbatim | boolean | Remove filler words ('um', 'uh'), false starts, and disfluencies from the transcript. Produces a cleaner, more readable output. | false | — |
num_speakers | integer | Maximum number of speakers expected in the audio. Helps the model with diarization. Set to 0 to let the model decide. Only used when 'diarize' is true. | 0 | min: 0, max: 32 |
seed | integer | Random seed for reproducible outputs. Set to -1 to use a non-deterministic seed. | -1 | min: -1, max: 2147483647 |
tag_audio_events | boolean | Tag non-speech sounds in the transcription, like (laughter), (footsteps), or (applause). | true | — |
temperature | number | Sampling temperature. Higher values produce more diverse, less deterministic output. Set to -1 to use the model default (usually 0). | -1 | min: -1, max: 2 |
timestamps_granularity | string | Granularity of word timestamps in the output. 'word' returns start/end times for each word, 'character' adds per-character timing, 'none' omits timestamps. | "word" | none word character |
audio required string Audio or video file to transcribe. Supports MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and more. Max 3 GB, up to 10 hours.
diarize boolean Identify and label different speakers in the audio. When enabled, each word in the output includes a 'speaker_id'. Supports up to 32 speakers.
false keyterms string Comma-separated list of words or phrases to bias transcription towards. Useful for product names, technical terms, or proper nouns. Up to 1000 terms, max 50 characters each.
"" language_code string Language of the audio as an ISO-639-1 (e.g. 'en') or ISO-639-3 (e.g. 'eng') code. Set to 'auto' to detect the language automatically. Setting a specific language can improve accuracy for noisy or unusual audio.
"auto" no_verbatim boolean Remove filler words ('um', 'uh'), false starts, and disfluencies from the transcript. Produces a cleaner, more readable output.
false num_speakers integer Maximum number of speakers expected in the audio. Helps the model with diarization. Set to 0 to let the model decide. Only used when 'diarize' is true.
0 min: 0, max: 32 seed integer Random seed for reproducible outputs. Set to -1 to use a non-deterministic seed.
-1 min: -1, max: 2147483647 tag_audio_events boolean Tag non-speech sounds in the transcription, like (laughter), (footsteps), or (applause).
true temperature number Sampling temperature. Higher values produce more diverse, less deterministic output. Set to -1 to use the model default (usually 0).
-1 min: -1, max: 2 timestamps_granularity string Granularity of word timestamps in the output. 'word' returns start/end times for each word, 'character' adds per-character timing, 'none' omits timestamps.
"word" 5cd433d181bb Updated: 6/26/2026 8.1K runs
cinemasetfree.com