← Back to all generators

elevenlabs/scribe-v2

Transcribe speech with ElevenLabs Scribe v2. 90+ languages, word-level timestamps, speaker diarization for up to 32 speakers, audio event tagging, and keyterm biasing. Files up to 3 GB and 10 hours.

Capabilities

Seed

Cost

Community model (estimated from hardware time)

Input Parameters

audio required string

Audio or video file to transcribe. Supports MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and more. Max 3 GB, up to 10 hours.

diarize boolean

Identify and label different speakers in the audio. When enabled, each word in the output includes a 'speaker_id'. Supports up to 32 speakers.

Default: false
keyterms string

Comma-separated list of words or phrases to bias transcription towards. Useful for product names, technical terms, or proper nouns. Up to 1000 terms, max 50 characters each.

Default: ""
language_code string

Language of the audio as an ISO-639-1 (e.g. 'en') or ISO-639-3 (e.g. 'eng') code. Set to 'auto' to detect the language automatically. Setting a specific language can improve accuracy for noisy or unusual audio.

Default: "auto"
no_verbatim boolean

Remove filler words ('um', 'uh'), false starts, and disfluencies from the transcript. Produces a cleaner, more readable output.

Default: false
num_speakers integer

Maximum number of speakers expected in the audio. Helps the model with diarization. Set to 0 to let the model decide. Only used when 'diarize' is true.

Default: 0 min: 0, max: 32
seed integer

Random seed for reproducible outputs. Set to -1 to use a non-deterministic seed.

Default: -1 min: -1, max: 2147483647
tag_audio_events boolean

Tag non-speech sounds in the transcription, like (laughter), (footsteps), or (applause).

Default: true
temperature number

Sampling temperature. Higher values produce more diverse, less deterministic output. Set to -1 to use the model default (usually 0).

Default: -1 min: -1, max: 2
timestamps_granularity string

Granularity of word timestamps in the output. 'word' returns start/end times for each word, 'character' adds per-character timing, 'none' omits timestamps.

Default: "word"
none word character
Version: 5cd433d181bb Updated: 6/26/2026 8.1K runs