elevenlabs/scribe-v2

OfficialView on Replicate →

Transcribe speech with ElevenLabs Scribe v2. 90+ languages, word-level timestamps, speaker diarization for up to 32 speakers, audio event tagging, and keyterm biasing. Files up to 3 GB and 10 hours.

Capabilities

Seed

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`audio`*	string(uri)	Audio or video file to transcribe. Supports MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and more. Max 3 GB, up to 10 hours.	`—`	—
`diarize`	boolean	Identify and label different speakers in the audio. When enabled, each word in the output includes a 'speaker_id'. Supports up to 32 speakers.	`false`	—
`keyterms`	string	Comma-separated list of words or phrases to bias transcription towards. Useful for product names, technical terms, or proper nouns. Up to 1000 terms, max 50 characters each.	`""`	—
`language_code`	string	Language of the audio as an ISO-639-1 (e.g. 'en') or ISO-639-3 (e.g. 'eng') code. Set to 'auto' to detect the language automatically. Setting a specific language can improve accuracy for noisy or unusual audio.	`"auto"`	—
`no_verbatim`	boolean	Remove filler words ('um', 'uh'), false starts, and disfluencies from the transcript. Produces a cleaner, more readable output.	`false`	—
`num_speakers`	integer	Maximum number of speakers expected in the audio. Helps the model with diarization. Set to 0 to let the model decide. Only used when 'diarize' is true.	`0`	min: 0, max: 32
`seed`	integer	Random seed for reproducible outputs. Set to -1 to use a non-deterministic seed.	`-1`	min: -1, max: 2147483647
`tag_audio_events`	boolean	Tag non-speech sounds in the transcription, like (laughter), (footsteps), or (applause).	`true`	—
`temperature`	number	Sampling temperature. Higher values produce more diverse, less deterministic output. Set to -1 to use the model default (usually 0).	`-1`	min: -1, max: 2
`timestamps_granularity`	string	Granularity of word timestamps in the output. 'word' returns start/end times for each word, 'character' adds per-character timing, 'none' omits timestamps.	`"word"`	nonewordcharacter

audiorequiredstring

Audio or video file to transcribe. Supports MP3, WAV, M4A, FLAC, OGG, OPUS, WebM, AAC, MP4, MOV, MKV, AVI, and more. Max 3 GB, up to 10 hours.

diarizeboolean

Identify and label different speakers in the audio. When enabled, each word in the output includes a 'speaker_id'. Supports up to 32 speakers.

Default: false

keytermsstring

Comma-separated list of words or phrases to bias transcription towards. Useful for product names, technical terms, or proper nouns. Up to 1000 terms, max 50 characters each.

Default: ""

language_codestring

Language of the audio as an ISO-639-1 (e.g. 'en') or ISO-639-3 (e.g. 'eng') code. Set to 'auto' to detect the language automatically. Setting a specific language can improve accuracy for noisy or unusual audio.

Default: "auto"

no_verbatimboolean

Remove filler words ('um', 'uh'), false starts, and disfluencies from the transcript. Produces a cleaner, more readable output.

Default: false

num_speakersinteger

Maximum number of speakers expected in the audio. Helps the model with diarization. Set to 0 to let the model decide. Only used when 'diarize' is true.

Default: 0min: 0, max: 32

seedinteger

Random seed for reproducible outputs. Set to -1 to use a non-deterministic seed.

Default: -1min: -1, max: 2147483647

tag_audio_eventsboolean

Tag non-speech sounds in the transcription, like (laughter), (footsteps), or (applause).

Default: true

temperaturenumber

Sampling temperature. Higher values produce more diverse, less deterministic output. Set to -1 to use the model default (usually 0).

Default: -1min: -1, max: 2

timestamps_granularitystring

Granularity of word timestamps in the output. 'word' returns start/end times for each word, 'character' adds per-character timing, 'none' omits timestamps.

Default: "word"

nonewordcharacter

Version: 5cd433d181bbUpdated: 7/25/20268.1K runs