nicknaskida/whisper-diarization

OfficialView on Replicate →

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`batch_size`	integer	Batch size for inference. (Reduce if face OOM error)	`64`	min: 1
`file`	string(uri)	Or an audio file	`—`	—
`file_string`	string	Either provide: Base64 encoded audio file,	`—`	—
`file_url`	string	Or provide: A direct audio file URL	`—`	—
`group_segments`	boolean	Group segments of same speaker shorter apart than 2 seconds	`true`	—
`hf_token`	string	Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips. You need to agree to the terms in 'https://huggingface.co/pyannote/speaker-diarization-3.1' and 'https://huggingface.co/pyannote/segmentation-3.0' first.	`—`	—
`language`	string	Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.	`—`	—
`num_speakers`	integer	Number of speakers, leave empty to autodetect.	`2`	min: 1, max: 50
`offset_seconds`	integer	Offset in seconds, used for chunked inputs	`0`	min: 0
`prompt`	string	Vocabulary: provide names, acronyms and loanwords in a list. Use punctuation for best accuracy.	`—`	—
`transcript_output_format`	string	Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.	`"both"`	words_onlysegments_onlyboth
`translate`	boolean	Translate the speech into English.	`false`	—

batch_sizeinteger

Batch size for inference. (Reduce if face OOM error)

Default: 64min: 1

filestring

Or an audio file

file_stringstring

Either provide: Base64 encoded audio file,

file_urlstring

Or provide: A direct audio file URL

group_segmentsboolean

Group segments of same speaker shorter apart than 2 seconds

Default: true

hf_tokenstring

Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips. You need to agree to the terms in 'https://huggingface.co/pyannote/speaker-diarization-3.1' and 'https://huggingface.co/pyannote/segmentation-3.0' first.

languagestring

Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.

num_speakersinteger

Number of speakers, leave empty to autodetect.

Default: 2min: 1, max: 50

offset_secondsinteger

Offset in seconds, used for chunked inputs

Default: 0min: 0

promptstring

Vocabulary: provide names, acronyms and loanwords in a list. Use punctuation for best accuracy.

transcript_output_formatstring

Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.

Default: "both"

words_onlysegments_onlyboth

translateboolean

Translate the speech into English.

Default: false

Version: c643440e783bUpdated: 7/25/2026451 runs