nicknaskida/whisper-diarization
⚡️ Insanely Fast audio transcription | whisper large-v3 | speaker diarization | word & sentence level timestamps | prompt | hotwords. Fork of thomasmol/whisper-diarization. Added batched whisper, 3x-4x speedup 🚀
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
batch_size | integer | Batch size for inference. (Reduce if face OOM error) | 64 | min: 1 |
file | string (uri) | Or an audio file | — | — |
file_string | string | Either provide: Base64 encoded audio file, | — | — |
file_url | string | Or provide: A direct audio file URL | — | — |
group_segments | boolean | Group segments of same speaker shorter apart than 2 seconds | true | — |
hf_token | string | Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips. You need to agree to the terms in 'https://huggingface.co/pyannote/speaker-diarization-3.1' and 'https://huggingface.co/pyannote/segmentation-3.0' first. | — | — |
language | string | Language of the spoken words as a language code like 'en'. Leave empty to auto detect language. | — | — |
num_speakers | integer | Number of speakers, leave empty to autodetect. | 2 | min: 1, max: 50 |
offset_seconds | integer | Offset in seconds, used for chunked inputs | 0 | min: 0 |
prompt | string | Vocabulary: provide names, acronyms and loanwords in a list. Use punctuation for best accuracy. | — | — |
transcript_output_format | string | Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both. | "both" | words_only segments_only both |
translate | boolean | Translate the speech into English. | false | — |
batch_size integer Batch size for inference. (Reduce if face OOM error)
64 min: 1 file string Or an audio file
file_string string Either provide: Base64 encoded audio file,
file_url string Or provide: A direct audio file URL
group_segments boolean Group segments of same speaker shorter apart than 2 seconds
true hf_token string Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips. You need to agree to the terms in 'https://huggingface.co/pyannote/speaker-diarization-3.1' and 'https://huggingface.co/pyannote/segmentation-3.0' first.
language string Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.
num_speakers integer Number of speakers, leave empty to autodetect.
2 min: 1, max: 50 offset_seconds integer Offset in seconds, used for chunked inputs
0 min: 0 prompt string Vocabulary: provide names, acronyms and loanwords in a list. Use punctuation for best accuracy.
transcript_output_format string Specify the format of the transcript output: individual words with timestamps, full text of segments, or a combination of both.
"both" translate boolean Translate the speech into English.
false c643440e783b Updated: 2/26/2026 451 runs
cinemasetfree.com