victor-upmeet/whisperx
Accelerated transcription, word-level timestamps and diarization with whisperX large-v3
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
audio_file * | string (uri) | Audio file | — | — |
align_output | boolean | Aligns whisper output to get accurate word-level timestamps | false | — |
batch_size | integer | Parallelization of input audio transcription | 64 | — |
debug | boolean | Print out compute/inference times and memory usage information | false | — |
diarization | boolean | Assign speaker ID labels | false | — |
huggingface_access_token | string | To enable diarization, please enter your HuggingFace token (read). You need to accept the user agreement for the models specified in the README. | — | — |
initial_prompt | string | Optional text to provide as a prompt for the first window | — | — |
language | string | ISO code of the language spoken in the audio, specify None to perform language detection | — | — |
language_detection_max_tries | integer | If language is not specified, then the language will be detected following the logic of language_detection_min_prob parameter, but will stop after the given max retries. If max retries is reached, the most probable language is kept. | 5 | — |
language_detection_min_prob | number | If language is not specified, then the language will be detected recursively on different parts of the file until it reaches the given probability | 0 | — |
max_speakers | integer | Maximum number of speakers if diarization is activated (leave blank if unknown) | — | — |
min_speakers | integer | Minimum number of speakers if diarization is activated (leave blank if unknown) | — | — |
temperature | number | Temperature to use for sampling | 0 | — |
vad_offset | number | VAD offset | 0.363 | — |
vad_onset | number | VAD onset | 0.5 | — |
audio_file required string Audio file
align_output boolean Aligns whisper output to get accurate word-level timestamps
false batch_size integer Parallelization of input audio transcription
64 debug boolean Print out compute/inference times and memory usage information
false diarization boolean Assign speaker ID labels
false huggingface_access_token string To enable diarization, please enter your HuggingFace token (read). You need to accept the user agreement for the models specified in the README.
initial_prompt string Optional text to provide as a prompt for the first window
language string ISO code of the language spoken in the audio, specify None to perform language detection
language_detection_max_tries integer If language is not specified, then the language will be detected following the logic of language_detection_min_prob parameter, but will stop after the given max retries. If max retries is reached, the most probable language is kept.
5 language_detection_min_prob number If language is not specified, then the language will be detected recursively on different parts of the file until it reaches the given probability
0 max_speakers integer Maximum number of speakers if diarization is activated (leave blank if unknown)
min_speakers integer Minimum number of speakers if diarization is activated (leave blank if unknown)
temperature number Temperature to use for sampling
0 vad_offset number VAD offset
0.363 vad_onset number VAD onset
0.5 84d2ad2d6194 Updated: 2/26/2026 6.4M runs
cinemasetfree.com