← Back to all generators

victor-upmeet/whisperx

Accelerated transcription, word-level timestamps and diarization with whisperX large-v3

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

audio_file required string

Audio file

align_output boolean

Aligns whisper output to get accurate word-level timestamps

Default: false
batch_size integer

Parallelization of input audio transcription

Default: 64
debug boolean

Print out compute/inference times and memory usage information

Default: false
diarization boolean

Assign speaker ID labels

Default: false
huggingface_access_token string

To enable diarization, please enter your HuggingFace token (read). You need to accept the user agreement for the models specified in the README.

initial_prompt string

Optional text to provide as a prompt for the first window

language string

ISO code of the language spoken in the audio, specify None to perform language detection

language_detection_max_tries integer

If language is not specified, then the language will be detected following the logic of language_detection_min_prob parameter, but will stop after the given max retries. If max retries is reached, the most probable language is kept.

Default: 5
language_detection_min_prob number

If language is not specified, then the language will be detected recursively on different parts of the file until it reaches the given probability

Default: 0
max_speakers integer

Maximum number of speakers if diarization is activated (leave blank if unknown)

min_speakers integer

Minimum number of speakers if diarization is activated (leave blank if unknown)

temperature number

Temperature to use for sampling

Default: 0
vad_offset number

VAD offset

Default: 0.363
vad_onset number

VAD onset

Default: 0.5
Version: 84d2ad2d6194 Updated: 2/26/2026 6.4M runs