victor-upmeet/whisperx

Accelerated transcription, word-level timestamps and diarization with whisperX large-v3

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`audio_file`*	string(uri)	Audio file	`—`	—
`align_output`	boolean	Aligns whisper output to get accurate word-level timestamps	`false`	—
`batch_size`	integer	Parallelization of input audio transcription	`64`	—
`debug`	boolean	Print out compute/inference times and memory usage information	`false`	—
`diarization`	boolean	Assign speaker ID labels	`false`	—
`huggingface_access_token`	string	To enable diarization, please enter your HuggingFace token (read). You need to accept " "the user agreement for the models specified in the README.	`—`	—
`initial_prompt`	string	Optional text to provide as a prompt for the first window	`—`	—
`language`	string	ISO code of the language spoken in the audio, specify None to perform language detection	`—`	—
`language_detection_max_tries`	integer	If language is not specified, then the language will be detected following the logic of " "language_detection_min_prob parameter, but will stop after the given max retries. If max " "retries is reached, the most probable language is kept.	`5`	—
`language_detection_min_prob`	number	If language is not specified, then the language will be detected recursively on different " "parts of the file until it reaches the given probability	`0`	—
`max_speakers`	integer	Maximum number of speakers if diarization is activated (leave blank if unknown)	`—`	—
`min_speakers`	integer	Minimum number of speakers if diarization is activated (leave blank if unknown)	`—`	—
`task`	string	Task to perform on the audio file. Options are: transcribe, translate (English only)	`"transcribe"`	transcribetranslate
`temperature`	number	Temperature to use for sampling	`0`	—
`user_agent`	string	Override the User-Agent used to download the audio file. Useful when the host " "blocks the default value.	`—`	—
`vad_offset`	number	VAD offset	`0.363`	—
`vad_onset`	number	VAD onset	`0.5`	—

audio_filerequiredstring

Audio file

align_outputboolean

Aligns whisper output to get accurate word-level timestamps

Default: false

batch_sizeinteger

Parallelization of input audio transcription

Default: 64

debugboolean

Print out compute/inference times and memory usage information

Default: false

diarizationboolean

Assign speaker ID labels

Default: false

huggingface_access_tokenstring

To enable diarization, please enter your HuggingFace token (read). You need to accept " "the user agreement for the models specified in the README.

initial_promptstring

Optional text to provide as a prompt for the first window

languagestring

ISO code of the language spoken in the audio, specify None to perform language detection

language_detection_max_triesinteger

If language is not specified, then the language will be detected following the logic of " "language_detection_min_prob parameter, but will stop after the given max retries. If max " "retries is reached, the most probable language is kept.

Default: 5

language_detection_min_probnumber

If language is not specified, then the language will be detected recursively on different " "parts of the file until it reaches the given probability

Default: 0

max_speakersinteger

Maximum number of speakers if diarization is activated (leave blank if unknown)

min_speakersinteger

Minimum number of speakers if diarization is activated (leave blank if unknown)

taskstring

Task to perform on the audio file. Options are: transcribe, translate (English only)

Default: "transcribe"

transcribetranslate

temperaturenumber

Temperature to use for sampling

Default: 0

user_agentstring

Override the User-Agent used to download the audio file. Useful when the host " "blocks the default value.

vad_offsetnumber

VAD offset

Default: 0.363

vad_onsetnumber

VAD onset

Default: 0.5

Version: 655845d6190eUpdated: 7/25/20267.8M runs