← Back to all generators

cjwbw/voicecraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

Capabilities

Seed Top-P

Cost

Community model (estimated from hardware time)

Input Parameters

orig_audio required string

Original audio file

target_transcript required string

Transcript of the target audio file

cut_off_sec number

Only used for for zero-shot text-to-speech task. The first seconds of the original audio that are used for zero-shot text-to-speech. 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec

Default: 3.01
kvcache integer

Set to 0 to use less VRAM, but with slower inference

Default: 1
0 1
left_margin number

Margin to the left of the editing segment

Default: 0.08
orig_transcript string

Optionally provide the transcript of the input audio. Leave it blank to use the WhisperX model below to generate the transcript. Inaccurate transcription may lead to error TTS or speech editing

Default: ""
right_margin number

Margin to the right of the editing segment

Default: 0.08
sample_batch_size integer

Default value for TTS is 4, and 1 for speech editing. The higher the number, the faster the output will be. Under the hood, the model will generate this many samples and choose the shortest one

Default: 4
seed integer

Random seed. Leave blank to randomize the seed

stop_repetition integer

Default value for TTS is 3, and -1 for speech editing. -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally stretched words, increase sample_batch_size to 2, 3 or even 4

Default: 3
task string

Choose a task

Default: "zero-shot text-to-speech"
speech_editing-substitution speech_editing-insertion speech_editing-deletion zero-shot text-to-speech
temperature number

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic. Do not recommend to change

Default: 1
top_p number

Default value for TTS is 0.9, and 0.8 for speech editing

Default: 0.9
voicecraft_model string

Choose a model

Default: "giga330M_TTSEnhanced.pth"
giga830M.pth giga330M.pth giga330M_TTSEnhanced.pth
whisperx_model string

If orig_transcript is not provided above, choose a WhisperX model for generating the transcript. Inaccurate transcription may lead to error TTS or speech editing. You can modify the generated transcript and provide it directly to orig_transcript above

Default: "base.en"
base.en small.en medium.en
Version: db97f6312d4c Updated: 6/8/2026 10.8K runs