zsxkib/thinksound

Generate contextual audio from video using step-by-step reasoning🎶

Capabilities

Seed

Community model (estimated from hardware time)

Name	Type	Description	Default	Constraints
`video`*	string(uri)	Input video file (supports various formats)	`—`	—
`caption`	string	Caption/title describing the video content (optional)	`""`	—
`cfg_scale`	number	Classifier-free guidance scale. Higher values follow conditioning more closely but may reduce creativity	`5`	min: 1, max: 20
`cot`	string	Chain-of-Thought description providing detailed reasoning about the desired audio (optional)	`""`	—
`num_inference_steps`	integer	Number of diffusion denoising steps. More steps = higher quality but slower generation	`24`	min: 10, max: 100
`seed`	integer	Random seed for reproducible outputs. Leave empty for random seed	`—`	—

videorequiredstring

Input video file (supports various formats)

captionstring

Caption/title describing the video content (optional)

Default: ""

cfg_scalenumber

Classifier-free guidance scale. Higher values follow conditioning more closely but may reduce creativity

Default: 5min: 1, max: 20

cotstring

Chain-of-Thought description providing detailed reasoning about the desired audio (optional)

Default: ""

num_inference_stepsinteger

Number of diffusion denoising steps. More steps = higher quality but slower generation

Default: 24min: 10, max: 100

seedinteger

Random seed for reproducible outputs. Leave empty for random seed

Version: 40d08f9f569eUpdated: 7/25/20268.3K runs