chenxwh/cogvlm2-video

CogVLM2: Visual Language Models for Image and Video Understanding

Capabilities

Max TokensTop-P

Community model (estimated from hardware time)

Name	Type	Description	Default	Constraints
`input_video`*	string(uri)	Input video	`—`	—
`max_new_tokens`	integer	Maximum number of tokens to generate. A word is generally 2-3 tokens	`2048`	min: 0
`prompt`	string	Input prompt	`"Describe this video."`	—
`temperature`	number	Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic	`0.1`	min: 0
`top_p`	number	When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens	`0.1`	min: 0, max: 1

input_videorequiredstring

Input video

max_new_tokensinteger

Maximum number of tokens to generate. A word is generally 2-3 tokens

Default: 2048min: 0

promptstring

Input prompt

Default: "Describe this video."

temperaturenumber

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic

Default: 0.1min: 0

top_pnumber

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

Default: 0.1min: 0, max: 1

Version: 9da7e9a554d3Updated: 7/25/2026672.6K runs