lucataco/qwen2.5-omni-7b

Official View on Replicate →

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Capabilities

Reference Images System Prompt

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`audio`	string (uri)	Optional audio input	`—`	—
`generate_audio`	boolean	Whether to generate audio output	`true`	—
`image`	string (uri)	Optional image input	`—`	—
`prompt`	string	Text prompt for the model	`—`	—
`system_prompt`	string	System prompt for the model	`"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."`	—
`use_audio_in_video`	boolean	Whether to use audio in video	`true`	—
`video`	string (uri)	Optional video input	`—`	—
`voice_type`	string	Voice type for audio output	`"Chelsie"`	Chelsie Ethan

audio string

Optional audio input

generate_audio boolean

Whether to generate audio output

Default: true

image string

Optional image input

prompt string

Text prompt for the model

system_prompt string

System prompt for the model

Default:

"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."

use_audio_in_video boolean

Whether to use audio in video

Default: true

video string

Optional video input

voice_type string

Voice type for audio output

Default: "Chelsie"

Chelsie Ethan

Version: 0ca8160f7aaf Updated: 2/26/2026 31.7K runs