← Back to all generators
lucataco/qwen2.5-omni-7b
Official
View on Replicate →
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Capabilities
Reference Images
System Prompt
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
audio | string (uri) | Optional audio input | — | — |
generate_audio | boolean | Whether to generate audio output | true | — |
image | string (uri) | Optional image input | — | — |
prompt | string | Text prompt for the model | — | — |
system_prompt | string | System prompt for the model | "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." | — |
use_audio_in_video | boolean | Whether to use audio in video | true | — |
video | string (uri) | Optional video input | — | — |
voice_type | string | Voice type for audio output | "Chelsie" | Chelsie Ethan |
audio string Optional audio input
generate_audio boolean Whether to generate audio output
Default:
true image string Optional image input
prompt string Text prompt for the model
system_prompt string System prompt for the model
Default:
"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." use_audio_in_video boolean Whether to use audio in video
Default:
true video string Optional video input
voice_type string Voice type for audio output
Default:
"Chelsie" Chelsie Ethan
Version:
0ca8160f7aaf Updated: 2/26/2026 31.7K runs
cinemasetfree.com