← Back to all generators

lucataco/qwen2.5-omni-7b

Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

Capabilities

Reference Images System Prompt

Cost

Community model (estimated from hardware time)

Input Parameters

audio string

Optional audio input

generate_audio boolean

Whether to generate audio output

Default: true
image string

Optional image input

prompt string

Text prompt for the model

system_prompt string

System prompt for the model

Default: "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
use_audio_in_video boolean

Whether to use audio in video

Default: true
video string

Optional video input

voice_type string

Voice type for audio output

Default: "Chelsie"
Chelsie Ethan
Version: 0ca8160f7aaf Updated: 2/26/2026 31.7K runs