Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

SIGGRAPH 2026

1Hong Kong University of Science and Technology
2WeChat Vision, Tencent Inc
3Peking University
Corresponding authors
arXiv Code 🤗 Model 🤗 Dataset 🤗 Demo

Quick Navigation

PART I: Understanding

1 / 6

PART II: Generation

1. Text-to-Audio

A bird chirps briefly, followed by a grunt, and then another grunt. A clock ticking. A telephone dials twice, followed by the sound of glass shattering.
A toilet flush occurs from 1.616 to 4.458 seconds, followed by a rumble between 6.044 and 10 seconds. Footsteps followed by rapid gunshots and people speaking. Vibrations from a running sewing machine with a woman speaking.

2. Text-to-Music

Compose a bright jazz swing instrumental with walking bass, brushed drums, and a lively horn melody. Create a cheerful acoustic folk song accompaniment with strummed guitar, light percussion, and a whistling-style lead melody. Generate a classical-style piano solo in a minor key with expressive dynamics, rubato passages, and a memorable recurring motif.
Generate a Latin-inspired salsa groove with lively percussion, piano montuno, and bright horn riffs. Generate an energetic rock instrumental with distorted guitars and driving drums. Write a cinematic orchestral cue that starts quietly and builds to a triumphant climax with brass, strings, and big drums-2.

3. Video-to-Audio

4. Video-to-Music

5. Text-to-Speech

This wasn't a strange place; it was a new one. The area was swirling in dust so intense that it hid the moon from view. The virus spread faster than wildfire.
The lady in the red car is crossing the bridge. Can you help me find Preaching to the Perverted? There is no universal definition of intelligence, but everyone agrees that the ability of learning belongs to it.
Grumbling makes the loaf no larger. They climbed a tall crane without securing themselves and filmed it for their YouTube channel. Now I will drift gently off to dreamland.
A cloud of dust stung his tender eyes. Did you write it or didn't you? The alchemist erased the circle in the sand, and the snake slithered away among the rocks.

6. Voice Conversion (Zero-shot)

Prompt Audio Edit Prompt Edit Audio
The composition and specific functions of the signoria varied from city to city. She was named for Secretary of the Navy Isaac Toucey.
My palms were sweaty and my mouth was dry, it must be a phobia. Nobody ever received the Nobel prize for the invention of chewing tobacco.
Considering the growth of human population, insects might be the food of the future. When I first started using Dragon, I was bummed to be restricted to Windows.

7. Cross-lingual Generation (Emergent Ability)

Thunder and rain during a sad piano solo. English
伤感的钢琴独奏中,伴随着雷声和雨声。 Chinese
悲しいピアノソロの中、雷と雨の音 Japanese
Un solo de piano mélancolique accompagné par le tonnerre et la pluie. French
Ein melancholisches Klaviersolo, während draußen ein Gewitter mit Regen tobt. German

8. Reasoning & Knowledge Augmentation Generation (Emergent Ability)

The sound of the instrument that John Bonham played in Led Zeppelin. Generate music using the instrument Jimi Hendrix played in a happy mood.
The sound of the machine used to dry your hair after a shower. Generate music with the traditional instrument you would hear in the Highlands of Scotland.

9. In-Context Generation (Emergent Ability)

Input Audio Prompt Edit Result
With the instrument in the input sample, generate a dramatic cinematic chord progression that builds tension.

PART III: Editing

1. Adding

Input Audio Edit Result Ground-Truth
Prompt: Add the sound of 'skateboarding' to the input audio.
Prompt: Add the sound of 'lions roaring' to the input audio.
Prompt: Add the sound of 'yodelling' to the input audio.

2. Removing

Input Audio Edit Result Ground-Truth
Prompt: Remove the sound of 'running electric fan' from the input audio.
Prompt: Remove the sound of 'female singing' from the input audio.
Prompt: Remove the sound of 'airplane' from the input audio.

3. Extracting

Input Audio Edit Result Ground-Truth
Prompt: Extract the sound of 'ambulance siren' from the input audio.
Prompt: Extract the sound of 'horse neighing' from the input audio.
Prompt: Extract the sound of 'wood thrush calling' from the input audio.

4. Style Transfer

Input Audio Edit Result Ground-Truth
Prompt: Change the sound of 'crow cawing' to 'baby crying'.
Prompt: Change the sound of 'dog barking' to 'hammering'.
Prompt: Change the sound of 'playing french horn' to 'playing clarinet'.

5. Speech Editing (Zero-shot)

Input Audio Edit Result
Original Prompt: Television can make you dumb, but it can also be good education.
Edit Prompt: Television can make you dumb, but it can also be bad education.
Original Prompt: Even if you're an expert, reading all of it is often not feasible due to lack of time.
Edit Prompt: Even if you're an expert, reading all of it is often not possible due to lack of time.
Original Prompt: Contrary to common belief, children won't gain computer literacy, if you just hand them tablet computers and smartphones.
Edit Prompt: Contrary to common belief, children won't gain digital skills, if you just hand them tablet computers and smartphones.

Teaser

Teaser

An overview of the Audio-Omni framework and its capabilities.

Method

Method

The Audio-Omni Framework.

BibTeX

If you find our work useful, please consider citing:

@article{tian2025audioomni,
  title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
  author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
  journal={arXiv preprint},
  year={2025}
}