Performance comparison of AudioX against baselines. (a) Comprehensive comparison across multiple benchmarks via Inception Score. (b) Results on instruction-following benchmarks
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, image, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential.
Prompt:
Thunder and rain during a sad piano solo
Prompt:
Ocean waves crashing
Prompt:
Typing on a keyboard
Prompt:
A machine gun fires twice, followed by a period of silence, then the sound of waves and surf.
Prompt:
Fireworks burst twice, followed by a period of silence before a clock begins ticking
Prompt:
Screaming occurs from 1.4 to 3.0 seconds, and a waterfall is heard from 5.0 to 10.0 seconds
Prompt:
A suspenseful scene in a haunted mansion
Prompt:
Uplifting ukulele tune for a travel vlog
Prompt:
Smooth urban R&B beat with a mellow groove
Prompt: Brief speech followed by loud applause and cheering
Unprocessed
Inpainting Result
Ground-Truth
Prompt: Action-packed orchestral music with strings, brass, and percussion.
To-be-Continued
Completed
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | MAGNET | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | MAGNET | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | MAGNET | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | Tango 2 |
|---|---|---|---|
| Stable Audio Open | Make-an-Audio-2 | MMAudio | AudioGen |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | TangoMusic |
|---|---|---|---|
| Stable Audio Open | MAGNET-large | MusicGen | |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | TangoMusic |
|---|---|---|---|
| Stable Audio Open | MAGNET-large | MusicGen | |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | TangoMusic |
|---|---|---|---|
| Stable Audio Open | MAGNET-large | MusicGen | |
|---|---|---|---|
| AudioX (Ours) | AudioLDM-L-Full | AudioLDM-2-Large | TangoMusic |
|---|---|---|---|
| Stable Audio Open | MAGNET-large | MusicGen | |
|---|---|---|---|
| AudioX (Ours) | MMAudio | Diff-Foley |
|---|---|---|
| Seeing&Hearing | FoleyCrafter | FRIEREN |
|---|---|---|
| AudioX (Ours) | MMAudio | Diff-Foley |
|---|---|---|
| Seeing&Hearing | FoleyCrafter | FRIEREN |
|---|---|---|
| AudioX (Ours) | VidMuse | MuMu-LLaMA |
|---|---|---|
| Video2Music | CMT | |
|---|---|---|
| AudioX (Ours) | VidMuse | MuMu-LLaMA |
|---|---|---|
| Video2Music | CMT | |
|---|---|---|
Performance comparison of AudioX against baselines. (a) Comprehensive comparison across multiple benchmarks via Inception Score. (b) Results on instruction-following benchmarks
The AudioX Framework.
If you find our work useful, please consider citing:
@article{tian2025audiox,
title={AudioX: Diffusion Transformer for Anything-to-Audio Generation},
author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2503.10522},
year={2025}
}