AudioX: A Unified Framework for Anything-to-Audio Generation

1HKUST
2Independent Researcher
Corresponding authors

Abstract

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, image, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential.

Demo Video

Quick Navigation

Samples

Text-to-Audio Text-to-Music Video-to-Audio Video-to-Music Audio Inpainting Music Completion

Comparison Samples

Comparison

Text-to-Audio Generation

Prompt:
Thunder and rain during a sad piano solo

Prompt:
Ocean waves crashing

Prompt:
Typing on a keyboard

Prompt:
A machine gun fires twice, followed by a period of silence, then the sound of waves and surf.

Prompt:
Fireworks burst twice, followed by a period of silence before a clock begins ticking

Prompt:
Screaming occurs from 1.4 to 3.0 seconds, and a waterfall is heard from 5.0 to 10.0 seconds

Text-to-Music Generation

Prompt:
A suspenseful scene in a haunted mansion

Prompt:
Uplifting ukulele tune for a travel vlog

Prompt:
Smooth urban R&B beat with a mellow groove

Video-to-Audio Generation

Video-to-Music Generation

Audio Inpainting

Prompt: Brief speech followed by loud applause and cheering

Unprocessed

Inpainting Result

Ground-Truth

Prompt: A fire engine horn blows, followed by a fire engine siren blowing

Unprocessed

Inpainting Result

Ground-Truth

Music Completion

Prompt: Action-packed orchestral music with strings, brass, and percussion.

To-be-Continued

Completed

Comparison

Text-to-Audio

AudioCaps Benchmark
Sample 1: A clock ticking.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open MAGNET MMAudio AudioGen
Sample 2: Footsteps followed by rapid gunshots and people speaking.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open MAGNET MMAudio AudioGen
Sample 3: A person whistling while drums play.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open MAGNET MMAudio AudioGen
T2A-bench Benchmark
Sample 1 (Category): A street musician is playing, with the sound of an acoustic guitar and coins being tossed into a case.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
Sample 2 (Count): A series of four deep, resonant bell tolls comes from the old church tower.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
Sample 3 (Ordering): The sound of a person typing on a laptop, the fan kicking in, and then the person sighing.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
Sample 4 (Timestamp): A piece of paper is crumpled into a ball between 1.0 and 2.5 seconds.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
AudioTime Benchmark
Sample 1 (Duration): A foghorn sounded for 3.71 seconds.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
Sample 2 (Frequency): A horse neighs or whinnies three times.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
Sample 3 (Ordering): First, a cricket chirps briefly, then later, a mosquito buzzes continuously.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen
Sample 4 (Timestamp): A toilet flush occurs from 1.616 to 4.458 seconds, followed by a rumble between 6.044 and 10 seconds.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large Tango 2
Stable Audio Open Make-an-Audio-2 MMAudio AudioGen

Text-to-Music

MusicCaps Benchmark
Prompt: This is an orchestra playing Arabic music. The melody is played by the conjunction of multiple instruments such as violin, flute and qanun while tambourines create the percussive background of the piece. It has an oriental atmosphere that could be very fitting in a movie or a show taking place in the Middle East. It could also be a good background music for Arab cuisine restaurants.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large TangoMusic
Stable Audio Open MAGNET-large MusicGen  
V2M-bench Benchmark
Sample 1: Instrumental jazz piece with piano, guitar, drums, and bass.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large TangoMusic
Stable Audio Open MAGNET-large MusicGen  
Sample 2: Instrumental folk song with acoustic guitar, violin, accordion, groovy bass line, and fast tempo percussion.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large TangoMusic
Stable Audio Open MAGNET-large MusicGen  
Sample 3: Punk rock track with electric guitar, bass, drums, aggressive and melodic.
Click to view comparison
AudioX (Ours) AudioLDM-L-Full AudioLDM-2-Large TangoMusic
Stable Audio Open MAGNET-large MusicGen  

Video-to-Audio

Sample 1
AudioX (Ours) MMAudio Diff-Foley
Seeing&Hearing FoleyCrafter FRIEREN
Sample 2
AudioX (Ours) MMAudio Diff-Foley
Seeing&Hearing FoleyCrafter FRIEREN

Video-to-Music

Sample 1
AudioX (Ours) VidMuse MuMu-LLaMA
Video2Music CMT  
Sample 2
AudioX (Ours) VidMuse MuMu-LLaMA
Video2Music CMT  

Teaser

Teaser.

Performance comparison of AudioX against baselines. (a) Comprehensive comparison across multiple benchmarks via Inception Score. (b) Results on instruction-following benchmarks

Method

Method.

The AudioX Framework.

BibTeX

If you find our work useful, please consider citing:

@article{tian2025audiox,
          title={AudioX: Diffusion Transformer for Anything-to-Audio Generation},
          author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
          journal={arXiv preprint arXiv:2503.10522},
          year={2025}
        }