MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

performance


MeanAudio is a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. It can synthesize realistic audio in a single step, achieving a real-time factor (RTF) of 0.013 on a single NVIDIA 3090 GPU. Moreover, it also demonstrates strong performance in multi-step generation.

Single-Step Audio Generation


GT represents real audio samples taken from the AudioCaps test set.

Caption MeanAudio (Ours)
1NFE
AudioLCM
1NFE
ConsistencyTTA
1NFE
GT
A speech and gunfire followed by a gun being loaded
Some humming followed by a toilet flushing
Typing on a keyboard
Rain falling on a hard surface as thunder roars in the distance
Food sizzling and oil popping
Pots and dishes clanking as a man talks followed by liquid pouring into a container
A few seconds of silence then a rasping sound against wood
A man speaks as he gives a speech and then the crowd cheers
A goat bleating repeatedly
Tires squealing followed by an engine revving

Multi-Step Audio Generation


Caption MeanAudio (Ours)
25NFE
GenAU
200NFE
AudioLDM
200NFE
GT
A speech and gunfire followed by a gun being loaded
Some humming followed by a toilet flushing
Typing on a keyboard
Rain falling on a hard surface as thunder roars in the distance
Food sizzling and oil popping
Pots and dishes clanking as a man talks followed by liquid pouring into a container
A few seconds of silence then a rasping sound against wood
A man speaks as he gives a speech and then the crowd cheers
A goat bleating repeatedly
Tires squealing followed by an engine revving


Overall Architecture


MeanAudio is based on a Flux-style latent transformer, combining N1 audio/text joint blocks and N2 audio-only blocks. It uses FLAN-T5 and CLAP to encode the textual captions. Conditioned on the encoded text and the timestep embeddings, MeanAudio regresses the average velocity field for fast and realistic audio generation.


architecture


Main Results


MeanAudio achieves state-of-the-art performance on single-step generation, and strong performance on multi-step generation. It delivers a real-time factor (RTF) of 0.013 on a single NVIDIA 3090 GPU, demonstrating a 100x speedup over existing diffusion-based models.


main table