Giving Voice to AI: A New Approach to Text-to-Audio Synthesis

Author: Denis Avetisyan


Researchers have developed a system that refines AI-generated speech in real-time using feedback from powerful audio models, dramatically improving quality and naturalness.

Resonate employs a Flow-GRPO training paradigm to define its model architecture, acknowledging that even innovative frameworks ultimately contribute to the inevitable accumulation of technical debt as production use cases challenge initial theoretical elegance.
Resonate employs a Flow-GRPO training paradigm to define its model architecture, acknowledging that even innovative frameworks ultimately contribute to the inevitable accumulation of technical debt as production use cases challenge initial theoretical elegance.

This work introduces Resonate, a text-to-audio generator leveraging online reinforcement learning with large audio language models for superior performance in audio quality and semantic alignment.

While reinforcement learning has proven effective for enhancing language and image generation, its application to text-to-audio synthesis remains comparatively underexplored. This paper introduces ‘Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models’, a novel approach leveraging online group relative policy optimization with rewards derived from large audio language models. Resonate, a 470M parameter model, achieves state-of-the-art performance on TTA-Bench, demonstrably improving both audio quality and semantic alignment. Could this integration of online RL and large audio language models unlock a new era of realistic and contextually relevant audio generation?


The Illusion of Real Sound: Why We’re Still Chasing the Ghost

Current text-to-audio (TTA) systems frequently fall short of truly realistic sound production due to difficulties in capturing the subtleties of human expression and the complexities of musical or narrative structure. These models often struggle to translate textual cues – such as emotional intent, emphasis, or the relationships between different sonic events – into convincing auditory experiences. While capable of generating basic speech or simple melodies, they often produce audio that sounds flat, robotic, or lacks the dynamic variation present in natural soundscapes. The core challenge lies in moving beyond literal interpretations of text and enabling the system to infer the underlying communicative intent and translate that into appropriate acoustic features, a process requiring both a deep understanding of language and an ability to reason about how sounds combine to create a cohesive and engaging auditory scene.

Current text-to-audio systems frequently depend on multi-stage processing pipelines – involving separate steps for acoustic feature prediction, vocoding, and sometimes even explicit duration modeling – which introduces compounding errors and limits adaptability. Critically, these models are often trained on relatively small datasets of curated speech or music, creating a significant bottleneck in their ability to convincingly synthesize sounds outside of those narrow training conditions. This data scarcity hinders generalization to novel prompts – a request for a sound not well-represented in the training data might result in distorted, unnatural, or simply incorrect audio output. Consequently, a substantial challenge remains in developing systems capable of robustly interpreting diverse textual descriptions and translating them into high-fidelity, contextually appropriate audio experiences.

The pursuit of increasingly realistic audio synthesis has frequently centered on scaling transformer-based models, a strategy that, while demonstrating impressive capabilities, quickly encounters significant computational hurdles. These models, inheriting the quadratic complexity of attention mechanisms, demand substantial processing power and memory, limiting both training speed and the length of audio sequences they can effectively handle. Furthermore, simply increasing model size doesn’t inherently resolve the challenge of efficiently representing audio data; raw waveforms or even conventional spectrograms remain information-rich but computationally burdensome. Researchers are actively exploring alternative representations – such as discrete audio codes or sparse feature spaces – alongside architectural innovations like linear attention and state space models, aiming to decouple model performance from prohibitive computational costs and unlock truly scalable, high-fidelity audio generation.

Resonate: A Flow-Based Approach, Because Everything Else Fails Eventually

Resonate utilizes a novel Text-to-Audio (TTA) framework grounded in flow matching, a generative modeling technique distinguished by its capacity to produce high-fidelity audio samples. Flow matching operates by learning a continuous normalizing flow that transforms a simple probability distribution into the complex distribution of the target audio data. This differs from traditional generative adversarial networks (GANs) or variational autoencoders (VAEs) by directly learning a trajectory from noise to data, promoting stable training and improved sample quality. The resulting model effectively maps textual inputs to corresponding audio waveforms by learning this continuous transformation, enabling the synthesis of realistic and nuanced sound based on the provided text prompt.

Resonate’s architecture utilizes a Flux-style Transformer to process audio data within the Variational Autoencoder (VAE) latent space. This approach offers computational advantages by operating on the reduced-dimensionality latent representation rather than raw audio waveforms, thereby improving processing efficiency. The Transformer architecture, known for its ability to model long-range dependencies, enhances the model’s expressiveness by capturing complex relationships within the audio data. Specifically, the use of a Flux-style implementation allows for efficient gradient computation and facilitates training stability, contributing to the overall performance and quality of the generated audio.

The Resonate framework utilizes Reinforcement Learning (RL) to refine audio generation based on human perceptual evaluation. This process involves a reward model informed by Large Audio Language Models (LALMs), which are trained to predict human preferences for audio quality attributes such as naturalness and clarity. The RL agent then adjusts the audio synthesis parameters to maximize this predicted reward, effectively aligning the generated audio with perceived realism. This approach moves beyond traditional loss functions, allowing the model to optimize for subjective qualities that are difficult to define explicitly, resulting in more convincing and human-aligned audio outputs.

An ablation study demonstrates that each component of Flow-GRPO contributes to overall performance, highlighting their combined importance.
An ablation study demonstrates that each component of Flow-GRPO contributes to overall performance, highlighting their combined importance.

Proof of Concept: Numbers Don’t Lie (Much)

Resonate’s initial training phase leverages a compilation of five distinct audio datasets: AudioSet, a large-scale collection of diverse audio events; Clotho, focused on high-quality, weakly-supervised audio-visual data; VGGSound, providing sound event detection labels; WavCaps, containing audio clips paired with textual descriptions; and AudioStock, a commercially sourced library of sound effects. This pre-training strategy establishes a robust foundational understanding of audio characteristics and content, enabling the model to generalize effectively across a wide range of auditory environments and subsequently improve performance during fine-tuning and downstream tasks. The diversity in content, labeling methodology, and data source within these datasets contributes to the model’s overall adaptability and reduces potential biases.

Following pre-training, Resonate undergoes fine-tuning specifically utilizing the AudioCaps dataset. This dataset consists of textual captions paired with corresponding audio segments, allowing the model to learn the relationship between language and acoustic features. The fine-tuning process optimizes Resonate’s parameters to more effectively translate textual descriptions into coherent and contextually relevant audio outputs, resulting in improved audio generation quality and adherence to the provided text prompts. This targeted training on AudioCaps is critical for Resonate’s ability to perform text-to-audio synthesis.

Rigorous evaluation of Resonate’s performance utilizes the TTA-Bench benchmark, a standardized measure for audio generation quality. Results demonstrate state-of-the-art performance, achieving an AQAScore of 0.737. This score surpasses that of previously published models, including MeanAudio, which achieved 0.729, and TangoFlux, with a score of 0.677, indicating Resonate’s superior ability to generate high-fidelity audio aligned with textual descriptions as assessed by the TTA-Bench metric.

The Inevitable Plateau: Where Do We Go From Here?

Resonate marks a notable leap forward in Text-to-Audio (TTA) technology, moving beyond simplistic speech synthesis to generate audio content imbued with greater realism and emotional nuance. This advancement stems from a novel architecture designed to capture the subtle complexities of human expression within synthesized sound. The system doesn’t merely convert text into speech; it aims to recreate the prosody, timbre, and articulation that convey meaning and feeling, resulting in audio that sounds more natural and engaging. Consequently, Resonate unlocks potential for applications demanding expressive audio, such as virtual assistants with personality, immersive storytelling, and the creation of compelling audiobooks and video game content. By prioritizing both fidelity and expressiveness, the system establishes a new benchmark for what is achievable in the field of TTA, paving the way for more human-like and emotionally resonant AI-generated audio experiences.

The innovative synergy between flow matching and reinforcement learning establishes a robust methodology for crafting artificial intelligence-generated audio that resonates with human aesthetic sensibilities. Flow matching initially sculpts a diverse and realistic audio landscape, providing a strong foundation for subsequent refinement. Reinforcement learning then acts as a discerning guide, subtly adjusting the generated content based on feedback that simulates human preference-effectively ‘teaching’ the AI what sounds pleasing. This iterative process, where the AI learns from its ‘mistakes’ and progressively improves its output, results in audio that isn’t just technically proficient but also emotionally engaging and perceptually aligned with human expectations, marking a significant step towards truly expressive and natural-sounding AI-driven audio creation.

Rigorous evaluation confirms that this novel approach to audio AI currently achieves state-of-the-art performance across multiple key metrics. Quantitative analysis, utilizing the CLAP score, yielded a result of 0.476-the highest recorded on the TTA-Bench dataset-demonstrating superior audio-text alignment. Furthermore, objective assessment of Production Quality reached 6.064, also leading the TTA-Bench rankings, and was reinforced by subjective human evaluations. Listeners awarded an Overall Quality score of 3.86 and a Relevance score of 3.83, consistently surpassing the performance of comparative models and indicating a significant step towards generating audio that is not only technically impressive but also perceptually aligned with human expectations.

The pursuit of ever more realistic text-to-audio generation, as exemplified by Resonate and its use of online reinforcement learning, feels…familiar. The researchers attempt to align generated audio with semantic meaning using large audio language models as reward signals, a noble effort. However, one suspects that the very metrics used to define ‘state-of-the-art performance’ will soon be revealed as insufficient, or simply shift the goalposts. As David Hilbert famously stated, “We must be able to answer definite questions.” The problem, of course, isn’t the ability to answer them, but defining which questions are truly worth answering before the system inevitably finds new, unforeseen ways to fail. It’s a beautiful theory, until production inevitably exposes its limitations.

What’s Next?

The pursuit of audio fidelity, predictably, will not cease with marginally improved mean opinion scores. This work demonstrates a clever bootstrapping – using one large language model to critique the output of another. It feels less like artificial intelligence and more like a very elaborate peer review process, which, anyone who’s shipped code will tell you, is still prone to error. The real test isn’t whether the generated speech sounds good in a research setting, but how it degrades when faced with adversarial inputs, noisy data, or, inevitably, edge cases nobody considered.

One suspects the current enthusiasm for reinforcement learning from language model feedback will eventually run aground on the rocks of reward hacking. The system optimizes for what it can measure, which is rarely what humans actually want. A perfectly aligned waveform, devoid of any subtle human imperfection, may be technically impressive, but also profoundly unsettling. Better one carefully crafted, slightly imperfect sample than a million sterile, algorithmically ‘optimal’ ones.

The claim of ‘online’ learning feels particularly optimistic. Production systems rarely afford the luxury of gradual refinement. More likely, these models will be retrained from scratch every few months, chasing a moving target of acceptable quality. The fundamental problem remains: scaling audio generation isn’t about clever algorithms, it’s about the sheer volume of data required to cover all the linguistic and acoustic nuances of human speech. And that, as anyone who’s tried to build a truly robust system knows, is a bottomless pit.


Original article: https://arxiv.org/pdf/2603.11661.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-15 16:59