ProsodyBench: Why Realistic Voice Still Bottlenecks on Data and Feedback

We designed ProsodyBench to evaluate whether a prosody-sensitive reward model could improve the naturalness of RL-trained speech synthesis. In a matched experiment we compared the performance of a prosody-aware reward branch against our standard RL baseline under the following identical conditions using the F5R-TTS model:

Pre-trained checkpoint
Data
200-update budget
Evaluation protocol

The experiment produced two observations: a) The prosody branch failed to demonstrate a measurable gain, and b) The quality of both RL branches slightly deteriorated in relation to the pre-trained baseline. This is a preliminary but reassuring negative result; it suggests that the limitation of natural speech synthesis is not in reward specification alone, even when the reward is explicitly grounded in acoustic-prosodic features and trained on human preference.

In this post, we define prosody as the rhythm, pitch, timing, pauses, stress, etc. that make speech sound human. A reward model is the scoring function used in the training process to inform the model which outputs are superior. Reinforcement Learning (RL) is a computational framework for learning optimal behavior (a policy) where an agent interacts with an environment to maximize a cumulative reward signal over time.

Why this experiment

Recent work in TTS alignment has demonstrated significant improvement on easily verifiable objectives via RL and preference alignment (e.g. voice intelligibility and speaker adherence). F5R-TTS discovered gains from GRPO using ASR- and speaker similarity based reward functions, and Koel-TTS reports gains from preference alignment using guidance driven by automatic speech recognition.

Prosodic naturalness appears to be a different kind of problem. The paper No Verifiable Reward for Prosody makes this evident. GRPO with transcription-oriented rewards yielded a collapse in F0 variability, and while samples improved on intelligibility metrics, outputs sounded flatter to human listeners. Their strongest result came from iterative DPO with ~200 human-annotated preference pairs per rollout, collected specifically for prosodic naturalness.

That made skepticism reasonable, but it did not settle our actual question. Could it be possible that prior baselines failed due to misspecification of reward signals? If so, could a reward vector grounded in prosodic features perform better?

What we built

ProsodyBench is a stack that conducts feature extraction, diagnostic, and reward modeling for voice AI. At the feature layer, it extracts a 22-dimensional acoustic-prosodic representation: log-F0 dispersion, F0 slope variability, duration statistics, nPVI, stress-duration ratio, pause rate, speech rate, jitter, shimmer, harmonics-to-noise ratio, spectral tilt, and F0 smoothness. The process aims to expose specific failure signatures rather than collapse quality into a single scalar value.

Above the feature layer, we trained a lightweight reward model on human naturalness preferences using a pairwise formulation. The model is intentionally lightweight to operate inside the RL environment while remaining grounded in human judgments and auditable at the feature level.

A substantial fraction of our effort went into eliminating confounds: tokenizer mismatches, incorrect checkpoint initialization, reference-conditioning bugs, transcript drift in canary generation, checkpoint-save cadence errors, and storage failures that could cause silent failures.

Experimental design

Both branches started from the same pretrained F5R-TTS checkpoint, used the same HiFi-TTS clean-train slice, and trained under the same bounded RL regime: one epoch, max_samples=32, max_updates=200. Both used the same inference path, checkpoint sweep procedure, and canary battery.

The baseline branch used the standard ACC+SIM reward. Predictably, the prosody branch added our feature-grounded prosody reward to that same objective. The experimental difference was reward specification alone.

Results

Both RL branches maintained intelligibility and neither collapsed. But both drifted from the pretrained floor along the same perceptual axis. In listening comparisons, both sounded scratchier, thinner, and more static than the pretrained model. The baseline and prosody branches remained close enough that neither established a defensible audible advantage.

The instrumental diagnostics were consistent with those impressions.

Condition	Diagnostic Warnings	mean_log_f0_std	Mean RMS	Mean Crest	Rel. HF Energy (×10⁴)
Pretrained floor	4	0.0983	0.1575	6.35	1.85
Baseline best	2	0.1296	0.1340	7.05	4.89
Prosody best	1	0.1071	0.1248	7.85	8.81

Lower RMS indicates reduced spectral body. Higher crest factor corresponds to peakier, thinner waveform structure. Higher relative HF energy suggests brighter, more brittle timbre. Both RL branches moved away from the pretrained floor in the same direction.

We also ran a matched listening study: 210 pairwise naturalness judgments, 70 per comparison.

Comparison	Preference Split	Reading
Pretrained vs Baseline	68.6% / 31.4%	Clear preference for pretrained
Pretrained vs Prosody	71.4% / 28.6%	Clear preference for pretrained
Baseline vs Prosody	51.4% / 48.6%	No clear separation

The pretrained model was clearly preferred over both RL branches. The two RL branches were indistinguishable.

Pairwise Naturalness Judgments

210 total judgments, 70 per comparison.

Pretrained vs Baseline

70 judgments

Pretrained 68.6% (48/70) · Baseline 31.4% (22/70) · Clear preference for pretrained

Pretrained vs Prosody

70 judgments

Pretrained 71.4% (50/70) · Prosody 28.6% (20/70) · Clear preference for pretrained

Baseline vs Prosody

70 judgments

Baseline 51.4% (36/70) · Prosody 48.6% (34/70) · No clear separation

Mean Crest Factor

Relative High-Frequency Energy (×10^4)

Concluding thoughts

The central finding is that a feature-grounded, human-derived prosody reward did not produce a perceptible naturalness advantage over a matched baseline. This is consistent with the broader direction of recent work, but stronger to say that the result is not surprising in hindsight.

Automatic rewards retain clear value as diagnostic infrastructure. In this experiment, they enabled checkpoint triage, regression detection, and matched branch comparison. What they do not yet provide is a reliable optimization target for the highest-value perceptual dimensions of speech. Second, this result is a single-architecture experiment under a 200-update budget. We don't claim it generalizes to all reward formulations. But the result is controlled enough to be informative: if a reward grounded in 22 interpretable features and trained on human preferences cannot separate from a standard baseline under matched conditions, the bottleneck is unlikely to be solved by reward engineering alone.

The result fits a broader pattern. WavBench (May 2025) found that audio quality scores collapse in multi-turn dialogue even as text quality improves. The GSRM paper (Meta, February 2026) noted that no prior work had investigated online RL for speech LLMs. CASPER (2025) documented Whisper-large-v3 degrading from 2.5% WER on LibriSpeech clean to 31% on spontaneous speech. The field still trains predominantly on read speech while calibrating rewards on read-speech evals. RL then pushes outputs further into that distribution. The hardest quality dimensions, prosodic naturalness, conversational timing, emotional fit, are properties of spontaneous voice interaction that are difficult to adequately represent.

We plan to extend this work by running controlled comparisons of training data composition (read speech vs. spontaneous conversational recordings), building a recurring human preference collection loop targeting prosodic and paralinguistic dimensions. Furthermore, we want to conduct research on novel evaluation infrastructure that could partially automate the collection of said human evaluation data.