Build a physically-grounded, academically defensible framework for detecting synthetic speech using classical signal-processing features. The core thesis is that TTS/VC systems cannot replicate the coupled biomechanical constraints of the human vocal tract, creating detectable physics violations.
The SSA is a companion to the Human Speech Atlas. Together they provide a 1M+ synthetic / 600k+ bonafide feature corpus.
jitter_local, shimmer_local, hnr_mean, cpps, hnr_c50_ratio, cpps_snr_ratio are strictly valid only when brouhaha_run=1.| Table | Rows | Bonafide | Spoof | Brouhaha% |
|---|---|---|---|---|
| in_the_wild | 31,779 | 19,963 | 11,816 | 99% bona |
| fake_or_real | 10,684 | 5,286 | 5,398 | 100% bona |
| sonar | 3,699 | 2,274 | 1,425 | 100% bona |
| librisevoc | 92,407 | 13,201 | 79,206 | 100% bona |
| asvspoof5_dev | 142,134 | 31,331 | 109,171 | 100% bona |
| asvspoof5_eval | 681,872 | 138,688 | 542,086 | 100% bona |
| asvspoof5_train | 182,357 | 18,797 | 163,560 | 0% (pending) |
| asvspoof_2021_df | 193,072 | 5,791 | 37,640 | 50% bona |
| dfadd | 207,955 | 44,454 | 163,501 | 0% |
| add2022 | 55,408 | 5,319 | 50,089 | 0% |
| mlaad | 156,966 | 0 | 156,966 | 0% |
| wavefake | 134,266 | 0 | 134,266 | 100% |
| TOTAL | 1,904,318 | 285,104 | 1,455,124 |
* Usable bonafide (Brouhaha-graded): ~213,484 clips. Ground-truth gender metadata available for ~190,000 clips.
Following a rigorous internal diagnostic review (April 2026), the methodology transitioned from a significance-based (p-value) model to an effect-first architecture. With 1.7M+ rows, p-values merely confirm sample size; the framework must interrogate physical magnitude.
ZERO of 41 features are biologically invariant across environments. Every feature undergoes a statistically significant distribution shift moving from Tier 1 (Studio) to Tier 2 (Near-field). A pure biological baseline is unattainable without tier stratification.
Action Taken: The single global baseline was deprecated to exploratory use. All formal detection claims must utilize tier-stratified baselines.
TEO Artifact Quarantine: Teager Energy Operator features (teo_mean, teo_std) exhibit CV% up to 283%. They act as dataset-identification footprints rather than biological markers. They are strictly disqualified from the global baseline and relegated to conditional Gate 2 use.
Formant Sex-Confounding: Formant velocities (f1_velocity, f2_velocity) show significant sex-based biological gaps (Cohen's d = -0.34 for M vs F). A single global baseline is demographically naive. Action: Implemented a pooled within-sex standard deviation for detection normalisation, sourced from ground-truth datasets.
nPVI Dataset Blacklist: The 100% instant-kill rate observed on ASVspoof2021 DF was confirmed to be a codec/compression artifact, not a universal biological failure. npvi is formally blacklisted for this specific dataset to prevent metric contamination.
Features passing the KS D < 0.1 / KL < 0.02 invariance thresholds, demonstrating consistent directional Cohen's d across both tiers with N ≥ 385:
bico_f0_f1: Phase-smear decoupling. Architecture-invariant. The strongest biomechanical 0-day signature.intensity_velocity_max: Robust measure of articulatory effort across tiers.spectral_tilt: Consistent physical separation.shimmer_local: Robust, but PRISTINE-gated (Tier 1 claims only).f1_mean, bico_f1_f2: Operationally invariant.f1_velocity, f2_velocity: Conditionally defensible (requires pooled within-sex variance)."The system has moved from a model overwhelmed by data to one that interrogates the physics. The baseline is now academically defensible." — Statistician Sign-Off, April 2026.
The framework correctly abandons p-value significance in favor of KS Distance (D > 0.1) and Kullback-Leibler Divergence. Feature blacklists act as the strongest defense against academic fluff, acknowledging when a dataset's internal compression destroys a metric rather than claiming biological failure.