SYNTHETIC SPEECH ATLAS

Analysis Methodology & Statistical Framework

Document version: 2026-04-02 (Sign-Off) | Status: Methodology LOCKED

1. Project Objective

Build a physically-grounded, academically defensible framework for detecting synthetic speech using classical signal-processing features. The core thesis is that TTS/VC systems cannot replicate the coupled biomechanical constraints of the human vocal tract, creating detectable physics violations.

The SSA is a companion to the Human Speech Atlas. Together they provide a 1M+ synthetic / 600k+ bonafide feature corpus.

2. Pipeline Architecture

Two-Pass Extraction

Pass 1 — Brouhaha (Lavechin et al., Interspeech 2022): Computes SNR, C50, VAD speech ratio, and quality tier per clip. Operates via single worker (IO-bound), outputting to a JSONL cache per dataset.
Pass 2 — Classical Feature Extraction: Utilizes Parselmouth/Praat, librosa, and custom physics-motivated scripts. Extracts 57 features per clip using a max 5 CPU worker setup, cooldown-hardened.

Feature Tiers & Gating

Tier 1: Standard DSP features — published as raw float32.
Tier 2: Proprietary biomechanical features — published as Z-scores only. Raw data is permanently dropped as adversarial 0-day protection.
PRISTINE-gated: jitter_local, shimmer_local, hnr_mean, cpps, hnr_c50_ratio, cpps_snr_ratio are strictly valid only when brouhaha_run=1.

3. Corpus State (as of 2026-04-02)

Table	Rows	Bonafide	Spoof	Brouhaha%
in_the_wild	31,779	19,963	11,816	99% bona
fake_or_real	10,684	5,286	5,398	100% bona
sonar	3,699	2,274	1,425	100% bona
librisevoc	92,407	13,201	79,206	100% bona
asvspoof5_dev	142,134	31,331	109,171	100% bona
asvspoof5_eval	681,872	138,688	542,086	100% bona
asvspoof5_train	182,357	18,797	163,560	0% (pending)
asvspoof_2021_df	193,072	5,791	37,640	50% bona
dfadd	207,955	44,454	163,501	0%
add2022	55,408	5,319	50,089	0%
mlaad	156,966	0	156,966	0%
wavefake	134,266	0	134,266	100%
TOTAL	1,904,318	285,104	1,455,124

* Usable bonafide (Brouhaha-graded): ~213,484 clips. Ground-truth gender metadata available for ~190,000 clips.

4. Statistical Peer Review & Revisions

Following a rigorous internal diagnostic review (April 2026), the methodology transitioned from a significance-based (p-value) model to an effect-first architecture. With 1.7M+ rows, p-values merely confirm sample size; the framework must interrogate physical magnitude.

The "Acoustic Bloodbath" (KS Invariance Failure)

ZERO of 41 features are biologically invariant across environments. Every feature undergoes a statistically significant distribution shift moving from Tier 1 (Studio) to Tier 2 (Near-field). A pure biological baseline is unattainable without tier stratification.

Action Taken: The single global baseline was deprecated to exploratory use. All formal detection claims must utilize tier-stratified baselines.

Feature Quarantines & Corrections

TEO Artifact Quarantine: Teager Energy Operator features (teo_mean, teo_std) exhibit CV% up to 283%. They act as dataset-identification footprints rather than biological markers. They are strictly disqualified from the global baseline and relegated to conditional Gate 2 use.

Formant Sex-Confounding: Formant velocities (f1_velocity, f2_velocity) show significant sex-based biological gaps (Cohen's d = -0.34 for M vs F). A single global baseline is demographically naive. Action: Implemented a pooled within-sex standard deviation for detection normalisation, sourced from ground-truth datasets.

nPVI Dataset Blacklist: The 100% instant-kill rate observed on ASVspoof2021 DF was confirmed to be a codec/compression artifact, not a universal biological failure. npvi is formally blacklisted for this specific dataset to prevent metric contamination.

5. Defensible Detection Framework

Dual-Gated Baseline Architecture

Tier 1 Baseline: (SNR > 50 dB AND C50 > 50 dB). ~130k bonafide clips. Uses all features including PRISTINE-gated. Primary reference for academic claims.
Tier 2 Baseline: ~80k clips. Non-PRISTINE features only. Used strictly for real-world robustness validation.

Tier A: Academically Defensible Detectors

Features passing the KS D < 0.1 / KL < 0.02 invariance thresholds, demonstrating consistent directional Cohen's d across both tiers with N ≥ 385:

bico_f0_f1: Phase-smear decoupling. Architecture-invariant. The strongest biomechanical 0-day signature.
intensity_velocity_max: Robust measure of articulatory effort across tiers.
spectral_tilt: Consistent physical separation.
shimmer_local: Robust, but PRISTINE-gated (Tier 1 claims only).

Conditioned Features

f1_mean, bico_f1_f2: Operationally invariant.
f1_velocity, f2_velocity: Conditionally defensible (requires pooled within-sex variance).

Final Verdict: "Statistically Honest"

"The system has moved from a model overwhelmed by data to one that interrogates the physics. The baseline is now academically defensible." — Statistician Sign-Off, April 2026.

The framework correctly abandons p-value significance in favor of KS Distance (D > 0.1) and Kullback-Leibler Divergence. Feature blacklists act as the strongest defense against academic fluff, acknowledging when a dataset's internal compression destroys a metric rather than claiming biological failure.

Data Dictionary Return to Hub