HUMAN SPEECH ATLAS

HSA Feature Extraction Methodology

Document version: 1.0 | Pipeline version: HSA_v1_2026 | Date: 2026-04-04

1. Overview

The Human Speech Atlas (HSA) is a multilingual acoustic telemetry corpus produced by Moonscape Software from two CC0-licensed Mozilla speech datasets: Common Voice CV24.0 (scripted read speech) and Spontaneous Speech SPS2.0 (unscripted conversational speech). It covers 90+ languages across 12 typologically diverse linguistic family trees, yielding approximately 531,000 exportable utterance-level feature rows.

The corpus is a feature-only product. No audio is distributed. Each row represents one spoken clip described by 38 hand-crafted classical signal-processing features plus metadata, extracted through a four-pass pipeline and hardened through a three-stage anonymisation and watermarking protocol before export.

The HSA serves as a companion to the Synthetic Speech Atlas (SSA), which provides parallel acoustic telemetry for deepfake detection research. Together they form a unified acoustic telemetry platform spanning natural and synthetic speech.

2. Source Data

2.1 Mozilla Common Voice CV24.0

Common Voice is a crowd-sourced read-speech corpus distributed by the Mozilla Foundation under CC0-1.0. Contributors record prompted sentences drawn from validated text corpora (Wikipedia, news, civic domains) on personal devices in uncontrolled environments. Each clip comes with a TSV demographics file containing self-reported gender, age bracket, and accent/dialect labels where the contributor chose to provide them.

The HSA ingests CV24.0 as the primary scripted-speech source. Clips are tagged speech_type = scripted.

2.2 Mozilla Spontaneous Speech SPS2.0

The Mozilla Spontaneous Speech corpus provides unscripted conversational recordings under CC0-1.0. SPS2.0 lacks validated per-speaker TSV demographics files, so gender and age fields default to unknown unless inferred from acoustic pitch during processing. Clips are tagged speech_type = spontaneous.

2.3 Language Coverage

All 158 language-corpus tables (language × source corpus pairs) in the internal macroprosody.db database are classified into one of 12 linguistic family trees for HuggingFace distribution. The trees are:

Indo-European, Niger-Congo, Austronesian, Mesoamerican, Americas-Other, Afro-Asiatic, Nilo-Saharan, Trans-New-Guinea, Eurasian-Minor, Turkic, Asian-Minor, and Isolates.

Classification follows established genealogical taxonomy. Contested or uncertain cases (e.g. Lasi/lss as Sindhic Indo-Aryan, Ukuriguma/ukv as TNG-Madang) were resolved through expert linguistic consultation.

3. Extraction Pipeline

Source audio passes through four sequential processing stages before any feature is stored.

3.1 Pass 1 — Audio Conversion

Mozilla distributes CV24 audio as 128 kbps MP3. Each file is converted to 16 kHz mono WAV at −20 dBFS normalisation using ffmpeg. This normalisation is applied by Mozilla at source for CV24 and is a known characteristic of the corpus that affects intensity-related features (see Section 6, Known Limitations).

3.2 Pass 2 — Quality Grading (Brouhaha)

Each WAV is graded using Brouhaha (Lavechin et al., Interspeech 2022), a multi-task neural model that produces per-clip estimates of: SNR (signal-to-noise ratio, dB), C50 (room clarity index, dB — ratio of early to late energy), and Speech ratio (fraction of frames classified as active speech by VAD).

Clips are assigned to one of four quality tiers based on simultaneous threshold satisfaction:

Tier	Label	SNR	C50	Speech Ratio
1	PRISTINE	≥ 35 dB	≥ 35 dB	≥ 0.30
2	STUDIO	≥ 25 dB	≥ 20 dB	≥ 0.30
3	AMBIENT	≥ 10 dB	any	≥ 0.10
4	TRASH	< 10 dB	any	< 0.10

The T2 gate requires all three conditions simultaneously. This was a deliberate design decision following pipeline bugs BUG-029 and BUG-030, which admitted high-reverb and silence-heavy clips by evaluating SNR and C50 independently.

Tier 1 and 2 clips (PRISTINE and STUDIO) form the cream set — the primary analytical population. Tier 3 and 4 clips are retained in the internal database but excluded from standard exports.

3.3 Pass 3 — Acoustic Feature Extraction

Acoustic features are extracted from each WAV using Parselmouth (Python bindings for Praat) and librosa. The extraction is bounded by VAD output from Brouhaha — features are computed over voiced/active frames only, not over silence or ambient noise. The 38 features span six domains:

Pitch / Prosody: F0 mean, standard deviation, 95th–5th percentile range, and maximum frame-to-frame F0 velocity (Savitzky-Golay smoothed, median-7 filtered for outlier robustness).
Voice Quality: Jitter (cycle-to-cycle period perturbation), shimmer (cycle-to-cycle amplitude perturbation), harmonics-to-noise ratio (HNR), and cepstral peak prominence smoothed (CPPS).
Intensity: Mean, maximum, range, and maximum velocity. Note: these four features are compromised by Mozilla's −20 dBFS source normalisation (see Section 6).
Spectral: Spectral centroid mean, spectral tilt (log-power spectrum slope), mean first-order MFCC delta, and zero crossing rate.
Energy Modulation: Teager Energy Operator mean and standard deviation — a nonlinear airflow proxy sensitive to subglottal pressure dynamics.
Formants: First, second, and third formant means (F1, F2, F3), and formant dispersion ((F3−F1)/2) as a vocal tract length proxy.
Rhythm (Phase 2 — Pending): Normalised Pairwise Variability Index (nPVI) and articulation rate require phoneme-level timing from forced alignment and return 0.0 in the current release.
Composite: An emotion score (0.0–1.0) derived from a composite vocal intensity heuristic, and a syllable count approximation based on vowel-count proxy (generated before transcript stripping).

3.4 Pass 4 — Cream Selection and Ingestion

Cream clips (T1+T2) are loaded into macroprosody.db, a SQLite database with one table per language-corpus pair (158 tables total). A cream manifest maintains a demographically balanced 25-minute representative WAV set per language for use in Phase 2 MFA forced alignment.

4. Export and Anonymisation Pipeline

All exports apply a ten-step hardening sequence. Steps are applied in the order listed below.

Step 1 — Syllable Proxy Generation

Before transcript removal, an approximate syllable count is computed from the transcript using a vowel-count heuristic. This proxy is retained in the export as syllable_count_approx, providing a structural length signal without exposing the actual text.

Step 2 — File ID Replacement

Original Mozilla file identifiers are replaced with salted sequential clip IDs in the format {lang_corpus_NNNNNN}. A private mapping table is retained locally and is never distributed.

Step 3 — Column Stripping

The following columns are removed entirely: client_id, sentence_id, source_file, wav_path, transcript, validated, gender_source, alignment_method, phoneme_count, db_inserted_at. This eliminates all direct speaker linkage, sentence linkage, and internal pipeline metadata.

Step 4 — Duration Bucketing

duration_ms is rounded to the nearest 100ms. This prevents cross-referencing exported rows against known clip durations from the source corpus.

Step 5–6 — Precision Reduction and FP16 Watermark

All 38 continuous acoustic variables are processed in two sub-steps. First, values are rounded to 2 decimal places. Second, each 2dp-rounded value is reinflated with deterministic seeded noise derived via HMAC-SHA256:

seed  = HMAC-SHA256(HSA_EXPORT_SECRET, col_name + "|HSA_v1_2026")
noise ~ Uniform(−0.004, +0.004)
value = float16(round(raw, 2) + noise)  →  stored as float32

The noise magnitude (±0.004) is chosen to be narrower than the 2dp quantisation bucket (±0.005), ensuring values remain within their original rounding interval. Values are cast through float16 before storage, locking them to FP16-resolution boundaries. This breaks the exact 2dp quantisation pattern that would otherwise allow backward-engineering to raw measurements, and embeds a deterministic cryptographic provenance signature into every row.

Step 7 — K-anonymity Suppression

Rows are grouped by the quasi-identifier tuple {gender, age_bucket, duration_bucket}. Any group with fewer than 5 members is suppressed entirely. Age is bucketed into broad brackets. Duration uses the 100ms bucketed value from Step 4.

Step 8 — Provenance Population

Three columns derived from the table name are populated: corpus, speech_type, and source_dataset.

Step 9 — Schema Enforcement

The enforce_schema() function normalises column names, adds missing canonical columns as NULL, drops extraneous columns, and reorders all columns to the canonical fixed order.

Step 10 — Parquet Write

Files are written as Snappy-compressed Parquet with per-file footer metadata baked into the schema: anon_standard: moonscape_k5_fp16_v1, watermark data, export_version, and prohibited clauses.

5. Distribution Structure

Exports are organised into a single gated HuggingFace repository with 12 named subset configurations, one per linguistic family tree. Each configuration maps to a subdirectory of Parquet files under data/:

Human_Speech_Atlas/
├── README.md
├── LICENSE.md
└── data/
    ├── Indo-European/
    ├── Niger-Congo/
    ├── Austronesian/
    ...

Users load by configuration name. The gating system requires institutional email verification and agreement to the Moonscape Tier 1 Academic License (Non-Commercial) for research access.

6. Known Limitations

Intensity features (intensity_mean, intensity_max): Mozilla normalises all CV24 and SPS2 audio to −20 dBFS at source. These columns therefore measure Mozilla's normalisation algorithm, not speaker acoustic power. Cross-speaker intensity comparison is invalid.

Jitter and shimmer: Both metrics measure sub-millisecond glottal pulse variation — exactly the timescale degraded by MP3 codec quantisation. Values reflect a mixture of genuine vocal fold behaviour and codec artifact. HNR and CPPS are more robust proxies here.

nPVI and articulation_rate: Both return 0.0 in HSA v1. These rhythm features require phoneme-level boundary timing from forced alignment, pending Phase 2.

Tonal language pitch features: For Niger-Congo, Tai-Kadai, and Sino-Tibetan languages, pitch means and ranges capture lexical tone rather than prosodic stress.

Tashlhiyt Berber (shi): Tachelhit allows words composed entirely of consonants. MFA forced alignment output should be treated as experimental until a consonant-syllabic-aware pronunciation dictionary is validated.

SPS2 demographics: The Spontaneous Speech corpus was released without validated per-speaker demographic TSV files. Age, accent, and dialect fields are blank for SPS2 languages.

Small corpus k-anonymity: Several language-corpus tables have clip counts below or near the k=5 threshold, resulting in high suppression rates or complete suppression (e.g., Tashlhiyt and Irish).

7. Phase 2 — MFA Phoneme Alignment (Pending)

Phase 2 will run the Montreal Forced Aligner (MFA) over the cream WAV sets for each language to produce phoneme-level boundary timing. This will unlock:

nPVI (Normalised Pairwise Variability Index) — a durational variability metric widely used in cross-linguistic rhythm typology research.
articulation_rate — syllables per second computed over speech intervals only.

A cream manifest of demographically balanced 25-minute WAV sets per language is maintained internally for this purpose. Human-validated transcripts have been backfilled for approximately 314,000 rows, providing MFA input text without relying on STT. MFA execution is deferred until all language packs are complete and the transcript backfill is fully stable.

8. Watermark Verification

Any party holding the export secret can verify provenance of any extracted row:

Re-derive the column seed: seed = HMAC-SHA256(secret, col_name + "|HSA_v1_2026")
Reconstruct the noise vector: numpy.random.default_rng(seed).uniform(-0.004, 0.004, n)
Confirm that exported_value ≈ round(raw_value, 2) + noise[row_index] within FP16 resolution

This verification holds for any subset of any size extracted from the corpus, including single rows. The HSA and SSA watermarks use entirely independent secrets.

9. References

Lavechin, M., et al. (2022). Brouhaha: Multi-task Training for Noise Speech Detection and Assessment. Interspeech 2022.
Mozilla Foundation. (2024). Mozilla Common Voice CV24.0. https://commonvoice.mozilla.org (CC0-1.0).
Parselmouth / Praat: Boersma, P. & Weenink, D. (2024). Praat: doing phonetics by computer. http://www.praat.org.
McAuliffe, M., et al. (2017). Montreal Forced Aligner. Interspeech 2017.

Data Dictionary Project Details