The Human Speech Atlas (HSA) is a multilingual acoustic telemetry corpus produced by Moonscape Software from two CC0-licensed Mozilla speech datasets: Common Voice CV24.0 (scripted read speech) and Spontaneous Speech SPS2.0 (unscripted conversational speech). It covers 90+ languages across 12 typologically diverse linguistic family trees, yielding approximately 531,000 exportable utterance-level feature rows.
The corpus is a feature-only product. No audio is distributed. Each row represents one spoken clip described by 38 hand-crafted classical signal-processing features plus metadata, extracted through a four-pass pipeline and hardened through a three-stage anonymisation and watermarking protocol before export.
The HSA serves as a companion to the Synthetic Speech Atlas (SSA), which provides parallel acoustic telemetry for deepfake detection research. Together they form a unified acoustic telemetry platform spanning natural and synthetic speech.
Common Voice is a crowd-sourced read-speech corpus distributed by the Mozilla Foundation under CC0-1.0. Contributors record prompted sentences drawn from validated text corpora (Wikipedia, news, civic domains) on personal devices in uncontrolled environments. Each clip comes with a TSV demographics file containing self-reported gender, age bracket, and accent/dialect labels where the contributor chose to provide them.
The HSA ingests CV24.0 as the primary scripted-speech source. Clips are tagged speech_type = scripted.
The Mozilla Spontaneous Speech corpus provides unscripted conversational recordings under CC0-1.0. SPS2.0 lacks validated per-speaker TSV demographics files, so gender and age fields default to unknown unless inferred from acoustic pitch during processing. Clips are tagged speech_type = spontaneous.
All 158 language-corpus tables (language × source corpus pairs) in the internal macroprosody.db database are classified into one of 12 linguistic family trees for HuggingFace distribution. The trees are:
Indo-European, Niger-Congo, Austronesian, Mesoamerican, Americas-Other, Afro-Asiatic, Nilo-Saharan, Trans-New-Guinea, Eurasian-Minor, Turkic, Asian-Minor, and Isolates.
Classification follows established genealogical taxonomy. Contested or uncertain cases (e.g. Lasi/lss as Sindhic Indo-Aryan, Ukuriguma/ukv as TNG-Madang) were resolved through expert linguistic consultation.
Source audio passes through four sequential processing stages before any feature is stored.
Mozilla distributes CV24 audio as 128 kbps MP3. Each file is converted to 16 kHz mono WAV at −20 dBFS normalisation using ffmpeg. This normalisation is applied by Mozilla at source for CV24 and is a known characteristic of the corpus that affects intensity-related features (see Section 6, Known Limitations).
Each WAV is graded using Brouhaha (Lavechin et al., Interspeech 2022), a multi-task neural model that produces per-clip estimates of: SNR (signal-to-noise ratio, dB), C50 (room clarity index, dB — ratio of early to late energy), and Speech ratio (fraction of frames classified as active speech by VAD).
Clips are assigned to one of four quality tiers based on simultaneous threshold satisfaction:
| Tier | Label | SNR | C50 | Speech Ratio |
|---|---|---|---|---|
| 1 | PRISTINE | ≥ 35 dB | ≥ 35 dB | ≥ 0.30 |
| 2 | STUDIO | ≥ 25 dB | ≥ 20 dB | ≥ 0.30 |
| 3 | AMBIENT | ≥ 10 dB | any | ≥ 0.10 |
| 4 | TRASH | < 10 dB | any | < 0.10 |
The T2 gate requires all three conditions simultaneously. This was a deliberate design decision following pipeline bugs BUG-029 and BUG-030, which admitted high-reverb and silence-heavy clips by evaluating SNR and C50 independently.
Tier 1 and 2 clips (PRISTINE and STUDIO) form the cream set — the primary analytical population. Tier 3 and 4 clips are retained in the internal database but excluded from standard exports.
Acoustic features are extracted from each WAV using Parselmouth (Python bindings for Praat) and librosa. The extraction is bounded by VAD output from Brouhaha — features are computed over voiced/active frames only, not over silence or ambient noise. The 38 features span six domains:
Cream clips (T1+T2) are loaded into macroprosody.db, a SQLite database with one table per language-corpus pair (158 tables total). A cream manifest maintains a demographically balanced 25-minute representative WAV set per language for use in Phase 2 MFA forced alignment.
All exports apply a ten-step hardening sequence. Steps are applied in the order listed below.
Before transcript removal, an approximate syllable count is computed from the transcript using a vowel-count heuristic. This proxy is retained in the export as syllable_count_approx, providing a structural length signal without exposing the actual text.
Original Mozilla file identifiers are replaced with salted sequential clip IDs in the format {lang_corpus_NNNNNN}. A private mapping table is retained locally and is never distributed.
The following columns are removed entirely: client_id, sentence_id, source_file, wav_path, transcript, validated, gender_source, alignment_method, phoneme_count, db_inserted_at. This eliminates all direct speaker linkage, sentence linkage, and internal pipeline metadata.
duration_ms is rounded to the nearest 100ms. This prevents cross-referencing exported rows against known clip durations from the source corpus.
All 38 continuous acoustic variables are processed in two sub-steps. First, values are rounded to 2 decimal places. Second, each 2dp-rounded value is reinflated with deterministic seeded noise derived via HMAC-SHA256:
seed = HMAC-SHA256(HSA_EXPORT_SECRET, col_name + "|HSA_v1_2026")
noise ~ Uniform(−0.004, +0.004)
value = float16(round(raw, 2) + noise) → stored as float32
The noise magnitude (±0.004) is chosen to be narrower than the 2dp quantisation bucket (±0.005), ensuring values remain within their original rounding interval. Values are cast through float16 before storage, locking them to FP16-resolution boundaries. This breaks the exact 2dp quantisation pattern that would otherwise allow backward-engineering to raw measurements, and embeds a deterministic cryptographic provenance signature into every row.
Rows are grouped by the quasi-identifier tuple {gender, age_bucket, duration_bucket}. Any group with fewer than 5 members is suppressed entirely. Age is bucketed into broad brackets. Duration uses the 100ms bucketed value from Step 4.
Three columns derived from the table name are populated: corpus, speech_type, and source_dataset.
The enforce_schema() function normalises column names, adds missing canonical columns as NULL, drops extraneous columns, and reorders all columns to the canonical fixed order.
Files are written as Snappy-compressed Parquet with per-file footer metadata baked into the schema: anon_standard: moonscape_k5_fp16_v1, watermark data, export_version, and prohibited clauses.
Exports are organised into a single gated HuggingFace repository with 12 named subset configurations, one per linguistic family tree. Each configuration maps to a subdirectory of Parquet files under data/:
Human_Speech_Atlas/
├── README.md
├── LICENSE.md
└── data/
├── Indo-European/
├── Niger-Congo/
├── Austronesian/
...
Users load by configuration name. The gating system requires institutional email verification and agreement to the Moonscape Tier 1 Academic License (Non-Commercial) for research access.
Intensity features (intensity_mean, intensity_max): Mozilla normalises all CV24 and SPS2 audio to −20 dBFS at source. These columns therefore measure Mozilla's normalisation algorithm, not speaker acoustic power. Cross-speaker intensity comparison is invalid.
Jitter and shimmer: Both metrics measure sub-millisecond glottal pulse variation — exactly the timescale degraded by MP3 codec quantisation. Values reflect a mixture of genuine vocal fold behaviour and codec artifact. HNR and CPPS are more robust proxies here.
nPVI and articulation_rate: Both return 0.0 in HSA v1. These rhythm features require phoneme-level boundary timing from forced alignment, pending Phase 2.
Tonal language pitch features: For Niger-Congo, Tai-Kadai, and Sino-Tibetan languages, pitch means and ranges capture lexical tone rather than prosodic stress.
Tashlhiyt Berber (shi): Tachelhit allows words composed entirely of consonants. MFA forced alignment output should be treated as experimental until a consonant-syllabic-aware pronunciation dictionary is validated.
SPS2 demographics: The Spontaneous Speech corpus was released without validated per-speaker demographic TSV files. Age, accent, and dialect fields are blank for SPS2 languages.
Small corpus k-anonymity: Several language-corpus tables have clip counts below or near the k=5 threshold, resulting in high suppression rates or complete suppression (e.g., Tashlhiyt and Irish).
Phase 2 will run the Montreal Forced Aligner (MFA) over the cream WAV sets for each language to produce phoneme-level boundary timing. This will unlock:
A cream manifest of demographically balanced 25-minute WAV sets per language is maintained internally for this purpose. Human-validated transcripts have been backfilled for approximately 314,000 rows, providing MFA input text without relying on STT. MFA execution is deferred until all language packs are complete and the transcript backfill is fully stable.
Any party holding the export secret can verify provenance of any extracted row:
seed = HMAC-SHA256(secret, col_name + "|HSA_v1_2026")numpy.random.default_rng(seed).uniform(-0.004, 0.004, n)exported_value ≈ round(raw_value, 2) + noise[row_index] within FP16 resolutionThis verification holds for any subset of any size extracted from the corpus, including single rows. The HSA and SSA watermarks use entirely independent secrets.