The below represents the features we have extracted from each sample in our human speech atlas, and a brief explaination of the feature in question. All HSA parquet files share an identical 46-column canonical schema regardless of language, family tree, or source corpus.
| Column | Type | Description |
|---|---|---|
| clip_id | string | Anonymous sequential ID. Format: {lang_corpus_NNNNNN} |
| lang_code | string | ISO 639-3 language code (e.g. ko, ha, btv) |
| lang_name | string | Human-readable language name |
| corpus | string | Source corpus: cv24 or sps2 |
| speech_type | string | scripted (CV24) or spontaneous (SPS2) |
| source_dataset | string | Full source name (e.g. Mozilla Common Voice CV24.0) |
| sentence_domain | string | Text domain: wikipedia | news | etc. |
| Column | Type | Description |
|---|---|---|
| gender | string | male | female | other | unknown |
| age | string | Age bracket (e.g. 20-29) or unknown |
| accent | string | Self-reported accent/dialect label |
| dialect_tag | string | Normalised dialect code |
| sample_type | string | cream_t1 | cream_t2 | fill_t3 | fill_t4 |
| Column | Type | Description |
|---|---|---|
| duration_ms | Int64 | Clip duration bucketed to nearest 100ms |
| duration_s | float32 | duration_ms / 1000 |
| Column | Type | Description |
|---|---|---|
| tier | int | 1=PRISTINE | 2=STUDIO | 3=AMBIENT | 4=TRASH |
| tier_label | string | PRISTINE | STUDIO | AMBIENT | TRASH |
| snr_median | float32 | Median signal-to-noise ratio (dB) |
| snr_mean | float32 | Mean SNR (dB) |
| c50_median | float32 | Median room clarity C50 (dB) |
| speech_ratio | float32 | Active speech fraction (0-1) |
Note: Values are float32, FP16 watermarked.
| Column | Units | Description |
|---|---|---|
| pitch_mean | Hz | Mean F0 (VAD-bounded, voiced frames only) |
| pitch_std | Hz | F0 standard deviation |
| pitch_range | Hz | 95th-5th percentile F0 |
| pitch_velocity_max | Hz/frame | Max F0 rate-of-change |
| jitter_local | % | Cycle-to-cycle period variation (MP3 fidelity caveat) |
| shimmer_local | % | Cycle-to-cycle amplitude variation (MP3 fidelity caveat) |
| hnr_mean | dB | Harmonics-to-noise ratio |
| cpps | — | Cepstral peak prominence smoothed |
| intensity_mean | dB | Mean intensity (normalised — see limitations) |
| intensity_max | dB | Peak intensity (normalised — see limitations) |
| intensity_range | dB | Dynamic range |
| intensity_velocity_max | dB/frame | Max intensity rate-of-change |
| spectral_centroid_mean | Hz | Mean spectral centroid |
| spectral_tilt | dB/kHz | Log-power spectrum slope |
| mfcc_delta_mean | — | Mean first-order MFCC delta |
| zcr_mean | — | Zero crossing rate |
| teo_mean | — | Mean Teager Energy Operator |
| teo_std | — | TEO standard deviation |
| f1_mean | Hz | Mean first formant |
| f2_mean | Hz | Mean second formant |
| f3_mean | Hz | Mean third formant |
| formant_dispersion | Hz | (F3-F1)/2 — vocal tract length proxy |
| npvi | — | Normalised Pairwise Variability Index (0.0 pending MFA) |
| articulation_rate | syl/s | Syllable rate (0.0 pending MFA) |
| emotion_score | 0-1 | Composite vocal intensity score |
| syllable_count_approx | int | Vowel-count syllable proxy |
intensity_mean / intensity_max — Mozilla normalises source audio to -20 dBFS. These columns are dead vectors. Cross-speaker intensity comparison is invalid.
jitter_local / shimmer_local — MP3 codec degrades sub-ms glottal measurements. HNR and CPPS are more robust alternatives for this corpus.
npvi / articulation_rate — Return 0.0 pending Phase 2 MFA phoneme alignment.
Tonal languages — In Niger-Congo, Tai-Kadai, and Sino-Tibetan languages, pitch_mean/std/range measure lexical tone, not prosodic stress.
| Tier | Label | SNR | C50 | Speech ratio |
|---|---|---|---|---|
| 1 | PRISTINE | >= 35 dB | >= 35 dB | >= 0.30 |
| 2 | STUDIO | >= 25 dB | >= 20 dB | >= 0.30 |
| 3 | AMBIENT | >= 10 dB | any | >= 0.10 |
| 4 | TRASH | < 10 dB | any | < 0.10 |