DATA DICTIONARY

Human Speech Atlas: Canonical Schema

The below represents the features we have extracted from each sample in our human speech atlas, and a brief explaination of the feature in question. All HSA parquet files share an identical 46-column canonical schema regardless of language, family tree, or source corpus.

Identity & Provenance

ColumnTypeDescription
clip_idstringAnonymous sequential ID. Format: {lang_corpus_NNNNNN}
lang_codestringISO 639-3 language code (e.g. ko, ha, btv)
lang_namestringHuman-readable language name
corpusstringSource corpus: cv24 or sps2
speech_typestringscripted (CV24) or spontaneous (SPS2)
source_datasetstringFull source name (e.g. Mozilla Common Voice CV24.0)
sentence_domainstringText domain: wikipedia | news | etc.

Demographics

ColumnTypeDescription
genderstringmale | female | other | unknown
agestringAge bracket (e.g. 20-29) or unknown
accentstringSelf-reported accent/dialect label
dialect_tagstringNormalised dialect code
sample_typestringcream_t1 | cream_t2 | fill_t3 | fill_t4

Temporal

ColumnTypeDescription
duration_msInt64Clip duration bucketed to nearest 100ms
duration_sfloat32duration_ms / 1000

Quality Gate (Brouhaha)

ColumnTypeDescription
tierint1=PRISTINE | 2=STUDIO | 3=AMBIENT | 4=TRASH
tier_labelstringPRISTINE | STUDIO | AMBIENT | TRASH
snr_medianfloat32Median signal-to-noise ratio (dB)
snr_meanfloat32Mean SNR (dB)
c50_medianfloat32Median room clarity C50 (dB)
speech_ratiofloat32Active speech fraction (0-1)

Acoustic Features

Note: Values are float32, FP16 watermarked.

ColumnUnitsDescription
pitch_meanHzMean F0 (VAD-bounded, voiced frames only)
pitch_stdHzF0 standard deviation
pitch_rangeHz95th-5th percentile F0
pitch_velocity_maxHz/frameMax F0 rate-of-change
jitter_local%Cycle-to-cycle period variation (MP3 fidelity caveat)
shimmer_local%Cycle-to-cycle amplitude variation (MP3 fidelity caveat)
hnr_meandBHarmonics-to-noise ratio
cppsCepstral peak prominence smoothed
intensity_meandBMean intensity (normalised — see limitations)
intensity_maxdBPeak intensity (normalised — see limitations)
intensity_rangedBDynamic range
intensity_velocity_maxdB/frameMax intensity rate-of-change
spectral_centroid_meanHzMean spectral centroid
spectral_tiltdB/kHzLog-power spectrum slope
mfcc_delta_meanMean first-order MFCC delta
zcr_meanZero crossing rate
teo_meanMean Teager Energy Operator
teo_stdTEO standard deviation
f1_meanHzMean first formant
f2_meanHzMean second formant
f3_meanHzMean third formant
formant_dispersionHz(F3-F1)/2 — vocal tract length proxy
npviNormalised Pairwise Variability Index (0.0 pending MFA)
articulation_ratesyl/sSyllable rate (0.0 pending MFA)
emotion_score0-1Composite vocal intensity score
syllable_count_approxintVowel-count syllable proxy

Known Limitations

intensity_mean / intensity_max — Mozilla normalises source audio to -20 dBFS. These columns are dead vectors. Cross-speaker intensity comparison is invalid.

jitter_local / shimmer_local — MP3 codec degrades sub-ms glottal measurements. HNR and CPPS are more robust alternatives for this corpus.

npvi / articulation_rate — Return 0.0 pending Phase 2 MFA phoneme alignment.

Tonal languages — In Niger-Congo, Tai-Kadai, and Sino-Tibetan languages, pitch_mean/std/range measure lexical tone, not prosodic stress.

Quality Tiers

TierLabelSNRC50Speech ratio
1PRISTINE>= 35 dB>= 35 dB>= 0.30
2STUDIO>= 25 dB>= 20 dB>= 0.30
3AMBIENT>= 10 dBany>= 0.10
4TRASH< 10 dBany< 0.10
Extraction Methodology Project Details