DATA DICTIONARY

Human Speech Atlas: Canonical Schema

The below represents the features we have extracted from each sample in our human speech atlas, and a brief explaination of the feature in question. All HSA parquet files share an identical 46-column canonical schema regardless of language, family tree, or source corpus.

Identity & Provenance

Column	Type	Description
clip_id	string	Anonymous sequential ID. Format: {lang_corpus_NNNNNN}
lang_code	string	ISO 639-3 language code (e.g. ko, ha, btv)
lang_name	string	Human-readable language name
corpus	string	Source corpus: cv24 or sps2
speech_type	string	scripted (CV24) or spontaneous (SPS2)
source_dataset	string	Full source name (e.g. Mozilla Common Voice CV24.0)
sentence_domain	string	Text domain: wikipedia \| news \| etc.

Demographics

Column	Type	Description
gender	string	male \| female \| other \| unknown
age	string	Age bracket (e.g. 20-29) or unknown
accent	string	Self-reported accent/dialect label
dialect_tag	string	Normalised dialect code
sample_type	string	cream_t1 \| cream_t2 \| fill_t3 \| fill_t4

Temporal

Column	Type	Description
duration_ms	Int64	Clip duration bucketed to nearest 100ms
duration_s	float32	duration_ms / 1000

Quality Gate (Brouhaha)

Column	Type	Description
tier	int	1=PRISTINE \| 2=STUDIO \| 3=AMBIENT \| 4=TRASH
tier_label	string	PRISTINE \| STUDIO \| AMBIENT \| TRASH
snr_median	float32	Median signal-to-noise ratio (dB)
snr_mean	float32	Mean SNR (dB)
c50_median	float32	Median room clarity C50 (dB)
speech_ratio	float32	Active speech fraction (0-1)

Acoustic Features

Note: Values are float32, FP16 watermarked.

Column	Units	Description
pitch_mean	Hz	Mean F0 (VAD-bounded, voiced frames only)
pitch_std	Hz	F0 standard deviation
pitch_range	Hz	95th-5th percentile F0
pitch_velocity_max	Hz/frame	Max F0 rate-of-change
jitter_local	%	Cycle-to-cycle period variation (MP3 fidelity caveat)
shimmer_local	%	Cycle-to-cycle amplitude variation (MP3 fidelity caveat)
hnr_mean	dB	Harmonics-to-noise ratio
cpps	—	Cepstral peak prominence smoothed
intensity_mean	dB	Mean intensity (normalised — see limitations)
intensity_max	dB	Peak intensity (normalised — see limitations)
intensity_range	dB	Dynamic range
intensity_velocity_max	dB/frame	Max intensity rate-of-change
spectral_centroid_mean	Hz	Mean spectral centroid
spectral_tilt	dB/kHz	Log-power spectrum slope
mfcc_delta_mean	—	Mean first-order MFCC delta
zcr_mean	—	Zero crossing rate
teo_mean	—	Mean Teager Energy Operator
teo_std	—	TEO standard deviation
f1_mean	Hz	Mean first formant
f2_mean	Hz	Mean second formant
f3_mean	Hz	Mean third formant
formant_dispersion	Hz	(F3-F1)/2 — vocal tract length proxy
npvi	—	Normalised Pairwise Variability Index (0.0 pending MFA)
articulation_rate	syl/s	Syllable rate (0.0 pending MFA)
emotion_score	0-1	Composite vocal intensity score
syllable_count_approx	int	Vowel-count syllable proxy

Known Limitations

intensity_mean / intensity_max — Mozilla normalises source audio to -20 dBFS. These columns are dead vectors. Cross-speaker intensity comparison is invalid.

jitter_local / shimmer_local — MP3 codec degrades sub-ms glottal measurements. HNR and CPPS are more robust alternatives for this corpus.

npvi / articulation_rate — Return 0.0 pending Phase 2 MFA phoneme alignment.

Tonal languages — In Niger-Congo, Tai-Kadai, and Sino-Tibetan languages, pitch_mean/std/range measure lexical tone, not prosodic stress.

Quality Tiers

Tier	Label	SNR	C50	Speech ratio
1	PRISTINE	>= 35 dB	>= 35 dB	>= 0.30
2	STUDIO	>= 25 dB	>= 20 dB	>= 0.30
3	AMBIENT	>= 10 dB	any	>= 0.10
4	TRASH	< 10 dB	any	< 0.10

Extraction Methodology Project Details