Abstract
Listeners use lexical stress to facilitate word recognition and speech segmentation. However, classical automatic speech recognition (ASR) models did not typically incorporate lexical stress in their recognition process. In contrast, end-to-end ASR models are trained in an unsupervised manner and may use the information carried by lexical stress.
The present study shows that Wav2vec 2.0 (an end-to-end ASR model) is indeed sensitive to lexical stress, and that this sensitivity is not a mere reflection of acoustic correlates of stress. Diagnostic classifiers of the convolutional neural network (CNN) output of the Wav2vec 2.0 model reveal vowel-specific stress representations, that perform on par with acoustic features. Stress classifiers trained on transformer layers of the Wav2vec 2 model outperform classifiers based on acoustic correlates, but degrade when context is removed, showing that later layers of the model take the relative nature of stress into account.
Results obtained by testing a lexical stress classifier on vowels it is not trained on, show that stress processing in the Wav2vec 2 model is to some extent abstract, i.e., the classifier does not simply detect a set of stressed vowel representations but rather, their common denominator.
The present study shows that Wav2vec 2.0 (an end-to-end ASR model) is indeed sensitive to lexical stress, and that this sensitivity is not a mere reflection of acoustic correlates of stress. Diagnostic classifiers of the convolutional neural network (CNN) output of the Wav2vec 2.0 model reveal vowel-specific stress representations, that perform on par with acoustic features. Stress classifiers trained on transformer layers of the Wav2vec 2 model outperform classifiers based on acoustic correlates, but degrade when context is removed, showing that later layers of the model take the relative nature of stress into account.
Results obtained by testing a lexical stress classifier on vowels it is not trained on, show that stress processing in the Wav2vec 2 model is to some extent abstract, i.e., the classifier does not simply detect a set of stressed vowel representations but rather, their common denominator.
Publication type
Poster
Presentation
Abstract_DvdF2024_Bentum_etal.pdf
(44.62 KB)
Year of publication
2024
Conference location
Utrecht
Conference name
Dag van de Fonetiek 2024
Publisher
Nederlandse Vereniging voor Fonetische Wetenschappen