2024P The Processing of Stress in End-to-End Automatic Speech Recognition Models

Authors

Martijn Bentum, Louis ten Bosch & Tom Lentz

Abstract

Listeners use lexical stress to facilitate word recognition and speech segmentation. However, classical automatic speech recognition (ASR) models did not typically incorporate lexical stress in their recognition process. In contrast, end-to-end ASR models are trained in an unsupervised manner and may use the information carried by lexical stress.

The present study shows that Wav2vec 2.0 (an end-to-end ASR model) is indeed sensitive to lexical stress, and that this sensitivity is not a mere reflection of acoustic correlates of stress. Diagnostic classifiers of the convolutional neural network (CNN) output of the Wav2vec 2.0 model reveal vowel-specific stress representations, that perform on par with acoustic features. Stress classifiers trained on transformer layers of the Wav2vec 2 model outperform classifiers based on acoustic correlates, but degrade when context is removed, showing that later layers of the model take the relative nature of stress into account.

Results obtained by testing a lexical stress classifier on vowels it is not trained on, show that stress processing in the Wav2vec 2 model is to some extent abstract, i.e., the classifier does not simply detect a set of stressed vowel representations but rather, their common denominator.

Publication type

Poster

Presentation

Abstract_DvdF2024_Bentum_etal.pdf (44.62 KB)

Year of publication

2024

Conference location

Utrecht

Conference name

Dag van de Fonetiek 2024

Publisher

Nederlandse Vereniging voor Fonetische Wetenschappen