Javascript must be enabled to continue!
A Deep Neural Network Model of Audiovisual Speech Recognition Reports the McGurk Effect
View through CrossRef
Abstract
In the McGurk effect, perception of an auditory syllable changes dramatically when it is paired with an incongruent visual syllable, countering our intuition that speech perception is solely an auditory process. The dominant modeling framework for the study of audiovisual speech perception is that of Bayesian causal inference, but current Bayesian models are unable to predict the wide range of percepts evoked by McGurk syllables. We explored whether a deep neural network (DNN) known as AVHuBERT could provide an alternative modeling framework. AVHuBERT model variants were presented with McGurk syllables consisting of auditory “ba” paired with visual “ga” recorded from 8 different talkers. AVHuBERT identified McGurk syllables as something other than “ba” at a rate of 59%, demonstrating a robust McGurk effect. The rate of the McGurk effect was similar to that observed in humans: 100 participants presented with the same McGurk syllables reported non-“ba” percepts on 56% of trials. AVHubert variants and humans produced a wide variety of responses to McGurk syllables, including the canonical McGurk fusion percept of “da“, responses without any initial consonant such as “ah” and responses with other initial consonants such as “fa“. The ability to predict percepts experienced by humans but not predicted by current Bayesian models suggest that DNNs and Bayesian models may provide complementary windows into the perceptual mechanisms underlying human audiovisual speech perception.
Title: A Deep Neural Network Model of Audiovisual Speech Recognition Reports the McGurk Effect
Description:
Abstract
In the McGurk effect, perception of an auditory syllable changes dramatically when it is paired with an incongruent visual syllable, countering our intuition that speech perception is solely an auditory process.
The dominant modeling framework for the study of audiovisual speech perception is that of Bayesian causal inference, but current Bayesian models are unable to predict the wide range of percepts evoked by McGurk syllables.
We explored whether a deep neural network (DNN) known as AVHuBERT could provide an alternative modeling framework.
AVHuBERT model variants were presented with McGurk syllables consisting of auditory “ba” paired with visual “ga” recorded from 8 different talkers.
AVHuBERT identified McGurk syllables as something other than “ba” at a rate of 59%, demonstrating a robust McGurk effect.
The rate of the McGurk effect was similar to that observed in humans: 100 participants presented with the same McGurk syllables reported non-“ba” percepts on 56% of trials.
AVHubert variants and humans produced a wide variety of responses to McGurk syllables, including the canonical McGurk fusion percept of “da“, responses without any initial consonant such as “ah” and responses with other initial consonants such as “fa“.
The ability to predict percepts experienced by humans but not predicted by current Bayesian models suggest that DNNs and Bayesian models may provide complementary windows into the perceptual mechanisms underlying human audiovisual speech perception.
Related Results
Audiovisual Speech Perception and the McGurk Effect
Audiovisual Speech Perception and the McGurk Effect
Research on visual and audiovisual speech information has profoundly influenced the fields of psycholinguistics, perception psychology, and cognitive neuroscience. Visual speech fi...
Audiovisual translation and media accessibility training in the EMT network
Audiovisual translation and media accessibility training in the EMT network
The increase in demand for the localisation of audiovisual media content has led to increased incorporation of audiovisual translation and accessibility modules into university cur...
Audiovisual Speech Perception in Aging Cochlear Implant Users and Age-Matched Non-Implanted Adults
Audiovisual Speech Perception in Aging Cochlear Implant Users and Age-Matched Non-Implanted Adults
Objectives. Older typical-hearing adults without a cochlear-implant (CI) have been found to exhibit greater multisensory benefits when identifying audiovisual speech than younger n...
Audiovisual Speech Perception in Aging Cochlear Implant Users and Age-Matched Non-Implanted Adults
Audiovisual Speech Perception in Aging Cochlear Implant Users and Age-Matched Non-Implanted Adults
Objectives. Older typical-hearing adults without a cochlear-implant (CI) have been found to exhibit greater multisensory benefits when identifying audiovisual speech than younger n...
McGurk doesn’t work: Individual differences and task demands explain the McGurk illusion
McGurk doesn’t work: Individual differences and task demands explain the McGurk illusion
Visual speech cues play an important role in speech recognition, and the McGurk effect is a classic demonstration of this. In the original McGurk and MacDonald (1976) experiment, 9...
Les illusions McGurk dans la parole : 25 ans de recherches
Les illusions McGurk dans la parole : 25 ans de recherches
Summary : The McGurk illusions in speech: 25years of research.
When presented with an auditory /b/ dubbed onto a visual /g/, listeners perceive sometimes a fused phoneme like ...
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
BACKGROUND
Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...
Neural Speech-Tracking During Selective Attention: A Spatially Realistic Audiovisual Study
Neural Speech-Tracking During Selective Attention: A Spatially Realistic Audiovisual Study
Abstract
Paying attention to a target talker in multi-talker scenarios is associated with its more accurate neural-tracking relative to competing non-target speech....

