Javascript must be enabled to continue!

A Deep Neural Network Model of Audiovisual Speech Recognition Reports the McGurk Effect

Abstract In the McGurk effect, perception of an auditory syllable changes dramatically when it is paired with an incongruent visual syllable, countering our intuition that speech perception is solely an auditory process. The dominant modeling framework for the study of audiovisual speech perception is that of Bayesian causal inference, but current Bayesian models are unable to predict the wide range of percepts evoked by McGurk syllables. We explored whether a deep neural network (DNN) known as AVHuBERT could provide an alternative modeling framework. AVHuBERT model variants were presented with McGurk syllables consisting of auditory “ba” paired with visual “ga” recorded from 8 different talkers. AVHuBERT identified McGurk syllables as something other than “ba” at a rate of 59%, demonstrating a robust McGurk effect. The rate of the McGurk effect was similar to that observed in humans: 100 participants presented with the same McGurk syllables reported non-“ba” percepts on 56% of trials. AVHubert variants and humans produced a wide variety of responses to McGurk syllables, including the canonical McGurk fusion percept of “da“, responses without any initial consonant such as “ah” and responses with other initial consonants such as “fa“. The ability to predict percepts experienced by humans but not predicted by current Bayesian models suggest that DNNs and Bayesian models may provide complementary windows into the perceptual mechanisms underlying human audiovisual speech perception.

openRxiv

Haotian Ma Zhengjia Wang Xiang Zhang John F Magnotti Michael S Beauchamp

2025

Title: A Deep Neural Network Model of Audiovisual Speech Recognition Reports the McGurk Effect

Description:

The dominant modeling framework for the study of audiovisual speech perception is that of Bayesian causal inference, but current Bayesian models are unable to predict the wide range of percepts evoked by McGurk syllables.

We explored whether a deep neural network (DNN) known as AVHuBERT could provide an alternative modeling framework.

AVHuBERT model variants were presented with McGurk syllables consisting of auditory “ba” paired with visual “ga” recorded from 8 different talkers.

AVHuBERT identified McGurk syllables as something other than “ba” at a rate of 59%, demonstrating a robust McGurk effect.

The rate of the McGurk effect was similar to that observed in humans: 100 participants presented with the same McGurk syllables reported non-“ba” percepts on 56% of trials.

AVHubert variants and humans produced a wide variety of responses to McGurk syllables, including the canonical McGurk fusion percept of “da“, responses without any initial consonant such as “ah” and responses with other initial consonants such as “fa“.

The ability to predict percepts experienced by humans but not predicted by current Bayesian models suggest that DNNs and Bayesian models may provide complementary windows into the perceptual mechanisms underlying human audiovisual speech perception.

Back

Research on visual and audiovisual speech information has profoundly influenced the fields of psycholinguistics, perception psychology, and cognitive neuroscience. Visual speech fi...

Audiovisual translation and media accessibility training in the EMT network

The increase in demand for the localisation of audiovisual media content has led to increased incorporation of audiovisual translation and accessibility modules into university cur...

Audiovisual Speech Perception in Aging Cochlear Implant Users and Age-Matched Non-Implanted Adults

Objectives. Older typical-hearing adults without a cochlear-implant (CI) have been found to exhibit greater multisensory benefits when identifying audiovisual speech than younger n...

Audiovisual Speech Perception in Aging Cochlear Implant Users and Age-Matched Non-Implanted Adults

Objectives. Older typical-hearing adults without a cochlear-implant (CI) have been found to exhibit greater multisensory benefits when identifying audiovisual speech than younger n...

Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)

BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

Les illusions McGurk dans la parole : 25 ans de recherches

Summary : The McGurk illusions in speech: 25years of research. When presented with an auditory /b/ dubbed onto a visual /g/, listeners perceive sometimes a fused phoneme like ...

McGurk doesn’t work: Individual differences and task demands explain the McGurk illusion

Visual speech cues play an important role in speech recognition, and the McGurk effect is a classic demonstration of this. In the original McGurk and MacDonald (1976) experiment, 9...

Neural Speech-Tracking During Selective Attention: A Spatially Realistic Audiovisual Study

Abstract Paying attention to a target talker in multi-talker scenarios is associated with its more accurate neural-tracking relative to competing non-target speech....

Email:
Password:

Email:

A Deep Neural Network Model of Audiovisual Speech Recognition Reports the McGurk Effect

Related Results