Javascript must be enabled to continue!

Speaker-Aware Simulation Improves Conversational Speech Recognition

Abstract Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach. We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training. Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora. Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation. While the additional duration conditioning in C-SASC yields modest but systematic gains--most notably in character-level error rates--its effectiveness depends on the match between source conversational statistics and the target domain. Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.

Springer Science and Business Media LLC

Máté Gedeon Péter Mihajlik

2026

Title: Speaker-Aware Simulation Improves Conversational Speech Recognition

Description:

Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues.

However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages.

In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR.

We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach.

We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training.

Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora.

Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation.

While the additional duration conditioning in C-SASC yields modest but systematic gains--most notably in character-level error rates--its effectiveness depends on the match between source conversational statistics and the target domain.

Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.

Back

Additionally, this chapter presents research of silence with review of main aspects of papers in the field of conversational analysis, ethnography of communication and metaphor of ...

Speaker Verification and Identification

A speaker recognition system verifies or identifies a speaker’s identity based on his/her voice. It is considered as one of the most convenient biometric characteristic for human m...

Quarantine Powers, Biodefense, and Andrew Speaker

In January 2007, Andrew Speaker (Speaker) underwent a chest X-ray and CT scan, which revealed an abnormality in his lungs. However, tests results indicated that he did not ha...

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion

Abstract The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with...

Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)

BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

Tiedon rajat ja vuorovaikutus. Toteamukseen tai vaihtoehtokysymykseen vastaavat VOI OLLA -rakenteet [On the limits of knowledge. Responding to an assertion or a polar question with VOI OLLA ‘(it) may be’ structures]

Artikkeli tarkastelee toteamukseen tai vaihtoehtokysymykseen vastaavia VOI OLLA -rakenteita voi olla, se voi olla, voi se olla ja voihan se olla. Toteamuksella tarkoitetaan kannano...

Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition

Abstract Speaker recognition has crucial application in forensic science, financial areas, access control, surveillance and law enforcement. The performance of speaker reco...

State-of-the-art in Open-domain Conversational AI: A Survey

We survey SoTA open-domain conversational AI models with the purpose of presenting the prevailing challenges that still exist to spur future research. In addition, we provide stati...

Email:
Password:

Email:

Speaker-Aware Simulation Improves Conversational Speech Recognition

Related Results