Javascript must be enabled to continue!
Comparison of Modern Deep Learning Models for Speaker Verification
View through CrossRef
This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks. The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible. This dataset includes short, non-English statements gathered from interviews on a popular online video platform. The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices. These speakers vary in age from 20 to 70 years old. This variety helps in thoroughly testing speaker verification models. This dataset is especially useful for research on speaker verification with short recordings. It consists of 10 clips for each person, each clip being no longer than 10 s, adding up to 500 recordings in total. The total length of all recordings is about 1 h and 30 min, which averages to roughly 100 s for each speaker. This dataset is a valuable tool for research in speaker verification, particularly for studies involving short audio clips. The performance of these models is evaluated using common biometric metrics such as false acceptance rate (FAR), false rejection rate (FRR), equal error rate (EER) and detection cost function (DCF). The results reveal that the TitaNet and ECAPA models stand out by presenting the lowest EER (1.91% and 1.71%, respectively) and thus exhibiting higher discriminative features, ensuring, on the one hand, a reduction in intra-class distance (the same speaker), and, on the other hand, maximizing the distance between different speaker embeddings. This analysis also highlights the ECAPA model’s advantageous balance of performance and efficiency, achieving an inference time of 69.43 milliseconds, slightly longer than the PyAnnote models. This study not only compares the performance of models but also provides a comparative analysis of respective model embeddings, offering insights into their strengths and weaknesses. The presented findings serve as a foundation for guiding future research in speaker verification, especially in the context of short audio samples or limited data. This may be particularly relevant for applications requiring quick and accurate speaker identification from short voice clips.
Title: Comparison of Modern Deep Learning Models for Speaker Verification
Description:
This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks.
The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible.
This dataset includes short, non-English statements gathered from interviews on a popular online video platform.
The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices.
These speakers vary in age from 20 to 70 years old.
This variety helps in thoroughly testing speaker verification models.
This dataset is especially useful for research on speaker verification with short recordings.
It consists of 10 clips for each person, each clip being no longer than 10 s, adding up to 500 recordings in total.
The total length of all recordings is about 1 h and 30 min, which averages to roughly 100 s for each speaker.
This dataset is a valuable tool for research in speaker verification, particularly for studies involving short audio clips.
The performance of these models is evaluated using common biometric metrics such as false acceptance rate (FAR), false rejection rate (FRR), equal error rate (EER) and detection cost function (DCF).
The results reveal that the TitaNet and ECAPA models stand out by presenting the lowest EER (1.
91% and 1.
71%, respectively) and thus exhibiting higher discriminative features, ensuring, on the one hand, a reduction in intra-class distance (the same speaker), and, on the other hand, maximizing the distance between different speaker embeddings.
This analysis also highlights the ECAPA model’s advantageous balance of performance and efficiency, achieving an inference time of 69.
43 milliseconds, slightly longer than the PyAnnote models.
This study not only compares the performance of models but also provides a comparative analysis of respective model embeddings, offering insights into their strengths and weaknesses.
The presented findings serve as a foundation for guiding future research in speaker verification, especially in the context of short audio samples or limited data.
This may be particularly relevant for applications requiring quick and accurate speaker identification from short voice clips.
Related Results
Speaker Verification and Identification
Speaker Verification and Identification
A speaker recognition system verifies or identifies a speaker’s identity based on his/her voice. It is considered as one of the most convenient biometric characteristic for human m...
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
BACKGROUND
As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...
Quarantine Powers, Biodefense, and Andrew Speaker
Quarantine Powers, Biodefense, and Andrew Speaker
In January 2007, Andrew Speaker (Speaker) underwent a chest X-ray and CT scan, which revealed an abnormality in his lungs. However, tests results indicated that he did not ha...
Verification of High Speed on Chip with VIP using System Verilog
Verification of High Speed on Chip with VIP using System Verilog
Abstract - The exploration work is addressing verification of High speed on chips protocol; we've used the system Verilog grounded test bench structure. I developed a system Verilo...
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Abstract
The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with...
Tiedon rajat ja vuorovaikutus. Toteamukseen tai vaihtoehtokysymykseen vastaavat VOI OLLA -rakenteet [On the limits of knowledge. Responding to an assertion or a polar question with VOI OLLA ‘(it) may be’ structures]
Tiedon rajat ja vuorovaikutus. Toteamukseen tai vaihtoehtokysymykseen vastaavat VOI OLLA -rakenteet [On the limits of knowledge. Responding to an assertion or a polar question with VOI OLLA ‘(it) may be’ structures]
Artikkeli tarkastelee toteamukseen tai vaihtoehtokysymykseen vastaavia VOI OLLA -rakenteita voi olla, se voi olla, voi se olla ja voihan se olla. Toteamuksella tarkoitetaan kannano...
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
Target sample mining with modified activation residual network for speaker verification
Target sample mining with modified activation residual network for speaker verification
In the domain of speaker verification, Softmax can be used as a backend for multi-classification, but traditional Softmax methods have some limitations that limit performance. Duri...

