Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

An overview of Microsoft’s Whistler text-to-speech system

View through CrossRef
The data-driven approach can significantly facilitate the process of creating text-to-speech (TTS) systems for a new language, a new voice, or a new style. As such, Whistler TTS engine was designed to benefit from automatically constructed model parameters. Efforts to improve Whistler with the use of additional training data and better learning algorithms that make full use of these data will be reviewed. Training data have been augmented for a number of speakers. To better use these data, the hidden Markov model speech recognition system has been used to segment the training corpora and select more representative acoustic units. The classification and regression tree was used for both grapheme to phoneme conversation and unseen triphone generalization. Speech signal reconstruction was based on the mixed excitation source-filter model that leads to better compression of the acoustic inventory. A number of ways to smooth the spectral parameters were also studied to minimize the concatenation distortion. To improve automatically extracted prosodic templates, the learning process was refined with an analysis-by-synthesis approach. However, the coverage remains a challenge for the data-driven approach to make Whistler produce synthetic speech that resembles the original speaker. This is especially true for the prosody model.
Title: An overview of Microsoft’s Whistler text-to-speech system
Description:
The data-driven approach can significantly facilitate the process of creating text-to-speech (TTS) systems for a new language, a new voice, or a new style.
As such, Whistler TTS engine was designed to benefit from automatically constructed model parameters.
Efforts to improve Whistler with the use of additional training data and better learning algorithms that make full use of these data will be reviewed.
Training data have been augmented for a number of speakers.
To better use these data, the hidden Markov model speech recognition system has been used to segment the training corpora and select more representative acoustic units.
The classification and regression tree was used for both grapheme to phoneme conversation and unseen triphone generalization.
Speech signal reconstruction was based on the mixed excitation source-filter model that leads to better compression of the acoustic inventory.
A number of ways to smooth the spectral parameters were also studied to minimize the concatenation distortion.
To improve automatically extracted prosodic templates, the learning process was refined with an analysis-by-synthesis approach.
However, the coverage remains a challenge for the data-driven approach to make Whistler produce synthetic speech that resembles the original speaker.
This is especially true for the prosody model.

Related Results

Perception advantages of foreign directed speech
Perception advantages of foreign directed speech
Foreign directed speech (FDS) is a listener directed speech style used when native speakers interact with non-native listeners of a language. This study considers if native and non...
Developmental Links Between Speech Perception in Noise, Singing, and Cortical Processing of Music in Children with Cochlear Implants
Developmental Links Between Speech Perception in Noise, Singing, and Cortical Processing of Music in Children with Cochlear Implants
The perception of speech in noise is challenging for children with cochlear implants (CIs). Singing and musical instrument playing have been associated with improved auditory skill...
Surrogate Speech of the Asante Ivory Trumpeters of Ghana
Surrogate Speech of the Asante Ivory Trumpeters of Ghana
Surrogate speech is a phonological system by which word tones of a spoken language are represented in tones produced on a musical instrument. Ethnomusicologists regard this as a mu...
Speech in “Paradise Lost”
Speech in “Paradise Lost”
ABSTRACT In the sixteenth and seventeenth centuries several treatises (religious, philosophical, and rhetorical) discussed the Fall of Man as involving a corruption ...
Boosting Speech-to-Text software potential
Boosting Speech-to-Text software potential
The article focuses on finding ways of boosting efficiency and accuracy of Speech-to-Text (STT)-powered input. The effort is triggered by the growing popularity of the software amo...
In Memoriam: Ralph L. Vanderslice and Gunnar Fant
In Memoriam: Ralph L. Vanderslice and Gunnar Fant
RALPH L. VANDERSLICE, who contributed to many areas of phonetics, died on 24 August 2008, aged 78, in Portland, Oregon. He was born on 2 January 1930 in South Bend, Indiana. He rec...
Noise Levels on Aircraft Carrier Flight Decks and Their Effects on Humans
Noise Levels on Aircraft Carrier Flight Decks and Their Effects on Humans
Measurements were made of noise levels produced by four aircraft during pilot qualification exercises aboard the flight deck of USS KITTY HAWK. These measurements, on both the A- a...
Illustrations et modèles mentaux dans la compréhension de textes
Illustrations et modèles mentaux dans la compréhension de textes
Summary: Illustrations and mental models in text comprehension. We know that graphics in texts can be effective for learning, but we do not have much knowledge about how text ...

Back to Top