Ensemble Prosody Prediction for Expressive Speech Synthesis

Authors: Tian Huey Teh*, Vivian Hu*, Devang S Ram Mohan, Zack Hodari, Christopher Wallis, Tomás Gómez Ibarrondo, Alexandra Torresquintero, James Leoni, Mark Gales, Simon King (*: contact)

Abstract: Generating expressive speech with rich and varied prosody continues to be a challenge for Text-to-Speech. Most efforts have focused on sophisticated neural architectures intended to better model the data distribution. Yet, in evaluations it is generally found that no single model is preferred for all input texts. This suggests an approach that has rarely been used before for Text-to-Speech: an ensemble of models.

We apply ensemble learning to prosody prediction. We construct simple ensembles of prosody predictors by varying either model architecture or model parameter values. To automatically select amongst the models in the ensemble when performing Text-to-Speech, we propose a novel, and computationally trivial, variance-based criterion.

We demonstrate that even a small ensemble of prosody predictors yields useful diversity, which, combined with the proposed selection criterion, outperforms any individual model from the ensemble.

Listener preference

We conducted a listening test to measure preference for the individual models (CONV and RNN) in the ensemble using an A/B. We then created a human ORACLE by choosing the rendition of each utterance that was most preferred by listeners.

Have a listen to some of the samples produced by each model. (We encourage readers to listen to each rendition before revealing the “answer”.) What do you think? Do you agree with the crowd favorite? Was your preference based on intonation or some other factor?

Speaker

RNN

CONV

Male

Female

Show ORACLE choice

Selection using variance of predicted F0 (AFP-F0)

Using the F0 variance-based criterion proposed in our paper, we are able to predict listener preference more accurately than using just a single model. However, greater variance doesn’t always correspond to crowd preference.

Below, we’ve shared some samples where the selection criterion agrees with the ORACLE choice and some where it does not. Which do you agree with?

AFP-F0 agrees with ORACLE

Speaker

RNN

CONV

Male

Female

Show ORACLE choice

AFP-F0 disagrees with ORACLE

Speaker

RNN

CONV

Male

Female

Show ORACLE choice

Our Research

Our Blog