Authors: Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao
Abstract: Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a reinforcement learning based framework to train an agent to make this decision. We compare our performance against that of deterministic, rule-based systems. Our results demonstrate that our agent successfully balances the trade-off between the latency of audio generation and the quality of synthesised audio. More broadly, we show that neural sequence-to-sequence models can be adapted to run in an incremental manner.
Description:
The baseline Tacotron-2 model was modified (by removing the bi-directional and convolutional layer in the encoder as well as the post-net module) in order to make it suitable for our incremental setting. This resulted in a slight drop in performance of the Tacotron-2 architecture. We provide the corresponding samples here to establish a baseline.
Sample | Text | Comments | |
---|---|---|---|
1 | He read a book. | The word read is pronounced to rhyme with 'need', as opposed to the colour 'red'. | |
2 | She was the secretary of state. | The failure mode of Sample 1 is not consistent. A worst-case scenario would consist of the sub-word secret (of secretary) being pronounced sikrət. However, what we observe is that the word secretary is pronounced sɛkrəˌtɛri suggesting that the attention (being the only component with access to future timesteps) adjusts to accomodate. | |
3 | So long and thanks for all the fish. | Despite removing all the bi-directionality of the encoder model, the architecture is still able to capture natural features of speech. Observe in the sample the emphasis on the word all. |
Description:
We present samples from our validation set of LJ Speech, comparing the performance of our learnt Online Agent against the benchmark models. In addition to the audio samples themselves, we provide the attention plots (along with the associated policy path of the underlying agent) to enable the listener to better understand the behaviour of the agent.
Sentence text: this vehicle was occupied by three secret service agents and clifton c. garter, assistant to the vice president.
Model Name | Sample | Alignment Plot | |
---|---|---|---|
1 | Wait Until End (WUE) | ||
2 | Wait 2 Steps (W2S) | ||
3 | Wait 3 Steps (W3S) | ||
4 | Online Agent |
Description:
We present samples (not formally evaluated in the paper) from our validation set of SIWIS, comparing the performance of our learnt Online Agent against the benchmark models. In addition to the audio samples themselves, we provide the attention plots (along with the associated policy path of the underlying agent) to enable the listener to better understand the behaviour of the agent.
Compared to the English agent, observe that the learnt policy path appears to be 'further' from the prominent alignments diagonal. WE hypothesise that this is related to the presence of liasons and elisions in French.
Sentence text: Cet inexplicable personnage avait l’air d’un professeur de mathématiques qui fait une démonstration à ses élèves.
Model Name | Sample | Alignment Plot | |
---|---|---|---|
1 | Wait Until End (WUE) | ||
2 | Wait 2 Steps (W2S) | ||
3 | Wait 3 Steps (W3S) * | ||
4 | Online Agent |
* The kink in the alignment is because we had to forcibly override the gate output signal and perform a READ action to ensure that the entire sentence is decoded under this policy
© 2023 Papercup Technologies Ltd.