Phonological Features for 0-shot Multilingual Speech Synthesis

Authors: Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao

Abstract: Code-switching ━ the intra-utterance use of multiple languages ━ is prevalent across the world. Within text-to-speech (TTS), multilingual models have been found to enable code-switching [1,2,3]. By modifying the linguistic input to sequence-to-sequence TTS, we show that code-switching is possible for languages unseen during training, even within monolingual models. We use a small set of phonological features derived from the International Phonetic Alphabet (IPA), such as vowel height and frontness, consonant place and manner. This allows the model topology to stay unchanged for different languages, and enables new, previously unseen feature combinations to be interpreted by the model. We show that this allows us to generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.

0-shot German from multilingual Spanish-English models

Description:

The following samples are produced with the multilingual Spanish-English models, trained on a combination of the CSTR VCTK-corpus and our proprietary, mexican Spanish Adrianex-corpus (referred to as the MIX data condition in the paper). Sentences are chosen from randomly sampled Wikipedia articles in German. 0-shot methods for producing unseen phonemes include 1) phonological features, or AUTO (our proposed method), and 2 baselines: 2) a manual, linguistic mapping to previously seen phonemes (MAPPED), and 3) a new, randomly initialized embedding (RANDOM).

Demokratie entsteht womöglich erst in diesem Konflikt.

[R], [ø], [ç]

play
play
play

Allerdings wurde dieser Beschluss nicht sofort umgesetzt.

[ç]

play
play
play

Es wurde über zwölf Millionen Mal verkauft.

[y], [œ]

play
play
play

Die Blütezeit reicht von Juni bis September.

[y], [R], [ç]

play
play
play

Schwerpunkt seiner Tätigkeit blieb jedoch die Schauspielerei.

[ç], [R]

play
play
play

Meistens ist eine Behandlung der Symptome ausreichend.

[y], [R], [ç]

play
play
play

Die Maschine wurde durch das Feuer zerstört.

[ç], [ø]

play
play
play

Sentence

Unseen phonemes

AUTO

MAPPED

RANDOM

0-shot German from monolingual English models

Description:

The following samples in German are produced with the monolingual English models, trained on the CSTR VCTK-corpus (referred to as the VCTK data condition in the paper). Sentences and 0-shot methods are as above. Note the collapse of [R] to [g] in AUTO due to the absence of trills in the phoneme set used for English, for example in the sentence "Die Blütezeit (r)eicht von Juni bis September".

Demokratie entsteht womöglich erst in diesem Konflikt.

[R], [ø], [ç], [o]

play
play
play

Allerdings wurde dieser Beschluss nicht sofort umgesetzt.

[ç], [o]

play
play
play

Es wurde über zwölf Millionen Mal verkauft.

[y], [œ], [o]

play
play
play

Die Blütezeit reicht von Juni bis September.

[y], [R], [ç]

play
play
play

Schwerpunkt seiner Tätigkeit blieb jedoch die Schauspielerei.

[ç], [x], [R]

play
play
play

Meistens ist eine Behandlung der Symptome ausreichend.

[y], [o], [R], [ç]

play
play
play

Die Maschine wurde durch das Feuer zerstört.

[ø], [ç]

play
play
play

Sentence

Unseen phonemes

AUTO

MAPPED

RANDOM

English samples with code-switching into German

Description:

The following samples in English demonstrate an application of 0-shot modelling of a new language (here: German), where individual words or phrases, such as foreign names of people, institutions or places, are part of the English sentence (code-switching). Code-switching, especially into English, is common in many languages, and these samples are only meant to serve as a small demonstration of what can be achieved with the method we presented. Pronunciations for foreign words can be easily approximated by using a target language dictionary in combination with either phonological features (AUTO) or an expert mapping from unseen to seen phonemes (MAPPED). Words containing unseen phonemes are usually mispronounced or completely skipped when new phonemes are represented with a random embedding (RANDOM).

Albrecht Dürer, sometimes spelled in English as Durer, without umlaut, was a German painter.

[R], [ç], [y]

play
play
play

The German name of the city Munich is München

[y], [ç]

play
play
play

Eugen Berthold Friedrich Brecht was a German theatre practitioner, playwright, and poet.

[R], [ç]

play
play
play

Gödel, Escher, Bach: an Eternal Golden Braid, also known as GEB, is a book by Douglas Hofstadter.

[œ]

play
play
play

Multilingual Spanish-English model

Sentence

Unseen phonemes

AUTO

MAPPED

RANDOM

© 2020 Papercup Technologies Ltd.