Authors: Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao
Abstract: Code-switching ━ the intra-utterance use of multiple languages ━ is prevalent across the world. Within text-to-speech (TTS), multilingual models have been found to enable code-switching [1,2,3]. By modifying the linguistic input to sequence-to-sequence TTS, we show that code-switching is possible for languages unseen during training, even within monolingual models. We use a small set of phonological features derived from the International Phonetic Alphabet (IPA), such as vowel height and frontness, consonant place and manner. This allows the model topology to stay unchanged for different languages, and enables new, previously unseen feature combinations to be interpreted by the model. We show that this allows us to generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
Description:
The following samples are produced with the multilingual Spanish-English models, trained on a combination of the CSTR VCTK-corpus and our proprietary, mexican Spanish Adrianex-corpus (referred to as the MIX data condition in the paper). Sentences are chosen from randomly sampled Wikipedia articles in German. 0-shot methods for producing unseen phonemes include 1) phonological features, or AUTO (our proposed method), and 2 baselines: 2) a manual, linguistic mapping to previously seen phonemes (MAPPED), and 3) a new, randomly initialized embedding (RANDOM).
Sentence | Unseen phonemes | AUTO | MAPPED | RANDOM |
---|---|---|---|---|
Description:
The following samples in German are produced with the monolingual English models, trained on the CSTR VCTK-corpus (referred to as the VCTK data condition in the paper). Sentences and 0-shot methods are as above. Note the collapse of [R] to [g] in AUTO due to the absence of trills in the phoneme set used for English, for example in the sentence "Die Blütezeit (r)eicht von Juni bis September".
Sentence | Unseen phonemes | AUTO | MAPPED | RANDOM |
---|---|---|---|---|
Description:
The following samples in English demonstrate an application of 0-shot modelling of a new language (here: German), where individual words or phrases, such as foreign names of people, institutions or places, are part of the English sentence (code-switching). Code-switching, especially into English, is common in many languages, and these samples are only meant to serve as a small demonstration of what can be achieved with the method we presented. Pronunciations for foreign words can be easily approximated by using a target language dictionary in combination with either phonological features (AUTO) or an expert mapping from unseen to seen phonemes (MAPPED). Words containing unseen phonemes are usually mispronounced or completely skipped when new phonemes are represented with a random embedding (RANDOM).
Multilingual Spanish-English model | ||||
---|---|---|---|---|
Sentence | Unseen phonemes | AUTO | MAPPED | RANDOM |
© 2023 Papercup Technologies Ltd.