Module Comparison of Transformer-Tts for Speaker Adaptation Based on Fine-Tuning

Katsuki Inoue, Sunao Hara, Masanobu Abe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

End-to-end text-to-speech (TTS) models have achieved remarkable results in recent times. However, the model requires a large amount of text and audio data for training. A speaker adaptation method based on fine-tuning has been proposed for constructing a TTS model using small scale data. Although these methods can replicate the target speaker s voice quality, synthesized speech includes the deletion and/or repetition of speech. The goal of speaker adaptation is to change the voice quality to match the target speaker ' s on the premise that adjusting the necessary modules will reduce the amount of data to be fine-tuned. In this paper, we clarify the role of each module in the Transformer-TTS process by not updating it. Specifically, we froze character embedding, encoder, layer predicting stop token, and loss function for estimating sentence ending. The experimental results showed the following: (1) fine-tuning the character embedding did not result in an improvement in the deletion and/or repetition of speech, (2) speech deletion increases if the encoder is not fine-tuned, (3) speech deletion was suppressed when the layer predicting stop token is not fine-tuned, and (4) there are frequent speech repetitions at sentence end when the loss function estimating sentence ending is omitted.

Original languageEnglish
Title of host publication2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages826-830
Number of pages5
ISBN (Electronic)9789881476883
Publication statusPublished - Dec 7 2020
Event2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Virtual, Auckland, New Zealand
Duration: Dec 7 2020Dec 10 2020

Publication series

Name2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings

Conference

Conference2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020
CountryNew Zealand
CityVirtual, Auckland
Period12/7/2012/10/20

Keywords

  • end-to-end speech synthesis
  • fine-tuning
  • module analysis
  • speaker adaptation
  • transformer

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Hardware and Architecture
  • Signal Processing
  • Decision Sciences (miscellaneous)
  • Instrumentation

Fingerprint Dive into the research topics of 'Module Comparison of Transformer-Tts for Speaker Adaptation Based on Fine-Tuning'. Together they form a unique fingerprint.

Cite this