Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum

Tadashi Inai, Sunao Hara, Masanobu Abe, Yusuke Ijima, Noboru Miyazaki, Hideyuki Mizuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

As described in this paper, we propose a sub-band speech syn- thesis approach to develop a high quality Text-to-Speech (TTS) system: A sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech syn- thesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smooth- ness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages264-268
Number of pages5
Volume2015-January
Publication statusPublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
CountryGermany
CityDresden
Period9/6/159/10/15

    Fingerprint

Keywords

  • HMM-based speech synthesis
  • Sub-band
  • Waveform-based speech synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Inai, T., Hara, S., Abe, M., Ijima, Y., Miyazaki, N., & Mizuno, H. (2015). Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 264-268). International Speech and Communication Association.