Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum

Tadashi Inai, Sunao Hara, Masanobu Abe, Yusuke Ijima, Noboru Miyazaki, Hideyuki Mizuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

As described in this paper, we propose a sub-band speech syn- thesis approach to develop a high quality Text-to-Speech (TTS) system: A sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech syn- thesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smooth- ness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages264-268
Number of pages5
Volume2015-January
Publication statusPublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
CountryGermany
CityDresden
Period9/6/159/10/15

Fingerprint

Text-to-speech
Speech synthesis
Speech Synthesis
Frequency bands
Waveform
Low Frequency
Smoothing
Hidden Markov Model
Speech

Keywords

  • HMM-based speech synthesis
  • Sub-band
  • Waveform-based speech synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Inai, T., Hara, S., Abe, M., Ijima, Y., Miyazaki, N., & Mizuno, H. (2015). Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 264-268). International Speech and Communication Association.

Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum. / Inai, Tadashi; Hara, Sunao; Abe, Masanobu; Ijima, Yusuke; Miyazaki, Noboru; Mizuno, Hideyuki.

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 264-268.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Inai, T, Hara, S, Abe, M, Ijima, Y, Miyazaki, N & Mizuno, H 2015, Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 264-268, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 9/6/15.
Inai T, Hara S, Abe M, Ijima Y, Miyazaki N, Mizuno H. Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January. International Speech and Communication Association. 2015. p. 264-268
Inai, Tadashi ; Hara, Sunao ; Abe, Masanobu ; Ijima, Yusuke ; Miyazaki, Noboru ; Mizuno, Hideyuki. / Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. pp. 264-268
@inproceedings{6b5aad76ddae431ebd7453ff28128e9b,
title = "Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum",
abstract = "As described in this paper, we propose a sub-band speech syn- thesis approach to develop a high quality Text-to-Speech (TTS) system: A sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech syn- thesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smooth- ness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.",
keywords = "HMM-based speech synthesis, Sub-band, Waveform-based speech synthesis",
author = "Tadashi Inai and Sunao Hara and Masanobu Abe and Yusuke Ijima and Noboru Miyazaki and Hideyuki Mizuno",
year = "2015",
language = "English",
volume = "2015-January",
pages = "264--268",
booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
publisher = "International Speech and Communication Association",

}

TY - GEN

T1 - Sub-band text-to-speech combining sample-based spectrum with statistically generated spectrum

AU - Inai, Tadashi

AU - Hara, Sunao

AU - Abe, Masanobu

AU - Ijima, Yusuke

AU - Miyazaki, Noboru

AU - Mizuno, Hideyuki

PY - 2015

Y1 - 2015

N2 - As described in this paper, we propose a sub-band speech syn- thesis approach to develop a high quality Text-to-Speech (TTS) system: A sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech syn- thesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smooth- ness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.

AB - As described in this paper, we propose a sub-band speech syn- thesis approach to develop a high quality Text-to-Speech (TTS) system: A sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is the most similar to spectrum generated by HMM-based speech syn- thesis. A key idea is to compensate over-smoothing caused by statistical procedures by introducing a sample-based spectrum, especially in the high-frequency band. Listening test results show that the proposed method has better performance than HMM-based speech synthesis in terms of clarity. It is at the same level as HMM-based speech synthesis in terms of smooth- ness. In addition, preference test results among the proposed method, HMM-based speech synthesis, and waveform speech synthesis using 80 min speech data reveal that the proposed method is the most liked.

KW - HMM-based speech synthesis

KW - Sub-band

KW - Waveform-based speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=84959169493&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959169493&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84959169493

VL - 2015-January

SP - 264

EP - 268

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

PB - International Speech and Communication Association

ER -