Phonetic and prosodic information estimation from texts for genuine Japanese end-to-end text-to-speech

Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima

研究成果

1 被引用数 (Scopus)

抄録

The biggest obstacle to develop end-to-end Japanese text-to-speech (TTS) systems is to estimate phonetic and prosodic information (PPI) from Japanese texts. The following are the reasons: (1) the Kanji characters of the Japanese writing system have multiple corresponding pronunciations, (2) there is no separation mark between words, and (3) an accent nucleus must be assigned at appropriate positions. In this paper, we propose to solve the problems by neural machine translation (NMT) on the basis of encoder-decoder models, and compare NMT models of recurrent neural networks and the Transformer architecture. The proposed model handles texts on token (character) basis, although conventional systems handle them on word basis. To ensure the potential of the proposed approach, NMT models are trained using pairs of sentences and their PPIs that are generated by a conventional Japanese TTS system from 5 million sentences. Evaluation experiments were performed using PPIs that are manually annotated for 5,142 sentences. The experimental results showed that the Transformer architecture has the best performance, with 98.0% accuracy for phonetic information estimation and 95.0% accuracy for PPI estimation. Judging from the results, NMT models are promising toward end-to-end Japanese TTS.

本文言語English
ホスト出版物のタイトル22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版社International Speech Communication Association
ページ3606-3610
ページ数5
ISBN(電子版)9781713836902
DOI
出版ステータスPublished - 2021
イベント22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno
継続期間: 8月 30 20219月 3 2021

出版物シリーズ

名前Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
5
ISSN(印刷版)2308-457X
ISSN(電子版)1990-9772

Conference

Conference22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国/地域Czech Republic
CityBrno
Period8/30/219/3/21

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Phonetic and prosodic information estimation from texts for genuine Japanese end-to-end text-to-speech」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル