An investigation to transplant emotional expressions in DNN-based TTS synthesis

Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

In this paper, we investigate deep neural network (DNN) architectures to transplant emotional expressions to improve the expressiveness of DNN-based text-to-speech (TTS) synthesis. DNN is expected to have potential power in mapping between linguistic information and acoustic features. From multispeaker and/or multi-language perspectives, several types of DNN architecture have been proposed and have shown good performances. We tried to expand the idea to transplant emotion, constructing shared emotion-dependent mappings. The following three types of DNN architecture are examined; (1) the parallel model (PM) with an output layer consisting of both speaker- dependent layers and emotion-dependent layers, (2) the serial model (SM) with an output layer consisting of emotion-dependent layers preceded by speaker-dependent hidden layers, (3) the auxiliary input model (AIM) with an input layer consisting of emotion and speaker IDs as well as linguistics feature vectors. The DNNs were trained using neutral speech uttered by 24 speakers, and sad speech and joyful speech uttered by 3 speakers from those 24 speakers. In terms of unseen emotional synthesis, subjective evaluation tests showed that the PM performs much better than the SM and slightly better than the AIM. In addition, this test showed that the SM is the best of the three models when training data includes emotional speech uttered by the target speaker.

Original languageEnglish
Title of host publicationProceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1253-1258
Number of pages6
Volume2018-February
ISBN (Electronic)9781538615423
DOIs
Publication statusPublished - Feb 5 2018
Event9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 - Kuala Lumpur, Malaysia
Duration: Dec 12 2017Dec 15 2017

Other

Other9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
CountryMalaysia
CityKuala Lumpur
Period12/12/1712/15/17

Fingerprint

Transplants
Speech synthesis
Network architecture
Linguistics
Deep neural networks
Acoustics

ASJC Scopus subject areas

  • Artificial Intelligence
  • Human-Computer Interaction
  • Information Systems
  • Signal Processing

Cite this

Inoue, K., Hara, S., Abe, M., Hojo, N., & Ijima, Y. (2018). An investigation to transplant emotional expressions in DNN-based TTS synthesis. In Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 (Vol. 2018-February, pp. 1253-1258). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2017.8282231

An investigation to transplant emotional expressions in DNN-based TTS synthesis. / Inoue, Katsuki; Hara, Sunao; Abe, Masanobu; Hojo, Nobukatsu; Ijima, Yusuke.

Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Vol. 2018-February Institute of Electrical and Electronics Engineers Inc., 2018. p. 1253-1258.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Inoue, K, Hara, S, Abe, M, Hojo, N & Ijima, Y 2018, An investigation to transplant emotional expressions in DNN-based TTS synthesis. in Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. vol. 2018-February, Institute of Electrical and Electronics Engineers Inc., pp. 1253-1258, 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017, Kuala Lumpur, Malaysia, 12/12/17. https://doi.org/10.1109/APSIPA.2017.8282231
Inoue K, Hara S, Abe M, Hojo N, Ijima Y. An investigation to transplant emotional expressions in DNN-based TTS synthesis. In Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Vol. 2018-February. Institute of Electrical and Electronics Engineers Inc. 2018. p. 1253-1258 https://doi.org/10.1109/APSIPA.2017.8282231
Inoue, Katsuki ; Hara, Sunao ; Abe, Masanobu ; Hojo, Nobukatsu ; Ijima, Yusuke. / An investigation to transplant emotional expressions in DNN-based TTS synthesis. Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017. Vol. 2018-February Institute of Electrical and Electronics Engineers Inc., 2018. pp. 1253-1258
@inproceedings{eca07d5c268e468fa93e1766a18f5ae3,
title = "An investigation to transplant emotional expressions in DNN-based TTS synthesis",
abstract = "In this paper, we investigate deep neural network (DNN) architectures to transplant emotional expressions to improve the expressiveness of DNN-based text-to-speech (TTS) synthesis. DNN is expected to have potential power in mapping between linguistic information and acoustic features. From multispeaker and/or multi-language perspectives, several types of DNN architecture have been proposed and have shown good performances. We tried to expand the idea to transplant emotion, constructing shared emotion-dependent mappings. The following three types of DNN architecture are examined; (1) the parallel model (PM) with an output layer consisting of both speaker- dependent layers and emotion-dependent layers, (2) the serial model (SM) with an output layer consisting of emotion-dependent layers preceded by speaker-dependent hidden layers, (3) the auxiliary input model (AIM) with an input layer consisting of emotion and speaker IDs as well as linguistics feature vectors. The DNNs were trained using neutral speech uttered by 24 speakers, and sad speech and joyful speech uttered by 3 speakers from those 24 speakers. In terms of unseen emotional synthesis, subjective evaluation tests showed that the PM performs much better than the SM and slightly better than the AIM. In addition, this test showed that the SM is the best of the three models when training data includes emotional speech uttered by the target speaker.",
author = "Katsuki Inoue and Sunao Hara and Masanobu Abe and Nobukatsu Hojo and Yusuke Ijima",
year = "2018",
month = "2",
day = "5",
doi = "10.1109/APSIPA.2017.8282231",
language = "English",
volume = "2018-February",
pages = "1253--1258",
booktitle = "Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - An investigation to transplant emotional expressions in DNN-based TTS synthesis

AU - Inoue, Katsuki

AU - Hara, Sunao

AU - Abe, Masanobu

AU - Hojo, Nobukatsu

AU - Ijima, Yusuke

PY - 2018/2/5

Y1 - 2018/2/5

N2 - In this paper, we investigate deep neural network (DNN) architectures to transplant emotional expressions to improve the expressiveness of DNN-based text-to-speech (TTS) synthesis. DNN is expected to have potential power in mapping between linguistic information and acoustic features. From multispeaker and/or multi-language perspectives, several types of DNN architecture have been proposed and have shown good performances. We tried to expand the idea to transplant emotion, constructing shared emotion-dependent mappings. The following three types of DNN architecture are examined; (1) the parallel model (PM) with an output layer consisting of both speaker- dependent layers and emotion-dependent layers, (2) the serial model (SM) with an output layer consisting of emotion-dependent layers preceded by speaker-dependent hidden layers, (3) the auxiliary input model (AIM) with an input layer consisting of emotion and speaker IDs as well as linguistics feature vectors. The DNNs were trained using neutral speech uttered by 24 speakers, and sad speech and joyful speech uttered by 3 speakers from those 24 speakers. In terms of unseen emotional synthesis, subjective evaluation tests showed that the PM performs much better than the SM and slightly better than the AIM. In addition, this test showed that the SM is the best of the three models when training data includes emotional speech uttered by the target speaker.

AB - In this paper, we investigate deep neural network (DNN) architectures to transplant emotional expressions to improve the expressiveness of DNN-based text-to-speech (TTS) synthesis. DNN is expected to have potential power in mapping between linguistic information and acoustic features. From multispeaker and/or multi-language perspectives, several types of DNN architecture have been proposed and have shown good performances. We tried to expand the idea to transplant emotion, constructing shared emotion-dependent mappings. The following three types of DNN architecture are examined; (1) the parallel model (PM) with an output layer consisting of both speaker- dependent layers and emotion-dependent layers, (2) the serial model (SM) with an output layer consisting of emotion-dependent layers preceded by speaker-dependent hidden layers, (3) the auxiliary input model (AIM) with an input layer consisting of emotion and speaker IDs as well as linguistics feature vectors. The DNNs were trained using neutral speech uttered by 24 speakers, and sad speech and joyful speech uttered by 3 speakers from those 24 speakers. In terms of unseen emotional synthesis, subjective evaluation tests showed that the PM performs much better than the SM and slightly better than the AIM. In addition, this test showed that the SM is the best of the three models when training data includes emotional speech uttered by the target speaker.

UR - http://www.scopus.com/inward/record.url?scp=85050452709&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050452709&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2017.8282231

DO - 10.1109/APSIPA.2017.8282231

M3 - Conference contribution

VL - 2018-February

SP - 1253

EP - 1258

BT - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -