In this paper, we propose a new algorithm to generate Speech-like Emotional Sound (SES). Emotional expressions may be the most important factor in human communication, and speech is one of the most useful means of expressing emotions. Although speech generally conveys both emotional and linguistic information, we have undertaken the challenge of generating sounds that convey emotional information alone. We call the generated sounds "speech-like,"because the sounds do not contain any linguistic information. SES can provide another way to generate emotional response in human-computer interaction systems. To generate "speech-like"sound, we propose employingWaveNet as a sound generator conditioned only by emotional IDs. This concept is quite different from the WaveNet Vocoder, which synthesizes speech using spectrum information as an auxiliary feature. The biggest advantage of our approach is that it reduces the amount of emotional speech data necessary for training by focusing on non-linguistic information. The proposed algorithm consists of two steps. In the first step, to generate a variety of spectrum patterns that resemble human speech as closely as possible, WaveNet is trained with auxiliary mel-spectrum parameters and Emotion ID using a large amount of neutral speech. In the second step, to generate emotional expressions, WaveNet is retrained with auxiliary Emotion ID only using a small amount of emotional speech. Experimental results reveal the following: (1) the two-step training is necessary to generate the SES with high quality, and (2) it is important that the training use a large neutral speech database and spectrum information in the first step to improve the emotional expression and naturalness of SES.

Original languageEnglish
Pages (from-to)1581-1589
Number of pages9
JournalIEICE Transactions on Information and Systems
Issue number9
Publication statusPublished - Sept 2022


  • deep neural network
  • emotion
  • emotional speech synthesis
  • WaveNet

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence


Dive into the research topics of 'Speech-Like Emotional Sound Generation UsingWaveNet'. Together they form a unique fingerprint.

Cite this