EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Journal: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Citation: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.1810-1814

Keyword: emotional style and intensity control expressive emotional speech synthesis Text-to-speech

Mesh Keyword: Complex nature Emotional speech Emotional speech synthesis Emotional style and intensity control Expressive emotional speech synthesis Human annotations Intensity models Speech emotions Synthetic speech Text to speech

All Science Classification Codes (ASJC): Language and Linguistics Human-Computer Interaction Signal Processing Software Modeling and Simulation

Abstract: Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model's ability to control emotional style and intensity with high-quality expressive speech.

URI: https://aurora.ajou.ac.kr/handle/2018.oak/38119
https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85206434807&origin=inward

Funding: This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University), No. 2021-0-02068, Artificial Intelligence Innovation Hub, and AI Technology for Interactive Communication of Language Impaired Individuals).

qrcode