Ajou University repository

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speechoa mark
  • Cho, Deok Hyeon ;
  • Oh, Hyung Seok ;
  • Kim, Seung Bin ;
  • Lee, Sang Hoon ;
  • Lee, Seong Whan
Citations

SCOPUS

12

Citation Export

Publication Year
2024-01-01
Journal
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher
International Speech Communication Association
Citation
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.1810-1814
Keyword
emotional style and intensity controlexpressive emotional speech synthesisText-to-speech
Mesh Keyword
Complex natureEmotional speechEmotional speech synthesisEmotional style and intensity controlExpressive emotional speech synthesisHuman annotationsIntensity modelsSpeech emotionsSynthetic speechText to speech
All Science Classification Codes (ASJC)
Language and LinguisticsHuman-Computer InteractionSignal ProcessingSoftwareModeling and Simulation
Abstract
Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model's ability to control emotional style and intensity with high-quality expressive speech.
ISSN
1990-9772
Language
eng
URI
https://aurora.ajou.ac.kr/handle/2018.oak/38119
https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85206434807&origin=inward
DOI
https://doi.org/10.21437/interspeech.2024-398
Journal URL
https://www.isca-speech.org/iscaweb/index.php/online-archive
Type
Conference Paper
Funding
This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University), No. 2021-0-02068, Artificial Intelligence Innovation Hub, and AI Technology for Interactive Communication of Language Impaired Individuals).
Show full item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Lee, Sang-Hoon Image
Lee, Sang-Hoon이상훈
Department of Software and Computer Engineering
Read More

Total Views & Downloads

File Download

  • There are no files associated with this item.