EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Cho, Deok Hyeon; Oh, Hyung Seok; Kim, Seung Bin; Lee, Sang Hoon; Lee, Seong Whan

DC Field	Value	Language
dc.contributor.author	Cho, Deok Hyeon	-
dc.contributor.author	Oh, Hyung Seok	-
dc.contributor.author	Kim, Seung Bin	-
dc.contributor.author	Lee, Sang Hoon	-
dc.contributor.author	Lee, Seong Whan	-
dc.date.issued	2024-01-01	-
dc.identifier.issn	1990-9772	-
dc.identifier.uri	https://aurora.ajou.ac.kr/handle/2018.oak/38119	-
dc.identifier.uri	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85206434807&origin=inward	-
dc.description.abstract	Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model's ability to control emotional style and intensity with high-quality expressive speech.	-
dc.description.sponsorship	This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University), No. 2021-0-02068, Artificial Intelligence Innovation Hub, and AI Technology for Interactive Communication of Language Impaired Individuals).	-
dc.language.iso	eng	-
dc.publisher	International Speech Communication Association	-
dc.subject.mesh	Complex nature	-
dc.subject.mesh	Emotional speech	-
dc.subject.mesh	Emotional speech synthesis	-
dc.subject.mesh	Emotional style and intensity control	-
dc.subject.mesh	Expressive emotional speech synthesis	-
dc.subject.mesh	Human annotations	-
dc.subject.mesh	Intensity models	-
dc.subject.mesh	Speech emotions	-
dc.subject.mesh	Synthetic speech	-
dc.subject.mesh	Text to speech	-
dc.title	EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech	-
dc.type	Conference	-
dc.citation.conferenceDate	2024.09.01.~2024.09.05.	-
dc.citation.conferenceName	25th Interspeech Conferece 2024	-
dc.citation.edition	Interspeech 2024	-
dc.citation.endPage	1814	-
dc.citation.startPage	1810	-
dc.citation.title	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.identifier.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.1810-1814	-
dc.identifier.doi	10.21437/interspeech.2024-398	-
dc.identifier.scopusid	2-s2.0-85206434807	-
dc.identifier.url	https://www.isca-speech.org/iscaweb/index.php/online-archive	-
dc.subject.keyword	emotional style and intensity control	-
dc.subject.keyword	expressive emotional speech synthesis	-
dc.subject.keyword	Text-to-speech	-
dc.type.other	Conference Paper	-
dc.identifier.pissn	2308457X	-
dc.description.isoa	true	-
dc.subject.subarea	Language and Linguistics	-
dc.subject.subarea	Human-Computer Interaction	-
dc.subject.subarea	Signal Processing	-
dc.subject.subarea	Software	-
dc.subject.subarea	Modeling and Simulation	-

Show simple item record

qrcode

트윗하기

Related Researcher

Lee, Sang-Hoon이상훈: Department of Software and Computer Engineering

File Download

There are no files associated with this item.

Related Researcher

Total Views & Downloads

File Download