DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations Without Text Alignment

Oh, Hyung Seok; Lee, Sang Hoon; Cho, Deok Hyeon; Lee, Seong Whan

DC Field	Value	Language
dc.contributor.author	Oh, Hyung Seok	-
dc.contributor.author	Lee, Sang Hoon	-
dc.contributor.author	Cho, Deok Hyeon	-
dc.contributor.author	Lee, Seong Whan	-
dc.date.issued	2025-01-01	-
dc.identifier.issn	1949-3045	-
dc.identifier.uri	https://aurora.ajou.ac.kr/handle/2018.oak/38451	-
dc.identifier.uri	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85216028667&origin=inward	-
dc.description.abstract	Emotional voice conversion (EVC) involves modifying various acoustic characteristics, such as pitch and spectral envelope, to match a desired emotional state while preserving the speaker's identity. Existing EVC methods often rely on text transcriptions or time-alignment information and struggle to handle varying speech durations effectively. In this paper, we propose DurFlex-EVC, a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. Additionally, we design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech. We further enhance emotional expressiveness through a hierarchical stylize encoder that applies the target emotional style at multiple hierarchical levels, refining the stylization process to improve the naturalness and expressiveness of the converted speech. Experimental results from subjective and objective evaluations demonstrate that our approach outperforms baseline models, effectively handling duration variability and enhancing emotional expressiveness in the converted speech.	-
dc.language.iso	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers Inc.	-
dc.subject.mesh	Acoustic characteristic	-
dc.subject.mesh	Duration control	-
dc.subject.mesh	Emotional state	-
dc.subject.mesh	Emotional voice conversion	-
dc.subject.mesh	Emotional voices	-
dc.subject.mesh	Self-supervised representation	-
dc.subject.mesh	Spectral envelopes	-
dc.subject.mesh	Style disentanglement	-
dc.subject.mesh	Text alignments	-
dc.subject.mesh	Voice conversion	-
dc.title	DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations Without Text Alignment	-
dc.type	Article	-
dc.citation.title	IEEE Transactions on Affective Computing	-
dc.identifier.bibliographicCitation	IEEE Transactions on Affective Computing	-
dc.identifier.doi	10.1109/taffc.2025.3530920	-
dc.identifier.scopusid	2-s2.0-85216028667	-
dc.identifier.url	http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5165369	-
dc.subject.keyword	Duration control	-
dc.subject.keyword	emotional voice conversion	-
dc.subject.keyword	self-supervised representation	-
dc.subject.keyword	style disentanglement	-
dc.type.other	Article	-
dc.identifier.pissn	19493045	-
dc.description.isoa	true	-
dc.subject.subarea	Software	-
dc.subject.subarea	Human-Computer Interaction	-

Show simple item record

qrcode

트윗하기

Related Researcher

Lee, Sang-Hoon이상훈: Department of Software and Computer Engineering

File Download

There are no files associated with this item.

Related Researcher

Total Views & Downloads

File Download