Ajou University repository

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations Without Text Alignmentoa mark
Citations

SCOPUS

1

Citation Export

DC Field Value Language
dc.contributor.authorOh, Hyung Seok-
dc.contributor.authorLee, Sang Hoon-
dc.contributor.authorCho, Deok Hyeon-
dc.contributor.authorLee, Seong Whan-
dc.date.issued2025-01-01-
dc.identifier.issn1949-3045-
dc.identifier.urihttps://aurora.ajou.ac.kr/handle/2018.oak/38451-
dc.identifier.urihttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85216028667&origin=inward-
dc.description.abstractEmotional voice conversion (EVC) involves modifying various acoustic characteristics, such as pitch and spectral envelope, to match a desired emotional state while preserving the speaker's identity. Existing EVC methods often rely on text transcriptions or time-alignment information and struggle to handle varying speech durations effectively. In this paper, we propose DurFlex-EVC, a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. Additionally, we design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech. We further enhance emotional expressiveness through a hierarchical stylize encoder that applies the target emotional style at multiple hierarchical levels, refining the stylization process to improve the naturalness and expressiveness of the converted speech. Experimental results from subjective and objective evaluations demonstrate that our approach outperforms baseline models, effectively handling duration variability and enhancing emotional expressiveness in the converted speech.-
dc.language.isoeng-
dc.publisherInstitute of Electrical and Electronics Engineers Inc.-
dc.subject.meshAcoustic characteristic-
dc.subject.meshDuration control-
dc.subject.meshEmotional state-
dc.subject.meshEmotional voice conversion-
dc.subject.meshEmotional voices-
dc.subject.meshSelf-supervised representation-
dc.subject.meshSpectral envelopes-
dc.subject.meshStyle disentanglement-
dc.subject.meshText alignments-
dc.subject.meshVoice conversion-
dc.titleDurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations Without Text Alignment-
dc.typeArticle-
dc.citation.titleIEEE Transactions on Affective Computing-
dc.identifier.bibliographicCitationIEEE Transactions on Affective Computing-
dc.identifier.doi10.1109/taffc.2025.3530920-
dc.identifier.scopusid2-s2.0-85216028667-
dc.identifier.urlhttp://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5165369-
dc.subject.keywordDuration control-
dc.subject.keywordemotional voice conversion-
dc.subject.keywordself-supervised representation-
dc.subject.keywordstyle disentanglement-
dc.type.otherArticle-
dc.identifier.pissn19493045-
dc.description.isoatrue-
dc.subject.subareaSoftware-
dc.subject.subareaHuman-Computer Interaction-
Show simple item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Lee, Sang-Hoon Image
Lee, Sang-Hoon이상훈
Department of Software and Computer Engineering
Read More

Total Views & Downloads

File Download

  • There are no files associated with this item.