Ajou University repository

HiddenSinger: High-quality singing voice synthesis via neural audio codec and latent diffusion modelsoa mark
Citations

SCOPUS

10

Citation Export

DC Field Value Language
dc.contributor.authorHwang, Ji Sang-
dc.contributor.authorLee, Sang Hoon-
dc.contributor.authorLee, Seong Whan-
dc.date.issued2025-01-01-
dc.identifier.issn1879-2782-
dc.identifier.urihttps://aurora.ajou.ac.kr/handle/2018.oak/38393-
dc.identifier.urihttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85205592183&origin=inward-
dc.description.abstractRecently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, there are limitations in complexity and controllability to apply diffusion models for time-varying audio synthesis. Particularly, a singing voice synthesis (SVS) task, which has begun to emerge as a practical application in the game and entertainment industries, requires high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in the SVS task, we propose HiddenSinger, a high-quality SVS system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model using an unlabeled singing voice dataset. Experimental results demonstrate that our model outperforms previous models regarding audio quality. Furthermore, the HiddenSinger-U can synthesize high-quality singing voices of speakers trained solely on unlabeled data.-
dc.description.sponsorshipThis work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University), No. 2021-0-02068, Artificial Intelligence Innovation Hub, and IITP-2024-RS-2023-00255968, the Artificial Intelligence Convergence Innovation Human Resources Development) and Netmarble AI Center.-
dc.language.isoeng-
dc.publisherElsevier Ltd-
dc.subject.meshAudio codecs-
dc.subject.meshDe-noising-
dc.subject.meshDiffusion model-
dc.subject.meshGenerative model-
dc.subject.meshHigh quality-
dc.subject.meshHigh-fidelity-
dc.subject.meshLatent diffusion model-
dc.subject.meshNeural audio codec-
dc.subject.meshSinging voices-
dc.subject.meshSinging-voice synthesis-
dc.subject.meshHumans-
dc.subject.meshMusic-
dc.subject.meshNeural Networks, Computer-
dc.subject.meshSinging-
dc.subject.meshVoice-
dc.subject.meshVoice Quality-
dc.titleHiddenSinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models-
dc.typeArticle-
dc.citation.titleNeural Networks-
dc.citation.volume181-
dc.identifier.bibliographicCitationNeural Networks, Vol.181-
dc.identifier.doi10.1016/j.neunet.2024.106762-
dc.identifier.pmid39368276-
dc.identifier.scopusid2-s2.0-85205592183-
dc.identifier.urlhttps://www.sciencedirect.com/science/journal/08936080-
dc.subject.keywordGenerative model-
dc.subject.keywordLatent diffusion model-
dc.subject.keywordNeural audio codec-
dc.subject.keywordSinging voice synthesis-
dc.subject.keywordUnsupervised learning-
dc.type.otherArticle-
dc.identifier.pissn08936080-
dc.description.isoatrue-
dc.subject.subareaCognitive Neuroscience-
dc.subject.subareaArtificial Intelligence-
Show simple item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Lee, Sang-Hoon Image
Lee, Sang-Hoon이상훈
Department of Software and Computer Engineering
Read More

Total Views & Downloads

File Download

  • There are no files associated with this item.