Ajou University repository

Deep Learning Methods for Sign Language Production
  • MA XIAOHAN
Citations

SCOPUS

0

Citation Export

Advisor
Tae-Sun Chung
Affiliation
아주대학교 대학원
Department
일반대학원 인공지능학과
Publication Year
2024-02
Publisher
The Graduate School, Ajou University
Keyword
Transformersign language production
Description
학위논문(박사)--인공지능학과,2024. 2
Abstract
Sign language serves as the predominant means of communication for individuals who are deaf or hard of hearing. While written language can certainly be a communication tool for the deaf, for those with congenital deafness raised in signing communities, sign language naturally becomes their preferred communication method. Thus, developing advanced technologies for sign language production (SLP) is vital for their societal integration. Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign language covers meaning by means of multiple asynchronous articulators, including manual and non- manual information channels. Recent advancements in deep learning have led to the creation of SLP models, these deep learning-based SLP models directly generate the full- articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down-weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. _x000D_ <br>In our first work, we propose an efficient cascade dual-decoder Transformer (CasDual- Transformer) for SLP to learn, successively, two mappings SLP_hand : Text → Hand pose and SLP_sign : Text → Sign pose, utilizing an attention based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, we introduce a novel spatio-temporal loss to penalize shape dissimilarity and temporal distortions of produced sequences. We perform experimental studies on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that our model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them._x000D_ <br>In our subsequent work, we address the challenge of capturing the spatial structure and temporal dynamics of sign language to enhance the quality of sign production. We introduce the Multi-Channel Spatio-Temporal Transformer (MCST-Transformer) for skeletal sign language production, which employs a dual attention mechanism: a multi-channel spatial attention to capture spatial correlations across various channels within a frame and multi-channel temporal attention to learn sequential dependencies for each channel. In addition, we exploit and experiment with multiple fusion methods to combine spatial and temporal representations and produce more accurate sign sequences. Our experimental results demonstrate that our approach not only exceeds the performance of existing models in terms of accuracy and realism but also affirms the effectiveness of the individual components of our proposed model.
Language
eng
URI
https://aurora.ajou.ac.kr/handle/2018.oak/38834
Journal URL
https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033671
Show full item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Total Views & Downloads

File Download

  • There are no files associated with this item.