SCOPUS
0Citation Export
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Tae-Sun Chung | - |
| dc.contributor.author | MA XIAOHAN | - |
| dc.date.issued | 2024-02 | - |
| dc.identifier.other | 33671 | - |
| dc.identifier.uri | https://aurora.ajou.ac.kr/handle/2018.oak/38834 | - |
| dc.description | 학위논문(박사)--인공지능학과,2024. 2 | - |
| dc.description.abstract | Sign language serves as the predominant means of communication for individuals who are deaf or hard of hearing. While written language can certainly be a communication tool for the deaf, for those with congenital deafness raised in signing communities, sign language naturally becomes their preferred communication method. Thus, developing advanced technologies for sign language production (SLP) is vital for their societal integration. Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign language covers meaning by means of multiple asynchronous articulators, including manual and non- manual information channels. Recent advancements in deep learning have led to the creation of SLP models, these deep learning-based SLP models directly generate the full- articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down-weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. _x000D_ <br>In our first work, we propose an efficient cascade dual-decoder Transformer (CasDual- Transformer) for SLP to learn, successively, two mappings SLP_hand : Text → Hand pose and SLP_sign : Text → Sign pose, utilizing an attention based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, we introduce a novel spatio-temporal loss to penalize shape dissimilarity and temporal distortions of produced sequences. We perform experimental studies on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that our model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them._x000D_ <br>In our subsequent work, we address the challenge of capturing the spatial structure and temporal dynamics of sign language to enhance the quality of sign production. We introduce the Multi-Channel Spatio-Temporal Transformer (MCST-Transformer) for skeletal sign language production, which employs a dual attention mechanism: a multi-channel spatial attention to capture spatial correlations across various channels within a frame and multi-channel temporal attention to learn sequential dependencies for each channel. In addition, we exploit and experiment with multiple fusion methods to combine spatial and temporal representations and produce more accurate sign sequences. Our experimental results demonstrate that our approach not only exceeds the performance of existing models in terms of accuracy and realism but also affirms the effectiveness of the individual components of our proposed model. | - |
| dc.description.tableofcontents | 1 Introduction 1_x000D_ <br> 1.1 Introduction 1_x000D_ <br> 1.2 Contributions of This Dissertation 6_x000D_ <br> 1.3 Outlines 7_x000D_ <br>2 Background 9_x000D_ <br> 2.1 Avatar Approaches for Sign Language Production 9_x000D_ <br> 2.2 Deep Learning Approaches for Sign Language Production 10_x000D_ <br> 2.2.1 Pro-Transformer 12_x000D_ <br> 2.3 Summary 16_x000D_ <br>3 Datasets and Evaluation Metrics 18_x000D_ <br> 3.1 Datasets 18_x000D_ <br> 3.2 Evaluation Protocols 22_x000D_ <br> 3.3 Evaluation Metrics 23_x000D_ <br> 3.3.1 Back-Translation Model 24_x000D_ <br> 3.3.2 Bilingual Evaluation Understudy (BLEU) 26_x000D_ <br> 3.3.3 Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 27_x000D_ <br> 3.3.4 Word Error Rate (WER) 28_x000D_ <br>4 Cascade Dual-decoder Transformer for Sign Language Production 29_x000D_ <br> 4.1 Cascade Dual-decoder Transformer 31_x000D_ <br> 4.1.1 Text Encoder 32_x000D_ <br> 4.1.2 Hand Pose Decoder 32_x000D_ <br> 4.1.3 Sign Pose Decoder 33_x000D_ <br> 4.2 Spatio-Temporal Loss 35_x000D_ <br> 4.2.1 Spatial Regression Loss 35_x000D_ <br> 4.2.2 Temporal Continuity Loss 36_x000D_ <br> 4.3 Performance Evaluations 38_x000D_ <br> 4.3.1 Model Configuration 38_x000D_ <br> 4.3.2 Quantitative Results 39_x000D_ <br> 4.3.2.1 Baseline Comparison 39_x000D_ <br> 4.3.3 Ablation Study 41_x000D_ <br> 4.3.3.1 Impact of Different Numbers of Decoder Layers 42_x000D_ <br> 4.3.3.2 Effect of Spatio-Temporal Loss 43_x000D_ <br> 4.3.4 Qualitative Analysis 44_x000D_ <br> 4.4 Conclusions 48_x000D_ <br>5 Multi-Channel Spatio-Temporal Transformer for Sign Language Production 49_x000D_ <br> 5.1 Problem Definition 49_x000D_ <br> 5.2 Mutil-Channel Spatio-Temporal Transformer 51_x000D_ <br> 5.2.1 Encoder 51_x000D_ <br> 5.2.2 Multi-Channel Spatio-Temporal Decoder 52_x000D_ <br> 5.2.2.1 Channel-Specific and Full-Channel Embedding 52_x000D_ <br> 5.2.2.2 Spatial-Attention Module 53_x000D_ <br> 5.2.2.3 Temporal-Attention Module 54_x000D_ <br> 5.2.2.4 Spatio-Temporal Fusion Module 56_x000D_ <br> 5.3 Performance Evaluations 57_x000D_ <br> 5.3.1 Model Configuration 58_x000D_ <br> 5.3.2 Quantitative Results 58_x000D_ <br> 5.3.2.1 Baseline Comparison 58_x000D_ <br> 5.3.2.2 Ablation Study 61_x000D_ <br> 5.3.3 Qualitative Analysis 62_x000D_ <br> 5.4 Conclusions 63_x000D_ <br>6 Conclusions and Future Work 66_x000D_ <br> 6.1 Conclusions 66_x000D_ <br> 6.2 Possible Future Work 67_x000D_ <br>Bibliography 69_x000D_ <br> A List of Research Outputs 75_x000D_ <br> A.1 SCI/SCIE Journal Papers 75_x000D_ <br> A.2 International Conference Papers 75_x000D_ | - |
| dc.language.iso | eng | - |
| dc.publisher | The Graduate School, Ajou University | - |
| dc.rights | 아주대학교 논문은 저작권에 의해 보호받습니다. | - |
| dc.title | Deep Learning Methods for Sign Language Production | - |
| dc.type | Thesis | - |
| dc.contributor.affiliation | 아주대학교 대학원 | - |
| dc.contributor.department | 일반대학원 인공지능학과 | - |
| dc.date.awarded | 2024-02 | - |
| dc.description.degree | Doctor | - |
| dc.identifier.url | https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033671 | - |
| dc.subject.keyword | Transformer | - |
| dc.subject.keyword | sign language production | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.