Sign language serves as the predominant means of communication for individuals who are deaf or hard of hearing. While written language can certainly be a communication tool for the deaf, for those with congenital deafness raised in signing communities, sign language naturally becomes their preferred communication method. Thus, developing advanced technologies for sign language production (SLP) is vital for their societal integration. Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign language covers meaning by means of multiple asynchronous articulators, including manual and non- manual information channels. Recent advancements in deep learning have led to the creation of SLP models, these deep learning-based SLP models directly generate the full- articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down-weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. _x000D_
<br>In our first work, we propose an efficient cascade dual-decoder Transformer (CasDual- Transformer) for SLP to learn, successively, two mappings SLP_hand : Text → Hand pose and SLP_sign : Text → Sign pose, utilizing an attention based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, we introduce a novel spatio-temporal loss to penalize shape dissimilarity and temporal distortions of produced sequences. We perform experimental studies on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that our model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them._x000D_
<br>In our subsequent work, we address the challenge of capturing the spatial structure and temporal dynamics of sign language to enhance the quality of sign production. We introduce the Multi-Channel Spatio-Temporal Transformer (MCST-Transformer) for skeletal sign language production, which employs a dual attention mechanism: a multi-channel spatial attention to capture spatial correlations across various channels within a frame and multi-channel temporal attention to learn sequential dependencies for each channel. In addition, we exploit and experiment with multiple fusion methods to combine spatial and temporal representations and produce more accurate sign sequences. Our experimental results demonstrate that our approach not only exceeds the performance of existing models in terms of accuracy and realism but also affirms the effectiveness of the individual components of our proposed model.