In order to comprehensively predict human condition, it is beneficial to analyze various bio-signals obtained from human body. Existing multi-modal deep sequence models are often very complex models that involve significantly more parameters than single-modal models. However, since the number of multi-modal signal data is small compared to single-modal, more efforts is needed to reduce the model complexity of multi-modal models. We introduce a multi-modal sequence classification model based on our cross-attention blocks, which aims to reduce the number of parameters involved in cross-referencing different modes. We compare our method with baseline sequential deep learning models, LSTM and Transformer, and a competitor. We test our methods on two public datasets and a dataset obtained from construction works. We show that our model outperforms compared methods in accuracy and number of parameters when the number of modes increases.
This work was supported in part by the National Research Foundation of Korea grant funded by the Korean government (2018R1A5A1060031). (Corresponding author: Lee Sael.)