In recent years, dynamic scene understanding has gained attention from researchers because of its widespread applications. The main important factor in successfully understanding the dynamic scenes lies in jointly representing the appearance and motion features to obtain an informative description. Numerous methods have been introduced to solve dynamic scene recognition problem, nevertheless, a few concerns still need to be investigated. In this article, we introduce a novel multi-modal network for dynamic scene understanding from video data, which captures both spatial appearance and temporal dynamics effectively. Furthermore, two-level joint tuning layers are proposed to integrate the global and local spatial features as well as spatial and temporal stream deep features. In order to extract the temporal information, we present a novel dynamic descriptor, namely, Volume Symmetric Gradient Local Graph Structure (VSGLGS), which generates temporal feature maps similar to optical flow maps. However, this approach overcomes the issues of optical flow maps. Additionally, Volume Local Directional Transition Pattern (VLDTP) based handcrafted spatiotemporal feature descriptor is also introduced, which extracts the directional information through exploiting edge responses. Lastly, a stacked Bidirectional Long Short-Term Memory (Bi-LSTM) network along with a temporal mixed pooling scheme is designed to achieve the dynamic information without noise interference. The extensive experimental investigation proves that the proposed multi-modal network outperforms most of the state-of-The-Art approaches for dynamic scene understanding.
This research was supported by the Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2016-0-00406, SIAT CCTV Cloud Platform), by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (NRF-2019R1A2C1006608) and by the BK21 FOUR program of the Ministry of Education (NRF5199991014091). Authors. addresses: Md. A. Uddin, Department of Artificial Intelligence, Ajou University, 206, World cup-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, 16499, Republic of Korea; email: azher006@yahoo.com; J. B. Joolee and Y.-K. Lee (corresponding author), Department of Computer Science and Engineering, Kyung Hee University, 1732, Deogyeong-daero, Giheung-gu, Yongin-si, Gyeonggi-do 17104, Republic of Korea; emails: julekhajulie@gmail.com, yklee@khu.ac.kr; K.-A. Sohn (corresponding author), Department of Software and Computer Engineering, and Department of Artificial Intelligence, Ajou University, 206, World cup-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, 16499, Republic of Korea; email: kasohn@ajou.ac.kr. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. \u00a9 2022 Association for Computing Machinery. 1551-6857/2022/01-ART7 $15.00 https://doi.org/10.1145/3462218