Recent stereo matching networks adopt 4D cost volumes and 3D convolutions for processing those volumes. Although these methods show good performance in terms of accuracy, they have an inherent disadvantage in that they require great deal of computing resources and memory. These requirements limit their applications for mobile environments, which are subject to inherent computing hardware constraints. Both accuracy and consumption of computing resources are important, and improving both at the same time is a non-trivial task. To deal with this problem, we propose a simple yet efficient network, called Sequential Feature Fusion Network (SFFNet) which sequentially generates and processes the cost volume using only 2D convolutions. The main building block of our network is a Sequential Feature Fusion (SFF) module which generates 3D cost volumes to cover a part of the disparity range by shifting and concatenating the target features, and processes the cost volume using 2D convolutions. A series of the SFF modules in our SFFNet are designed to gradually cover the full disparity range. Our method prevents heavy computations and allows for efficient generation of an accurate final disparity map. Various experiments show that our method has an advantage in terms of accuracy versus efficiency compared to other networks.
Funding: This work was supported by the Ministry of Science and ICT (MSIT), South Korea, under the Information Technology Research Center (ITRC) Support Program supervised by the Institute for Information and Communications Technology Promotion (IITP) under Grant IITP-2021-2018-0-01424.