With the aid of one manually annotated frame, One-Shot Video Object Segmentation (OSVOS) uses a CNN architecture to tackle the problem of semi-supervised video object segmentation (VOS). However, annotating a pixel-level segmentation mask is expensive and time-consuming. To alleviate the problem, we explore a language interactive way of initializing semi-supervised VOS and run the semi-supervised methods into a weakly supervised mode. Our contributions are two folds: (i) we propose a variant of OSVOS initialized with referring expressions (REVOS), which locates a target object by maximizing the matching score between all the candidates and the referring expression; (ii) segmentation performance of semi-supervised VOS methods varies dramatically when selecting different frames for annotation. We present a strategy of the best annotation frame selection by using image similarity measurement. Meanwhile, we first to propose a multiple frame annotation selection strategy for initialization of semi-supervised VOS with more than one annotated frames. Finally we evaluate our method on DAVIS-2016 dataset, and experimental results show that REVOS achieves similar performance (79.94% measured by average IoU) compared with OSVOS (80.1%). Although current REVOS implementation is specific to the method of one-shot video object segmentation, it can be more widely applicable to other semi-supervised VOS methods.
This work was supported by The Tianjin Science and Technology Program under grant 19PTZWHZ00020 and the National Natural Science Foundation of China under grant 61902281.This study is funded by The Tianjin Science and Technology Program(19PTZWHZ00020) and National Natural Science Foundation of China (Grant No. 61902281).