Multi-object tracking with hallucinated and unlabeled video
In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for deep neural trackers, while tracking annotations are expensive to acquire. We first hallucinate videos from images with bounding box annotations using motion transformations along with simulated video effects to create a diverse tracking dataset. We then use a tracker trained from our hallucinated data to mine hard examples from a pool of unlabeled real videos. We propose an optimization-based connecting process to first identify and then rectify hard examples from the unlabeled videos. The output of this process is a set of mined hard examples with refined pseudo labels. We train jointly on hallucinated data and mined hard video examples, and our tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets.