CoD: Coherent detection of entities from images with multiple modalities
Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a ”Negative Product Bounding Box” (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model’s performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers’ catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.