Self-supervised incremental learning of object representations from arbitrary image sets
2025
Computing a comprehensive and robust visual representation of an arbitrary object or category of objects is a complex problem. The difficulty increases when one starts from a set of uncalibrated images obtained from different sources. We propose a self-supervised approach, Multi-Image Latent Embedding (MILE), which computes a single representation from such an image set. MILE operates incrementally, considering one image at a time, while processing various depictions of the class through a shared gated cross-attention mechanism. The representations are progressively refined as more available images are incorporated, without requiring additional training. Our experiments on Amazon Berkeley Objects (ABO) and iNaturalist demonstrate the effectiveness in two tasks: object or category-specific image retrieval and unsupervised context-conditioned object segmentation. Moreover, the proposed multi-image input setup opens new frontiers for the task of object retrieval. Our studies indicate that our models can capture descriptive representations that better encapsulate the intrinsic characteristics of the objects.
Research areas