InDi: Informative and diverse sampling for dense retrieval
2024
Negative sample selection has been shown to have a crucial effect on the training procedure of dense retrieval systems. Nevertheless, most existing negative selection methods end by randomly choosing from some pool of samples. This calls for a better sampling solution. We define desired requirements for negative sample selection; the samples chosen should be informative, to advance the learning process, and diverse, to help the model generalize. We compose a sampling method designed to meet these requirements, and show that using our sampling method to enhance the training procedure of a recent significant dense retrieval solution (coCondenser) improves the obtained model’s performance. Specifically, we see a ∼ 2% improvement in MRR@10 on the MS MARCO dataset (from 38.2 to 38.8) and a ∼ 1.5% improvement in Recall@5 on the Natural Questions dataset (from 71% to 72.1%), both statistically significant. Our solution, as opposed to other methods, does not require training or inferencing a large model, and adds only a small overhead (∼ 1% added time) to the training procedure. Finally, we report ablation studies showing that the objectives defined are indeed important when selecting negative samples for dense retrieval.
Research areas