Conversational AI

Improving unsupervised sentence-pair comparison

Method that captures advantages of cross-encoding and bi-encoding improves on predecessors by as much as 5%.

By Fangyu Liu, Yunlong Jiao

April 29, 2022

4 min read

Many tasks in natural-language processing and information retrieval involve pairwise comparisons of sentences — for example, sentence similarity detection, paraphrase identification, question-answer entailment, and textual entailment.

The most accurate method of sentence comparison is so-called cross-encoding, which maps sentences against each other on a pair-by-pair basis. Training cross-encoders, however, requires annotated training data, which is labor intensive to collect.

How can we train completely unsupervised models for sentence-pair tasks, eliminating the need for data annotation?

At this year’s International Conference on Learning Representations (ICLR), we are presenting an unsupervised sentence-pair model we call a trans-encoder (paper, code), which improves on the prior state of the art by up to 5% on sentence similarity benchmarks.

A tale of two encoders

Today, there are basically two paradigms for sentence-pair tasks: cross-encoders and bi-encoders. The choice between the two comes down to the standard trade-off between computational efficiency and performance.

A graphic showing graph representations of the phrases "bank account" and "bank river bank", with a full set of connections between all words of both phrases (the cross-encoder) and connections only between the words of the individual phrases (the bi-encoder). — Cross-encoder *(left)* and bi-encoder *(right)*.

Cross-encoder. In a cross-encoder, two sequences are concatenated and sent in one pass to the sentence pair model, which is usually built atop a Transformer-based language model like BERT or RoBERTa. The attention heads of a Transformer can directly model which elements of one sequence correlate with which elements of the other, enabling the computation of an accurate classification/relevance score.

However, a cross-encoder needs to compute a new encoding for every pair of input sentences, resulting in high computational overhead. Cross-encoding is thus impractical for tasks like information retrieval and clustering, which involve massive pairwise sentence comparisons. Also, converting pretrained language models (PLMs) into cross-encoders always requires fine-tuning on annotated data.

Trans-encoder: The best of both worlds

In our ICLR paper, we ask whether we can leverage the advantages of both bi- and cross-encoders to bootstrap an accurate sentence-pair model in an unsupervised manner.

Our answer — the trans-encoder — is built on the following intuition: As a starting point, we can use bi-encoder representations to fine-tune a cross-encoder. With its more powerful inter-sentence modeling, the cross-encoder should extract more knowledge from the PLMs than the bi-encoder can given the same input data. In turn, the more powerful cross-encoder can distill its knowledge back into the bi-encoder, improving the accuracy of the more computationally practical model. We can repeat this cycle to iteratively bootstrap from both the bi- and cross-encoders.

The trans-encoder training process, in which a bi-encoder trained in an unsupervised fashion creates training targets for a cross-encoder, which in turn outputs training targets for the bi-encoder.

Specifically, the process of training a trans-encoder is as follows:

Step 1. Transform PLMs into effective bi-encoders. To transform existing PLMs into bi-encoders, we leverage a simple contrastive tuning procedure. Given a sentence, we encode it twice, with two different PLMs. Because of dropout — a standard technique in which a fraction of neural-network nodes are randomly dropped during each pass through the training data, to prevent bottlenecks — the two PLMs will produce slightly different encodings.

The bi-encoder is then trained to maximize the similarity of the two almost-identical encodings. This step primes the PLMs to be good at embedding sequences. Details can be found in prior work Mirror-BERT and SimCSE.

Framework improves efficiency, accuracy of applications that search for a handful of solutions in a huge space of candidates.

Step 2. Self-distillation: bi- to cross-encoder. After obtaining a reasonably good bi-encoder from step one, we use it to create training data for a cross-encoder. Specifically, we label sentence pairs with the pairwise similarity scores computed by the bi-encoder and use them as training targets for a cross-encoder built on top of a new PLM.

Step 3. Self-distillation: Cross- to bi-encoder. A natural next step is to distil the extra knowledge gained from the cross-encoder back into bi-encoder form, which is more useful for downstream tasks. More important, a better bi-encoder can produce even more self-labeled data for tuning the cross-encoder. In this way we can repeat steps two and three, continually bootstrapping the encoder performance.

Our paper proposes other techniques, such as mutual distillation, to improve our model’s performance. Please refer to Section 2.4 of the paper for more details.

Benchmark: A new state-of-the-art for sentence similarity

We experiment with the trans-encoder on seven sentence textual similarity (STS) benchmarks. We observe significant improvements upon previous unsupervised sentence-pair models across all datasets.

A table of results of experiments comparing trans-encoder to existing models. — Trans-encoder performance on the sentence textual similarity (STS) benchmarks STS 2012-2017, STS-B, and SICK-R.

We also benchmark binary-classification and domain transfer tasks. Please refer to section 5 of the paper for more details.

About the Author

Fangyu Liu

Fangyu Liu, a PhD student in computation, cognition, and language at the University of Cambridge, was an intern at Amazon when the work was done.

Yunlong Jiao

Yunlong Jiao is an applied scientist with Alexa Shopping.

Improving unsupervised sentence-pair comparison

Method that captures advantages of cross-encoding and bi-encoding improves on predecessors by as much as 5%.

A tale of two encoders

Trans-encoder: The best of both worlds

Related content

Work with us