Conversational AI

Compressing token-embedding matrices for language models

Combining low-rank approximation, a residual binary autoencoder, and a new loss function enables a fivefold increase in compression ratio.

By Haoyu Wang, Ruirui Li

August 9, 2023

4 min read

Pretrained language models (PLMs) like BERT, RoBERTa, and DeBERTa, when fine-tuned on task-specific data, have demonstrated exceptional performance across a diverse array of natural-language tasks, including natural-language inference, sentiment classification, and question answering.

PLMs typically comprise a matrix for token embeddings, a deep neural network featuring an attention mechanism, and an output layer. The token-embedding matrix often constitutes a substantial portion of the model due to its extensive vocabulary table: for instance, it accounts for more than 21% of BERT’s model size and 31.2% of RoBERTa’s. Moreover, due to variances in token frequencies, the token-embedding matrix contains numerous redundancies. Thus, any technique capable of compressing the token embedding matrix has the potential to complement other methods for model compression, resulting in a heightened compression ratio.

Desiderata

The ideal approach to compressing token-embedding matrices for PLMs should have the following characteristics: (1) task agnosticity, to ensure effective application across diverse downstream tasks; (2) model agnosticity, allowing seamless integration as a modular component for various backbone models; (3) synergistic compatibility with other model compression techniques; and (4) a substantial compression ratio with little diminishment of model performance.
LightToken meets these criteria, generating compressed token embeddings in a manner that is independent of both specific tasks and specific models.

Rank-k SVD approximation

LM singular values.png — Singular values for three transformer-based language models. Particularly for BERT, the first few singular values dominate the rest.

Numerous prior studies have highlighted the potency of singular-value decomposition (SVD) in effectively compressing model weight matrices. SVD decomposes a matrix into three matrices, one of which is a diagonal matrix. The entries in the diagonal matrix — the singular values — indicate how much variance in the data each variable explains. By keeping only the high singular values, it’s possible to project high-dimensional data down to a lower-dimensional subspace.

Token embedding matrices typically have a relatively small number of singular values. Consequently, the first step in our approach involves employing SVD to achieve a rank-𝑘 approximation for the token embedding matrix, for a small 𝑘.

Residual hashing

In experiments, we found that relying solely on the rank-k compression matrix, while it provided substantial compression, compromised performance on downstream tasks too severely. So LightToken also uses a residual binary autoencoder to encode the differences between the full token-embedding matrix and the matrix reconstituted from the rank-k compression matrix.

Learning hash codes.png — The architecture of the model that learns to produce hash codes for residual matrix values.

Autoencoders are trained to output the same values they take as inputs, but in between, they produce compressed vector representations of the inputs. We constrain those representations to be binary: they are the hash codes.

Binary codes are non-differentiable, however, so during model training, in order to use the standard gradient descent learning algorithm, we have to approximate the binary values with tempered sigmoid activation functions, which have a steep slope between low and high values.

Graphic of Agora sampling the development set and generating, labeling and adding new points back to the training set.

Loss function

Typically, a model like ours would be trained to minimize the Euclidean distance between the reconstructed token-embedding matrix and the uncompressed matrix. But we find that Euclidean distance yields poor performance on some natural-language-processing (NLP) tasks and on tasks with small training sets. We hypothesize that this is because Euclidean distance pays inadequate attention to the angle between vectors in the embedding space, which on NLP tasks can carry semantic information.

So we propose a fresh reconstruction loss, which serves as an upper limit for Euclidean distance. This loss encourages the model to prioritize alignment between the original and compressed embeddings by recalibrating cosine similarity.

LightToken framework.png — The full, four-stage LightToken framework.

We carried out comprehensive experiments on two benchmark datasets: GLUE and SQuAD 1.1. The outcomes clearly demonstrate the remarkable superiority of LightToken over the established baselines. Specifically, LightToken achieves an impressive 25-fold compression ratio while maintaining accuracy levels. Moreover, as the compression ratio escalates to 103, the incurred accuracy loss remains within a modest 6% deviation from the original benchmark.

About the Author

Haoyu Wang

Haoyu Wang is a graduate student in electrical and computer engineering at Purdue University. He was an intern at Amazon when the work was done.

Ruirui Li

Ruirui Li is a senior applied scientist at Amazon.

Compressing token-embedding matrices for language models

Combining low-rank approximation, a residual binary autoencoder, and a new loss function enables a fivefold increase in compression ratio.

Desiderata

Rank-k SVD approximation

Residual hashing

Loss function

Related content

Work with us