Factorization-aware training of transformers for natural language understanding on the edge
Fine-tuning transformer-based models have shown to outperform other methods for many Natural Language Understanding (NLU) tasks. Recent studies to reduce the size of transformer models have achieved reductions of > 80%, making on-device inference on powerful devices possible. However, other resource-constrained devices, like those enabling voice assistants (VAs), require much further reductions. In this work, we propose factorization-aware training (FAT), wherein we factorize the linear mappings of an already compressed transformer model (DistilBERT) and train jointly on NLU tasks. We test this method on three different NLU datasets and show our method outperforms naive application of factorization after training by 10% - 440% across various compression rates. Additionally, We introduce a new metric called factorization gap and use it to analyze the need for FAT across various model components. We also present results for training subsets of factorized components to enable faster training, re-usability and maintainability for multiple on-device models. We further demonstrate the trade-off between memory, inference speed and performance at a given compression-rate for a on-device implementation of a factorized model. Our best performing factorized model, achieves a relative size reduction of 84% with ≈ 10% relative degradation in NLU error rate compared to a non-factorized model on our internal dataset.