Pre-trained language models like BERT have reported state-of-the-art performance on several Natural Language Processing (NLP) tasks, but high computational demands hinder its widespread adoption for large scale NLP tasks. In this work, we propose a novel routing based early exit model called BE3R (BERT based Early-Exit using Expert Routing), where we learn to dynamically exit in the earlier layers without needing to traverse through the entire model. Unlike the existing early-exit methods, our approach can be extended to a batch inference setting. We consider the specific application of search relevance filtering in Amazon India marketplace services (a large e-commerce website). Our experimental results show that BE3R improves the batch inference throughput by 46.5% over the BERT-Base model and 35.89% over the DistilBERTBase model on large dataset with 50 Million samples without any trade-off on the performance metric. We conduct thorough experimentation using various architectural choices, loss functions and
perform qualitative analysis. We perform experiments on public GLUE Benchmark [28] and demonstrate comparable performance to corresponding baseline models with 23% average throughput improvement across tasks in batch inference setting.
Research areas