Magic pyramid: Accelerating inference with early exiting and token pruning
Pretraining and then finetuning of large language models is one of the commonly used approaches to achieve good performance in natural language processing (NLP) tasks. However most pre-trained models have large memory footprint and low inference speed. Deploying such large models to applications with latency constraint is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction through terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that MP is not only able to achieve a speed adjustable inference, but also to surpass token pruning and early exiting in terms of GFLOPs with minimum losses on accuracy. Token pruning and early exiting express distinctive preferences to sequences with different lengths. However, MP is capable of achieving drastic speedup, regardless of the lengths of the inputs.