Pyramid dynamic inference: Encouraging faster inference via early exit boosting
Large transformer-based models have demonstrated state of the art results on several Natural Language Understanding (NLU) tasks. However, their deployment comes at the cost of increased footprint and inference latency, limiting their adoption to real-time applications, especially on resource constrained devices. In order to optimize the trade-off between model accuracy, footprint and inference latency, we propose Pyramid Dynamic Inference (PDI), a scheme that encourages fast inference by introducing early inference routes in a transformer model, with a focus on boosting the performance of early exit heads. Owing to the limited capacity of the earlier transformer layers to extract complex semantics, the exit heads for these layers typically display high confidence only over easy data samples. PDI aims to recover this by applying a pyramidal structure to the classification heads that allows for more confident early inference by injecting stronger classifiers at earlier layers. It also prevents a significant increase in the model footprint by gradually shrinking the classifiers as the semantic capacity of the deeper transformer layers increase. We validate the efficiency of the PDI scheme on the GLUE benchmark, where we show that PDI consistently outperforms FastBert on both accuracy and latency.