Semantic prefixes as an auxiliary target for code understanding and generation
2023
Code understanding and generation require learning the mapping between human and programming languages. As human and programming languages are different in vocabulary, semantic, and, syntax, it is challenging for an autoregressive model to generate a sequence of tokens that is both semantically (i.e., carry the right meaning) and syntactically correct (i.e., in the right sequence order). Inspired by this, we propose a prefix-based learning framework to lessen the burden of an autoregressive generation model by decoupling the learning of semantic and syntactic dependencies. In particular, during the training we prepend the target output with a semantic embedding that encodes the output sequence. In this way, a model learns to first predict the semantics of the output before generating the sequence. Evaluating on 11 code understanding and generation datasets, we show that our prefix-prepending approach improves the baseline by an average of 8.1% in exact match and 5.5% in CodeBLEU. It also either outperforms or is on-par with the state-of-the-art methods across a variety of code understanding tasks. Our approach is general and can be used as a meta algorithm on top of any autoregressive language model.
Research areas