Improving generalizability of protein sequence models with data augmentations
Protein sequence modeling typically does not use randomized data augmentation procedures during training due to the unpredictable functional changes introduced by even simple sequence modifications. However, in this paper, we empirically explore a set of simple string manipulations, when fine-tuning semi-supervised protein models. We compare to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with methods that vary from the baseline methods only in the data augmentations and representation learning procedure, and demonstrate improvements between 1% and 41% to the baseline scores on the TAPE validation tasks, with both linear evaluation and full fine-tuning on downstream tasks. We find the most consistent results using domain-motivated transformations, such as amino acid replacement, as well as subsampling of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as random sequence shuffling, can improve performance.