Improving generalizability of protein sequence models with data augmentations

Hongyu Shen; Layne C. Price; Mohammad Taha Bahadori; Franziska Seeger

Publication

Improving generalizability of protein sequence models with data augmentations

By Hongyu Shen, Layne C. Price, Mohammad Taha Bahadori, Franziska Seeger

2020

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Protein sequence modeling typically does not use randomized data augmentation procedures during training due to the unpredictable functional changes introduced by even simple sequence modifications. However, in this paper, we empirically explore a set of simple string manipulations, when fine-tuning semi-supervised protein models. We compare to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with methods that vary from the baseline methods only in the data augmentations and representation learning procedure, and demonstrate improvements between 1% and 41% to the baseline scores on the TAPE validation tasks, with both linear evaluation and full fine-tuning on downstream tasks. We find the most consistent results using domain-motivated transformations, such as amino acid replacement, as well as subsampling of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as random sequence shuffling, can improve performance.

Improving generalizability of protein sequence models with data augmentations

Latest news

Work with us