Debarun Bhattacharjya, Karthikeyan Shanmugam, et al.
NeurIPS 2020
Life science practitioners are drowning in unlabeled protein sequences. Natural Language Processing (NLP) community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In a transfer learning fashion, expensive pretrained universal embeddings can be rapidly fine-tuned to multiple downstream prediction tasks.
In this work we present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences (from STRING database) during pre-training with the Masked Language Modeling (MLM) objective. Next, we compress protein sequences by 64% with a Byte Pair Encoding (BPE) vocabulary consisting of 10K tokens, each 3-4 amino acids long. Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens. Our approach produces excellent fine-tuning results for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery. Further work in token-level classification for secondary structure prediction is needed.
Debarun Bhattacharjya, Karthikeyan Shanmugam, et al.
NeurIPS 2020
Wang Zhou, Levente Klein
NeurIPS 2020
Etienne Eben Vos, Ashley Daniel Gritzman, et al.
NeurIPS 2020
Georgios Damaskinos, Celestine Mendler-Dünner, et al.
NeurIPS 2020