Active site sequence representation of human kinases outperforms full sequence for affinity prediction
Abstract
Focusing on the human kinome, we challenge a standard practice in proteochemo-metric, sequence-based affinity prediction models: instead of leveraging the full primary structure of proteins, each target is represented only by a sequence of 29 residues defining the ATP binding site. In kinase-ligand binding prediction, our results show that the reduced active site sequence representation is not only computationally more efficient but consistently yields significantly higher performance than the full primary structure. This trend persists across different models (a k-NN baseline and a multimodal deep neural network), datasets (BindingDB, IDG-DREAM), performance metrics (RMSE, Pearson correlation) and holds true when predicting affinity for both unseen ligands and kinases. For example, the RMSE on pIC50 can be reduced by 5% and 9% respectively for unseen kinases and kinase inhibitors. This trend is robust across kinases’ families and classes of inhibitors with a few exceptions where the necessity of full sequence is explained by the drugs mechanism of action. Our interpretability analysis further demonstrates that, even without supervision, the full sequence model can learn to focus on the active site residues to a higher extent. Overall, this work challenges the assumption that full primary structure is indispensable for virtual screening of human kinases.