Predicting protein-ligand binding free energy with recent machine learning and deep learning approaches

Boris TOUZEAU^1*, Jung-Hsin Lin^2,3,4,5

¹TIGP-CBMB, Academia Sinica, Taipei, 11529, Taiwan
²Research Center for Applied Sciences, Academia Sinica, Taipei, 11529, Taiwan
³Institute for Biomedical Sciences, Academia Sinica, Taipei, 11529, Taiwan
⁴School of Pharmacy, College of Medicine, National Taiwan University, Taipei, 11529, Taiwan
⁵College of Engineering, Chang Gung University, Taoyuan, 33302, Taiwan

* Presenter:Boris TOUZEAU, email:boris.touzeau@gmail.com

In this report, we present the comparison of the performance of various recent machine learning and deep learning approaches for predicting the binding free energy of protein-ligand complexes. Among the state-of-the-art techniques to predict the protein-ligand free-energy interactions, the software named ΔVinaRF20 [J. Comput. Chem. 38, 169-177 (2017)], which was developed by Yingkai Zhang et al. from New York University, had attracted a great attention due to its remarkable performance. This set of scoring functions relies on two features; one is the widely recognized Autodock Vina scoring functions, which makes use of physico-chemical parameters, such as hydrophobic interactions, hydrogen-bond terms also known as molecular descriptors while the second feature is the addition of other parameters for a total number of 20 molecular descriptors : 10 were based on autodock vina source code and 10 were based on the buried solvent‐accessible surface area (such as atom type, number of heavy atom). Random Forest was used to help select the most critical molecular descriptors, leading ΔVinaRF20 to rank first position, for pose prediction, free energy prediction and screening prediction, in a benchmark hence, outperforming most of its competitors (Glide, X-score, etc.). ΔVinaRF20 may be efficient but it depends on feature engineering which may be time consuming. Also its training set is limited in size due to technical constraints and focus to a subset of the chemical space. One way to solve these two issues is the use of deep learning (DL) methods, which allow us to accomodate a training set increase of more than 100 fold as well as allow the algorithm to identify the critical features of the complex independently of the user.
Convolutional Neural Network or CNN is a powerful DL algorithme that can input “picture” (or grid) to its network. David Ryan Koes et al. [J. Chem. Inf. Model. 57 : 942-957 (2017)] developed one such model, Gnina. This allows the model to define and learn its own pharmacophores, which if done in conjunction with a chemist/biochemist knowledge could potentially lead to a significant boost of efficacy,. Another software developed by Pawel Siedlecki et al. [Bioinformatics 34 : 3666-3674 (2018) ] has been trained to predict the affinity (energy of interaction of protein-ligand complexes) using experimental data from the public databases. The predictive power of Pafnucy proved similar to that of ΔVinaRF20, when comparing both in a benchmark study, albeit it doesn’t use features as complex as ΔVinaRF20. We will also report that there are many possibilities of improvements based on our recent studies and that it is possible to further improve the predictive powers when refining network architectures.

Keywords: Convolutional Neural Network, Machine Learning, Binding Free Energy