Research on machine learning-based activity prediction models for KRAS inhibitors
-
摘要:
Kirsten大鼠肉瘤病毒癌基因同系物 (Kirsten rat sarcoma viral oncogene homolog,KRAS)基因是最常见的突变癌基因之一,发现KRAS抑制剂对存在该基因突变的癌症患者具有潜在的治疗作用。本研究将机器学习应用于KRAS抑制剂的定量构效关系(quantitative structure-activity relationship,QSAR)模型,从ChEMBL、BindingDB、PubChem 3个数据库中收集了1857条KRAS小分子抑制剂的IC50和SMILES(simplified molecular input line entry system),采用3种不同的特征筛选方式结合随机森林、支持向量机、极端梯度提升机3种机器学习模型,构建了9个不同的分类器。结果表明,SVM模型结合互信息筛选显示出最佳性能:AUCtest=0.912,ACCtest=0.859,F1test=0.890,并且在外部验证集上也表现出良好的预测性能(AUCExt=0.944,RecallExt=0.856,FPRExt=0.111)。该研究为使用人工智能方法在天然产物数据库中进行KRAS抑制剂筛选提供了新的技术路线。
Abstract:Kirsten rat sarcoma viral oncogene homolog (KRAS) gene is one of the most commonly mutated oncogenes. It has been found that KRAS inhibitors have the potential therapeutic effect on cancer patients with this gene mutation. In this study, machine learning was applied to develop a QSAR(quantitative structure-activity relationship) model for KRAS small molecule inhibitors. A total of 1857data points of IC50 and SMILES(simplified molecular input line entry system) for KRAS inhibitors were collected from three databases: ChEMBL, BindingDB, and PubChem. And nine different classifiers were constructed using three different feature screening methods combined with three machine learning models, namely, random forest, support vector machine, and extreme gradient boosting machine. The results showed that the SVM model combined with mutual information feature selection exhibited the best performance: AUCtest=0.912, ACCtest=0.859, F1test=0.890. Moreover, it also demonstrated good predictive performance on the external validation set(AUCExt=0.944, RecallExt=0.856, FPRExt=0.111). This study provides a new technical route for KRAS inhibitor screening in natural product databases using artificial intelligence methods.
-
-
Table 1 Sample statistics for training and test sets
Data type Positive samples Negative samples Sum Ratio (+/–) Training set 990 495 1485 2.0 Test set 248 124 372 2.0 Total 1238 619 1857 2.0 Table 2 Training and test set results for 9 models
Features Model CV5 Training Test AUC ACC F1 AUC ACC F1 AUC ACC F1 None RF 0.890 0.837 0.882 1.000 0.990 0.992 0.900 0.849 0.894 SVM 0.890 0.850 0.888 0.999 0.974 0.981 0.900 0.847 0.888 XGBoost 0.890 0.829 0.874 0.999 0.987 0.990 0.892 0.833 0.881 MI RF 0.892 0.833 0.879 1.000 0.995 0.996 0.903 0.852 0.895 SVM 0.891 0.835 0.877 0.990 0.950 0.962 0.912 0.859 0.890 XGBoost 0.892 0.829 0.873 0.999 0.984 0.988 0.897 0.831 0.876 PCA RF 0.885 0.832 0.879 1.000 0.997 0.997 0.897 0.831 0.882 SVM 0.886 0.839 0.880 0.993 0.953 0.965 0.908 0.852 0.893 XGBoost 0.883 0.833 0.877 1.000 0.995 0.996 0.895 0.823 0.872 -
[1] Dharmaiah S, Tran TH, Messing S, et al. Structures of N-terminally processed KRAS provide insight into the role of N-acetylation[J]. Sci Rep, 2019, 9(1): 10512. doi: 10.1038/s41598-019-46846-w
[2] Prior IA, Hood FE, Hartley JL. The frequency of ras mutations in cancer[J]. Cancer Res, 2020, 80(14): 2969-2974. doi: 10.1158/0008-5472.CAN-19-3682
[3] Simanshu DK, Nissley DV, McCormick F. RAS proteins and their regulators in human disease[J]. Cell, 2017, 170(1): 17-33. doi: 10.1016/j.cell.2017.06.009
[4] Sánchez-Rivera FJ, Papagiannakopoulos T, Romero R, et al. Rapid modelling of cooperating genetic events in cancer through somatic genome editing[J]. Nature, 2014, 516(7531): 428-431. doi: 10.1038/nature13906
[5] Canon J, Rex K, Saiki AY, et al. The clinical KRAS(G12C) inhibitor AMG 510 drives anti-tumour immunity[J]. Nature, 2019, 575(7781): 217-223. doi: 10.1038/s41586-019-1694-1
[6] Hallin J, Engstrom LD, Hargis L, et al. The KRASG12C inhibitor MRTX849 provides insight toward therapeutic susceptibility of KRAS-mutant cancers in mouse models and patients[J]. Cancer Discov, 2020, 10(1): 54-71. doi: 10.1158/2159-8290.CD-19-1167
[7] Lanman BA, Allen JR, Allen JG, et al. Discovery of a covalent inhibitor of KRASG12C (AMG 510) for the treatment of solid tumors[J]. J Med Chem, 2020, 63(1): 52-65. doi: 10.1021/acs.jmedchem.9b01180
[8] Wang H, Chi LL, Yu FQ, et al. Annual review of KRAS inhibitors in 2022[J]. Eur J Med Chem, 2023, 249: 115124. doi: 10.1016/j.ejmech.2023.115124
[9] Liu LM, Chen XJ, Sun SW, et al. A review of deep learning application on drug activity prediction[J]. Prog Biochem Biophys, 2022, 49(8): 1498-1519.
[10] Simeon S, Jongkon N. Construction of quantitative structure activity relationship (QSAR) models to predict potency of structurally diversed Janus kinase 2 inhibitors[J]. Molecules, 2019, 24(23): 4393. doi: 10.3390/molecules24234393
[11] Chen XY, Xie WC, Yang Y, et al. Discovery of dual FGFR4 and EGFR inhibitors by machine learning and biological evaluation[J]. J Chem Inf Model, 2020, 60(10): 4640-4652. doi: 10.1021/acs.jcim.0c00652
[12] Xing GM, Liang L, Deng CL, et al. Activity prediction of small molecule inhibitors for antirheumatoid arthritis targets based on artificial intelligence[J]. ACS Comb Sci, 2020, 22(12): 873-886. doi: 10.1021/acscombsci.0c00169
[13] Srisongkram T, Khamtang P, Weerapreeyakul N. Prediction of KRASG12C inhibitors using conjoint fingerprint and machine learning-based QSAR models[J]. J Mol Graph Model, 2023, 122: 108466. doi: 10.1016/j.jmgm.2023.108466
[14] Doddareddy MR, Klaasse EC, Shagufta, et al. Prospective validation of a comprehensive in silico hERG model and its applications to commercial compound and drug databases[J]. ChemMedChem, 2010, 5(5): 716-729. doi: 10.1002/cmdc.201000024
[15] Srisongkram T, Weerapreeyakul N. Drug repurposing against KRAS mutant G12C: a machine learning, molecular docking, and molecular dynamics study[J]. Int J Mol Sci, 2022, 24(1): 669. doi: 10.3390/ijms24010669
[16] Kulkarni AM, Kumar V, Parate S, et al. Identification of new KRAS G12D inhibitors through computer-aided drug discovery methods[J]. Int J Mol Sci, 2022, 23(3): 1309. doi: 10.3390/ijms23031309
[17] Scikit-learn: machine learning in Python–scikit-learn 0.16. 1 documentation[EB/OL]. [2023-03-07].https://scikit-learn.org/.
[18] Wigh DS, Goodman JM, Lapkin AA. A review of molecular representation in the age of machine learning[J]. WIREs Comput Mol Sci, 2022, 12: e1603. doi: 10.1002/wcms.1603
[19] Venkatesh B, Anuradha J. A review of feature selection and its methods[J]. Cybern Inf Technol, 2019, 19(1): 3-26.
[20] Steuer R, Kurths J, Daub CO, et al. The mutual information: detecting and evaluating dependencies between variables[J]. Bioinformatics, 2002, 18(Suppl 2): S231-S240.
[21] Breiman L. Random forests[J]. Mach Learn, 2001, 45(1): 5-32. doi: 10.1023/A:1010933404324
[22] Chen TQ, Guestrin C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA. ACM, 2016: 785-794.
[23] Cohen J. A coefficient of agreement for nominal scales[J]. Educ Psychol Meas, 1960, 20(1): 37-46. doi: 10.1177/001316446002000104
[24] Mysinger MM, Carchia M, Irwin JJ, et al. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking[J]. J Med Chem, 2012, 55(14): 6582-6594. doi: 10.1021/jm300687e
[25] Huang N, Shoichet BK, Irwin JJ. Benchmarking sets for molecular docking[J]. J Med Chem, 2006, 49(23): 6789-6801. doi: 10.1021/jm0608356
[26] De P, Kar S, Ambure P, et al. Prediction reliability of QSAR models: an overview of various validation tools[J]. Arch Toxicol, 2022, 96(5): 1279-1295. doi: 10.1007/s00204-022-03252-y
[27] Kubinyi H, Hamprecht FA, Mietzner T. Three-dimensional quantitative similarity-activity relationships (3D QSiAR) from SEAL similarity matrices[J]. J Med Chem, 1998, 41(14): 2553-2564. doi: 10.1021/jm970732a
-
期刊类型引用(1)
1. 周迎芳,韩琮定,任胜杰,文贵辉,文利新. 基于高脂饮食探究不同膳食油脂对小鼠附睾脂肪沉积的影响. 粮食与油脂. 2025(03): 55-59+66 . 百度学术
其他类型引用(1)