• 中国中文核心期刊
  • 中国科学引文数据库核心期刊
  • 中国科技核心期刊
  • 中国高校百佳科技期刊
高级检索

基于机器学习的KRAS抑制剂活性预测模型研究

杜克, 荣丹琪, 卢瑞, 张小雅, 赵鸿萍

杜克,荣丹琪,卢瑞,等. 基于机器学习的KRAS抑制剂活性预测模型研究[J]. 中国药科大学学报,2024,55(3):306 − 315. DOI: 10.11665/j.issn.1000-5048.2024031102
引用本文: 杜克,荣丹琪,卢瑞,等. 基于机器学习的KRAS抑制剂活性预测模型研究[J]. 中国药科大学学报,2024,55(3):306 − 315. DOI: 10.11665/j.issn.1000-5048.2024031102
DU Ke, RONG Danqi, LU Rui, et al. Research on machine learning-based activity prediction models for KRAS inhibitors[J]. J China Pharm Univ, 2024, 55(3): 306 − 315. DOI: 10.11665/j.issn.1000-5048.2024031102
Citation: DU Ke, RONG Danqi, LU Rui, et al. Research on machine learning-based activity prediction models for KRAS inhibitors[J]. J China Pharm Univ, 2024, 55(3): 306 − 315. DOI: 10.11665/j.issn.1000-5048.2024031102

基于机器学习的KRAS抑制剂活性预测模型研究

基金项目: 江苏高校哲学社会科学研究重大项目(No. 2023SJZD130)
详细信息
    作者简介:

    赵鸿萍,药学信息学专业博士,教授,新加坡国立大学访问学者,中国药科大学医药大数据与人工智能专业硕士生导师。研究方向为基于GNN、GAN、Diffusion Model等多类深度学习技术发现、筛选靶标和先导化合物。近年来,主持完成国家卫生健康委员会课题“全国药品集中采购数据比对与国家药管平台数据安全体系建设”,作为主要参与人参加了国家自然科学基金面上项目、教育部科学技术研究重点项目 4项,发表 SCI论文 20余篇,开发在线运行的医药大数据与人工智能平台 2 个,获批专利 1 项。主讲课程为 Python 与医药大数据处理,主编出版了《Python 程序设计——以医药数据处理为例》和《药学信息检索教程》(入选江苏省高等学校重点教材),荣获省级以上教学奖(排名第一)9项

    通讯作者:

    赵鸿萍: Tel:025-86185163 E-mail:zhaohongping@cpu.edu.cn

  • 中图分类号: TP181;R914

Research on machine learning-based activity prediction models for KRAS inhibitors

Funds: This study was supported by the Key Project of Philosophy and Social Science Research in Colleges and Universities in Jiangsu Province (No. 2023SJZD130)
  • 摘要:

    Kirsten大鼠肉瘤病毒癌基因同系物 (Kirsten rat sarcoma viral oncogene homolog,KRAS)基因是最常见的突变癌基因之一,发现KRAS抑制剂对存在该基因突变的癌症患者具有潜在的治疗作用。本研究将机器学习应用于KRAS抑制剂的定量构效关系(quantitative structure-activity relationship,QSAR)模型,从ChEMBL、BindingDB、PubChem 3个数据库中收集了1857条KRAS小分子抑制剂的IC50和SMILES(simplified molecular input line entry system),采用3种不同的特征筛选方式结合随机森林、支持向量机、极端梯度提升机3种机器学习模型,构建了9个不同的分类器。结果表明,SVM模型结合互信息筛选显示出最佳性能:AUCtest=0.912,ACCtest=0.859,F1test=0.890,并且在外部验证集上也表现出良好的预测性能(AUCExt=0.944,RecallExt=0.856,FPRExt=0.111)。该研究为使用人工智能方法在天然产物数据库中进行KRAS抑制剂筛选提供了新的技术路线。

    Abstract:

    Kirsten rat sarcoma viral oncogene homolog (KRAS) gene is one of the most commonly mutated oncogenes. It has been found that KRAS inhibitors have the potential therapeutic effect on cancer patients with this gene mutation. In this study, machine learning was applied to develop a QSAR(quantitative structure-activity relationship) model for KRAS small molecule inhibitors. A total of 1857data points of IC50 and SMILES(simplified molecular input line entry system) for KRAS inhibitors were collected from three databases: ChEMBL, BindingDB, and PubChem. And nine different classifiers were constructed using three different feature screening methods combined with three machine learning models, namely, random forest, support vector machine, and extreme gradient boosting machine. The results showed that the SVM model combined with mutual information feature selection exhibited the best performance: AUCtest=0.912, ACCtest=0.859, F1test=0.890. Moreover, it also demonstrated good predictive performance on the external validation set(AUCExt=0.944, RecallExt=0.856, FPRExt=0.111). This study provides a new technical route for KRAS inhibitor screening in natural product databases using artificial intelligence methods.

  • Figure  1.   Molecular structural formulae of sotorasib(AMG510) and adagrasib(MRTX849)

    Figure  2.   Workflow of KRAS inhibitors activity prediction based on machine learning

    ECFP4: Extended connectivity fingerprints; MACCS: Molecular access system; RF: Random forest; SVM: Support vector machine; XGBoost: Extreme gradient boosting; MI:Mutual information

    Figure  3.   Schematic diagram of the QSAR model

    SMILES: Simplified molecular input line entry system; QSAR: Quantitative structure-activity relationship

    Figure  4.   Distribution of pIC50 values for KRAS inhibitors

    A: Scatter plot of all compound pIC50 values; B: Histogram of pIC50 values distribution

    Figure  5.   Spatial distributions of training set and test set

    A: Two-dimensional results presentation; B: Three-dimensional results presentation

    Figure  6.   Molecular fingerprint similarity between training and test sets

    Figure  7.   Features selected by RF, SVM, and XGBoost respectively by five-fold cross-validation

    A: Optimal number of features selected by MI; B: Optimal number of dimensions reduced by PCA

    Figure  8.   Probability distributions of different metrics on different models and corresponding mean line charts

    Figure  9.   Receiver operating characteristic (ROC) curves for 9 models under five-fold cross-validation

    A:No screening of features;B:Screening of features based on MI;C: Dimensionality reduction of features by PCA

    Figure  10.   AUC, ACC, and F1-score for 9 models on the test set

    Figure  11.   ROC curves for external validation set

    Table  1   Sample statistics for training and test sets

    Data typePositive samplesNegative samplesSumRatio (+/–)
    Training set99049514852.0
    Test set2481243722.0
    Total123861918572.0
    下载: 导出CSV

    Table  2   Training and test set results for 9 models

    FeaturesModelCV5TrainingTest
    AUCACCF1AUCACCF1AUCACCF1
    NoneRF0.8900.8370.8821.0000.9900.9920.9000.8490.894
    SVM0.8900.8500.8880.9990.9740.9810.9000.8470.888
    XGBoost0.8900.8290.8740.9990.9870.9900.8920.8330.881
    MIRF0.8920.8330.8791.0000.9950.9960.9030.8520.895
    SVM0.8910.8350.8770.9900.9500.9620.9120.8590.890
    XGBoost0.8920.8290.8730.9990.9840.9880.8970.8310.876
    PCARF0.8850.8320.8791.0000.9970.9970.8970.8310.882
    SVM0.8860.8390.8800.9930.9530.9650.9080.8520.893
    XGBoost0.8830.8330.8771.0000.9950.9960.8950.8230.872
    下载: 导出CSV
  • [1]

    Dharmaiah S, Tran TH, Messing S, et al. Structures of N-terminally processed KRAS provide insight into the role of N-acetylation[J]. Sci Rep, 2019, 9(1): 10512. doi: 10.1038/s41598-019-46846-w

    [2]

    Prior IA, Hood FE, Hartley JL. The frequency of ras mutations in cancer[J]. Cancer Res, 2020, 80(14): 2969-2974. doi: 10.1158/0008-5472.CAN-19-3682

    [3]

    Simanshu DK, Nissley DV, McCormick F. RAS proteins and their regulators in human disease[J]. Cell, 2017, 170(1): 17-33. doi: 10.1016/j.cell.2017.06.009

    [4]

    Sánchez-Rivera FJ, Papagiannakopoulos T, Romero R, et al. Rapid modelling of cooperating genetic events in cancer through somatic genome editing[J]. Nature, 2014, 516(7531): 428-431. doi: 10.1038/nature13906

    [5]

    Canon J, Rex K, Saiki AY, et al. The clinical KRAS(G12C) inhibitor AMG 510 drives anti-tumour immunity[J]. Nature, 2019, 575(7781): 217-223. doi: 10.1038/s41586-019-1694-1

    [6]

    Hallin J, Engstrom LD, Hargis L, et al. The KRASG12C inhibitor MRTX849 provides insight toward therapeutic susceptibility of KRAS-mutant cancers in mouse models and patients[J]. Cancer Discov, 2020, 10(1): 54-71. doi: 10.1158/2159-8290.CD-19-1167

    [7]

    Lanman BA, Allen JR, Allen JG, et al. Discovery of a covalent inhibitor of KRASG12C (AMG 510) for the treatment of solid tumors[J]. J Med Chem, 2020, 63(1): 52-65. doi: 10.1021/acs.jmedchem.9b01180

    [8]

    Wang H, Chi LL, Yu FQ, et al. Annual review of KRAS inhibitors in 2022[J]. Eur J Med Chem, 2023, 249: 115124. doi: 10.1016/j.ejmech.2023.115124

    [9]

    Liu LM, Chen XJ, Sun SW, et al. A review of deep learning application on drug activity prediction[J]. Prog Biochem Biophys, 2022, 49(8): 1498-1519.

    [10]

    Simeon S, Jongkon N. Construction of quantitative structure activity relationship (QSAR) models to predict potency of structurally diversed Janus kinase 2 inhibitors[J]. Molecules, 2019, 24(23): 4393. doi: 10.3390/molecules24234393

    [11]

    Chen XY, Xie WC, Yang Y, et al. Discovery of dual FGFR4 and EGFR inhibitors by machine learning and biological evaluation[J]. J Chem Inf Model, 2020, 60(10): 4640-4652. doi: 10.1021/acs.jcim.0c00652

    [12]

    Xing GM, Liang L, Deng CL, et al. Activity prediction of small molecule inhibitors for antirheumatoid arthritis targets based on artificial intelligence[J]. ACS Comb Sci, 2020, 22(12): 873-886. doi: 10.1021/acscombsci.0c00169

    [13]

    Srisongkram T, Khamtang P, Weerapreeyakul N. Prediction of KRASG12C inhibitors using conjoint fingerprint and machine learning-based QSAR models[J]. J Mol Graph Model, 2023, 122: 108466. doi: 10.1016/j.jmgm.2023.108466

    [14]

    Doddareddy MR, Klaasse EC, Shagufta, et al. Prospective validation of a comprehensive in silico hERG model and its applications to commercial compound and drug databases[J]. ChemMedChem, 2010, 5(5): 716-729. doi: 10.1002/cmdc.201000024

    [15]

    Srisongkram T, Weerapreeyakul N. Drug repurposing against KRAS mutant G12C: a machine learning, molecular docking, and molecular dynamics study[J]. Int J Mol Sci, 2022, 24(1): 669. doi: 10.3390/ijms24010669

    [16]

    Kulkarni AM, Kumar V, Parate S, et al. Identification of new KRAS G12D inhibitors through computer-aided drug discovery methods[J]. Int J Mol Sci, 2022, 23(3): 1309. doi: 10.3390/ijms23031309

    [17]

    Scikit-learn: machine learning in Python–scikit-learn 0.16. 1 documentation[EB/OL]. [2023-03-07].https://scikit-learn.org/.

    [18]

    Wigh DS, Goodman JM, Lapkin AA. A review of molecular representation in the age of machine learning[J]. WIREs Comput Mol Sci, 2022, 12: e1603. doi: 10.1002/wcms.1603

    [19]

    Venkatesh B, Anuradha J. A review of feature selection and its methods[J]. Cybern Inf Technol, 2019, 19(1): 3-26.

    [20]

    Steuer R, Kurths J, Daub CO, et al. The mutual information: detecting and evaluating dependencies between variables[J]. Bioinformatics, 2002, 18(Suppl 2): S231-S240.

    [21]

    Breiman L. Random forests[J]. Mach Learn, 2001, 45(1): 5-32. doi: 10.1023/A:1010933404324

    [22]

    Chen TQ, Guestrin C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA. ACM, 2016: 785-794.

    [23]

    Cohen J. A coefficient of agreement for nominal scales[J]. Educ Psychol Meas, 1960, 20(1): 37-46. doi: 10.1177/001316446002000104

    [24]

    Mysinger MM, Carchia M, Irwin JJ, et al. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking[J]. J Med Chem, 2012, 55(14): 6582-6594. doi: 10.1021/jm300687e

    [25]

    Huang N, Shoichet BK, Irwin JJ. Benchmarking sets for molecular docking[J]. J Med Chem, 2006, 49(23): 6789-6801. doi: 10.1021/jm0608356

    [26]

    De P, Kar S, Ambure P, et al. Prediction reliability of QSAR models: an overview of various validation tools[J]. Arch Toxicol, 2022, 96(5): 1279-1295. doi: 10.1007/s00204-022-03252-y

    [27]

    Kubinyi H, Hamprecht FA, Mietzner T. Three-dimensional quantitative similarity-activity relationships (3D QSiAR) from SEAL similarity matrices[J]. J Med Chem, 1998, 41(14): 2553-2564. doi: 10.1021/jm970732a

  • 期刊类型引用(1)

    1. 周迎芳,韩琮定,任胜杰,文贵辉,文利新. 基于高脂饮食探究不同膳食油脂对小鼠附睾脂肪沉积的影响. 粮食与油脂. 2025(03): 55-59+66 . 百度学术

    其他类型引用(1)

图(11)  /  表(2)
计量
  • 文章访问数:  201
  • HTML全文浏览量:  45
  • PDF下载量:  63
  • 被引次数: 2
出版历程
  • 收稿日期:  2024-03-10
  • 网络出版日期:  2024-06-24
  • 刊出日期:  2024-06-24

目录

    /

    返回文章
    返回
    x 关闭 永久关闭