摘要
从分子库中筛选出潜在活性化合物,是药物发现常用的方法。然而,随着化学空间的不断探索,目前已有超过数十亿分子的化合物库,仅仅依靠分子对接已不足以从超大化合物库中对特定靶点抑制剂进行快速筛选。本研究提出了一种筛选潜在活性化合物的方法,通过计算物理化学性质相似性、构建机器学习预测模型以及分子对接等步骤,对含有55亿分子的候选化合物库进行过滤筛选,最终得到51个具有共济失调毛细血管扩张突变基因和Rad3相关蛋白(ataxia telangiectasia-mutated and Rad3-related,ATR)激酶潜在抑制活性的化合物。该方法为从超大库中快速筛选新颖潜在活性分子提供了有效途径。
共济失调毛细血管扩张突变基因和Rad3相关蛋白(ataxia telangiectasia-mutated and rad3-related,ATR)激酶,属于磷脂酰肌醇3激酶样激酶(PIKK)家族,是丝氨酸/苏氨酸蛋白激酶家族当中的重要成员之一。ATR是DNA 损伤修复(DDR)过程中的一个关键蛋白,其主要作为复制应力(replication stress,RS)的传感器,参与介导DNA复制和有丝分裂。由于癌细胞高度依赖细胞周期中的S和G2/M检查点,这使得靶向ATR抑制剂成为抗肿瘤药物的研发重

Figure 1 Ataxia telangiectasia-mutated and Rad3-related (ATR) inhibitors entering the clinic
本研究提出了一种基于人工智能从超大库中筛选ATR活性分子的方法。如

Figure 2 Workflow of ATR drug screening based on machine learning and molecular docking
分别从BindingDB数据库(http://www.bindingdb.org/bind/index.jsp)、ChEMBL数据库(https://www.ebi.ac.uk/chembl)和科睿唯安(Clarivate)的Cortellis Drug Discovery Intelligence数据库(https://www.cortellis.com/drugdiscovery)中收集已知IC50活性的化合物数据作为ATR活性数据库。
Enamine公司的REAL (readily available for synthesis)是目前最广泛应用的虚拟化合物库之一,是典型的枚举结构数据库。目前发布的REAL数据库包含超过55亿个符合Lipinski五规
蛋白质数据库PDB(http://www1.rcsb.org/)中共包含4个人类ATR蛋白晶体结构,其一为通过冷冻电镜(Cryo-EM)获得的人类ATR-ATRIP蛋白晶体结构,ID为5YZ0,分辨率为4.7 Å。另外3个皆为模仿ATR设计的与ATR同源的PI3K-α突变体,ID分别为5UL1,5UKJ,5UK8。鉴于5YZ0相较于其他3个晶体结构,分辨率低,包含的蛋白序列繁多,且不包含配体分子进而无法很好地得到配体的结合位置,本研究忽略5YZ0晶体结构。
重原子数是指化合物中所含的重原子的数量,通常指C、N等非H的原子数量。对于特定的靶点蛋白,落在其蛋白口袋的重原子数通常在一定的范围内,因而可以结合蛋白口袋形状以及已知活性小分子的重原子分布来缩小化合物库。
AlogP是计算所得的脂水分配系数,能反映化合物在水(脂)溶性的大小。相对分子质量是指化合物中各个原子的相对原子质量的总和,通过
本研究使用了5种不同的机器学习算法建立了ATR IC50预测回归模型,分别是梯度提升决策树(gradient boosted decision trees,GBDT
对于“2.2”项提出的5种机器学习回归模型,本研究通过决定系数(coefficient of determination,
为了更好地对模型进行评估,本研究根据Tropsha
1) > 0.5
2) > 0.6
3) < 0.1或< 0.1
4) 0.85 ≤ ≤1.15或0.85 ≤ ≤1.15
5) 或
6)
其中是交叉验证的决定系数,、均表示决定系数,和分别是根据最小二乘回归线没有截距时预测值与实际值以及实际值与预测值的决定系数,由计算公
折合误差FE和平均折合误差AFE也可以用来评价模型预测的正确性。对于FE而言,若FE小于2,模型预测值得信
本研究在机器学习模型缩小化合物库的基础上,通过使用Schrödinger软件中的高通量虚拟筛选(high throughput virtual screening,HTVS)模块进行虚拟筛选来获得对接打分较低的分子,即潜在的活性分子。对接后保留对接得分排名前0.1%的构象,并根据关键氨基酸再次筛选。在进行虚拟筛选以及关键氨基酸筛选前,还需要获取最优蛋白、关键氨基酸以及进行蛋白准备、配体准备。
在最优蛋白方面,本研究采用交叉对接的方法来从剩余的3个晶体复合物中进行挑选。交叉对接是将每个共晶的配体准备后依次与每个蛋白受体进行Glide对接,通过计算均方根误差(root mean square error,RMSE)来衡量受体与配体的结合情况的方
化合物库的多样性与骨架的多样性呈正相关。本研究采用文献[
化合物库的相似性与骨架的相似性呈正相关。本研究采用骨架的余弦相似度(cosine similarity)来计算化合物库之间的相似性。余弦相似度可以通过计算公
收集的化合物结构及活性数据都经过统一处理,即先删除缺乏单位和SMILES无法被RDKit规范化的数据,然后针对同一个分子具有多个IC50的情况,取其平均值。为了缩小模型的预测范围,提高模型的预测精确度,将抑制剂的单位先统一规范到摩尔后再从IC50转化为pIC50,即将IC50为1 × 1

Figure 3 Frequency distribution of heavy atom count in known active ATR inhibitors
为训练和测试5种机器学习模型,本研究还对866条数据进行基于4∶1随机划分,共获得训练集692条,测试集174条。为了验证基于4∶1随机划分方法的合理性,本研究通过PCA降维来观察训练集和测试集之间的化学空间分布以及通过ECFP4分子指纹来计算相似度。由

Figure 4 Comparison between training set and test set data
A: Comparison of chemical spatial distribution of principal component analysis (PCA); B: Comparison of similarity between training set and test set data, with darker colors (range: 0-1) representing higher similarity
首先,本研究通过重原子数进行过滤,REAL数据库中化合物的重原子分布在6 ~ 38范围内。根据
5个回归模型在训练集和测试集上的表现如
Methods | MSETrain | MSETest | |||
---|---|---|---|---|---|
XGBoost | 0.557 | 0.971 | 0.021 | 0.611 | 0.301 |
CatBoost | 0.625 | 0.966 | 0.025 | 0.622 | 0.293 |
GBDT | 0.586 | 0.918 | 0.059 | 0.621 | 0.294 |
MLP | 0.072 | 0.062 | 0.678 | 0.117 | 0.683 |
GNN | 0.601 | 0.968 | 0.023 | 0.614 | 0.299 |
a Inspection standards are

Figure 5 Comparison of experimental and predicted values for the five regression models
Methods | AFE | <2-fold | |||||||
---|---|---|---|---|---|---|---|---|---|
XGBoost | 1.003 | 0.992 | 0.001 | 0.344 | 0.599 | 0.331 | 0.268 | 0.999 | 100% |
CatBoost | 1.002 | 0.992 | 0.000 | 0.332 | 0.614 | 0.339 | 0.275 | 1.001 | 100% |
GBDT | 1.010 | 0.989 | 0.000 | 0.326 | 0.613 | 0.342 | 0.271 | 0.998 | 100% |
MLP | 1.002 | 0.985 | 0.384 | 4.939 | 0.138 | 0.005 | 0.133 | 0.999 | 100% |
GNN | 0.991 | 1.003 | 0.011 | 0.198 | 0.574 | 0.405 | 0.169 | 1.010 | 100% |
aInspection standards are 0.85 ≤ ≤ 1.15 or 0.85 ≤ ≤ 1.15; < 0.1 or < 0.1; or ;
在基于物理化学性质过滤获得的约1 200万化合物的基础上,通过最优GNN模型进行活性预测。其预测结果如

Figure 6 GNN model -pIC50 activity prediction results
在最优蛋白方面,5UKJ 的交叉对接结果的RMSD均值仅为2.12 Å,低于5UK8的3.45 Å和5UL1的4.58 Å,说明5UKJ在重现性方面较其余两个结构更好。此外在5UKJ的RMSD中位数为3.11 Å,也低于5UK8的3.71 Å和5UL1的4.21 Å,这说明了5UKJ对于不同结构的配体分子对接结果更稳定,可信度更高。故而,本研究将PDB ID为5UKJ的蛋白晶体复合物结构选择为最优蛋白晶体复合物结构。

Figure 7 Key amino acid analysis by protein-ligand interaction fingerprint (PLIF, A) and FTMAP (B)
本研究将经过机器学习模型过滤后的约120万分子进行配体准备,共生成约292万构象分子,然后通过5UKJ最优蛋白晶体结构进行虚拟筛选,根据对接打分保留前0.1%的构象分子,共计2 916个。进而通过是否跟关键氨基酸Trp850、Val851、Thr856形成氢键相互作用进行过滤,以进一步筛选对接结果,共计获得2 561个构象分子。在比对对接分子与原配体的叠合情况以及观察对接分子在蛋白口袋的相互作用模式等基于经验的人工挑选下,得到51个具有潜在ATR抑制活性的化合物,其中对接得分排名前6的化合物如
ID | Structure | Predicted binding mode | Docking score/ (kcal/mo | Predicted values/ (nmol/L) |
---|---|---|---|---|
Hit-1 |
![]() |
![]() | -10.790 | 80.619 |
Hit-2 |
![]() |
![]() | -10.762 | 19.704 |
Hit-3 |
![]() |
![]() | -10.588 | 12.779 |
Hit-4 |
![]() |
![]() | -10.493 | 50.586 |
Hit-5 |
![]() |
![]() | -10.322 | 22.122 |
Hit-6 |
![]() |
![]() | -10.205 | 9.491 |
a 1 cal = 4.184 J
为验证本研究方法的合理性,本研究做了进一步分析。即将最终得到的51个分子命名为Data1,将已知活性在100 nmol/L以内的556个分子命名为Data2,并比较两个数据集在结构新颖性和化学空间分布上的表现。
Database | Mraw | Nraw | N | N/Nraw | Ns | Ns/Nraw |
---|---|---|---|---|---|---|
Data1 | 51 | 51 | 51 | 1 | 51 | 1 |
Data2 | 556 | 556 | 225 | 0.40 | 151 | 0.27 |
Mraw: Number of nonrepetitive molecules; Nraw: Number of scaffolds; N: Number of nonrepetitive scaffolds; Ns: Number of skeletons that occur only once.Data1: Resulting 51 molecules were screened; Data2: 556 molecules with known activity within 100 nmol/L
由骨架的余弦相似度计算可知,Data1数据集和Data2数据集的余弦相似度仅为0.005 8,说明Data1中含有部分Data2的骨架,但这些共同骨架在两个数据集中的占比有较大差异,这表明Data1数据集具有较高的骨架新颖性。
综合以上的多样性和相似性分析结果,可以说明本研究从超大库中筛选获得的Data1数据集具有丰富的骨架且与Data2数据集骨架相似度极低,即说明了Data1数据集在结构上是新颖性的。

Figure 8 Comparison of chemical spatial distribution of principal componelt analysis (PCA)
在活细胞中,以内源性诱导的DNA损伤在高速形成
References
Bradbury A, Hall S, Curtin N, et al. Targeting ATR as Cancer Therapy: a new era for synthetic lethality and synergistic combinations [J]? Pharmacol Ther, 2020, 207: 107450. [百度学术]
Zimmermann A, Dahmen H, Grombacher T, et al. Abstract 2588: M1774, a novel potent and selective ATR inhibitor, shows antitumor effects as monotherapy and in combination[J]. Cancer Res, 2022, 82(12_Suppl): 2588. [百度学术]
Yap Timothy A, Tolcher Anthony W, Ruth PE, et al. A first-in-human phase I study of ATR inhibitor M1774 in patients with solid tumors[J]. J Clin Oncol, 2021, 39(15_suppl): TPS3153. [百度学术]
Zenke FT, Zimmermann A, Dahmen H, et al. Antitumor activity of M4344, a potent and selective ATR inhibitor, in monotherapy and combination therapy [J]. Cancer Res, 2019, 79(13_Suppl): 369. [百度学术]
Fokas E, Prevo R, Pollard JR, et al. Targeting ATR in vivo using the novel inhibitor VE-822 results in selective sensitization of pancreatic tumors to radiation[J]. Cell Death Dis, 2012, 3(12): e441. [百度学术]
Knegtel R, Charrier JD, Durrant S, et al. Rational design of 5-(4-(isopropylsulfonyl) phenyl)-3-(3-(4-((methylamino) methyl) phenyl) isoxazol-5-yl) pyrazin-2-amine (VX-970, M6620): optimization of intra- and intermolecular polar interactions of a new ataxia telangiectasia mutated and Rad3-related (ATR) kinase inhibitor[J]. J Med Chem, 2019, 62(11): 5547-5561. [百度学术]
Foote KM, Nissink JWM, McGuire T, et al. Discovery and characterization of AZD6738, a potent inhibitor of ataxia telangiectasia mutated and Rad3 related (ATR) kinase with application as an anticancer agent[J]. J Med Chem, 2018, 61(22): 9889-9907. [百度学术]
Foote KM, Lau A. Drugging ATR: progress in the development of specific inhibitors for the treatment of cancer[J]. Future Med Chem, 2015, 7(7): 873-891. [百度学术]
Luecking U, Lefranc J, Wengner A, et al. Abstract 983: identification of potent, highly selective and orally available ATR inhibitor BAY 1895344 with favorable PK properties and promising efficacy in monotherapy and combination in preclinical tumor models[J]. Cancer Res, 2017, 77(13_Suppl): 983. [百度学术]
Wengner AM, Siemeister G, Lücking U, et al. The novel ATR inhibitor BAY 1895344 is efficacious as monotherapy and combined with DNA damage-inducing or repair-compromising therapies in preclinical cancer models[J]. Mol Cancer Ther, 2020, 19(1): 26-38. [百度学术]
Roulston A, Zimmermann M, Papp R, et al. RP-3500: a novel, potent, and selective ATR inhibitor that is effective in preclinical models as a monotherapy and in combination with PARP inhibitors[J]. Mol Cancer Ther, 2022, 21(2): 245-256. [百度学术]
Lipinski CA, Lombardo F, Dominy BW, et al. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings[J]. Adv Drug Deliv Rev, 2001, 46(1/2/3): 3-26. [百度学术]
Veber DF, Johnson SR, Cheng HY, et al. Molecular properties that influence the oral bioavailability of drug candidates[J]. J Med Chem, 2002, 45(12): 2615-2623. [百度学术]
Taylor RD, MacCoss M, Lawson AD. Rings in drugs: miniperspective [J]. J Med Chem, 2014, 57(14): 5845-5859. [百度学术]
Friedman JH. Greedy function approximation: a gradient boosting machine[J]. Ann Statist, 2001, 29(5): 1189-1232. [百度学术]
Chen TQ, Guestrin C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 785-794. [百度学术]
Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support[J]. arXiv, 2018: 1810.11363. [百度学术]
Tropsha A, Gramatica P, Gombar V. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models[J]. QSAR Comb Sci, 2003, 22(1): 69-77. [百度学术]
Kar S, Roy K. First report on development of quantitative interspecies structure—carcinogenicity relationship models and exploring discriminatory features for rodent carcinogenicity of diverse organic chemicals using OECD guidelines[J]. Chemosphere, 2012, 87(4): 339-355. [百度学术]
Ojha P, Mitra I, Das R, et al. Further exploring rm2 metrics for validation of QSPR models[J]. Chemom Intell Lab Syst, 2011, 107: 194-205. [百度学术]
Roy PP, Leonard JT, Roy K. Exploring the impact of size of training sets for the development of predictive QSAR models[J]. Chemom Intell Lab Syst, 2008, 90(1): 31-42. [百度学术]
Pratim Roy P, Paul S, Mitra I, et al. On two novel parameters for validation of predictive QSAR models[J]. Molecules, 2009, 14(5): 1660-1701. [百度学术]
Mitra I, Roy PP, Kar S, et al. On further application of r as a metric for validation of QSAR models[J]. J Chemom, 2010, 24(1): 22-33. [百度学术]
Brian Houston J, Carlile DJ. Prediction of hepatic clearance from microsomes, hepatocytes, and liver slices[J]. Drug Metab Rev, 1997, 29(4): 891-922. [百度学术]
Tang HD, Hussain A, Leal M, et al. Interspecies prediction of human drug clearance based on scaling data from one or two animal species[J]. Drug Metab Dispos, 2007, 35(10): 1886-1893. [百度学术]
Friesner RA, Banks JL, Murphy RB, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy[J]. J Med Chem, 2004, 47(7): 1739-1749. [百度学术]
Friesner RA, Murphy RB, Repasky MP, et al. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes[J]. J Med Chem, 2006, 49(21): 6177-6196. [百度学术]
Vilar S, Cozza G, Moro S. Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery[J]. Curr Top Med Chem, 2008, 8(18): 1555-1572. [百度学术]
Vass M, Kooistra AJ, Ritschel T, et al. Molecular interaction fingerprint approaches for GPCR drug discovery[J]. Curr Opin Pharmacol, 2016, 30: 59-68. [百度学术]
Kozakov D, Grove LE, Hall DR, et al. The FTMap family of web servers for determining and characterizing ligand-binding hot spots of proteins[J]. Nat Protoc, 2015, 10(5): 733-755. [百度学术]
Bemis GW, Murcko MA. The properties of known drugs. 1. molecular frameworks[J]. J Med Chem, 1996, 39(15): 2887-2893. [百度学术]
Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models[J]. Front Pharmacol, 2020, 11: 565644. [百度学术]
Fearn T. Probabilistic principal component analysis[J]. NIR News, 2014, 25(3): 23. [百度学术]
Lu YP, Knapp M, Crawford K, et al. Rationally designed PI3Kα mutants to mimic ATR and their use to understand binding specificity of ATR inhibitors[J]. J Mol Biol, 2017, 429(11): 1684-1704. [百度学术]