摘要
预测药物在血浆中的蛋白结合率,有助于了解药物的药代动力学特征,对药物发现的早期研究有重要的参考价值。本研究收集了2 452个临床药物的血浆蛋白结合率信息,用Molecular Operating Environment(MOE)和Mordred两种软件计算分子描述符,将算得的分子描述符作为模型的输入特征。使用极端梯度提升(extreme gradient boosting, XGBoost)算法和随机森林(random forest,RF)算法构建机器学习模型。结果表明,与MOE相比,将Mordred计算的分子描述符作为模型的输入,构建的模型预测性能更优。使用XGBoost算法和RF算法构建模型的预测性能结果相近,最优模型的
关键词
药物发现是一个漫长的过程,主要包括4个阶段:目标选择和验证、化合物筛选和优化、临床前研究、临床试
近年来,人工智能算法被越来越多的应用在计算人体药代动力学特性的研究中,为研究者在新药发现和优化阶段选择合适的化合物提供了帮
本研究从文献[
人工智能是一种利用计算机高速处理和分布式计算实现快速地分析和解决问题的先进技术,它能使计算机模拟人脑的信息处理和学习过
本研究从文献[

Figure 1 Distribution histogram of the fraction unbound in plasma (fu) (A) and the logarithm of fu.(log2 fu)(B)
分别使用商业软件MOE(2014版)和Python开源库Mordred(1.2.0版
机器学习算法需要独立的训练集和测试集来进行模型训练和性能评估。将数据随机划分训练集和测试集,其中测试集占比为10%。同时,本文在训练预测模型阶段使用十折交叉验证方法。该方法将整体数据平均分成10份,依次将其中的一份作为测试集,剩下的作为训练集。这个过程重复10次,得到10个训练模型,最终的结果取10个模型预测结果的平均值。十折交叉验证方法的特点是,数据中的每个样本都有机会作为测试集参与模型的预测,从而使模型具有更好的泛化能力。十折交叉验证方法如

Figure 2 Workflow of 10-fold cross validationD: Represents the entire dataset; D1-D10: Represent the sub-datasets divided into ten parts
模型参数调优的方法一般采用网格搜索的方法。网格搜索是一种指定参数值的穷举搜索方法,通过将参数交叉验证的方法进行优化来得到最优的学习算
本研究采用决定系数(coefficient of determination,R²)和均方误差(mean squared error, MSE)来评估模型的预测精度。
为了改进模型性能,本研究对由分子描述符构成的特征输入空间进行重构。首先,通过XGBoost算法和RF算法分别计算两种特征提取方法得到的输入特征的重要性。以F1评分作为评价标准,选择重要性排前200的特征。之后在特征数量2 ~ 200的范围内,以2为步长依次改变特征数量,训练由不同特征数量构建的预测模型。

Figure 3 Performance comparison of models with different number of featuresA: MOE was used to extract features and RF was used to build the model; B: MOE was used to extract features and XGBoost was used to build the model; C: Mordred was used to extract features and RF was used to build the model; D: Mordred was used to extract features and XGBoost was used to build the model
随后,本研究探究了在特征数量为75 ~ 160之间,不同方法构建模型的最优性能,结果如

Figure 4 Best prediction results obtained by building models with different methodsA: MOE was used to extract features and RF was used to build the model; B: MOE was used to extract features and XGBoost was used to build the model; C: Mordred was used to extract features and RF was used to build the model; D: Mordred was used to extract features and XGBoost was used to build the model
对于模型的构建,在XGBoost模型中,主要调试的参数有学习率(learning_rate)、树的最大深度(max_depth)、树的最小节点(min_child_weight)、预剪枝程度(gamma)、随机采样比例(subsample)。在RF模型中,主要调试的参数有分类器个数(n_estimators)、最大特征数(max_features)、树的最大深度(max_depth)、节点可分最小样本数(min_samples_split)。在最优结果中,XGBoost模型和RF模型的参数设置如
分析药物数据集中的主要特征有助于进一步了解药物性质与fu之间的关系。本研究分别使用XGBoost算法和RF算法对特征的重要性进行排序。对于通过MOE和Mordred两种方法构建的特征,本研究分别取前5个特征做进一步分析,结果如
本研究以2 452条药物信息作为数据集,使用MOE和Mordred计算分子描述符,采用XGBoost算法和RF算法,建立了预测药物fu的回归模型。结果表明,使用开源的描述符计算工具构建的模型,显示出与使用商业软件构建的模型相当的性能。同时,本研究的最佳回归模型显示出比之前的研究(包括使用商业软件)更好的性能。
在大多数模型中,对于预测fu影响较大的几个特征是较为统一的,常见的有亲脂性和共轭双键的相关特征(如SlogP、AATS1v)。通常,亲脂性高的药物更倾向与血浆蛋白结合,而电负性或极化性会影响分子的酸碱
数据集的质量对模型的构建有着重要的作用,由临床实验得到的药物信息对模型构建更具有实际价值。本研究整理得到的药物信息均是从实际的临床实验中的来的,虽然数据集的规模比现有大多数同类型研究使用的数据集更大,但随着临床实验数据量的增加,本研究所使用的方法有望取得预测性能更高的回归模型。
References
Ding BX,Hu J,Wang JF. Progress in the application of artificial intelligence in drug development[J]. Shandong Chem(山东化工),2019,48(22):70-73. [百度学术]
Kola I,Landis J. Can the pharmaceutical industry reduce attrition rates[J]. Nat Rev Drug Discov,2004,3(8):711-715. [百度学术]
Zhang L,Jiang C,Chen SM,et al. Determination of plasma protein binding of peptide drug candidates by dextran-coated charcoal[J]. J China Pharm Univ(中国药科大学学报),2020,51(5):522-529. [百度学术]
Chen Y,Wu H,Ge WH,et al. Research on entity relation extraction of Chinese adverse drug reaction reports based on deep learning method[J]. J China Pharm Univ(中国药科大学学报),2019,50(6):753-759. [百度学术]
Ghafourian T,Barzegar J,Dastmalchi S,et al. QSPR models for the prediction of apparent volume of distribution[J]. Int J Pharm,2006,319(1/2):82-97. [百度学术]
Gleeson MP,Waters NJ,Paine SW,et al. In silico human and rat vss quantitative structure-activity relationship models[J]. Med Chem,2006,49(6):1953-1963. [百度学术]
Lombardo F,Obach RS,DiCapua FM,et al. A hybrid mixture discriminant analysis-random forest computational model for the prediction of volume of distribution in human[J]. Med Chem,2006,49(7):2262-2267. [百度学术]
Gleeson MP. Plasma protein binding affinity and its relationship to molecular structure:an in-silico analysis[J]. Med Chem,2007,50(1):101-112. [百度学术]
Gunturi SB,Narayanan R. In silico ADME modeling 3:computational models to predict human intestinal absorption using sphere exclusion and kNN QSAR methods[J]. QSAR Combinat Sci,2007,26:653-668. [百度学术]
Norinder U,Bergstroem CA. Prediction of ADMET properties[J]. Med Chem,2006,1(9):920-937. [百度学术]
Votano JR,Parham M,Hall LM,et al. QSAR modeling of human serum protein binding with several modeling techniques utilizing structure information representation[J]. Med Chem,2006,49(24):7169-7181. [百度学术]
Ingle L,Veber BC,Nichols JW,et al. Informing the human plasma protein binding of environmental chemicals by machine learning in the pharmaceutical space:applicability domain and limits of predictability[J]. Chem Inf Model,2016,56(11):2243-2252. [百度学术]
Watanabe R,Esaki T,Kawashima H,et al. Predicting fraction unbound in human plasma from chemical structure:improved accuracy in the low value ranges[J]. Mol Pharm,2018,15(11):5302-5311. [百度学术]
Obach RS,Lombardo F,Waters NJ. Trend analysis of a database of intravenous pharmacokinetic parameters in humans for 670 drug compounds[J]. Drug Metab Dispos,2008,36(7):1385-1405. [百度学术]
Zhang R,Wang YB. Research on machine learning with algorithm and development[J]. Comm Univ China (中国传媒大学学报),2016,23(2):10-18. [百度学术]
Liu BY,Wang Q,Xu LY,et al. Application of artificial intelligence technology in medicine research and development[J]. Chin J New Drugs (中国新药杂志),2020,29(17):1979-1986. [百度学术]
Moriwaki H,Tian YS,Kawashita N,et al. Mordred:a molecular descriptor calculator[J]. Cheminform,2018,10(1):4. [百度学术]
Bergstra J,Bengio Y. Random search for hyper-parameter optimization[J]. Machine Learning,2012,13:281-305. [百度学术]
Nagle K. Atomic polarizability and electronegativity[J]. Am Chem Soc,1990,112(12),4741-4747. [百度学术]
Zhivkova Z,Doytchinova I. Quantitative structure-plasma protein binding relationships of acidic drugs[J]. Pharm Scim 2012,101(12):4627-4641. [百度学术]