摘要
基于串联质谱的蛋白质组学分析方法往往依赖于实际谱图和理论谱图的匹配打分,而大量共洗脱肽的干扰会降低多肽和蛋白的鉴定及定量的准确性。多肽保留时间预测可将多肽色谱保留行为转变为稳定独立的特征时间属性,作为多肽鉴定的辅助和验证指标,改善多肽鉴定的准确性。复杂体系中多肽色谱保留预测也对优化蛋白质组学测定条件、提高数据非依赖采集中质谱数据的检出率和重复性具有重要意义。本文针对未修饰多肽及修饰多肽常用的色谱保留预测方法(包括基于标准化索引、多肽分子模型、氨基酸残基参数和机器学习等)进行了综述,总结各种方法的原理及其特点,并对其在蛋白质组学中的应用及发展方向进行了展望。
目前,绝大多数蛋白质组学的分析都是采用基于串联质谱的自下而上(bottom-up)的方法,对酶解的肽段进行LC-MS分析,通过肽段的串联质谱数据鉴定蛋白
多肽的色谱保留取决于色谱方法和多肽本身的性质,而多肽的性质在很大程度上是由它们的氨基酸序列决定的。因此在给定的色谱条件下,保留时间(retention time,RT)包含了多肽序列的信
本文对未修饰多肽和修饰多肽保留时间预测的各类方法进行了综述,对各方法原理、模型、特点及其在蛋白质定性及定量中的应用进行总结,讨论了这些方法在蛋白质组学中预测完整蛋白质的可行性和准确性,并对多肽保留时间预测方法的发展方向及其应用前景进行了展望。
为了充分利用色谱保留数据,已有众多多肽保留时间预测方法,见

Figure 1 Four methods of peptide retention prediction: each figure illustrates the principles and characteristics of this four different methods
在给定的色谱条件下,特定多肽的RT应该是恒定的,因此RT是化学结构依赖性参数。多肽分子模型法是通过多肽的物理化学性质即肽的结构信息或它们在分离期间的化学相互作用的信息实现多肽保留时间预测。分子模型方法偏向于对大分子进行物理建模,辅之以氨基酸残基的贡献总和进行预测,方法简便,但缺失了一些影响色谱保留的因素。
Kaliszan
Le Maux
标准化索引法是利用一组标准肽的保留时间建立数据库,把这些数值作为其他待测肽的RT预测的基础和标准。这样的标准肽覆盖不同的疏水性并且易于用MS检测。只需要进行一组标准肽的校正实验,就可以在后续所有不同条件的实验分析中使用其RT信息,进而改善了由于色谱系统差异导致RT数值差异很大的问题。
iRT首先由Escher
与多肽分子模型法相比,iRT的一系列方法应用更广泛,大大提高了蛋白质组学数据分析的检出率和准确性。但由于iRT肽数量非常有限,主要用于线性梯度条件,其精度有限。
基于残基参数的方法最初旨在预测肽段序列中每个氨基酸残基对整条肽的RT的影响。氨基酸残基的个体贡献通常被称为保留系数(retention coefficients,RC),那么整个肽的保留就是各个贡献的总和(一组RC)。在给定的色谱条件下,可以通过简单地总结(累加)组成肽的氨基酸残基的RC来估计肽的RT,这便是加性模型(additive model)。
该方法最早的实例是使用一组25个短肽(胰高血糖素、生长抑素等)以及它们观察到的RT来得到序列中存在的每个氨基酸残基的保留系
随后的研究表
在加性模型的基础研究上,Krokhin
该算法的第2个版本便将数据集扩大至2 000,除了引入短肽的氨基酸残基的单独RC,还校正了等电点、带电肽的最近邻效应和形成螺旋结构的倾向(脯氨酸重复)。在此基础上,Elutato
基于参数的方法的局限性就在于它们通常被优化用于预测特定色谱系统的保留时间。Dwivedi
在亲水相互作用液相色谱(hydrophilic interaction liquid chromatography,HILIC)系统中,携带N帽螺旋稳定基序和两亲性高螺旋的肽保留比预测值偏低,这是因为肽骨架上的亲水性羰基和酰胺基团与螺旋结构间发生氢键稳定,它决定了HILIC中独特的肽的序列依赖性行
另一种基于SSRCalc的肽保留预测模型阳离子交换(strong cation exchange,SCX)系统的肽段分离和预测机制则是基于库仑定律驱动的肽在离子交换色谱中的静电相互作
由此也能看出,对于不同的实验条件,它们的预测结果力就会发生偏差,需要引入特定的参数进行校正才能获得良好的相关性。SSRCalc是目前使用最广泛的基于参数的保留时间预测器,可以说是该领域的基准工具,也是最准确的保留时间预测模型之一。在肽的电荷、长度、疏水性、二级结构、螺旋结构,氨基酸的个体保留和相对于肽末端的位置乃至不同色谱系统等方面的优化,SSRCalc已经取得了较大进展。
利用人工智能的机器学习法也被用于多肽保留时间预测。方法利用计算机算法从已知的输入数据中获得信息,输出数值,进行训练。根据训练中获得的输入输出数据建立已知参数模型,对目标肽段的RT进行预测。基于机器学习的RT预测方法可以分为两大类:传统的机器学习方法和深度学习方法。机器学习方法又分为两个子类:一类为人工神经网络(artificial neural networks,ANN
最初ANN以20个氨基酸残基的组成为基础,由20个输入节点、2个隐含节点和1个输出节点组
为了达到使用较少的训练肽的同时也能适应不同的色谱条件,Moruz
在此基础上又衍生出来许多SVR组合算法预测模型。串并行支持向量机(serial and parallel support vector machine,SP-SVM)包含一个仅用于模型训练的SVR (p-SVR)和4个用于RT预测的SVM (C-SVM、1-SVR、s-SVR和n-SVR
不确定性可以公式化为目标样本与训练数据集之间的关系,所以掌握了这样的预测策略之后,GPTime便将SVR替代为高斯计算过程(Gaussian Processes,GP),以同样的选择-训练-校准-计算模式,证明了GP与SVR同等的准确性,同时提供了预测RT的不确定性估
Lu
深度学习可以自动从庞大数据中有效解读复杂关系并学习特征和模式,无需进行人工特征设计,因此特别适合大型的复杂数据集的科学领域。基于深度学习的算法大致分为3类:递归神经网络(recurrent neural network,RNN)、卷积神经网络(convolutional neural networks,CNN)和混合网络,其中RNN是最主要的网络架构。
Prosit是RNN的代表性算
CNN包含卷积层和池化层,可在不同的空间尺度上提取序列特征。Ma
对于较小的数据集,传统的机器学习方法通常优于深度学习方法,但是随着训练集的增多,深度学习方法的优势便逐渐显现,性能也大大优于机器学
PTM能够改变蛋白质的电荷状态、疏水性、空间结构和稳定性,最终影响其与受体等的相互作用及功能。目前已发现300多种不同的PTM,主要形式包括磷酸化、糖基化、乙酰化、羧基化、糖基化以及二硫键的配对
目前有很多研究在开发适用于PTM肽的RT预测,大多是在已有模型基础上引入修饰的氨基酸残基的模型参数(RC,疏水性等)来进行预测。如Reime
BioLCCC的拓展模型可以预测具有磷酸化修饰的
Elude 2.
在深度学习方法中,大多数模型采用的一键编码氨基酸的形式限制了PTM肽段的适用
在靶向蛋白质组学中,保留时间预测模型可以潜在地帮助生成数据采集的参考列表,实现更多的蛋白质同时定量。在bottom-up蛋白质组学中,这些模型主要用于在数据库搜索过程中,作为肽匹配图谱(peptide-spectrum matches,PSM)的额外验证标准。近年来,越来越多的研究将多肽RT预测模型集成到蛋白质组学数据分析工作流程中。这些不同原理的方法已大量应用于数据依赖采集(data dependent acquisition,DDA)靶向蛋白质组学实验、DIA蛋白质组学实验和完整蛋白质RT预测的综合模型开发中。
对于靶向蛋白质组学中关键的第一步“方法开发建立”,预测的RT已用于减少分析靶标所需的实验次数。采集窗口越小,便可以在不损害数据质量的情况下靶向更多的肽。复杂的背景可能导致选择反应监测(selected reaction monitoring, SRM)测量结果的模糊性,因为样品中可能存在具有与目标肽段类似的干扰肽。在DDA中,Prosi
二级谱是混合谱,DIA的数据来源于很多肽段,而且碎片离子还会受到未碎裂的母离子的干扰,在短色谱梯度与复杂样品同时出现的情况下,干扰会进一步被放大。在没有碎片谱图提供的高可信度数据的情况下,可以将观察到的肽段RT和未碎片化的质量用作肽段鉴定的附加信息,过滤错误识别的代谢产物。这些预测算法的优势在于可以确保库始终是最新的,甚至可以考虑不同仪器平台之间的差异。DIA方法思路大致为,使用相似样品来源(如酿酒酵母蛋白质)数据库及Prosit14辅助生成RT预测的谱库(320 150个独特的肽序列),经过经验校正(6次气相分馏DIA进样),新库包含来自4 464个蛋白质组的64 597个肽序
高精度iR
丰富的多肽保留预测模型的经验能够应用在完整蛋白质的RT预测上,当然也更具挑战性。Bio LCCC,基于高分子统计物理学方法,把吸附剂孔内的所有多肽链分子的可能构型都考虑在内,对于完整蛋白质的RT预测有良好的可行性。研究表明,BioLCCC模型在12个完整蛋白
在基于LC-MS技术的蛋白质组学中,保留时间对多肽鉴定及定量的准确性、完整性和深入性起到重要作用。与基于多肽分子模型的方法相比,索引及序列特异性模型的应用性更广泛,但其预测能力仍受限于色谱条件。随着研究的不断深入,在更多数据集、更多未知肽段及蛋白面前,通过训练深度神经网络模型,构建专属于每一台仪器的网络模型或组合模型,采集时间可以从几天大大缩短至几小时。PTM修饰肽的保留模型的发展未来集中在,在训练集中无已知修饰类型的参数的前提下,优化由空间结构变化导致的修饰这一方面的数据。在多肽RT预测领域,仍需进一步提高模型的准确性,建立统一的评价标准,开发更具普适性的算法,使RT预测真正成为蛋白质组学研究的重要手段之一。
References
Henneman A,Palmblad M. Retention time prediction and protein identification[J]. Methods Mol Biol,2020,2051:115-132. [百度学术]
Dorfer V,Maltsev S,Winkler S,et al. CharmeRT:boosting peptide identifications by chimeric spectra identification and retention time prediction[J]. J Proteome Res,2018,17(8):2581-2589. [百度学术]
Escher C,Reiter L,Maclean B,et al. Using iRT,a normalized retention time for more targeted measurement of peptides[J]. Proteomics,2012,12(8):1111-1121. [百度学术]
Krokhin O,Craig R,Spicer V,et al. An improved model for prediction of retention times of tryptic peptides in ion pair reversed-phase HPLC:its application to protein peptide mapping by off-line HPLC-MALDI MS[J]. Mol Cell Proteomics,2004,3(9):908-919. [百度学术]
Moruz L,Tomazela D,Käll L. Training,selection,and robust calibration of retention time models for targeted proteomics[J]. J Proteome Res,2010,9(10):5209-5216. [百度学术]
Zohora FT,Rahman MZ,Tran NH,et al. DeepIso:a deep learning model for peptide feature detection from LC-MS map[J].Sci Rep,2019,9(1):17168. [百度学术]
Baczek T,Kaliszan R,Novotná K,et al. Comparative characteristics of HPLC columns based on quantitative structure-retention relationships (QSRR) and hydrophobic-subtraction model[J]. J Chromatogr A,2005,1075:109-115. [百度学术]
Le Maux S,Nongonierma A,Fitzgerald R. Improved short peptide identification using HILIC-MS/MS:retention time prediction model based on the impact of amino acid position in the peptide sequence[J]. Food Chem,2015,173:847-854. [百度学术]
Gorshkov A,Tarasova I,Evreinov V,et al. Liquid chromatography at critical conditions:comprehensive approach to sequence-dependent retention time prediction[J]. Anal Chem,2006,78(22):7770-7777. [百度学术]
Tarasova I,Goloborodko A,Perlova T,et al. Application of statistical thermodynamics to predict the adsorption properties of polypeptides in reversed-phase HPLC[J]. Anal Chem,2015,87(13):6562-6569. [百度学术]
Gallien S,Peterman S,Kiyonami R,et al. Highly multiplexed targeted proteomics using precise control of peptide retention time[J]. Proteomics,2012,12(8):1122-1133. [百度学术]
Bruderer R,Bernhardt O,Gandhi T,et al. High-precision iRT prediction in the targeted analysis of data-independent acquisition and its impact on identification and quantitation[J]. Proteomics,2016,16:2246-2256. [百度学术]
Meek J. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition[J]. Proc Natl Acad Sci U S A,1980,77(3):1632-1636. [百度学术]
Mant C,Hodges R. Context-dependent effects on the hydrophilicity/hydrophobicity of side-chains during reversed-phase high-performance liquid chromatography:implications for prediction of peptide retention behaviour[J]. J Chromatogr A,2006,1125(2):211-219. [百度学术]
Mant C,Hodges R. Design of peptide standards with the same composition and minimal sequence variation to monitor performance/selectivity of reversed-phase matrices[J]. J Chromatogr A,2012,1230:30-40. [百度学术]
Tripet B,Cepeniene D,Kovacs JM,et al. Requirements for prediction of peptide retention time in reversed-phase high-performance liquid chromatography:hydrophilicity/hydrophobicity of side-chains at the N- and C-termini of peptides are dramatically affected by the end-groups and location[J]. J Chromatogr A,2007,1141(2):212-225. [百度学术]
Dwivedi R,Spicer V,Harder M,et al. Practical implementation of 2D HPLC scheme with accurate peptide retention prediction in both dimensions for high-throughput bottom-up proteomics[J]. Anal Chem,2008,80(18):7036-7042. [百度学术]
Reimer J,Spicer V,Krokhin O. Application of modern reversed-phase peptide retention prediction algorithms to the Houghten and DeGraw dataset:peptide helicity and its effect on prediction accuracy[J]. J Chromatogr A,2012,1256:160-168. [百度学术]
Spicer V,Lao Y,Shamshurin D,et al. N-capping motifs promote interaction of amphipathic helical peptides with hydrophobic surfaces and drastically alter hydrophobicity values of individual amino acids[J]. Anal Chem,2014,86(23):11498-11502. [百度学术]
Krokhin O,Ezzati P,Spicer V. Peptide retention time prediction in hydrophilic interaction liquid chromatography:data collection methods and features of additive and sequence-specific models[J]. Anal Chem,2017,89(10):5526-5533. [百度学术]
Gussakovsky D,Neustaeter H,Spicer V,et al. Sequence-specific model for peptide retention time prediction in strong cation exchange chromatography[J]. Anal Chem,2017,89(21):11795-11802. [百度学术]
Petritis K,Kangas L,Ferguson P,et al. Use of artificial neural networks for the accurate prediction of peptide liquid chromatography elution times in proteome analyses[J]. Anal Chem,2003,75(5):1039-1048. [百度学术]
Shinoda K,Sugimoto M,Yachie N,et al. Prediction of liquid chromatographic retention times of peptides generated by protease digestion of the Escherichia coli proteome using artificial neural networks[J]. J Proteome Res,2006,5(12):3312-3317. [百度学术]
Klammer AA,Yi X,Maccoss MJ,et al. Peptide retention time prediction yields improved tandem mass spectrum identification for diverse chromatography conditions[J]. Anal Chem,2007,79(16):6111-6118. [百度学术]
Petritis K,Kangas L,Yan B,et al. Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information[J]. Anal Chem,2006,78(14):5026-5039. [百度学术]
Zhang J,Zhang D,Zhang W,et al. A new peptide retention time prediction method for mass spectrometry based proteomic analysis by a serial and parallel support vector machine model[J]. Se Pu,2012,30(9):857-863. [百度学术]
Maboudi Afkham H,Qiu X,The M,et al. Uncertainty estimation of predictions of peptides' chromatographic retention times in shotgun proteomics[J]. Bioinformatics,2017,33(4):508-513. [百度学术]
Lu W,Liu X,Liu S,et al. Locus-specific retention predictor (LsRP):a peptide retention time predictor developed for precision proteomics[J]. Sci Rep,2017,7:43959. [百度学术]
Gessulat S,Schmidt T,Zolg D,et al. Prosit:proteome-wide prediction of peptide tandem mass spectra by deep learning[J]. Nat Methods,2019,16(6):509-518. [百度学术]
Fergadis A,Baziotis C,Pappas D,et al. Hierarchical bi-directional attention-based RNNs for supporting document classification on protein-protein interactions affected by genetic mutations.[J] .Database (Oxford),2018:bay076. [百度学术]
Tiwary S,Levy R,Gutenbrunner P,et al. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis[J]. Nat Methods,2019,16(6):519-525. [百度学术]
Guan S,Moran M,Ma B,et al. Prediction of LC-MS/MS properties of peptides from sequence by deep learning[J]. Mol Cell Proteomics,2019,18(10):2099-2107. [百度学术]
Ma C,Ren Y,Yang J,et al. Improved peptide retention time prediction in liquid chromatography through deep learning[J]. Anal Chem,2018,90(18):10881-10888. [百度学术]
Yang Y,Liu X,Shen C,et al. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics[J]. Nat Commun,2020,11(1):146. [百度学术]
Wen B,Li K,Zhang Y,et al. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis[J]. Nat Commun,2020,11(1):1759. [百度学术]
Bouwmeester R,Gabriels R,Van Den Bossche T,et al. The age of data-driven proteomics:how machine learning enables novel workflows[J]. Proteomics,2020,20(21/22):e1900351. [百度学术]
Olsen J,Mann M. Status of large-scale analysis of post-translational modifications by mass spectrometry[J]. Mol Cell Proteomics,2013,12(12):3444-3452. [百度学术]
Reimer J,Shamshurin D,Harder M,et al. Effect of cyclization of N-terminal glutamine and carbamidomethyl-cysteine (residues) on the chromatographic behavior of peptides in reversed-phase chromatography[J]. J Chromatogr A,2011,1218(31):5101-5107. [百度学术]
Perlova T,Goloborodko A,Margolin Y,et al. Retention time prediction using the model of liquid chromatography of biomacromolecules at critical conditions in LC-MS phosphopeptide analysis[J]. Proteomics,2010,10(19):3458-3468. [百度学术]
Sargaeva NP,Goloborodko AA,O'connor PB,et al. Sequence-specific predictive chromatography to assist mass spectrometric analysis of asparagine deamidation and aspartate isomerization in peptides[J]. Electrophoresis,2011,32(15):1962-1969. [百度学术]
Ogata K,Krokhin O,Ishihama Y. Retention order reversal of phosphorylated and unphosphorylated peptides in reversed-phase LC/MS[J]. Anal Sci,2018,34(9):1037-1041. [百度学术]
Moruz L,Staes A,Foster J,et al. Chromatographic retention time prediction for posttranslationally modified peptides[J]. Proteomics,2012,12(8):1151-1159. [百度学术]
Wen B,Zeng W,Liao Y,et al. Deep learning in proteomics[J]. Proteomics,2020,20(20/21):e1900335. [百度学术]
Ivanov MV,Bubis JA,Gorshkov V,et al. Boosting MS1-only proteomics with machine learning allows 2000 protein identifications in single-shot human proteome analysis using 5 min HPLC gradient[J] .J Proteome Res,2021,20(4):1864-1873. [百度学术]
MacLean B,Tomazela DM,Shulman N,et al. Skyline:an open source document editor for creating and analyzing targeted proteomics experiments[J]. Bioinformatics,2010,26(7):966-968. [百度学术]
Röst H,Malmström L,Aebersold R,et al. A computational tool to detect and avoid redundancy in selected reaction monitoring[J]. Mol Cell Proteomics,2012,11(8):540-549. [百度学术]
Searle BC,Swearingen KE,Barnes CA,et al. Generating high quality libraries for DIA MS with empirically corrected peptide predictions[J] .Nat Commun,2020,11(1):1548. [百度学术]
Moruz L,Hoopmann M,Rosenlund M,et al. Mass fingerprinting of complex mixtures:protein inference from high-resolution peptide masses and predicted retention times[J]. J Proteome Res,2013,12(12):5730-5741. [百度学术]
Demichev V,Messner C,Vernardis S,et al. DIA-NN:neural networks and interference correction enable deep proteome coverage in high throughput[J]. Nat Methods,2020,17(1):41-44. [百度学术]
Gorshkov AV,Evreinov VV,Pridatchenko ML,et al. Applicability of the critical-chromatography concept to analysis of proteins:dependence of retention times on the sequence of amino acid residues in a chain[J]. Polymer Sci,2011,53(12):1227-1241. [百度学术]
Pridatchenko M,Perlova T,Ben Hamidane H,et al. On the utility of predictive chromatography to complement mass spectrometry based intact protein identification[J]. Anal Bioanal Chem,2012,402(8):2521-2529. [百度学术]
Xu L,Glatz C. Predicting protein retention time in ion-exchange chromatography based on three-dimensional protein characterization[J]. J Chromatogr A,2009,1216(2):274-280. [百度学术]
Chen J,Yang T,Cramer S. Prediction of protein retention times in gradient hydrophobic interaction chromatographic systems[J]. J Chromatogr A,2008,1177(2):207-214. [百度学术]
Karlberg M,de Souza JV,Fan L,et al. QSAR Implementation for HIC retention time prediction of mAbs using fab structure:a comparison between structural representations[J] .Int J Mol Sci,2020,21(21):8037. [百度学术]