基于中文医药文本的实体识别和图谱构建

杨晔; 裴雷; 侯凤贞

doi:10.11665/j.issn.1000-5048.2023030903

基于中文医药文本的实体识别和图谱构建

中国药科大学理学院，医药大数据与人工智能研究院，南京 211198

计量
- 文章访问数: 133
- HTML全文浏览量: 28
- PDF下载量: 340
出版历程
- 收稿日期: 2023-03-08
- 修回日期: 2023-06-11
- 刊出日期: 2023-06-24

Entity extraction and graph construction based on Chinese medical text

Institute of Medical Big Data and Artificial Intelligence, School of Science, China Pharmaceutical University, Nanjing 211198, China

摘要

摘要: 知识图谱技术促进了新药研发的进展，但国内研究起点晚且领域知识多以文本形式存储，图谱重用率低。因此，本研究基于多源异构的医药文本，设计了以Bert-wwm-ext预训练模型为基础，并融合级联思想的中文命名实体识别模型，从而减少了传统单次分类的复杂度，进一步提高了文本识别的效率。实验结果显示，该模型在自建的训练语料上的F1分数达0.903，精确率达89.2%，召回率达91.5%。同时，将模型应用于公开数据集CCKS2019上，结果显示该模型能够更好地识别中文文本中的医疗实体。最后，利用此模型构建了一个中文医药知识图谱，图谱包含13 530个实体，10 939个属性，以及39 247个相关关系。本研究所提出的中文医药实体识别与图谱构建方法，有望助力研究者加快医药知识新发现，从而缩短新药研发进程。
- 中文医药文本 /
- 命名实体识别模型 /
- Bert-wwm-ext预训练模型 /
- 级联思想 /
- 知识图谱
Abstract: Knowledge graph technology has promoted the progress of new drug research and development, but domestic research starts late and domain knowledge is mostly stored in text, resulting in low rate of knowledge graph reuse.Based on multi-source and heterogeneous medical texts, this paper designed a Chinese named entity recognition model based on Bert-wwm-ext pre-training model and also integrated cascade thought, which reduced the complexity of traditional single classification and further improved the efficiency of text recognition.The experimental results showed that the model achieved the best performance with an F1-score of 0.903, a precision of 89.2%, and a recall rate of 91.5% on the self-built dataset.At the same time, the model was applied to the public dataset CCKS2019, and the results showed that the model had better performance and recognition effect.Using this model, this paper constructed a Chinese medical knowledge graph, involving 13 530 entities, 10 939 attributes and 39 247 relationships of them in total.The Chinese medical entity extraction and graph construction method proposed in this paper is expected to help researchers accelerate the new discovery of medical knowledge, and shorten the process of new drug discovery.
- Chinese medical text /
- named entity recognition model /
- Bert-wwm-ext pre-training model /
- cascade thought /
- knowledge graph

HTML全文

参考文献(35)

[1]	Mohamed SK, Nová?ek V, Nounu A. Discovering protein drug targets using knowledge graph embeddings[J]. Bioinformatics, 2020, 36(2): 603-610.
[2]	Lukashina N, Kartysheva E, Spjuth O, et al. SimVec: predicting polypharmacy side effects for new drugs[J]. J Cheminform, 2022, 14(1): 49.
[3]	Li ZX. Relocation of Parkinson''s disease drugs based on knowledge graph[J]. Inf Technol (信息技术与信息化), 2022(7): 28-32.
[4]	Wu XD, Sheng SJ, Jiang TT, et al. Huapu-CP:From knowledge graphs to a data central-platform[J]. JAS (自动化学报), 2020(10): 2045-2059.
[5]	Fan YY, Li ZM. Research and application progress of Chinese medical knowledge graph[J]. J Front Comput Sci Technol (计算机科学与探索), 2022, 16(10): 2219-2233.
[6]	Qi GL, Gao H, Wu TX. Research progress of knowledge map[J]. Inf Eng(情报工程), 2017, 3(1): 4-25.
[7]	Ma XG. Knowledge graph construction and application in geosciences: a review[J]. Comput Geosci, 2022, 161: 105082.
[8]	Li ZW, Ding Y, Hua ZY, et al. Knowledge graph completion model based on triplet importance integration[J]. Comput Sci (计算机科学), 2020, 47(11): 231-236.
[9]	Hu JH, Zhao WQ, Fang A. Research on clinical text processing and knowledge discovery method based on medical big data[J]. China Digit Med (中国数字医学), 2020, 15(7): 11-13, 88.
[10]	Guo XY, He TT. A survey of information extraction[J]. Comput Sci (计算机科学), 2015, 42(2): 14-17,38.
[11]	de Aquino Silva R, da Silva L, Dutra ML, et al. An improved NER methodology to the Portuguese language[J]. Mobile Netw Appl, 2021, 26(1): 319-325.
[12]	Liu P, Guo YM, Wang FL, et al. Chinese named entity recognition: the state of the art[J]. Neurocomputing, 2022, 473: 37-53.
[13]	Wu ST, Liu HF, Li DC, et al. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis[J]. J Am Med Inform Assoc, 2012, 19(e1): e149-e156.
[14]	Friedman C, Alderson PO, Austin JH, et al. A general natural-language text processor for clinical radiology[J]. J Am Med Inform Assoc, 1994, 1(2): 161-174.
[15]	Chiticariu L, Krishnamurthy R, Li YY, et al. Domain adaptation of rule-based annotators for named-entity recognition tasks[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, Massachusetts. New York: ACM, 2010: 1002–1012.
[16]	Eddy SR. Hidden Markov models[J]. Curr Opin Struct Biol, 1996, 6(3): 361-365.
[17]	Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence dat[C]. ICML. New York:Association for Computing Machinery, 2001:282-289.
[18]	Cortes C, Vapnik V. Support-vector networks[J]. Mach Learn, 1995, 20: 273-297.
[19]	Zhang CS, Guo JY, Xian YT, et al. English product named entity recognition based on conditional random field[J]. Comput Sci Eng (计算机工程与科学), 2010, 32 (6): 115-117.
[20]	Elman JL. Finding structure in time[J]. Cogn Sci, 1990, 14(2): 179-211.
[21]	Cai LQ, Zhou ST, Yan X, et al. A stacked BiLSTM neural network based on coattention mechanism for question answering[J]. Comput Intell Neurosci, 2019, 2019: 9543490.
[22]	Xu YS, Li L, Gao HH, et al. Sentiment classification with adversarial learning and attention mechanism[J]. Comput Intell, 2021, 37(2): 774-798.
[23]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all You need[J]. arXiv,2017:1706.03762.
[24]	Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv,2018: 1810.04805
[25]	Song YH, Tian SW, Yu L. A method for identifying local drug names in Xinjiang based on BERT-BiLSTM-CRF[J]. Autom Control Comput Sci, 2020, 54(3): 179–190.
[26]	Chen LM, Liu D, Yang JK, et al. Construction and application of COVID-19 infectors activity information knowledge graph[J]. Comput Biol Med, 2022, 148: 105908.
[27]	Xu L, Li JH. Biomedical named entity recognition based on BERT and BiLSTM-CRF[J]. Comput Sci Eng, 2021(10): 1873-1879.
[28]	Hou YT, Abduklimu A, Haridamu A. Research progress of Chinese pre training model[J]. Comput Sci (计算机科学), 2022, 49(7): 148-163.
[29]	Cui YM, Che WX, Liu T, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Trans Audio Speech Lang Process, 2021, 29: 3504-3514.
[30]	Song SL, Zhang N, Huang HT. Named entity recognition based on conditional random fields[J].Clust Comput, 2019, 22(3): 5195-5206.
[31]	Wei ZP, Su JL, Wang Y, et al. A novel cascade binary tagging framework for relational triple extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 1476-1488.
[32]	Zheng SC, Wang F, Bao HY, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017: 1227-1236.
[33]	Luque A, Carrasco A, Martín A, et al. The impact of class imbalance in classification performance metrics based on the binary confusion matrix[J]. Pattern Recognit, 2019, 91: 216-231.
[34]	Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks[J]. Inf Process Manag, 2009, 45(4): 427-437.
[35]	Sen S, Mehta A, Ganguli R, et al. Recommendation of influenced products using association rule mining: Neo4j as a case study[J]. SN Comput Sci, 2021, 2(2): 1-17.

施引文献(4)

期刊类型引用(2)

1.	崔研伟，高屹. 藏医药知识图谱检索系统设计与开发. 西藏科技. 2024(10): 74-80 . 百度学术
2.	郑胜男，柳圣，鞠文慧，钱文泉. 基于自注意机制的中文医药命名实体识别算法研究. 南京工程学院学报(自然科学版). 2023(04): 37-40 . 百度学术

其他类型引用(2)

资源附件(0)

计量

文章访问数: 133
HTML全文浏览量: 28
PDF下载量: 340
被引次数: 4

基于中文医药文本的实体识别和图谱构建

计量

出版历程

Entity extraction and graph construction based on Chinese medical text

期刊类型引用(2)

其他类型引用(2)

计量

出版历程

目录