基于中文医药文本的实体识别和图谱构建

杨晔; 裴雷; 侯凤贞

doi:10.11665/j.issn.1000-5048.2023030903

基于中文医药文本的实体识别和图谱构建

Entity extraction and graph construction based on Chinese medical text

摘要

摘要: 知识图谱技术促进了新药研发的进展，但国内研究起点晚且领域知识多以文本形式存储，图谱重用率低。因此，本研究基于多源异构的医药文本，设计了以Bert-wwm-ext预训练模型为基础，并融合级联思想的中文命名实体识别模型，从而减少了传统单次分类的复杂度，进一步提高了文本识别的效率。实验结果显示，该模型在自建的训练语料上的F1分数达0.903，精确率达89.2%，召回率达91.5%。同时，将模型应用于公开数据集CCKS2019上，结果显示该模型能够更好地识别中文文本中的医疗实体。最后，利用此模型构建了一个中文医药知识图谱，图谱包含13 530个实体，10 939个属性，以及39 247个相关关系。本研究所提出的中文医药实体识别与图谱构建方法，有望助力研究者加快医药知识新发现，从而缩短新药研发进程。

Abstract: Knowledge graph technology has promoted the progress of new drug research and development, but domestic research starts late and domain knowledge is mostly stored in text, resulting in low rate of knowledge graph reuse.Based on multi-source and heterogeneous medical texts, this paper designed a Chinese named entity recognition model based on Bert-wwm-ext pre-training model and also integrated cascade thought, which reduced the complexity of traditional single classification and further improved the efficiency of text recognition.The experimental results showed that the model achieved the best performance with an F1-score of 0.903, a precision of 89.2%, and a recall rate of 91.5% on the self-built dataset.At the same time, the model was applied to the public dataset CCKS2019, and the results showed that the model had better performance and recognition effect.Using this model, this paper constructed a Chinese medical knowledge graph, involving 13 530 entities, 10 939 attributes and 39 247 relationships of them in total.The Chinese medical entity extraction and graph construction method proposed in this paper is expected to help researchers accelerate the new discovery of medical knowledge, and shorten the process of new drug discovery.

HTML全文

参考文献(35)

施引文献

资源附件(0)