本文已被:浏览 2165次 下载 1748次 |
码上扫一扫! |
基于句子级Lattice-长短记忆神经网络的中文电子病历命名实体识别 |
潘璀然1,王青华1,汤步洲2,姜磊3,黄勋4,王理1* |
|
(1. 南通大学医学院医学信息学教研室, 南通 226001; 2. 哈尔滨工业大学(深圳)计算机科学与技术学院, 深圳 518055; 3. 海军军医大学(第二军医大学)长征医院风湿免疫科, 上海 200433; 4. 南通大学信息科学技术学院通讯工程教研室, 南通 226001 *通信作者) |
|
摘要: |
目的 提出一种基于Re-entity新分词方法的条件随机场(CRF)模型,并与双向长短记忆神经网络(BiLSTM)-CRF和Lattice-长短记忆神经网络(LSTM)进行比较。方法 比较了现有实体识别方法和模型后,针对2018年全国知识图谱与语义计算大会(CCKS2018)任务一“电子病历命名实体识别”,提出基于Re-entity的CRF、BiLSTM-CRF、Lattice-LSTM方法,并在不同语料库训练不同参数级别的字符向量集。分别将各方法引入神经网络模型中进行模型性能对比实验,最后分别基于句子级和篇级输入句长进行对比研究。结果 CRF模型在最优特征工程的结果下引入Re-entity方法后性能得到提高,句子级的Lattice-LSTM模型在该任务上取得了89.75%的严格F1-measure,优于CCKS2018任务一的最高结果(89.25%)。结论 基于Re-entity新分词方法的CRF模型可利用中文临床药物知识库有效提高电子病历中药物的识别率,Re-entity方法可改善数据预处理阶段分词导致的错误累加,Lattice结构可以更好地结合字符和词序列的潜在语义信息,同时句子级输入能有效提高神经网络模型的识别准确率。 |
关键词: 计算机化病案系统 中文电子病历 实体识别 条件随机场 双向长短记忆神经网络 点阵长短记忆神经网络 |
DOI:10.16781/j.0258-879x.2019.05.0497 |
投稿时间:2019-02-23修订日期:2019-04-12 |
基金项目:国家重点研发计划(2018YFC0116902),国家自然科学基金(81873915),江苏省研究生科研与实践创新计划项目(KYCX17-1932). |
|
Chinese electronic medical record named entity recognition based on sentence-level Lattice-long short-term memory neural network |
PAN Cui-ran1,WANG Qing-hua1,TANG Bu-zhou2,JIANG Lei3,HUANG Xun4,WANG Li1* |
(1. Department of Medical Informatics, School of Medicine, Nantong University, Nantong 226001, Jiangsu, China; 2. College of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen 518055, Guangdong, China; 3. Department of Rheumatology and Immunology, Changzheng Hospital, Naval Medical University(Second Military Medical University), Shanghai 200433, China; 4. Department of Communication Engineering, School of Information Science and Technology, Nantong University, Nantong 226001, Jiangsu, China *Corresponding author) |
Abstract: |
Objective To propose a conditional random field (CRF) model based on the new word segmentation method Re-entity, and to compare with bi-directional long short-term memory neural network (BiLSTM)-CRF and Lattice-long short-term memory neural network (LSTM). Methods After analyzing the existing entity recognition methods, we proposed CRF method based on Re-entity, BiLSTM-CRF and Lattice-LSTM for the China Conference on Knowledge Graph and Semantic Computing in 2018 (CCKS2018) task one:Chinese clinical named entity recognition, and trained character vector sets at different parameter levels based on different corpora. The comparative experiments on model performance were carried out in the different neural network models for each methods. Finally, the comparative study was carried out based on different input lengths such as the sentence level and the text level. Results Re-entity method can improve the performance of CRF model. Lattice-LSTM model based on sentence level achieved a strict F1-measure of 89.75% on this task, which was higher than the highest F1-measure (89.25%) on the task one of CCKS2018. Conclusion The CRF model based on Re-entity can effectively improve the recognition rate of traditional Chinese medicines in electronic medical records by using normalized Chinese clinical drug. Re-entity method can improve the error accumulation caused by word segmentation in data preprocessing. Lattice structure can better combine the latent semantic information of characters and word sequences. At the same time, sentence-level input can effectively improve the recognition accuracy of neural network models. |
Key words: computed medical records systems electronic medical record entity identification conditional random field bi-directional long short-term memory neural network lattice-long short-term memory neural network |