摘要: |
目的 针对缺血性脑卒中这一发病率高、预后差的疾病,应用自然语言处理技术从患者出院小结中进行文本数据挖掘,并通过Python编程语言将非结构化的文本数据转换成供后续统计分析的结构化数据库。方法 利用缺血性脑卒中患者出院小结资料,构建基于知识增强的语义表示模型(ERNIE)+神经网络+条件随机场的命名实体识别模型,进行疾病、药物、手术、影像学检查、症状5种医疗命名实体的识别,提取实体构建半结构化数据库。为了进一步从半结构化数据库中提取出结构化数据,构建基于ERNIE的孪生文本相似度匹配模型,评价指标为准确率,采用最优模型构建协变量提取器。结果 命名实体识别模型总体F1值为90.27%,其中疾病F1值为88.41%,药物F1值为91.03%,影像学检查F1值为87.71%,手术F1值为87.07%,症状F1值为96.59%。文本相似度匹配模型的总体准确率为99.11%。结论 通过自然语言处理技术,实现了从完全的非结构化数据到半结构化数据再到结构化数据的构建流程,与人工阅读病历并手动提取病历信息相比,极大提高了数据库构建的效率。 |
关键词: 卒中 电子健康病历 患者出院小结 自然语言处理 命名实体识别 相似度匹配 协变量 |
DOI:10.16781/j.0258-879x.2021.11.1273 |
投稿时间:2021-05-21修订日期:2021-06-28 |
基金项目:全军后勤科研重大项目子课题(AWS14R013-1),上海市公共卫生体系建设三年行动计划(2020-2022年)优秀人才培养计划(GWV-10.1-XD05). |
|
Covariate extraction method based on discharge summary of stroke patients |
LIN Zhen1,2,QIN Yu-chen1,QIN Ying-yi1,LI Dong-dong1,WU Cheng1*,HE Jia1* |
(1. Department of Health Statistics, Faculty of Health Services, Naval Medical University (Second Military Medical University), Shanghai 200433, China; 2. No. 73127 Troop Hospital of PLA, Fuzhou 350000, Fujian, China *Corresponding authors) |
Abstract: |
Objective To carry out text data mining from discharge summary of patients with stroke (a disease with high incidence and poor prognosis) using natural language processing technology, and to convert unstructured text data into structured database for subsequent statistical analysis through Python. Methods Based on the discharge summary of patients with ischemic stroke, the named entity recognition model of enhanced representation from knowledge integration (ERNIE)+neural network+conditional random field was constructed to identify 5 kinds of medical named entities, including disease, drug, surgery, imaging examination and symptoms. The entities were extracted and the semi-structured database was constructed. In order to further extract structured data from semi-structured databases, a similarity matching model of twin texts based on ERNIE was constructed. The evaluation index was accuracy, and the optimal model was used to construct the covariable extractor. Results The overall F1 value of the named entity recognition model reached 90.27%, including 88.41% for disease F1, 91.03% for drug F1, 87.71% for imaging examination F1, 87.07% for surgery F1, and 96.59% for symptom F1. The overall accuracy of the text similarity matching model reached 99.11%. Conclusion The construction process from complete unstructured data, to semi-structured data, and then to structured data, is realized through natural language processing technology. Compared with reading and extracting medical records manually, the natural language processing technology greatly improved the efficiency of database construction. |
Key words: stroke electronic health record patient discharge summary natural language processing named entity recognition similarity matching covariate |