Abstract:Objective To carry out text data mining from discharge summary of patients with stroke (a disease with high incidence and poor prognosis) using natural language processing technology, and to convert unstructured text data into structured database for subsequent statistical analysis through Python. Methods Based on the discharge summary of patients with ischemic stroke, the named entity recognition model of enhanced representation from knowledge integration (ERNIE)+neural network+conditional random field was constructed to identify 5 kinds of medical named entities, including disease, drug, surgery, imaging examination and symptoms. The entities were extracted and the semi-structured database was constructed. In order to further extract structured data from semi-structured databases, a similarity matching model of twin texts based on ERNIE was constructed. The evaluation index was accuracy, and the optimal model was used to construct the covariable extractor. Results The overall F1 value of the named entity recognition model reached 90.27%, including 88.41% for disease F1, 91.03% for drug F1, 87.71% for imaging examination F1, 87.07% for surgery F1, and 96.59% for symptom F1. The overall accuracy of the text similarity matching model reached 99.11%. Conclusion The construction process from complete unstructured data, to semi-structured data, and then to structured data, is realized through natural language processing technology. Compared with reading and extracting medical records manually, the natural language processing technology greatly improved the efficiency of database construction.