国内大语言模型与神经内科医生在轻度认知障碍运动干预问答中的表现对比研究
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

海军军医大学护理系登峰人才计划项目(2022KYD07).


Performance of domestic large language models and neurologists in question-answering regarding exercise interventions for mild cognitive impairment: a comparative study
Author:
Affiliation:

Fund Project:

Supported by Dengfeng Talent Program of Department of Nursing of Naval Medical University (2022KYD07).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目的 评估国内主流开源大语言模型(LLM)在轻度认知障碍运动干预医学问答中的表现,并与神经内科医生的回答进行对比,以探讨LLM在临床决策支持中的潜在应用价值。方法 基于多源数据构建问题库,生成包含25个轻度认知障碍运动相关问题,涵盖运动类型(1~8题)、运动方案(9~19题)和运动安全(20~25题)3个维度。首先由12名神经内科医生(初级、中级、副高级和高级职称各3名)对各问题进行独立作答,然后将每个问题向5种LLM提问3次;邀请3名资深神经内科医生依据循证最佳证据对各回答结果进行评分,分析LLM和医生回答的符合率及差异。结果 5种LLM中Kimi-K2回答与最佳证据建议的完全符合比例最高(84%,21/25);临床医生中,完全符合比例随临床资历升高而递增,主任医师最高(96%,24/25),其次为副主任医师(88%,22/25)和主治医师(84%,21/25)。主任医师回答的总体平均得分高于文心一言X1-Turbo、通义千问-max-latest和DeepSeek-V3.1,差异均有统计学意义(均P<0.05)。运动方案维度中,不同临床资历医生和各LLM回答得分差异较大。结论 LLM在轻度认知障碍运动干预问答中的表现接近低年资医生,但与高年资医生仍有显著差距,尤其在运动方案制定方面稳定性不足,目前尚难以替代高年资医生的临床决策。

    Abstract:

    Objective To evaluate the performance of mainstream open-source large language models (LLMs) in medical question-answering related to exercise interventions for mild cognitive impairment (MCI), and to compare their answers with those of neurologists, so as to explore the potential application value of LLMs in clinical decision support. Methods A question bank was constructed based on multi-source data, generating 25 exercise related questions on MCI, covering 3 dimensions: exercise type (questions 1-8), exercise program (questions 9-19), and exercise safety (questions 20-25). First, 12 neurologists with different professional titles (3 each at junior, intermediate, associate senior, and senior levels) independently answered each question. Then, each question was posed 3 times to 5 LLMs. Three senior neurologists scored the answers according to evidence-based best evidence, and the rates of consistency and differences between the answers from LLMs and physicians were analyzed. Results Among the LLMs, Kimi-K2 achieved the highest rate of complete consistency with the best evidence (84%, 21/25). Among clinicians, the rate of complete consistency increased with professional title: chief physicians (96%, 24/25) ranked the highest, followed by associate chief physicians (88%, 22/25) and attending physicians (84%, 21/25). The overall mean score of chief physicians was significantly higher than that of Wenxin Yiyan X1-Turbo, Tongyi Qianwen-max-latest, and DeepSeek-V3.1 (all P<0.05). In the exercise program dimension, performance varied considerably among different LLMs and physicians at different professional levels. Conclusion The performance of LLMs is comparable to that of junior physicians in question-answering regarding exercise interventions for MCI, but remains significantly inferior to senior physicians, especially in the consistency of developing exercise program. Currently, LLMs cannot yet replace senior physicians in clinical decision-making.

    参考文献
    相似文献
    引证文献
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-08-26
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2026-04-18
  • 出版日期: 2026-04-20
文章二维码
重要通知
友情提醒: 近日发现论文正式见刊或网络首发后,有人冒充我刊编辑部名义给作者发邮件,要求添加微信,此系诈骗行为!可致电编辑部核实:021-81870792。
            《海军军医大学学报》编辑部
关闭