Abstract:Objective To evaluate the performance of mainstream open-source large language models (LLMs) in medical question-answering related to exercise interventions for mild cognitive impairment (MCI), and to compare their answers with those of neurologists, so as to explore the potential application value of LLMs in clinical decision support. Methods A question bank was constructed based on multi-source data, generating 25 exercise related questions on MCI, covering 3 dimensions: exercise type (questions 1-8), exercise program (questions 9-19), and exercise safety (questions 20-25). First, 12 neurologists with different professional titles (3 each at junior, intermediate, associate senior, and senior levels) independently answered each question. Then, each question was posed 3 times to 5 LLMs. Three senior neurologists scored the answers according to evidence-based best evidence, and the rates of consistency and differences between the answers from LLMs and physicians were analyzed. Results Among the LLMs, Kimi-K2 achieved the highest rate of complete consistency with the best evidence (84%, 21/25). Among clinicians, the rate of complete consistency increased with professional title: chief physicians (96%, 24/25) ranked the highest, followed by associate chief physicians (88%, 22/25) and attending physicians (84%, 21/25). The overall mean score of chief physicians was significantly higher than that of Wenxin Yiyan X1-Turbo, Tongyi Qianwen-max-latest, and DeepSeek-V3.1 (all P<0.05). In the exercise program dimension, performance varied considerably among different LLMs and physicians at different professional levels. Conclusion The performance of LLMs is comparable to that of junior physicians in question-answering regarding exercise interventions for MCI, but remains significantly inferior to senior physicians, especially in the consistency of developing exercise program. Currently, LLMs cannot yet replace senior physicians in clinical decision-making.