基于贝叶斯网络的随机森林优化填补算法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

上海市卫生健康委员会新兴交叉领域研究专项(2022JC011),上海市产业协同创新项目(2021-cyxt1-kj10).


Bayesian network optimized random forest imputation algorithm
Author:
Affiliation:

Fund Project:

Supported by Emerging Interdisciplinary Research Project of Shanghai Municipal Health Commission (2022JC011) and Shanghai Industrial Collaborative Innovation Project (2021-cyxt1-kj10).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目的 评估并改进缺失数据处理方法,提升二分类结局预测模型性能。方法 模拟数据缺失场景,通过预测模型的ROC AUC及均方根误差(RMSE)共同评估直接剔除、均值填补、随机森林填补、多重填补对预测模型性能的影响,并将贝叶斯网络引入随机森林填补算法,利用变量间相关性进行填补方法的优化。结果 不同缺失占比下,通过AUC及RMSE均可得出贝叶斯网络优化随机森林填补算法效果最佳。此外,在缺失占比为10%~20%时,各种填补方法对预测模型的性能提升效果大体相同;当缺失占比为30%~40%时,相较于均值填补,除贝叶斯网络优化随机森林填补算法外,随机森林填补更好,其效果略优于多重填补;当缺失占比接近50%时,即使模型性能依旧较好,但填补数据逐渐偏离真实数据特征,模型的可用性下降。结论 贝叶斯网络优化随机森林填补算法总体效果较好,当随机缺失占比30%~40%时可优先考虑。

    Abstract:

    Objective To evaluate and improve missing data imputation methods to enhance the performance of binary classification prediction model. Methods By simulating data missing scenes, the effects of direct elimination, mean imputation, random forest (RF) imputation algorithm, and multiple imputation-random forest (MI-RF) on the performance of the prediction model were jointly evaluated by receiver operating characteristic area under curve (AUC) and root mean square error (RMSE). Bayesian Network was introduced for the random forest imputation algorithm to optimize the imputation method using the correlations between variables. Results Under different missing proportions, both AUC and RMSE indicated that Bayesian network optimized random forest (BN-RF) imputation algorithm was better. In addition, when the missing proportion was 10%-20%, various imputation methods had roughly the same improvement effect for the prediction model; when the proportion of missing data was 30%-40%, compared to the mean imputation, except for the BN-RF, RF was more effective and its effect was slightly better than MI-RF; however, when the proportion of missing data was close to 50%, even if the model performance was still appropriate, the imputation data gradually deviated from the true data features, resulting in a decrease in the usability of the model. Conclusion The overall effect of BN-RF is satisfactory, and it should be chosen when random missing was 30%-40%.

    参考文献
    相似文献
    引证文献
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-06-14
  • 最后修改日期:2023-08-24
  • 录用日期:
  • 在线发布日期: 2025-02-22
  • 出版日期: 2025-02-20
文章二维码
重要通知
友情提醒: 近日发现论文正式见刊或网络首发后,有人冒充我刊编辑部名义给作者发邮件,要求添加微信,此系诈骗行为!可致电编辑部核实:021-81870792。
            《海军军医大学学报》编辑部
关闭