Abstract:Objective To evaluate and improve missing data imputation methods to enhance the performance of binary classification prediction model. Methods By simulating data missing scenes, the effects of direct elimination, mean imputation, random forest (RF) imputation algorithm, and multiple imputation-random forest (MI-RF) on the performance of the prediction model were jointly evaluated by receiver operating characteristic area under curve (AUC) and root mean square error (RMSE). Bayesian Network was introduced for the random forest imputation algorithm to optimize the imputation method using the correlations between variables. Results Under different missing proportions, both AUC and RMSE indicated that Bayesian network optimized random forest (BN-RF) imputation algorithm was better. In addition, when the missing proportion was 10%-20%, various imputation methods had roughly the same improvement effect for the prediction model; when the proportion of missing data was 30%-40%, compared to the mean imputation, except for the BN-RF, RF was more effective and its effect was slightly better than MI-RF; however, when the proportion of missing data was close to 50%, even if the model performance was still appropriate, the imputation data gradually deviated from the true data features, resulting in a decrease in the usability of the model. Conclusion The overall effect of BN-RF is satisfactory, and it should be chosen when random missing was 30%-40%.