Use of Stratified Cascade Learning to predict hospitalization risk with only socioeconomic factors
Abstract
Background and objective: Published models predicting health related outcomes rely on clinical, claims and social determinants of health (SDH) data. Addressing the challenge of predicting with only SDH we developed a novel framework termed Stratified Cascade Learning (SCL) and used it for predicting the risk of hospitalization (ROH). Materials and methods: The variable set includes 27 SDH and “age” and “sex” for a cohort of diabetic patients. The SCL model uses three sub-models: SM1 (whole training set) stratifies training set into “predictable” and “unpredictable” subsets, SM2 (built on whole training set) classifies test set patients into “predictable” and “unpredictable”, and SM3 (built on only the “predictable” subset) predicts the ROH for the patients classified as “predictable” by SM2. Results: The SCL model does not improve either the AUC or the NPV of the basic classifier, but materially improves accuracy and specificity measures at the expense of lowering sensitivity for the “predictable” subset. Optimization of the risk thresholds of the sub-models does not noticeably change the AUC and NPV but further improves the accuracy and specificity at the expense of further lowering sensitivity. Conclusion: Since the SLC model yields low sensitivity it fails to predict high risk patients. But it yields high specificity that can be useful when the objective is to eliminate low-risk patients as candidates for further testing or treatment. The use of the SCL is not limited to healthcare, it can be applied to any predictive modeling problem when reliable predictions can only be made for a fraction of incoming data.