Our Machine Learning Pipeline
1. Dataset Retrieval
Datasets were retrieved from the U.S. Centers for Disease Control and Prevention (CDC) Environmental Justice Index & the Environmental Protection Agency (EPA) Smart Location Database.
2. Feature Preprocessing
150+ features preprocessed to 41 inputs spanning the Sociodemographic, Economic, Environment/Geographic, and Environmental Justice categories & 5 outputs spanning Asthma, Cancer, Diabetes, Hypertension, and Poor Mental Health.
3. Model Training
Custom Cubist M5 Model Tree(Quinlan, 1992) fine-tuned through 100-repetition randomized grid search & manual tuning.
4. Model Analysis
Permutation feature importances and model performance metrics (R2, RMSE, MAPE) were retrieved through 5-fold cross-validation.
5. Model Visualization
Geospatial census tract-wide NCD prevalence predictions, detailing our comparative accuracies to past studies, were modelled through QGIS and Matplotlib.