Our Machine Learning Pipeline

1. Dataset Retrieval

Datasets were retrieved from the U.S. Centers for Disease Control and Prevention (CDC) Environmental Justice Index & the Environmental Protection Agency (EPA) Smart Location Database.

2. Feature Preprocessing

150+ features preprocessed to 41 inputs spanning the Sociodemographic, Economic, Environment/Geographic, and Environmental Justice categories & 5 outputs spanning Asthma, Cancer, Diabetes, Hypertension, and Poor Mental Health.

3. Model Training

Custom Cubist M5 Model Tree^{(Quinlan, 1992)} fine-tuned through 100-repetition randomized grid search & manual tuning.

4. Model Analysis

Permutation feature importances and model performance metrics (R², RMSE, MAPE) were retrieved through 5-fold cross-validation.

5. Model Visualization

Geospatial census tract-wide NCD prevalence predictions, detailing our comparative accuracies to past studies, were modelled through QGIS and Matplotlib.

Research Poster

Expand

Research Paper

Expand