PM2.5 prediction model >

TAP developed a two-stage machine learning model to predict daily PM2.5 concentrations with complete spatial coverage. The structure of our model is shown below. The data sources include PM2.5 measurements, satellite AOD (aerosol optical depth) retrievals, online CMAQ simulations, meteorological reanalysis data, land use information, and population distribution. The first-stage model predicts the high pollution events and employed the SMOTE algorithm to resample the model training data, thus balanced the proportion of high pollution events and normal events in the model training dataset. The first-stage extreme gradient boosting (XGB) model trained with the resampled data predicts high pollution events. The second-stage model predicts the residual between CMAQ PM2.5 simulations and PM2.5 measurements with XGB. The prediction of residual rather than the prediction of PM2.5 enlarged the response of predictions to variations in predictors, thus improved the prediction accuracy. The missingness in satellite retrievals was filled by decision-tree based modeling algorithm.


  • Geng, G., Xiao, Q., Liu, S., Liu, X., Cheng, J., Zheng, Y., Xue, T., Tong, D., Zheng, B., Peng, Y., Huang, X., He, K., & Zhang, Q. (2021). Tracking Air Pollution in China: Near Real-Time PM2.5 Retrievals from Multisource Data Fusion. Environ Sci Technol, 55, 12106-12115. [Link] [PDF]
  • Xiao, Q., Zheng, Y., Geng, G., Chen, C., Huang, X., Che, H., Zhang, X., He, K., & Zhang, Q. (2021). Separating emission and meteorological contribution to PM2.5 trends over East China during 2000–2018. Atmos Chem Phys, 21, 9475-9496. [Link] [PDF]
  • Xiao, Q., Geng, G., Cheng, J., Liang, F., Li, R., Meng, X., Xue, T., Huang, X., Kan, H., Zhang, Q., & He, K. (2021). Evaluation of gap-filling approaches in satellite-based daily PM2.5 prediction models. Atmos Environ, 244, 117921. [Link] [PDF]

O3 prediction model >

TAP developed a machine learning model to predict full-coverage daily maximum 8-h average O3 concentrations by fusing data from multiple sources. The model structure is shown below. The predictors include O3 measurements, satellite O3 vertical distribution profile, CMAQ simulations, WRF meteorological simulations, Normalized Difference Vegetation Index (NDVI), night light and population distribution. First, two random forest model were developed to describe the associations between O3 measurements and all the predictors, with and without the satellite O3 vertical distribution profile. Due to the missingness in satellite retrievals, the predictions from the model with satellite retrievals were spatiotemporally discontinuous. Then an elastic-net model was developed to fuse the predictions from the two random forest model in order to improve the prediction accuracy as well as fill the missingness. The last step simulate the residual of the fused predictions with a spatiotemporal Kriging interpolation.


  • Xue, T., Zheng, Y., Geng, G., Xiao, Q., Meng, X., Wang, M., Li, X., Wu, N., Zhang, Q., & Zhu, T. (2020). Estimating spatiotemporal variation in ambient ozone exposure during 2013–2017 using a data-fusion model. Environ Sci Technol, 54, 14877-14888. [Link] [PDF]