The North Carolina Journal of Mathematics and Statistics

EPA Particulate Matter Data - Analyses using Local Control Strategy

Robert Lincoln Obenchain, Sidney Stanley Young

Abstract


Statistical Learning methodology for analysis of large collections of observational, cross sectional data can be most effective
when the approach used is both Non-parametric and Unsupervised. We illustrate this using "LocalControl Strategy" on 2016 US environmental epidemiology data that we have contributed to Dryad. We invite researchers to download our CSV file, apply whatever methodology they wish, and contribute to development of a broad-based "consensus view" of potential effects of Secondary Organic Aerosols (Volatile Organic Compounds that have predominantly Biogenic or Anthropogenic origin) within PM2.5 particulate matter on Circulatory and/or Respiratory mortality. Our analyses here focus on the question: "Can life in a region with relatively high air-borne Biogenic particulate matter also be relatively dangerous in terms of Circulatory and/or Respiratory Mortality?"


Keywords


Nonparametric Unsupervised Learning; Local Control Strategy; Clustering as Matching; Permutation Distributions; Random Forests and Partial Dependence Plots.

Full Text:

PDF

References


L. Breiman. Random forests. Machine Learning, 45(1):5{32, 5 2001. ISSN 1573-0565. doi:

1023/A:1010933404324.

L. Breiman. Manual on setting up, using, and understanding random forests, v3.1. Berkeley Stat, 2002. URL https://www.stat.berkeley.edu/~breiman/Using_random_

forests_V3.1.pdf.

CDC. Compressed mortality le 1999-2016 on cdc wonder online database. Information Release, 6 2017. URL http://wonder.cdc.gov/cmf-icd10.html.

EPA. Cmaq version 5.3.1. Information Release, 2019. doi: 10.5281/zenodo.3585898.

J. Friedman. Greedy function approximation: the gradient boosting machine. Annals of

Statistics, 29(5):1189{1232, 2001. doi: 10.1214/aos/1013203451.

E.L. Glaeser. Researcher incentives and empirical methods. Preprint, 2006. URL https:

//www.nber.org/system/files/working_papers/t0329/t0329.pdf.

T. Hothorn and A. Zeileis. partykit: A modular toolkit for recursive partytioning in r.

Journal of Machine Learning Research, 16:3905{3909, 2015. URL https://jmlr.org/papers/v16/hothorn15a.html.

T. Hothorn, H. Seibold, and A. Zeileis. partykit: A modular toolkit for recursive partytioning. CRAN, 2022. URL https://CRAN.R-project.org/package=partykit.

D. Kahle and H. Wickham. ggmap: Spatial visualization with ggplot2. The R Journal, 5:144{161, 2013. URL http://journal.r-project.org/archive/2013-1/

kahle-wickham.pdf.

R. Koenker. quantreg: Quantile regression. CRAN, 2022. URL https://CRAN.R-project.org/package=quantreg.

A. Liaw. randomforest: Breiman and cutler's random forests for classication and regres-

sion. CRAN, 2022. URL https://CRAN.R-project.org/package=randomForest.

K.K. Lopiano, R.L. Obenchain, and S.S. Young. Fair treatment comparisons in observational research. Statistical Analysis and Data Mining, 7:376{384, 2014. doi:10.1002/sam.11235.

B.A. Nault, D.S. Jo, B.C. McDonald, P. Campuzano-Jost, D.A. Day, W. Hu, ..., and J.L. Jimenez. Secondary organic aerosols from anthropogenic volatile organic compounds

contribute substantially to air pollution mortality. Atmospheric Chemistry and Physics, 21:11201{11224, 2021.

doi: 10.5194/acp-21-11201-2021.

R.L. Obenchain. Multivariate procedures invariant under linear transformations. Annals of

Mathematical Statistics, 42:1569{1578, 1971.

doi: 10.1214/aoms/1177693155.

R.L. Obenchain. Localcontrolstrategy: Robust analysis of cross-sectional data. CRAN, 2019. URL https://CRAN.R-project.org/package=LocalControlStrategy.

R.L. Obenchain. Nonparametric and unsupervised: Nu-learning from big data. ASA Biopharmaceutical Report Newsletter, 27: 9-12-2020.

R.L. Obenchain. Efficient generalized ridge regression. Open Statistics, 3:1{18, 2022a. doi:10.1515/stat-2022-0108.

R.L. Obenchain. RXshrink: Maximum likelihood shrinkage using generalized ridge or least angle regression methods. CRAN, 2022b. URL https://CRAN.R-project.org/package=RXshrink.

R.L. Obenchain and S.S. Young. Advancing statistical thinking in health care research. Journal of Statistical Theory and Practice, 7:456{469, 2013. doi: 10.1080/15598608.2013.772821.

R.L. Obenchain, S.S. Young, and G. Krstic. Low-level radon exposure and lung cancer mortality. Regulatory Toxicology and Pharmacology, 107, 2019. doi: 10.1016/j.yrtph.2019.104418.

H.O.T. Pye, C.K. Ward-Caviness, B.N. Murphy, K.W. Appel, and K.M. Seltzer. Secondary organic aerosol association with cardiorespiratory disease mortality in the united states. Nature Communications, 12, 2021. doi: 10.1038/s41467-021-27484-1.

Rteam. R: A language and environment for statistical computing. CRAN, 2022. URL https://www.R-project.org.

D.B. Rubin. Bias reduction using mahalanobis metric matching. Biometrics, 36:293{298, 1980. doi: 10.2307/2529981. URL https://www.jstor.org/stable/2529981.

D.B. Rubin. For objective causal inference, design trumps analysis. Annals of Applied Statistics, 2:808{840, 2008. doi: 10.1214/08-AOAS187.

P.E. Stang, P.B. Ryan, J.A. Racoosin, J.M. Overhage, A.G. Hartzema, C. Reich, E. Welebob, T. Scarnecchia, and J. Woodcock. Advancing the science for active surveillance: Rationale and design for the observational medical outcomes partnership. Annals of

Internal Medicine, 153:600{606, 2010. doi: 10.7326/0003-4819-153-9-201011020-00010.

E.A. Stuart. Matching methods for causal inference: A review and a look forward. Statistical Science, 25:1{21, 2010. doi: 10.1214/09-STS313.

M. van der Laan and S. Rose. Statistics ready for a revolution: Next generation of statisticians must build tools for massive data sets. AMStat News, 88(9):38{39, 2010. URL

https://magazine.amstat.org/blog/2010/09/01/statrevolution/.

R. Volkamer, J.L. Jimenez, F. San Martini, K. Dzepina, Z. Qi, D. Salcedo, L.T. Molina, D.R.Worsnop, and M.J. Molina. Secondary organic aerosol formation from anthropogenic air pollution: Rapid and higher than expected. Geophys. Res. Lett., 2006. doi: 10.1029/

GL026899.

W.J. Welch. Construction of permutation tests. Journal American Statistical Association, 85:693{698, 1990. doi: 10.1080/01621459.1990.10474929.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag, 2016.

S.N. Wood. Thin plate regression splines. J. R. Stat. Soc. B. Met., 65:95{114, 2003. doi: 10.1111/1467-9868.00374.

S.N. Wood. Stable and effcient multiple smoothing parameter estimation for generalized additive models. Journal American Statistical Association, 99:673{686, 2004. doi: 10.1198/016214504000000980.

S.N. Wood. mgcv: Mixed gam computation vehicle with automatic smoothness estimation. CRAN, 2022. URL https://CRAN.R-project.org/package=mgcv.

S.S. Young and R.L. Obenchain. Epa particulate matter data. dryad, 2022. doi: 10.5061/dryad.63xsj3v58.

S.S. Young, R.L. Smith, and K.K. Lopiano. Air quality and acute deaths in california, 2000-2012. Regulatory Toxicology and Pharmacology, 88:173{184, 2017. doi: 10.1016/j.yrtph.2017.06.003.

S.S. Young, W. Kindzierski, and D. Randall. Shifting sands: Unsound science and unsafe regulation { keeping count of government science { p-value plotting, p-hacking, and pm2.5 regulation. National Association of Scholars, 1(5), 2021. URL https://files.eric.ed.

gov/fulltext/ED616199.pdf.

A. Zeileis, T. Hothorn, and K. Hornik. Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17:492{514, 2008. doi: 10.1198/106186008X319331