20-methods.qmd

## Methods

### Study Design 

We conducted a cross-sectional retrospective study concerning cases of SUID in census tracts/"communities" of Cook County, IL.

We used the concept of "communities" in order to:

- Increase the geographic catchment area for case counts (incidence in census tracts was too low to make meaningful case count predictions)
- Lend semantic meaning to the geographic unit of analysis (being able to name each unit rather than using a numeric identifier)

Within Chicago city limits, communities were spatial joins of census tracts based on the "suburb" tag (or "neighborhood" if "suburb" was not listed) provided by the OpenCage geocoding service.

Outside Chicago city limits, communities were spatial joins of census tracts based on the highest designation available for the "city", then "town", then "suburb", then "village" tags.

Although aggregating to "communities" lended the benefits listed above, it came at the cost of less precision for covariates, which were aggregated by taking simple sums of their estimates at the census tract level without taking into account variations in their margins of error.

### Primary Outcome

The primary outcome of interest was SUID case count in each community.
We translated The National Institute of Child Health and Human Development's definition of SUID [@CommonSIDSSUID] into a SQL query of the Cook County Medical Examiner Office's Archive [@fig-sql].
Each case identified by the query was validated by team members from the SUID Case Registry for Cook County.
Each validated case was geocoded confidentially using ArcGIS Pro geocoding tools behind the network firewall [@ArcGISPro].
Geocoded cases were spatially joined by intersection with census tracts and aggregated to case counts.
These anonymized case counts at the census tract level, were further aggregated to the community level as described above.

![SQL Query for Identifying Prospective Cases of SUID from the Cook County Medical Examiner's Case Archive](_media/fig_sql.png){#fig-sql}

The "WHERE CASENUMBER IN" clause is censored, but contained a list of additional case numbers recommended for inclusion by the team members from the SUID Case Registry for Cook County.

<!-- SELECT * -->
<!-- INTO #PC_File -->
<!-- FROM [MedExaminer].[dbo].[ME_case] -->
<!-- WHERE (datediff(day,decedent_dob, death_Date) / 365.25) <= 1 -->
<!-- AND year(death_date) IN (2015, 2016, 2017, 2018, 2019) -->
<!-- AND -->
<!-- ( -->
<!-- PRIMARYCAUSE LIKE '%asphyxia%' -->
<!-- OR PRIMARYCAUSE LIKE '%Ashyxia%' -->
<!-- OR PRIMARYCAUSE LIKE '%Suffoc%' -->
<!-- OR -->
<!-- ( -->
<!-- (PRIMARYCAUSE LIKE '%UNDETERMINED%' OR PRIMARYCAUSE LIKE '%undertemined%') -->
<!-- AND (manner LIKE '%UNDETERMINED%' OR manner LIKE '%undertemined%') -->
<!-- ) -->
<!-- ) -->
<!-- UNION -->
<!-- SELECT * FROM MedExaminer.dbo.ME_case -->
<!-- WHERE CASENUMBER IN -->
<!-- ('<CENSORED>') -->


### Temporal Setting

The year of origin for each data variable was context-specific.
SUID Case Counts were observed from 2015-2019.
A descriptive comparison of census tracts with and without SUID present was performed on variables from contemporaneous years.
We retrospectively trained a model predicting SUID Case Counts from 2015-2019 using predictor covariates from 2014.
Using the trained model, we made prospective predictions for SUID Case Counts from 2021-2025 with covariates from 2020.

### Data Sources for Model Predictor Variables

We sourced all covariates from U.S. Census' "American Community Survey" (ACS) and from the Center for Disease Control's associated "Social Vulnerability Index" (SVI) [@AmericanCommunitySurvey; @SocialVulnerabilityIndex].

### Coding Pipeline

We performed all steps in the data pipeline downstream from the SQL query in the R computing environment [@ProgrammingLanguage] using the {tidyverse} suite of packages [@TidyversePackageSuite]. 
We utilized additional data cleaning convenience functions from the {janitor} and {RSocrata} packages [@samfirkeJanitorPackage; @hughdevlinRSocrataPackage]. 
The pipeline was orchestrated by specification using the {targets} package [@williammichaellandauTargetsPackage].

We generated data tables for publication using the {gt} package and its companion {gtsummary} [@richardiannoneGtPackage; @danield.sjobergGtsummaryPackage].

We performed geospatial manipulation and mapping with the {sf}, {leaflet}, {tmap}, {opencage}, and {tigris} packages [@edzerpebesmaSfPackage; @LeafletPackage; @tennekesTmapThematicMaps; @danielpossenriedeOpencagePackage; @kylewalkerTigrisPackage] along with their back-end infrastructure [@CARTOBasemapStyles; @frankwarmerdamGDAL; @GEOS; @volodymyragafonkinLeaflet; @OpenCageGeocodingAPI; @OpenStreetMap; @PROJ; @TIGERDataProducts]. 

### Statistical Analysis

#### Descriptive Comparison

We compared communities with and without observed cases of SUID using median values (and interquartile ranges) of demographic variables. 
Median values were reported because most of the variables did not approximate normal distributions.

#### Predictive Modeling

We modeled SUID case counts in each census tract using maximum likelihood estimation via the {MASS} R package [@brianripleyMASSPackage]. 
We used the negative binomial family of generalized linear models instead of the Poisson family because overdispersion was detected.
Selection of model form and predictors along with evaluation of performance is described in detail in Supplement 1.
Briefly, we calculated Pearson correlation coefficients between SUID count and every variable in the pool used to calculate SVI [@SocialVulnerabilityIndex], which has four thematic domains.
One to two promising variables from each SVI thematic domain were assessed in an additive step-wise fashion with the performance parameters Akaike/Bayesian Information Criteria (AIC; BIC), root mean squared error (RMSE), and Nagelkerke's R<sup>2</sup> using the {EasyStats} suite of R packages [@danielludeckeEasystatsPackageSuite].
All predictor variables were log-transformed in the final model in order to reduce the influence of extreme values.

<!-- Data was divided into spatially-clustered folds using {spatialsample} package. 80% of the data was apportioned from the each of the folds to training the model and 20% to testing performance. -->