Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility to Analytics group of data for fields frequently missing from CDM submissions #26

Open
DaveraGabriel opened this issue Jun 1, 2020 · 2 comments
Labels
Harmonization & Analytics Issues which involve both Data Ingestion & Harmonization & Analytics workstreams

Comments

@DaveraGabriel
Copy link
Collaborator

PCORnet captures Provider and Provider IDs in the data set. ACT does not. Discussion in the mapping validation meetings ponder the utility of this field in the N3C data set if there is systematically missing data for a field such as Provider.

Also pertains to Immunization in PCORNet - some sites may populate this and others may not

@hlehmann17
Copy link
Collaborator

SiteID is one that we have control over its being missing.
We all agree that there should be no explicit indication in the data as the source of the data, recognizing that zipcodes in the patient data will provide a strong indicator in cities with only one data donor.

We have two options: (1) Delete siteIDs from data uploaded to Palantir (2) Retain siteIDs in the data, but have a standard operating procedure rule that no analysis presents results that enable reidentification of the site

(1) Delete SiteIDs
Pro: Prevents mischief and inadvertent reidentification
Con: Hamstrings the analyses (see #2)

(2) Retain IDs
Pro: Enables the following types of analysis, recommended by analysts of Real World Evidence (e.g., FDA workshop Oct 2019):
Variability that swamps the effect [confounding]
Causal mediator
Semantics of missing (→Missing at random or not)
Range of data obfuscation, for comparability
Propensity score
?instrumental variable
Negative control

@hlehmann17
Copy link
Collaborator

(2) Retain site IDs (continued)
(2) Retain IDs
Pro: Enables the following types of analysis, recommended by analysts of Real World Evidence (e.g., FDA workshop Oct 2019):
Variability that swamps the effect [confounding]
Variability across sites includes biomedical issues (differences in prevalence in the surrounding community, differences in medical practice) as well as informatics issues (differences in how codes are used). If this variability noise is greater than the covid signal we are seeking, we won't see the signal. But we cannot/are not representing each of these differences. So a designation of "site" is the minimum we can do to account for this noise.
Causal mediator
More than just a confounder, differences across sites may have biomedical impact, as noted above. It would a shame (or more) to eliminate an explicit, relevant causal factor
Semantics of missing (→Missing at random or not)
There will be a strong temptation to impute missing data. The first decision is whether data are missing at random or not at random. Looking at missingness by site could help in that decision.
Range of data obfuscation, for comparability
It may be (and needs to be checked) that different sites use different ranges for obfuscation, and knowledge of that range may help (dissuade) analyses time-based analytics
Propensity score
In the creation of controls, propensity scores will be important (either for matching or simply as a covariate). Going back to the variability discussed earlier, site identity will be an important component in building such scores.
?instrumental variable
I'm not sure if siteID itself can count as a instrumental variable (going back to site as causal mediator)
Negative control
Controls almost certainly have to be constructed within sites. Eliminating siteID eliminates that possibility.

@DaveraGabriel DaveraGabriel added the Harmonization & Analytics Issues which involve both Data Ingestion & Harmonization & Analytics workstreams label Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Harmonization & Analytics Issues which involve both Data Ingestion & Harmonization & Analytics workstreams
Projects
None yet
Development

No branches or pull requests

2 participants