Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with large number of null values in NSSP data #2130

Open
melange396 opened this issue Feb 28, 2025 · 1 comment
Open

Deal with large number of null values in NSSP data #2130

melange396 opened this issue Feb 28, 2025 · 1 comment
Labels
API change Renames, large changes to calculations, large changes to affected regions bug Something isn't working data quality Missing data, weird data, broken data

Comments

@melange396
Copy link
Contributor

Roni and Dmytro noticed this and brought it to my attention: flat lines for NSSP signals for NY and Los Angeles in epivis. What can be seen in the plots are not actually "zero" values but instead all "null" values, which epivis currently plots at 0 (you can verify this by looking in your browser's network debugger for the API responses). After looking into this more, i found the following...

NSSP raw source data is almost all at the county resolution (there is also state and nation resolution data, but those are vastly outnumbered by county. they do not seem to suffer from the same problem. i will ignore them from here on.). Some of these data points come from the upstream source as "empty", and are then properly interpreted as NULL by the nssp indicators pipeline code and given a missingness annotation. However, there seems to be an inordinately and surprisingly large number of these null/empty/missing values.

Some missing values may be from low-population ares where the data is not well reported or censored for privacy reasons, but data is missing from some very large regions (in terms of area and/or population) like Manhattan (New York County, FIPS code 36061) and Los Angeles (FIPS 06037), as seen in the epivis link at the top of this comment.


  • For NSSP data over all available time, as seen on Feb 27, 2025 in the "query data" tool on the cdc website ( https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/explore/query/ ) :

    • total rows: 384625
    • rows where at least one value column is non-null: 293330
    • rows where all value columns are null: 91295 (24%)
      • note that this number plus the one above it sum to the total: 91295 + 293330 = 384625
    • rows with null in "combined" col: 139460 (36%)
    • rows with null in rsv column: 98750 (26%)
    • rows with null in flu column: 98750 (26%)
    • rows with null in covid column: 98750 (26%)
      • (as you probably noticed, "combined" column is different, but rsv+flu+covid column missingness appears to be coincident)
    • so the number of null values that end up in our data set should be: (987503+139460)/(3846254) = 28%
  • For newer data, the missingness is even more prevalent... For a targeted date range between 29 Sep 2024 and 27 Feb 2025, we see:

    • total rows: 61540
    • rows where at least one value column is non-null: 45740
    • rows where all value columns are null: 15800 (26%)
      • note 15800 + 45740 = 61540
    • rows with null in "combined" col: 61540 (100% ?!?! perhaps they silently discontinued this)
    • rows with null in rsv column: 15800 (26%)
    • rows with null in flu column: 15800 (26%)
    • rows with null in covid column: 15800 (26%)
      • (you may have noticed that the rsv, flu, and covid column missingness percentage is ~the same as for the whole dataset above)
    • so the number of null values that end up in our data set should be: (158003+61540)/(615404) = >44%

(the above numbers include rows that are for state-level and national-level data points, but i claim they are negligible compared to the number of counties)


Whatever the reason for the frequency of null values, i think we should probably exclude them from the database. 44% missingness is nearly the norm and not the exception, and leaving all those nulls in place is wasteful and potentially misleading.

Ive not checked exhaustively yet, but i hypothesize the nulls appear consistently for many of the same locations -- it is certainly true that all of the NYC data is null across the entire timeseries. If we have no actual datapoints for a location, i think it is more appropriate to omit all records for that location instead of storing only null values.

This may be of particular interest to @RoniRos : if we keep the null values, the recently implemented geo coverage for the discovery app will report that nssp includes data for Manhattan, which is effectively erroneous; if we remove the null values, geo coverage will not report that nssp includes any data for Manhattan, which is arguably more accurate.


Another potential concern is that while "null-value locations" are appropriately left out of our geoaggregation computations, i fear that smaller geo types that we aggregate (like MSA perhaps) might still end up with unrepresentative values if large portions of their population are not accounted for.

Related issue for epivis "null" plotting: cmu-delphi/www-epivis#91

@melange396 melange396 added API change Renames, large changes to calculations, large changes to affected regions bug Something isn't working data quality Missing data, weird data, broken data labels Feb 28, 2025
@aysim319 aysim319 assigned aysim319 and unassigned aysim319 Feb 28, 2025
@aysim319
Copy link
Contributor

aysim319 commented Feb 28, 2025

rows with null in "combined" col: 61540 (100% ?!?! perhaps they silently discontinued this)

looking at #2081 it does seem that they stopped populating the combined column. So continuing on with the math, if we were to drop the combined columns stating from November 4, and if we recreated using the existing columns

it should give the math of

For newer data, the missingness is even more prevalent... For a targeted date range between 29 Sep 2024 and 27 Feb 2025, we see:

  • total rows: 61540
  • rows where at least one value column is non-null: 45740
  • rows where all value columns are null: 15800 (26%)
    note 15800 + 45740 = 61540
  • but since there's null values for rsv,flu, and covid seperately, if we take that to account and filter out entries where one of the dieases are null (date AND rsv is null) OR (date AND covid is null) OR (date AND flu is NULL) 15800
  • rows with null in rsv column: 15800 (26%)
  • rows with null in flu column: 15800 (26%)
  • rows with null in covid column: 15800 (26%)
    (you may have noticed that the rsv, flu, and covid column missingness percentage is ~the same as for the whole dataset above)

so the number of null values that end up in our data set should be: (158003+(15800))/(615404) ~ 28%

At least should solve part of the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change Renames, large changes to calculations, large changes to affected regions bug Something isn't working data quality Missing data, weird data, broken data
Projects
None yet
Development

No branches or pull requests

2 participants