Deal with large number of null values in NSSP data #2130
Labels
API change
Renames, large changes to calculations, large changes to affected regions
bug
Something isn't working
data quality
Missing data, weird data, broken data
Roni and Dmytro noticed this and brought it to my attention: flat lines for NSSP signals for NY and Los Angeles in epivis. What can be seen in the plots are not actually "zero" values but instead all "
null
" values, which epivis currently plots at 0 (you can verify this by looking in your browser's network debugger for the API responses). After looking into this more, i found the following...NSSP raw source data is almost all at the county resolution (there is also state and nation resolution data, but those are vastly outnumbered by county. they do not seem to suffer from the same problem. i will ignore them from here on.). Some of these data points come from the upstream source as "empty", and are then properly interpreted as NULL by the nssp indicators pipeline code and given a missingness annotation. However, there seems to be an inordinately and surprisingly large number of these null/empty/missing values.
Some missing values may be from low-population ares where the data is not well reported or censored for privacy reasons, but data is missing from some very large regions (in terms of area and/or population) like Manhattan (New York County, FIPS code 36061) and Los Angeles (FIPS 06037), as seen in the epivis link at the top of this comment.
For NSSP data over all available time, as seen on Feb 27, 2025 in the "query data" tool on the cdc website ( https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/explore/query/ ) :
For newer data, the missingness is even more prevalent... For a targeted date range between 29 Sep 2024 and 27 Feb 2025, we see:
(the above numbers include rows that are for state-level and national-level data points, but i claim they are negligible compared to the number of counties)
Whatever the reason for the frequency of null values, i think we should probably exclude them from the database. 44% missingness is nearly the norm and not the exception, and leaving all those nulls in place is wasteful and potentially misleading.
Ive not checked exhaustively yet, but i hypothesize the nulls appear consistently for many of the same locations -- it is certainly true that all of the NYC data is null across the entire timeseries. If we have no actual datapoints for a location, i think it is more appropriate to omit all records for that location instead of storing only null values.
This may be of particular interest to @RoniRos : if we keep the null values, the recently implemented geo coverage for the discovery app will report that nssp includes data for Manhattan, which is effectively erroneous; if we remove the null values, geo coverage will not report that nssp includes any data for Manhattan, which is arguably more accurate.
Another potential concern is that while "null-value locations" are appropriately left out of our geoaggregation computations, i fear that smaller geo types that we aggregate (like MSA perhaps) might still end up with unrepresentative values if large portions of their population are not accounted for.
Related issue for epivis "null" plotting: cmu-delphi/www-epivis#91
The text was updated successfully, but these errors were encountered: