Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of good data due to excess Extraneous Same Day exclusions #73

Open
kradimer opened this issue May 27, 2022 · 8 comments · May be fixed by #78
Open

Loss of good data due to excess Extraneous Same Day exclusions #73

kradimer opened this issue May 27, 2022 · 8 comments · May be fixed by #78
Assignees
Milestone

Comments

@kradimer
Copy link

Hello! First, I love this package. I had been working on cleaning a dataset of over 700K weights using overall z-scores but some bad weights were still getting through. The weighted moving averages are amazing at catching these. Thank you! The only thing that this package is doing that I wish it weren't is throwing out nearly all of the same day measures. My algorithm was keeping one measure per day with the lowest absolute z-score, whereas growthcleanR threw out 13K such measures. I didn't see anything under the configuration options that would allow me to change that. Am I missing anything? Or might you be able to add an option for keeping one per day if they are not otherwise outliers? Anyway, thank you again for creating and sharing this package. It's fantastic.

@dchud
Copy link
Collaborator

dchud commented Jun 14, 2022

Hi @kradimer, sorry for the slow response. Glad to hear the package is helping you!

Generally, the algorithm should choose one same-day weight to keep while flagging the others. You might be running into an error load issue if your data has many subjects with lots of same day observations - if the algorithm flags a majority of observations as implausible, it can flag the rest based on an assumption of too many errors making the data for that subject difficult hard to evaluate at all. If you're seeing exclusions with "Too-Many-Errors" (see https://carriedaymont.github.io/growthcleanr/articles/output.html for the various flavors of this) that could be part of what's happening.

Could you tell us a little more about your dataset? Are these pediatric or adult subjects? And could you perhaps share a summary percentage of the exclusion types growthcleanr is flagging so we can get a sense of whether it seems out of proportion to what we've seen with data we've tested with?

@carriedaymont
Copy link
Owner

@kradimer Thanks for using growthcleanr and sorry this part is not working well for you.

If it's adult data, this may be the issue (taken from steps # 9H and 10W on the adult algorithm page): "Then, if more than 25% of days for a subject have same day extraneous values OR if there are adjacent days that have same day extraneous values, all remaining same day extraneous values for that subject are excluded." The issue is that if there are too many same day extraneous values for a subject that differ by a non-trivial amount, it is difficult/impossible to reliably choose which of the values on the same day is more likely to be accurate. Same-days should be excluded from error load calculations, but this is a sort of "same-day extraneous load".

If you are able to tell us more about the dataset (if you're still working on this) we can try to figure out if this is in fact the issue, and it may give us an idea for adjusting things for later iterations.

@kradimer
Copy link
Author

Hi! Thanks for the responses! This is adult data. The immediate context is a prospective cohort study of women diagnosed with breast cancer, but that this work may eventually be applied to other studies interested in examining trajectories in weight or other variables, or estimating those values for a time point in which those data are not available, in which we are leveraging EHR data.

Out of 747,267 observations, 39,481 were flagged by GCR as same day extraneous. There were 13,437 rows that I had marked to keep that GCR marked for same day extraneous exclusion, but 6,109 of these were just differences of opinion of which of two same day measures to keep (generally within a pound of each other). So that leaves about 7K fewer same day measures kept by GCR.

Choosing one subject as an example, they have 75 observed weights. There are three days for which this subject has multiple measures, two per day. This is less than 25% of days, and the days with multiple measures are not consecutive.

  • For the first such day, one of the measures is the same as a measurement for the previous day. The measures are 3 lbs apart. Both were excluded.
  • For the second such day, one of the measures is the same as the previous measure, which was a 53 days earlier. The measures are 10 lbs apart. Both were excluded.
  • For the third such day, neither of the measures matches the previous measure. The measures are less than a pound apart. Only one of the measures for that day was excluded.

@carriedaymont
Copy link
Owner

@kradimer Since both of the examples you mentioned involved measures that were the same as the previous day, I checked to see if we had treated repeated values differently for same-day extraneous, but we hadn't (at least we didn't try to). I am looking into things further. In the meantime, if you want there are additional questions that might be helpful --

Are there examples of issues with values where neither value was repeated from the prior value? Are all the issues for weights, or is it heights also? Did any of the SDEs that growthcleanr kept in differ by over 1kg? Are you by any chance able to email the age, weight, and sex for this subject and possibly a couple of others with no identifiers? You could shift all the ages by a number of days if you prefer. It would help us to try to replicate the issue, although I understand if you can't send them.

@kradimer
Copy link
Author

Hi! The issues I have described was with weights (I haven't tried heights in the algorithm), and all patients are female. There are cases where all the values were excluded for a day with multiple measures even when none of the values matched the previous measure. All the examples that I've looked at where one measure was kept from a day with multiple measures did include values that differed by less than 1 kg. As for sending more specific data, I will need to get approval from my PIs and probably move this conversation to email rather than this public forum (but I'm not sure how to do that on GitHub).

@carriedaymont
Copy link
Owner

carriedaymont commented Jun 21, 2022 via email

@delosh653 delosh653 linked a pull request Jun 29, 2022 that will close this issue
@dchud
Copy link
Collaborator

dchud commented Jul 27, 2022

@kradimer just a quick update - we've been reviewing this and are testing an updated set of rules that should address this issue. We're working on it at #78 and have a few more things to check/update before it's ready.

@delosh653
Copy link
Collaborator

Hi, @kradimer! I just wanted to let you know we've been working on this and fixing not only your exact problem, but problems we discovered in the Extraneous Same Day (SDE) step for adults. We've been adjusting the algorithm with:

  • Realizing that we were too strict in the initial permissiveness of the SDE step, and changing the |delta EWMA| cutoffs to be 2.54 for height and .5 of the "wtallow" for weight, then keeping only the value with the smallest |delta EWMA|.
  • Updating the EWMA calculation for SDE such that when we're calculating the weight averages, we're not including the SDEs.
  • In fixing this, we also noticed that there was a bug in the "duplicate ratio" step such that the ratio was being counted by observations instead of days -- we have fixed that, using a ratio of > 0.25 to be excluded.
  • In looking through your problem and internal data, we also noticed that there were a fair amount of SDEs with relatively trivial differences that both get excluded (i.e. 71 and 71.2 kg, for example). We have thus added a rule after the duplicate ratio step and before the SDE step:
    • If there are any included, non-SDE values for a subject and parameter, and all SDEs on a single day differ by <=1kg for weight or <=2.54cm for height (+ epsilon), then the SDE value with the lowest delta-EWMA should be retained.
    • If there are no nonSDE days:
      • if there are at least 3 SDEs on the same day, choose the SDE closest to the median across those SDEs.
      • if there are only 2 SDEs on the same day, choose the SDE closest to the median across all days (which are also SDEs, but on other days).

So lots of changes! We're still in progress on testing/updating these changes.

All of these changes can be found on the sde-bugs branch (linked to this issue). Thanks for reporting this, and your patience!

@dchud dchud added this to the v3.0.0 milestone Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants