Loss of good data due to excess Extraneous Same Day exclusions #73

kradimer · 2022-05-27T16:44:05Z

Hello! First, I love this package. I had been working on cleaning a dataset of over 700K weights using overall z-scores but some bad weights were still getting through. The weighted moving averages are amazing at catching these. Thank you! The only thing that this package is doing that I wish it weren't is throwing out nearly all of the same day measures. My algorithm was keeping one measure per day with the lowest absolute z-score, whereas growthcleanR threw out 13K such measures. I didn't see anything under the configuration options that would allow me to change that. Am I missing anything? Or might you be able to add an option for keeping one per day if they are not otherwise outliers? Anyway, thank you again for creating and sharing this package. It's fantastic.

dchud · 2022-06-14T00:47:44Z

Hi @kradimer, sorry for the slow response. Glad to hear the package is helping you!

Generally, the algorithm should choose one same-day weight to keep while flagging the others. You might be running into an error load issue if your data has many subjects with lots of same day observations - if the algorithm flags a majority of observations as implausible, it can flag the rest based on an assumption of too many errors making the data for that subject difficult hard to evaluate at all. If you're seeing exclusions with "Too-Many-Errors" (see https://carriedaymont.github.io/growthcleanr/articles/output.html for the various flavors of this) that could be part of what's happening.

Could you tell us a little more about your dataset? Are these pediatric or adult subjects? And could you perhaps share a summary percentage of the exclusion types growthcleanr is flagging so we can get a sense of whether it seems out of proportion to what we've seen with data we've tested with?

carriedaymont · 2022-06-14T21:37:16Z

@kradimer Thanks for using growthcleanr and sorry this part is not working well for you.

If it's adult data, this may be the issue (taken from steps # 9H and 10W on the adult algorithm page): "Then, if more than 25% of days for a subject have same day extraneous values OR if there are adjacent days that have same day extraneous values, all remaining same day extraneous values for that subject are excluded." The issue is that if there are too many same day extraneous values for a subject that differ by a non-trivial amount, it is difficult/impossible to reliably choose which of the values on the same day is more likely to be accurate. Same-days should be excluded from error load calculations, but this is a sort of "same-day extraneous load".

If you are able to tell us more about the dataset (if you're still working on this) we can try to figure out if this is in fact the issue, and it may give us an idea for adjusting things for later iterations.

kradimer · 2022-06-15T00:32:02Z

Hi! Thanks for the responses! This is adult data. The immediate context is a prospective cohort study of women diagnosed with breast cancer, but that this work may eventually be applied to other studies interested in examining trajectories in weight or other variables, or estimating those values for a time point in which those data are not available, in which we are leveraging EHR data.

Out of 747,267 observations, 39,481 were flagged by GCR as same day extraneous. There were 13,437 rows that I had marked to keep that GCR marked for same day extraneous exclusion, but 6,109 of these were just differences of opinion of which of two same day measures to keep (generally within a pound of each other). So that leaves about 7K fewer same day measures kept by GCR.

Choosing one subject as an example, they have 75 observed weights. There are three days for which this subject has multiple measures, two per day. This is less than 25% of days, and the days with multiple measures are not consecutive.

For the first such day, one of the measures is the same as a measurement for the previous day. The measures are 3 lbs apart. Both were excluded.
For the second such day, one of the measures is the same as the previous measure, which was a 53 days earlier. The measures are 10 lbs apart. Both were excluded.
For the third such day, neither of the measures matches the previous measure. The measures are less than a pound apart. Only one of the measures for that day was excluded.

carriedaymont · 2022-06-16T20:09:06Z

@kradimer Since both of the examples you mentioned involved measures that were the same as the previous day, I checked to see if we had treated repeated values differently for same-day extraneous, but we hadn't (at least we didn't try to). I am looking into things further. In the meantime, if you want there are additional questions that might be helpful --

Are there examples of issues with values where neither value was repeated from the prior value? Are all the issues for weights, or is it heights also? Did any of the SDEs that growthcleanr kept in differ by over 1kg? Are you by any chance able to email the age, weight, and sex for this subject and possibly a couple of others with no identifiers? You could shift all the ages by a number of days if you prefer. It would help us to try to replicate the issue, although I understand if you can't send them.

kradimer · 2022-06-21T16:26:17Z

Hi! The issues I have described was with weights (I haven't tried heights in the algorithm), and all patients are female. There are cases where all the values were excluded for a day with multiple measures even when none of the values matched the previous measure. All the examples that I've looked at where one measure was kept from a day with multiple measures did include values that differed by less than 1 kg. As for sending more specific data, I will need to get approval from my PIs and probably move this conversation to email rather than this public forum (but I'm not sure how to do that on GitHub).

carriedaymont · 2022-06-21T20:15:17Z

I was able to use some data that I have to recreate issues, so we are good there. We are working on figuring out where the issue is. Sorry this isn't working right!

…

On Tue, Jun 21, 2022 at 12:26 PM kradimer ***@***.***> wrote: Hi! The issues I have described was with weights (I haven't tried heights in the algorithm), and all patients are female. There are cases where all the values were excluded for a day with multiple measures even when none of the values matched the previous measure. All the examples that I've looked at where one measure was kept from a day with multiple measures did include values that differed by less than 1 kg. As for sending more specific data, I will need to get approval from my PIs and probably move this conversation to email rather than this public forum (but I'm not sure how to do that on GitHub). — Reply to this email directly, view it on GitHub <#73 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKJGMNXM5NVK7YMVFICBW3LVQHULNANCNFSM5XEXJNPQ> . You are receiving this because you commented.Message ID: ***@***.***>

dchud · 2022-07-27T20:40:26Z

@kradimer just a quick update - we've been reviewing this and are testing an updated set of rules that should address this issue. We're working on it at #78 and have a few more things to check/update before it's ready.

delosh653 · 2022-09-22T14:30:16Z

Hi, @kradimer! I just wanted to let you know we've been working on this and fixing not only your exact problem, but problems we discovered in the Extraneous Same Day (SDE) step for adults. We've been adjusting the algorithm with:

Realizing that we were too strict in the initial permissiveness of the SDE step, and changing the |delta EWMA| cutoffs to be 2.54 for height and .5 of the "wtallow" for weight, then keeping only the value with the smallest |delta EWMA|.
Updating the EWMA calculation for SDE such that when we're calculating the weight averages, we're not including the SDEs.
In fixing this, we also noticed that there was a bug in the "duplicate ratio" step such that the ratio was being counted by observations instead of days -- we have fixed that, using a ratio of > 0.25 to be excluded.
In looking through your problem and internal data, we also noticed that there were a fair amount of SDEs with relatively trivial differences that both get excluded (i.e. 71 and 71.2 kg, for example). We have thus added a rule after the duplicate ratio step and before the SDE step:
- If there are any included, non-SDE values for a subject and parameter, and all SDEs on a single day differ by <=1kg for weight or <=2.54cm for height (+ epsilon), then the SDE value with the lowest delta-EWMA should be retained.
- If there are no nonSDE days:
  - if there are at least 3 SDEs on the same day, choose the SDE closest to the median across those SDEs.
  - if there are only 2 SDEs on the same day, choose the SDE closest to the median across all days (which are also SDEs, but on other days).

So lots of changes! We're still in progress on testing/updating these changes.

All of these changes can be found on the sde-bugs branch (linked to this issue). Thanks for reporting this, and your patience!

delosh653 linked a pull request Jun 29, 2022 that will close this issue

Fix adult same day extraneous step #78

Draft

dchud assigned delosh653 Aug 23, 2022

dchud added this to the v3.0.0 milestone Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss of good data due to excess Extraneous Same Day exclusions #73

Loss of good data due to excess Extraneous Same Day exclusions #73

kradimer commented May 27, 2022

dchud commented Jun 14, 2022

carriedaymont commented Jun 14, 2022

kradimer commented Jun 15, 2022

carriedaymont commented Jun 16, 2022

kradimer commented Jun 21, 2022

carriedaymont commented Jun 21, 2022 via email

dchud commented Jul 27, 2022

delosh653 commented Sep 22, 2022

Loss of good data due to excess Extraneous Same Day exclusions #73

Loss of good data due to excess Extraneous Same Day exclusions #73

Comments

kradimer commented May 27, 2022

dchud commented Jun 14, 2022

carriedaymont commented Jun 14, 2022

kradimer commented Jun 15, 2022

carriedaymont commented Jun 16, 2022

kradimer commented Jun 21, 2022

carriedaymont commented Jun 21, 2022 via email

dchud commented Jul 27, 2022

delosh653 commented Sep 22, 2022