-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss of good data due to excess Extraneous Same Day exclusions #73
Comments
Hi @kradimer, sorry for the slow response. Glad to hear the package is helping you! Generally, the algorithm should choose one same-day weight to keep while flagging the others. You might be running into an error load issue if your data has many subjects with lots of same day observations - if the algorithm flags a majority of observations as implausible, it can flag the rest based on an assumption of too many errors making the data for that subject difficult hard to evaluate at all. If you're seeing exclusions with "Too-Many-Errors" (see https://carriedaymont.github.io/growthcleanr/articles/output.html for the various flavors of this) that could be part of what's happening. Could you tell us a little more about your dataset? Are these pediatric or adult subjects? And could you perhaps share a summary percentage of the exclusion types growthcleanr is flagging so we can get a sense of whether it seems out of proportion to what we've seen with data we've tested with? |
@kradimer Thanks for using growthcleanr and sorry this part is not working well for you. If it's adult data, this may be the issue (taken from steps # 9H and 10W on the adult algorithm page): "Then, if more than 25% of days for a subject have same day extraneous values OR if there are adjacent days that have same day extraneous values, all remaining same day extraneous values for that subject are excluded." The issue is that if there are too many same day extraneous values for a subject that differ by a non-trivial amount, it is difficult/impossible to reliably choose which of the values on the same day is more likely to be accurate. Same-days should be excluded from error load calculations, but this is a sort of "same-day extraneous load". If you are able to tell us more about the dataset (if you're still working on this) we can try to figure out if this is in fact the issue, and it may give us an idea for adjusting things for later iterations. |
Hi! Thanks for the responses! This is adult data. The immediate context is a prospective cohort study of women diagnosed with breast cancer, but that this work may eventually be applied to other studies interested in examining trajectories in weight or other variables, or estimating those values for a time point in which those data are not available, in which we are leveraging EHR data. Out of 747,267 observations, 39,481 were flagged by GCR as same day extraneous. There were 13,437 rows that I had marked to keep that GCR marked for same day extraneous exclusion, but 6,109 of these were just differences of opinion of which of two same day measures to keep (generally within a pound of each other). So that leaves about 7K fewer same day measures kept by GCR. Choosing one subject as an example, they have 75 observed weights. There are three days for which this subject has multiple measures, two per day. This is less than 25% of days, and the days with multiple measures are not consecutive.
|
@kradimer Since both of the examples you mentioned involved measures that were the same as the previous day, I checked to see if we had treated repeated values differently for same-day extraneous, but we hadn't (at least we didn't try to). I am looking into things further. In the meantime, if you want there are additional questions that might be helpful -- Are there examples of issues with values where neither value was repeated from the prior value? Are all the issues for weights, or is it heights also? Did any of the SDEs that growthcleanr kept in differ by over 1kg? Are you by any chance able to email the age, weight, and sex for this subject and possibly a couple of others with no identifiers? You could shift all the ages by a number of days if you prefer. It would help us to try to replicate the issue, although I understand if you can't send them. |
Hi! The issues I have described was with weights (I haven't tried heights in the algorithm), and all patients are female. There are cases where all the values were excluded for a day with multiple measures even when none of the values matched the previous measure. All the examples that I've looked at where one measure was kept from a day with multiple measures did include values that differed by less than 1 kg. As for sending more specific data, I will need to get approval from my PIs and probably move this conversation to email rather than this public forum (but I'm not sure how to do that on GitHub). |
I was able to use some data that I have to recreate issues, so we are good
there. We are working on figuring out where the issue is. Sorry this isn't
working right!
…On Tue, Jun 21, 2022 at 12:26 PM kradimer ***@***.***> wrote:
Hi! The issues I have described was with weights (I haven't tried heights
in the algorithm), and all patients are female. There are cases where all
the values were excluded for a day with multiple measures even when none of
the values matched the previous measure. All the examples that I've looked
at where one measure was kept from a day with multiple measures did include
values that differed by less than 1 kg. As for sending more specific data,
I will need to get approval from my PIs and probably move this conversation
to email rather than this public forum (but I'm not sure how to do that on
GitHub).
—
Reply to this email directly, view it on GitHub
<#73 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKJGMNXM5NVK7YMVFICBW3LVQHULNANCNFSM5XEXJNPQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi, @kradimer! I just wanted to let you know we've been working on this and fixing not only your exact problem, but problems we discovered in the Extraneous Same Day (SDE) step for adults. We've been adjusting the algorithm with:
So lots of changes! We're still in progress on testing/updating these changes. All of these changes can be found on the sde-bugs branch (linked to this issue). Thanks for reporting this, and your patience! |
Hello! First, I love this package. I had been working on cleaning a dataset of over 700K weights using overall z-scores but some bad weights were still getting through. The weighted moving averages are amazing at catching these. Thank you! The only thing that this package is doing that I wish it weren't is throwing out nearly all of the same day measures. My algorithm was keeping one measure per day with the lowest absolute z-score, whereas growthcleanR threw out 13K such measures. I didn't see anything under the configuration options that would allow me to change that. Am I missing anything? Or might you be able to add an option for keeping one per day if they are not otherwise outliers? Anyway, thank you again for creating and sharing this package. It's fantastic.
The text was updated successfully, but these errors were encountered: