-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore original index with PandasParallelLFApplier #1589
Comments
+1 to this. We just spent several cycles tracking down this issue in a model that uses Snorkel. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Not sure on the exact solution you were looking for on this, but one possibility would be to add an additional default keyword arg to PandasParallelLFApplier on whether to index sort or not. For example, at line 97 within snorkel/snorkel/labeling/apply/dask.py, dd.from_pandas() is called:
If you want the index to remain unsorted, and prevent the problem highlighted in this issue, we can simply call dd.from_pandas() like so:
We might not always want this to be false however, since dask makes this by default for performance purposes and for obtaining meaningful divisions (Ref: dask/dask#1428). Therefore, it could be worth letting users call PandasParallelLFApplier with sort=True / False, as required. Perhaps it could be made clear in the documentation that this sorting occurs by default, and if users do not want this to happen, they should provide sort=False. Just a suggestion anyway! |
Describe the solution you'd like
Using
PandasParallelLFApplier
includes an index sort on the original DataFrame, which can result in unexpected row order if the index is not sorted when passed in. This should be documented, or the original index order should be restored.Discussion: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13
Additional context
Related issues: #1587 #1581 #1524
The text was updated successfully, but these errors were encountered: