-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
labelmodel.fit on a superset of data changes predictions of subset #1581
Comments
Hi, i check the matrices generated from PandasLFApplier and PandasParallelLFApplier and they were different. df_full = pd.concat([df_single,df_multilabel] lm1 =applier.apply(df=df_full) np.array_equal(lm1, lm2) Is there anything i am missing. |
Hi @srimugunthan thanks for surfacing this! At the current moment, the master branch version of Snorkel is not configured to support multi-label, though we've certainly applied Snorkel here (e.g. https://www.snorkel.org/blog/superglue / multi-task formulation...). So I'm not surprised there are some issues here- perhaps, since Snorkel's label model is expecting a single label, it's just taking e.g. the last one per data point, but this order is getting shuffled when applied in parallel? Either way, we'll look into this to make sure not an issue with PandasParallelLFApplier. If, as I suspect, it's just an issue with multi-label support, we'll put on the roadmap! |
|
Hi @srimugunthan, sorry for the delayed reply! In response to the |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Issue description
We have a dataset with records which will be either have one label or multiple labels.
To verify the label model predictions, we filtered out from the original data, the records with only one label. Doing labelmodel.fit on the single-labelled data was giving accuracy of more than 90%.
But when we did labelmodel.fit on the whole data the above accuracy for singlelabelled datapoints dropped drastically to 30%.
Code example/repro steps
i was able to reproduce the bug with some generated label matrix https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb
Although here the accuracy drop in the generated data is not drastic, it illustrates the scenario
Expected behavior
the subset of data with single labels should have the same accuracy.
System info
used snorkel 0.9.3 on linux
The text was updated successfully, but these errors were encountered: