You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evaluate using squad2's official evaluation script
I'm quite new to this and am unsure where I might have gone wrong.
I'm thinking whether I should add an extra binary classifier. However, the model card on Hugging Face indicates that deepset/roberta-base-squad2 is already a fine-tuned model, so I assume that simply loading and using it should suffice.
Alternatively, should I also save the prediction scores in step2, pass them to the evaluation script using the --na-prob-file param, and then establish a --na-prob-thresh? If so, what would be the appropriate threshold? I'm trying to determine the threshold that replicates the performance metrics reported on Hugging Face.
I've tried searching for papers, documentation, and issues related to this but haven't found anything conclusive. I feel like I might be missing some basic understanding here. Could anyone offer some guidance?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I loaded the
deepset/roberta-base-squad2
onsquad2
, and got really poor performance on no-answer questions:Here's what I did:
I'm quite new to this and am unsure where I might have gone wrong.
I'm thinking whether I should add an extra binary classifier. However, the model card on Hugging Face indicates that
deepset/roberta-base-squad2
is already a fine-tuned model, so I assume that simply loading and using it should suffice.Alternatively, should I also save the prediction scores in step2, pass them to the evaluation script using the
--na-prob-file
param, and then establish a--na-prob-thresh
? If so, what would be the appropriate threshold? I'm trying to determine the threshold that replicates the performance metrics reported on Hugging Face.I've tried searching for papers, documentation, and issues related to this but haven't found anything conclusive. I feel like I might be missing some basic understanding here. Could anyone offer some guidance?
Beta Was this translation helpful? Give feedback.
All reactions