Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of whitespace tokens in Scorer.score_token_attr #13739

Open
nrodnova opened this issue Jan 31, 2025 · 0 comments
Open

Inconsistent handling of whitespace tokens in Scorer.score_token_attr #13739

nrodnova opened this issue Jan 31, 2025 · 0 comments

Comments

@nrodnova
Copy link
Contributor

I've been training tagger and parser on heavily augmented data and was surprised by poor performance (comparing to what I calculated manually on the test dataset). I narrowed it down to Scorer.score_token_attr.

How to reproduce the behaviour

Sorry, I don't have a code example, but it's pretty straight-forward. In this function, in the gold dataset, all tokens (except for those with a missing attribute) are included in evaluation. However, in the predicted dataset, whitespace tokens are excluded. If the gold dataset contains whitespace tokens (which is true in my case), we are not comparing apples to apples here and inflate the error rate.

I just created my own scorer for now, but this behavior is kind of unexpected.

Let me know if you want me to change the behavior, and I will do a PR. My own fix to the function was to add exclude_spaces parameter, defaulting to the current behavior (i.e. True), and either include or exclude whitespace tokens in both datasets.

# line 240 of scorer.py
for gold_i, token in enumerate(gold_doc):
        value = getter(token, attr)
        if value not in missing_values:
            gold_tags.add((gold_i, getter(token, attr)))
        else:
            missing_indices.add(gold_i)
    pred_tags = set()
    for token in pred_doc:
        if token.orth_.isspace(): # HERE: excluding whitespace tokens
            continue
        if align.x2y.lengths[token.i] == 1:
            gold_i = align.x2y[token.i][0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant