Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scorers which don't return a numeric Score cause crashes in the log viewer #1324

Open
max-kaufmann opened this issue Feb 15, 2025 · 2 comments

Comments

@max-kaufmann
Copy link
Contributor

max-kaufmann commented Feb 15, 2025

I'm creating an evaluation where I want the output to just be text (which I process later in a different way). I hence return a dict[str,str] as my Score (which is an allowed type). However, when I then try and view the log the log viewer gives me the following error:

Error: Cannot read properties of undefined (reading 'reducer')$
at $TypeError: Cannot read properties of undefined (reading 'reducer')
    at ResultsPanel (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:60545:42)
    at renderWithHooks (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:3533:25)
    at updateFunctionComponent (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:5030:20)
    at beginWork (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:5685:18)
    at performUnitOfWork (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8750:18)
    at workLoopSync (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8649:41)
    at renderRootSync (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8633:11)
    at performWorkOnRoot (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8335:44)
    at performWorkOnRootViaSchedulerTask (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:9175:7)
    at MessagePort.performWorkUntilDeadline (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:191:50)

Chasing stack traces, it's because the reducer is convert my strings to float and failing, and I thnk there's no way to avoid this? The resolve_reducers function seems to always like putting mean_score() in there:

def resolve_reducer(
reducers: ScoreReducer | list[ScoreReducer] | None,
) -> tuple[list[ScoreReducer], bool]:
if reducers is None:
return ([mean_score()], False)
else:
return (reducers if isinstance(reducers, list) else [reducers], True)

Maybe my scorer is written badly, or I misunderstand something (plausibly I have to set something to only be able to return strings, although seems a little bit unintuitive).

Here is my scorer for context:

@scorer(metrics=[])
def scenario_parser_v0() -> Scorer:
    """Returns a scorer that returns a dictionary containing the model's question and answer. Fills out metadata with parse_error if the model's output does not contain both a <scenario> and a <question> tag."""

    async def scenario_parser_v0(state: TaskState, target: Target) -> Score:
        model_output = state.output.completion

        scenario_match = re.search(
            r"<scenario>(.*?)</scenario>", model_output, re.DOTALL
        )
        question_match = re.search(
            r"<question>(.*?)</question>", model_output, re.DOTALL
        )

        scenario = scenario_match.group(1).strip() if scenario_match else None
        question = question_match.group(1).strip() if question_match else None

        parse_error = scenario is None or question is None
        score = {
            "question": question if question else "",
            "answer": scenario if scenario else "",
        }

        return Score(
            value=score, answer=model_output, metadata={"parse_error": parse_error}
        )

    return scenario_parser_v0
@dragonstyle
Copy link
Collaborator

One issue is definitely that we currently always run reducers against the scores, even when there is only a single epoch. (This allows us to be sure the complete scoring logic executes consistently and behaves identically across both single and multi-epoch cases). The default reducer is mean which doesn't know how to deal with strings, so I expect scoring to fail in this case.

A couple of suggestions here:

  • You can implement a custom reducer that reduces scores with strings (you would need to consider what multiple epochs mean in the context of how you're scoring).
  • You could also use the mode reducer which would just choose the most common value (and hence supports strings).

I am thinking about how we might improve our support for reducing scores like this, but I haven't come up with a great solution that is consistent no matter the number of epochs. One thing we could do is create a string specific reducer that just merges any strings into a list- I'll think more about this but if you have ideas, lmk.

I can also try to address that error in the viewer - the reducer is allowed to be null by the type system, so if it isn't dealing with that case, it is an issue for sure.

@max-kaufmann
Copy link
Contributor Author

max-kaufmann commented Feb 16, 2025

I think if it was just null in the case where mean() couldn't handle the types that would be enough for me! (although thats not very "fail fast", if the user has made a mistake - maybe emit a warning?)

I don't really expect a coherent reduction in the string case, and being able to access the per-sample outputs is enough. Not sure if the list would help me much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants