Scorers which don't return a numeric Score cause crashes in the log viewer #1324

max-kaufmann · 2025-02-15T22:35:15Z

I'm creating an evaluation where I want the output to just be text (which I process later in a different way). I hence return a dict[str,str] as my Score (which is an allowed type). However, when I then try and view the log the log viewer gives me the following error:

Error: Cannot read properties of undefined (reading 'reducer')$
at $TypeError: Cannot read properties of undefined (reading 'reducer')
    at ResultsPanel (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:60545:42)
    at renderWithHooks (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:3533:25)
    at updateFunctionComponent (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:5030:20)
    at beginWork (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:5685:18)
    at performUnitOfWork (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8750:18)
    at workLoopSync (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8649:41)
    at renderRootSync (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8633:11)
    at performWorkOnRoot (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8335:44)
    at performWorkOnRootViaSchedulerTask (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:9175:7)
    at MessagePort.performWorkUntilDeadline (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:191:50)

Chasing stack traces, it's because the reducer is convert my strings to float and failing, and I thnk there's no way to avoid this? The resolve_reducers function seems to always like putting mean_score() in there:

inspect_ai/src/inspect_ai/_eval/task/results.py

Lines 172 to 178 in 011d6da

    
           def resolve_reducer( 
        
               reducers: ScoreReducer | list[ScoreReducer] | None, 
        
           ) -> tuple[list[ScoreReducer], bool]: 
        
               if reducers is None: 
        
                   return ([mean_score()], False) 
        
               else: 
        
                   return (reducers if isinstance(reducers, list) else [reducers], True)

Maybe my scorer is written badly, or I misunderstand something (plausibly I have to set something to only be able to return strings, although seems a little bit unintuitive).

Here is my scorer for context:

@scorer(metrics=[])
def scenario_parser_v0() -> Scorer:
    """Returns a scorer that returns a dictionary containing the model's question and answer. Fills out metadata with parse_error if the model's output does not contain both a <scenario> and a <question> tag."""

    async def scenario_parser_v0(state: TaskState, target: Target) -> Score:
        model_output = state.output.completion

        scenario_match = re.search(
            r"<scenario>(.*?)</scenario>", model_output, re.DOTALL
        )
        question_match = re.search(
            r"<question>(.*?)</question>", model_output, re.DOTALL
        )

        scenario = scenario_match.group(1).strip() if scenario_match else None
        question = question_match.group(1).strip() if question_match else None

        parse_error = scenario is None or question is None
        score = {
            "question": question if question else "",
            "answer": scenario if scenario else "",
        }

        return Score(
            value=score, answer=model_output, metadata={"parse_error": parse_error}
        )

    return scenario_parser_v0

The text was updated successfully, but these errors were encountered:

dragonstyle · 2025-02-15T22:55:13Z

One issue is definitely that we currently always run reducers against the scores, even when there is only a single epoch. (This allows us to be sure the complete scoring logic executes consistently and behaves identically across both single and multi-epoch cases). The default reducer is mean which doesn't know how to deal with strings, so I expect scoring to fail in this case.

A couple of suggestions here:

You can implement a custom reducer that reduces scores with strings (you would need to consider what multiple epochs mean in the context of how you're scoring).
You could also use the mode reducer which would just choose the most common value (and hence supports strings).

I am thinking about how we might improve our support for reducing scores like this, but I haven't come up with a great solution that is consistent no matter the number of epochs. One thing we could do is create a string specific reducer that just merges any strings into a list- I'll think more about this but if you have ideas, lmk.

I can also try to address that error in the viewer - the reducer is allowed to be null by the type system, so if it isn't dealing with that case, it is an issue for sure.

max-kaufmann · 2025-02-16T14:58:13Z

I think if it was just null in the case where mean() couldn't handle the types that would be enough for me! (although thats not very "fail fast", if the user has made a mistake - maybe emit a warning?)

I don't really expect a coherent reduction in the string case, and being able to access the per-sample outputs is enough. Not sure if the list would help me much!

max-kaufmann mentioned this issue Feb 15, 2025

Empty "metrics" list passed to "@scorer" casuses a crash of the log viewer #1325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scorers which don't return a numeric Score cause crashes in the log viewer #1324

Scorers which don't return a numeric Score cause crashes in the log viewer #1324

max-kaufmann commented Feb 15, 2025 •

edited

Loading

dragonstyle commented Feb 15, 2025

max-kaufmann commented Feb 16, 2025 •

edited

Loading

Scorers which don't return a numeric Score cause crashes in the log viewer #1324

Scorers which don't return a numeric Score cause crashes in the log viewer #1324

Comments

max-kaufmann commented Feb 15, 2025 • edited Loading

dragonstyle commented Feb 15, 2025

max-kaufmann commented Feb 16, 2025 • edited Loading

max-kaufmann commented Feb 15, 2025 •

edited

Loading

max-kaufmann commented Feb 16, 2025 •

edited

Loading