You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm creating an evaluation where I want the output to just be text (which I process later in a different way). I hence return a dict[str,str] as my Score (which is an allowed type). However, when I then try and view the log the log viewer gives me the following error:
Error: Cannot read properties of undefined (reading 'reducer')$
at $TypeError: Cannot read properties of undefined (reading 'reducer')
at ResultsPanel (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:60545:42)
at renderWithHooks (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:3533:25)
at updateFunctionComponent (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:5030:20)
at beginWork (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:5685:18)
at performUnitOfWork (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8750:18)
at workLoopSync (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8649:41)
at renderRootSync (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8633:11)
at performWorkOnRoot (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:8335:44)
at performWorkOnRootViaSchedulerTask (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:9175:7)
at MessagePort.performWorkUntilDeadline (https://file+.vscode-resource.vscode-cdn.net/Users/max/Documents/git_repos/stories-cip/.venv/lib/python3.11/site-packages/inspect_ai/_view/www/dist/assets/index.js:191:50)
Chasing stack traces, it's because the reducer is convert my strings to float and failing, and I thnk there's no way to avoid this? The resolve_reducers function seems to always like putting mean_score() in there:
Maybe my scorer is written badly, or I misunderstand something (plausibly I have to set something to only be able to return strings, although seems a little bit unintuitive).
Here is my scorer for context:
@scorer(metrics=[])defscenario_parser_v0() ->Scorer:
"""Returns a scorer that returns a dictionary containing the model's question and answer. Fills out metadata with parse_error if the model's output does not contain both a <scenario> and a <question> tag."""asyncdefscenario_parser_v0(state: TaskState, target: Target) ->Score:
model_output=state.output.completionscenario_match=re.search(
r"<scenario>(.*?)</scenario>", model_output, re.DOTALL
)
question_match=re.search(
r"<question>(.*?)</question>", model_output, re.DOTALL
)
scenario=scenario_match.group(1).strip() ifscenario_matchelseNonequestion=question_match.group(1).strip() ifquestion_matchelseNoneparse_error=scenarioisNoneorquestionisNonescore= {
"question": questionifquestionelse"",
"answer": scenarioifscenarioelse"",
}
returnScore(
value=score, answer=model_output, metadata={"parse_error": parse_error}
)
returnscenario_parser_v0
The text was updated successfully, but these errors were encountered:
One issue is definitely that we currently always run reducers against the scores, even when there is only a single epoch. (This allows us to be sure the complete scoring logic executes consistently and behaves identically across both single and multi-epoch cases). The default reducer is mean which doesn't know how to deal with strings, so I expect scoring to fail in this case.
A couple of suggestions here:
You can implement a custom reducer that reduces scores with strings (you would need to consider what multiple epochs mean in the context of how you're scoring).
You could also use the mode reducer which would just choose the most common value (and hence supports strings).
I am thinking about how we might improve our support for reducing scores like this, but I haven't come up with a great solution that is consistent no matter the number of epochs. One thing we could do is create a string specific reducer that just merges any strings into a list- I'll think more about this but if you have ideas, lmk.
I can also try to address that error in the viewer - the reducer is allowed to be null by the type system, so if it isn't dealing with that case, it is an issue for sure.
I think if it was just null in the case where mean() couldn't handle the types that would be enough for me! (although thats not very "fail fast", if the user has made a mistake - maybe emit a warning?)
I don't really expect a coherent reduction in the string case, and being able to access the per-sample outputs is enough. Not sure if the list would help me much!
I'm creating an evaluation where I want the output to just be text (which I process later in a different way). I hence return a
dict[str,str]
as myScore
(which is an allowed type). However, when I then try and view the log the log viewer gives me the following error:Chasing stack traces, it's because the reducer is convert my strings to float and failing, and I thnk there's no way to avoid this? The resolve_reducers function seems to always like putting mean_score() in there:
inspect_ai/src/inspect_ai/_eval/task/results.py
Lines 172 to 178 in 011d6da
Maybe my scorer is written badly, or I misunderstand something (plausibly I have to set something to only be able to return strings, although seems a little bit unintuitive).
Here is my scorer for context:
The text was updated successfully, but these errors were encountered: