Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some question #6

Open
yuanzhiyong1999 opened this issue Jul 16, 2024 · 6 comments
Open

some question #6

yuanzhiyong1999 opened this issue Jul 16, 2024 · 6 comments

Comments

@yuanzhiyong1999
Copy link

Hello, I have a question: After I executed model_inference.py and got the results, do I need to use my own model to infer all the questions before executing llm_eval.py? What will the result be after the inference is completed? Because I saw parameters such as gpt4_discriminative_eval_input_path in llm_eval.py, I don't understand how this works. Looking forward to your reply.
@YJiangcm

@yuanzhiyong1999
Copy link
Author

Why doesn't the quantity under 'data' correspond to that in 'api_input'?
image
@YJiangcm

@lzzzx666
Copy link

llm_eval.py will examine the results except example_constraints. Example_constraints will be checked by rule instead of llm. After running llm_eval.py, you should run eval.py to get the final result.

@zhejunliux
Copy link

model_inference.py 是要评估的模型
llm_eval.py 是将评估模型结果经过处理喂给gpt4
eval:
规则是给【评估模型结果】出值
gpt4结果那边也会出一个值
最后融合,计算hrs、ssr、csl

代码写的重复度太高了,😂,读起来太费劲儿。

@bittersweet1999
Copy link

bittersweet1999 commented Sep 5, 2024

model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval: 规则是给【评估模型结果】出值 gpt4结果那边也会出一个值 最后融合,计算hrs、ssr、csl

代码写的重复度太高了,😂,读起来太费劲儿。

我正在重构他这个代码,想问一下他这个rule base的评测和llm base的评测的题目应该是不一样的吧,应该可以拆成两份数据分别跑?他现在弄了一堆if看的晕头转向的完全没理解操作

@zhejunliux
Copy link

评测方式:rule base (规则) + llm base(评估模型(gpt))
数据子集:它给的六类:content、example、mix etc。
指标:HSR、SSR、CLS

example这个数据子集跑了(HSR, CLS)指标,特殊的,直接用了待评估模型数据跑的。def evaluate_example_constraint and def csl_evaluation
其他的五个:调用gpt跑了评估结果,拼了prompt。def discriminative_evaluation and def rule_evaluation这两跑完出了个结果,HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr. CLS: def csl_evaluation

rule base是评待评估模型(lam3 etc) 数据,llm base 是用gpt去跑lam3 的结果, 一般叫打标,llm base并没有去统计结果。

@kkk-an
Copy link

kkk-an commented Sep 23, 2024

评测方式:rule base (规则) + llm base(评估模型(gpt)) 数据子集:它给的六类:content、example、mix etc。 指标:HSR、SSR、CLS

example这个数据子集跑了(HSR, CLS)指标,特殊的,直接用了待评估模型数据跑的。def evaluate_example_constraint and def csl_evaluation 其他的五个:调用gpt跑了评估结果,拼了prompt。def discriminative_evaluation and def rule_evaluation这两跑完出了个结果,HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr. CLS: def csl_evaluation

rule base是_评待评估模型(lam3 etc)_ 数据,llm base 是用_gpt去跑lam3_ 的结果, 一般叫打标,llm base并没有去统计结果。

他的代码里有一个让我感到很迷惑的点,不知道是不是我的问题:
我看他针对比如content type只有50条数据需要llm evla,但是实际上gpt4_discriminative_eval_input有65条,以此类推在mixed上本应该只有10条结果有45条。。。我就不是很懂了这是什么情况

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants