some question #6

yuanzhiyong1999 · 2024-07-16T04:04:07Z

Hello, I have a question: After I executed model_inference.py and got the results, do I need to use my own model to infer all the questions before executing llm_eval.py? What will the result be after the inference is completed? Because I saw parameters such as gpt4_discriminative_eval_input_path in llm_eval.py, I don't understand how this works. Looking forward to your reply.
@YJiangcm

yuanzhiyong1999 · 2024-07-16T05:01:03Z

Why doesn't the quantity under 'data' correspond to that in 'api_input'?

@YJiangcm

lzzzx666 · 2024-07-16T09:59:31Z

llm_eval.py will examine the results except example_constraints. Example_constraints will be checked by rule instead of llm. After running llm_eval.py, you should run eval.py to get the final result.

zhejunliux · 2024-07-24T08:01:39Z

model_inference.py 是要评估的模型
llm_eval.py 是将评估模型结果经过处理喂给gpt4
eval：
规则是给【评估模型结果】出值
gpt4结果那边也会出一个值
最后融合，计算hrs、ssr、csl

代码写的重复度太高了，😂，读起来太费劲儿。

bittersweet1999 · 2024-09-05T06:04:36Z

model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval：规则是给【评估模型结果】出值 gpt4结果那边也会出一个值最后融合，计算hrs、ssr、csl

代码写的重复度太高了，😂，读起来太费劲儿。

我正在重构他这个代码，想问一下他这个rule base的评测和llm base的评测的题目应该是不一样的吧，应该可以拆成两份数据分别跑？他现在弄了一堆if看的晕头转向的完全没理解操作

zhejunliux · 2024-09-09T11:53:46Z

评测方式：rule base （规则） + llm base（评估模型（gpt））
数据子集：它给的六类：content、example、mix etc。
指标：HSR、SSR、CLS

example这个数据子集跑了（HSR, CLS）指标，特殊的，直接用了待评估模型数据跑的。def evaluate_example_constraint and def csl_evaluation
其他的五个：调用gpt跑了评估结果，拼了prompt。def discriminative_evaluation and def rule_evaluation这两跑完出了个结果，HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr. CLS: def csl_evaluation

rule base是评待评估模型(lam3 etc) 数据，llm base 是用gpt去跑lam3 的结果，一般叫打标，llm base并没有去统计结果。

kkk-an · 2024-09-23T13:55:15Z

评测方式：rule base （规则） + llm base（评估模型（gpt））数据子集：它给的六类：content、example、mix etc。指标：HSR、SSR、CLS

example这个数据子集跑了（HSR, CLS）指标，特殊的，直接用了待评估模型数据跑的。def evaluate_example_constraint and def csl_evaluation 其他的五个：调用gpt跑了评估结果，拼了prompt。def discriminative_evaluation and def rule_evaluation这两跑完出了个结果，HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr. CLS: def csl_evaluation

rule base是_评待评估模型(lam3 etc)_ 数据，llm base 是用_gpt去跑lam3_ 的结果，一般叫打标，llm base并没有去统计结果。

他的代码里有一个让我感到很迷惑的点，不知道是不是我的问题：
我看他针对比如content type只有50条数据需要llm evla，但是实际上gpt4_discriminative_eval_input有65条，以此类推在mixed上本应该只有10条结果有45条。。。我就不是很懂了这是什么情况

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some question #6

some question #6

yuanzhiyong1999 commented Jul 16, 2024

yuanzhiyong1999 commented Jul 16, 2024

lzzzx666 commented Jul 16, 2024

zhejunliux commented Jul 24, 2024

bittersweet1999 commented Sep 5, 2024 •

edited

Loading

zhejunliux commented Sep 9, 2024

kkk-an commented Sep 23, 2024

some question #6

some question #6

Comments

yuanzhiyong1999 commented Jul 16, 2024

yuanzhiyong1999 commented Jul 16, 2024

lzzzx666 commented Jul 16, 2024

zhejunliux commented Jul 24, 2024

bittersweet1999 commented Sep 5, 2024 • edited Loading

zhejunliux commented Sep 9, 2024

kkk-an commented Sep 23, 2024

bittersweet1999 commented Sep 5, 2024 •

edited

Loading