-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
如何使vllm下测评结果更接近hf? #95
Comments
你好,看起来你在hf和vllm的inference中均用了greedy decoding,请问你用hf和vllm中打多次输出一样吗? |
我用上面贴的代码测了五次,hf是一样的,vllm出现了2种不同,具体如下:(前面贴的结果不是用这两份代码跑的,所以稍有不同) |
vllm中的测评结果在各个模型上都比hf差一或两个点
评测代码逻辑大致如下:
hf
vllm
结果不同部分如下(narrativeqa),output1为hf,output2为vllm:
output1: He is living in a boarding house.
output2: He is living in a hotel.
output1: In a cave on the side of a mountain.
output2: In a cave on Atlas' mountain.
output1: 10 years
output2:
output1: A thief.
output2: A slave.
output1: The teens faces were frozen in a look of terror.
output2: They were pale and their eyes were wide open.
output1: Soames is upset with Beerbohm because he is not included in the list of writers who are to be included in the book.
output2: Soames is upset with Beerbohm because Beerbohm has written a book about the literature of the eighteen-nineties and has not included Soames in it.
output1: Falder was worried about Ruth because he was afraid that she might tell the truth about him to the police.
output2: Falder was worried about Ruth because he was afraid that she would tell the police about his criminal past.
output1: The painting was of the Statue of Liberty.
output2: The Ghost of Slimer
output1: The teens faces were frozen in a look of terror.
output2: They were pale and their eyes were wide open.
output1: The Witch creates a Witch.
output2: She creates a Witch.
output1: Reiko discovers that the tape is a copy of the original tape.
output2: Reiko discovers that the tape is a copy of the original.
output1: Holmes had been sitting in the window, and the housekeeper had been standing outside.
output2: Holmes had been sitting on the floor, and his feet were wet.
output1: He is living in a boarding house.
output2: He is living in a hotel.
output1: The video that was shown to them.
output2: The teenagers were discussing the rumor that if you watch a video called "Ring" and then you die within a week, you will come back to life.
output1: Alabama was a prostitute.
output2: She was a prostitute.
output1: Jezzie was a Vietnamese girl who was raped by a group of American soldiers.
output2: Jezzie was a Vietnamese girl who was raped by a soldier.
output1: Nupton thought that Soames was a figment of my brain.
output2: He thought that Soames was a figment of the author's imagination.
output1: He discovered that she had been unfaithful to him.
output2: He discovered that she had been having an affair with a man named Gorenflot.
output1: The message is that Reiko is the next victim of the curse.
output2: The message is that Reiko will die in one week.
output1: The American Revolutionary War.
output2: In the Second War of Independence.
output1: 1927
output2: 2419 A.D.
output1: He believes that artists laugh at his work.
output2:
output1: Because he wants to marry the Baron's daughter.
output2: Baron Henry attacks Castle Drachenhausen because he wants to marry the Baron's daughter.
output1: The Methodist minister, Reverend Simms.
output2: The Mayor, Joe Clark.
output1: Soames's presence in the future affected others by making them aware of the possibility of time travel.
output2: Soames's presence in the future affected others in a variety of ways.
output1: In a cave on the top of Atlas' mountain.
output2: In a cave on the side of a mountain.
output1: Holmes had been sitting in the window, and the housekeeper had been standing outside.
output2: Holmes had been sitting on the floor, and his feet were wet.
output1: When she sees the videotape of her father.
output2: When she sees the videotape.
其他数据集分数如下:
hf
Dataset Metrics
+-----------------+---------+
| Dataset | Score |
|-----------------+---------|
| narrativeqa | 20.6 |
| qasper | 25.36 |
| multifieldqa_en | 32.27 |
| hotpotqa | 30.52 |
| 2wikimqa | 27.07 |
| musique | 10.32 |
| gov_report | 26.41 |
| qmsum | 20.72 |
| multi_news | 3.31 |
| trec | 65.5 |
| triviaqa | 87.12 |
| samsum | 34.32 |
| lcc | 66.68 |
| repobench-p | 59.4 |
| � |
| Avg. | 36.4 |
+-----------------+---------+
vllm
Dataset Metrics
+-----------------+---------+
| Dataset | Score |
|-----------------+---------|
| narrativeqa | 20.6 |
| qasper | 25.72 |
| multifieldqa_en | 31.69 |
| hotpotqa | 30.72 |
| 2wikimqa | 26.57 |
| musique | 11.01 |
| gov_report | 18.77 |
| qmsum | 8.55 |
| multi_news | 1.97 |
| trec | 65 |
| triviaqa | 88.51 |
| samsum | 25.5 |
| lcc | 69.12 |
| repobench-p | 62.57 |
| � |
| Avg. | 34.7357 |
+-----------------+---------+
The text was updated successfully, but these errors were encountered: