InternLM2-Reward Model Card

Introduction

InternLM2-Reward is a series of reward models trained on the foundation of InternLM2-Chat-SFT. These model have been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpfulness and harmlessness.

Key Features:

Variety of Sizes Available: Our open-sourced reward models are available in sizes of 1.8B, 7B, and 20B, each demonstrating exceptional performance across various metrics. We aim for these different-sized models to facilitate research on the scaling laws of reward models, providing valuable insights to the community.
Comprehensive Coverage of Preference: Trained with 2.4 million preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
Multilingual Support: InternLM2-Reward was trained on high-quality English and Chinese preference data, delivering robust performance in both languages.

This model was applied to the RLHF training process of InternLM2-Chat. The reward model training techniques from the InternLM2 Technical Report have been open-sourced in XTuner, try it out here!

Download links

Model	RewardBench Score	Transformers(HF)	ModelScope(HF)	Release Date
InternLM2-1.8B-Reward	80.6	🤗internlm2-1_8b-reward	internlm2-1_8b-reward	2024-07-19
InternLM2-7B-Reward	86.6	🤗internlm2-7b-reward	internlm2-7b-reward	2024-07-19
InternLM2-20B-Reward	89.5	🤗internlm2-20b-reward	internlm2-20b-reward	2024-07-19

Performance Evaluation on RewardBench

Models	Score	Chat	Chat Hard	Safety	Reasoning
InternLM2-20B-Reward	89.5	98.6	74.1	89.4	95.7
InternLM2-7B-Reward	86.6	98.6	66.7	88.3	92.8
InternLM2-1.8B-Reward	80.6	95.0	58.1	81.8	87.4

The evaluation is conducted on the RewardBench dataset.
For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.

Demo Code

Basic Usage

We provide some user-friendly APIs for you to use the model. Here is an example of how to use the model to get the reward score of a chat, compare two chats, or rank multiple chats.

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "internlm/internlm2-20b-reward",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-20b-reward", trust_remote_code=True)

chat_1 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
]
chat_2 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "I have no idea."}
]


# get reward score for a single chat
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1:  0.767578125
# >>> score2:  -2.22265625


# batch inference, get multiple scores at once
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores:  [0.767578125, -2.22265625]


# compare whether chat_1 is better than chat_2
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res:  True


# rank multiple chats, it will return the ranking index of each chat
# the chat with the highest score will have ranking index as 0
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res)  # lower index means higher score
# >>> rank_res:  [0, 1]

Best of N Sampling

Here is an example of how to use the reward model to perform best of N sampling. The code below demonstrates how to select the best response from the candidates generated by the language model.

import torch
from transformers import AutoModel, AutoTokenizer

# prepare the llm model and tokenizer
llm = AutoModel.from_pretrained(
    "internlm/internlm2-chat-7b",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)

# prepare the reward model and tokenizer
reward = AutoModel.from_pretrained(
    "internlm/internlm2-20b-reward",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-20b-reward", trust_remote_code=True)

# prepare the chat prompt
prompt = "Write an article about the artificial intelligence revolution."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = llm_tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")

# generate best of N candidates
num_candidates = 10  # N=10
candidates = []

outputs = llm.generate(
    **model_inputs,
    max_new_tokens=512,
    num_return_sequences=num_candidates,
    pad_token_id=llm_tokenizer.eos_token_id,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
)
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]
for i in range(num_candidates):
    candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
    candidates.append(messages + [{"role": "assistant", "content": candidate}])

rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])

## print the ranked candidates
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
#     print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")

# print the best response
best_response = sorted_candidates[0][1][-1]['content']
print(best_response)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internlm2_reward.md

internlm2_reward.md

InternLM2-Reward Model Card

Introduction

Key Features:

Download links

Performance Evaluation on RewardBench

Demo Code

Basic Usage

Best of N Sampling

Files

internlm2_reward.md

Latest commit

History

internlm2_reward.md

File metadata and controls

InternLM2-Reward Model Card

Introduction

Key Features:

Download links

Performance Evaluation on RewardBench

Demo Code

Basic Usage

Best of N Sampling