SELF-J

This repository contains the code for our paper Self-Judge: Selective Instruction Following with Alignment Self-Evaluation.

Quick Links

Overview
Quick Start
Judge Model Tuning
- 1. Training Data
- 2. Script
Instruction Collection

Overview

To address inaccuracies in large language models (LLMs) when executing instructions, the concept of selective instruction following is proposed. This method involves models declining tasks when expected response quality is low, improving reliability.

A new self-training framework, Self-J, is introduced to develop judge models that assess response quality without human-annotated data. These models use the model’s inherent evaluation abilities and a gold reference for self-calibration.

The models are validated using high-quality data from Hugging Face and tested extensively, showing strong correlation with advanced models like GPT-4 and superior performance across different domains.

The judge models not only enhance performance in specific benchmarks but also rank 95 models from AlpacaEval, demonstrating high correlation with GPT-4. This underscores the potential of alignment self-evaluation in improving LLMs.

Quick Start

We provide one of our trained judge models and the code of generating quality scores with the judge model for evaluation.

1. Setup

We base on vLLM for inference, so you have to refer here to install vLLM if you don't have.

2. Inference

We provide the inference code in judge.py. To get the score, run:

python judge.py

3. Model

We release our judge model, tuned for the instruction-following model of Vicuna-v1.5, at Huggingface, link: Self-J-13B-Vicuna-v1.5.

4. Performance

Our judge models achieve better correlations with GPT-4 than strong baselines. However, it can still suffer from the issue of distribution shifts.

Pearson correlation between various measures with GPT-4's scores. Assessment based on 850 test samples

Method	Ours-13b	Vicuna-13b	Wizardlm-13b	Llm2-13b-chat	Llm2-70b-chat	Avg.
With reference
Cosine	39.75	42.81	40.81	59.04	58.82	48.25
Self-eval	44.66	55.13	48.52	40.26	50.70	47.85
Self-eval + Cosine	53.19	60.77	55.69	64.72	65.51	59.98
GPT-3.5-turbo	66.41	66.58	69.96	73.35	75.81	70.42
GPT-3.5-turbo + Cosine	68.33	69.99	70.90	78.13	78.12	73.09
Self-J (ours)	66.75	70.95	69.56	72.76	71.70	70.34
Without reference
PPL	13.22	13.46	6.47	29.25	-3.99	11.68
VRO	45.20	40.03	38.24	40.66	41.47	41.12
Self-eval	1.23	15.19	12.75	12.13	15.99	11.46
GPT-3.5-turbo	15.21	25.98	19.07	20.05	22.78	20.62
Auto-J-13b	37.02	39.68	37.88	53.71	49.43	43.54
UltraRM-13b	43.50	44.18	50.68	63.83	62.69	52.98

Judge Models-13b Results

Judge Model	Ours-13b	Vicuna-13b	Wizardlm-13b	Llm2-13b-chat	Llm2-70b-chat	Avg.
Judge (Cosine)	39.73	38.78	39.21	61.20	58.06	47.40
Judge (Self-eval)	45.02	45.14	43.61	48.13	44.57	45.29
Self-J (ours)	56.94	56.67	53.10	64.87	61.65	58.65

Judge Model Tuning

Our training code for judge modeling is based on the project of Alpaca-Lora, so you will have to first follow the original instructions to set up the environment.

1. Training Data

We provide the example training data for tuning the judge model at Huggingface, where the evaluated model is Vicuna-v1.5 and the qulaity score is the combination of model's self-evaluation and cosine similarity.

Data with reference answer: data_w_ref.
Data without reference answer: data_wo_ref.

2. Script

To train the judge model, run

cd bash
bash finetune.kd.sh

Instruction Collection

1. Statistics

We collect a large-scale of instructions to study alignment evaluation on generation tasks, such as coding, writing, etc. We manually filtered datasets from Hugging Face as of June 2023, particularly those in the NLP category. We post-processed the datasets to filter out low-quality instructions as much as possible. We retained all good-quality instructions. We removed instructions that were either too short or too long. We also used the original instructions without tokenization, paraphrasing, etc, to maintain the real distribution of the instructions. After sorting, we keep 37 datasets in total. We manually categorized the datasets into three main categories: common, coding, and academic. Common instructions mainly concern everyday matters, such as seeking advice and solving technical problems. All instructions involving coding such as code generation and debugging are classified under the coding category. Lastly, subject-specific instructions, such as science and medicine, are categorized as academic.

2. Data Release

We have released the collection of instructions, and you can download the data from Huggingface at instruction-5.7m.

3. Performance: AlpacaEval

By fine-tuning Llama-2-13b with random 87K of our instruction selections and GPT-3.5 Turbo responses, we can match Llama-2-13b-Chat's performance on AlpacaEval.

Models	V1	v2
Vicuna 13B v1.5	-	6.7
Llama-2-13B-Chat	81.09	7.7
Ours-Llama-2-13B	79.13	7.33

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
bash		bash
figures		figures
templates		templates
.DS_Store		.DS_Store
README.md		README.md
finetune.kd.py		finetune.kd.py
judge.py		judge.py
kd.py		kd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SELF-J

Quick Links

Overview

Quick Start

1. Setup

2. Inference

3. Model

4. Performance

Judge Model Tuning

1. Training Data

2. Script

Instruction Collection

1. Statistics

2. Data Release

3. Performance: AlpacaEval

About

Releases

Packages

Languages

nusnlp/Self-J

Folders and files

Latest commit

History

Repository files navigation

SELF-J

Quick Links

Overview

Quick Start

1. Setup

2. Inference

3. Model

4. Performance

Judge Model Tuning

1. Training Data

2. Script

Instruction Collection

1. Statistics

2. Data Release

3. Performance: AlpacaEval

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages