From d8159672988a12bff03d0000631540dd7f46ae64 Mon Sep 17 00:00:00 2001 From: Guokan Shang Date: Fri, 24 Jan 2025 19:13:37 +0100 Subject: [PATCH] Update --- index.html | 8 ++++---- program-day-1/index.html | 16 ++++++++-------- program-day-2/index.html | 6 +++--- 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/index.html b/index.html index cd0471b..d5251ae 100644 --- a/index.html +++ b/index.html @@ -89,10 +89,10 @@

Program Highlights & Poster Registration

Organizers

Organizing committee:
- Eric Moulines, Ecole Polytechnique & MBZUAI
- Guokan Shang, MBZUAI France Lab
- Michalis Vazirgiannis, Ecole Polytechnique & MBZUAI
- Kun Zhang, MBZUAI + Eric Moulines, Ecole Polytechnique & MBZUAI
+ Guokan Shang, MBZUAI France Lab
+ Michalis Vazirgiannis, Ecole Polytechnique & MBZUAI
+ Kun Zhang, MBZUAI

Logistics support:
diff --git a/program-day-1/index.html b/program-day-1/index.html index a5e4ae6..e105637 100644 --- a/program-day-1/index.html +++ b/program-day-1/index.html @@ -57,7 +57,7 @@

Program on Wednesday, February 12

- 9:00am + 09:00am Registration and Coffee & Tea! @@ -173,7 +173,7 @@

Program on Wednesday, February 12

- 12:20am + 12:20pm Exploiting Knowledge for Model-based Deep Music Generation @@ -194,7 +194,7 @@

Program on Wednesday, February 12

- 12:20pm + 13:00pm Lunch @@ -212,7 +212,7 @@

Program on Wednesday, February 12

@@ -247,7 +247,7 @@

Program on Wednesday, February 12

- 14:00am + 14:00pm TBD @@ -237,7 +237,7 @@

Program on Wednesday, February 12

14:40pm
- TBD + Intricacies of Game-theoretical LLM Alignment
- TBD + Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. @This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.
@@ -257,7 +257,7 @@

Program on Wednesday, February 12

diff --git a/program-day-2/index.html b/program-day-2/index.html index 907ffee..e7a4626 100644 --- a/program-day-2/index.html +++ b/program-day-2/index.html @@ -57,7 +57,7 @@

Program on Thursday, February 13

- 14:50pm + 15:20pm Coffee & Tea Break @@ -339,7 +339,7 @@

Program on Wednesday, February 12

18:00pm
- Poster Session at MBZUAI France Lab + Poster Session with Buffet at MBZUAI France Lab
- 9:00am + 09:00am Registration and Coffee & Tea! @@ -173,7 +173,7 @@

Program on Thursday, February 13

- 12:20am + 12:20pm GFlowNets: A Novel Framework for Diverse Generation in Combinatorial and Continuous Spaces @@ -194,7 +194,7 @@

Program on Thursday, February 13

- 13:10pm + 13:00pm Lunch