RLHF #18

0x4007 · 2024-10-27T11:45:38Z

The acronym is RLHF, which stands for Reinforcement Learning from Human Feedback. This method involves humans guiding the behavior and outputs of the model by providing feedback on its responses, helping it learn desired responses over time.

We should support three inputs for the model to learn from:

Thumbs up
Thumbs down
Edits

Edits could be useful to compare what the output should be. That way we have a chance to correct some specific mistakes. In the below example it incorrectly writes the .ubiquity-os.config.yml name and this could be a good opportunity to teach it.

ubiquity-os/ubiquity-os-kernel#111 (comment)

The text was updated successfully, but these errors were encountered:

0x4007 · 2024-10-27T11:46:18Z

@sshivaditya2019 Rfc

Roughly estimating a week time estimate but also curious if you can help clarify implementation details.

I'm assuming v1 of this can be used to affect our RAG/embeddings search first and then v2 maybe eventually to fine tune a specific model.

Keyrxng · 2024-10-27T11:50:40Z

This feels like the wrong repo to be opening this task in as it's a model-training task, but we'd add structuredMetadata in here through this idea though for sure.

We'd need to store the AI response and using the DB comes with overheads.
We should use structuredMetadata like I've implemented in feat: dynamic ground truths #14 for posting ground truths, here we'd post which models were used creating the embeddings, ranking, general query, etc all of that stuff.
When the time comes we write a script to parse all of our LLM response comments and we extract the comment metadata which contains the reactions to the comment and edits (maybe need GQL to fetch these)
Sanitize and prep the dataset and then train our own model.

0x4007 · 2024-10-27T12:00:30Z

Simpler to just compare the earliest and latest comment revision. Don't think we need to complicate this and store in database or generate any embeddings. Just add diff to context and say

For this example output 
(Before)
Make sure to adjust the response to accommodate 
(After diff)

Something like that.

Can possibly optimize to make it only show diff instead of entire beginning comment example. This would show it clearly the filename of our .ubiquity-os.config.yml

Keyrxng · 2024-10-27T15:42:27Z

I may be out of my depth but we need to move away from thinking in terms of prompt injection when considering model training. We need to fine-tune a model with specific datasets to embed it into it's foundational knowledge separate from our prompts.

This training will also be partner specific so we should build with this in mind so we can white label it as a service for our partners chatbot.

0x4007 · 2024-10-27T16:04:05Z

First step is always prompting.

Later comes RAG/true fine tuning.

Technically this may not qualify as true RLHF

sshivaditya2019 · 2024-10-28T02:34:19Z

I think this could be designed like, we could keep a track of the reactions and edits, we keep track of edit diffs as well. So, We can monitor both positive and negative examples within a repository and incorporate them into the prompt at query time. This process should be implemented at the repository level.

For positive examles we could add this

Input: [Original context]
Weight: +1

For Negative Examples

Input: [Original context]
Weight: -1

For edits we would need to track what was changes in a particular edit:

Original Output Line: [Model output]
Corrected Output Line: [User's correction]

With this approach, we can create a word/phrase weight dictionary, where certain words receive higher rewards while others incur greater penalties. We can provide this dictionary to the LLM as context and calculate the overall score for each generation. If the score falls below the organization-wide limit, we can trigger a restart of the prompt generation process.

Keyrxng · 2024-10-28T15:07:42Z

we could keep a track of the reactions and edits, we keep track of edit diffs as well

GitHub does this for us already so we should not use our own DB to handle that aspect

We can monitor both positive and negative examples within a repository and incorporate them into the prompt at query time.

This would imply that for all correction and edits that are made repository wide they are going to be pulled from a DB (I assumed you meant keeping track via DB storage) and they'd all be fed into the context window?

No Supabase it's not required we can use GitHub alone if we wanted to.
We use structuredMetadata to create a trackable header per LLM response comment, have it contain key details: comment_id it replied to for easy indexing, models used, tokens, groundTruths, application (e.g chatbot or codeReview)
We either A) use Search API for the comment header in that repo B) list all comments and use the parser.
The comment obj contains the reactions on the comment (perhaps restrict one to regenerate the response with?)

I don't think injecting all of the edits etc into the systemMessage is the way to go with it exactly if we consider our issues with context window so far. Perhaps we build another LLM call like for groundTruths and have it summarize or succinctly embody all of the various RLHF techinques we intend for this feature?

Our system message is already MASSIVE and littered with context from all sorts of tasks, I don't think injecting 10s of diffs and full body llm responses and the original context window it was fed is a great idea.

sshivaditya2019 · 2024-10-28T16:05:01Z

GitHub does this for us already so we should not use our own DB to handle that aspect

As I mentioned, we would only require the weights and not the edits or reactions.

This would imply that for all correction and edits that are made repository wide they are going to be pulled from a DB (I assumed you meant keeping track via DB storage) and they'd all be fed into the context window?

We would only retain the word/phrase weight pairs to provide to the LLM, without any edits or diffs.

No Supabase it's not required we can use GitHub alone if we wanted to.

I don't think we can store inside plugins, so we might be limited to env and supabase for data storage.

We use structuredMetadata to create a trackable header per LLM response comment, have it contain key details: comment_id it replied to for easy indexing, models used, tokens, groundTruths, application (e.g chatbot or codeReview)

We either A) use Search API for the comment header in that repo B) list all comments and use the parser.

The comment obj contains the reactions on the comment (perhaps restrict one to regenerate the response with?)

I’m not sure how this approach differs from our current process. Our goal is to introduce feedback into the system, not to revisit the same set of issues and comments repeatedly.

I don't think injecting all of the edits etc into the systemMessage is the way to go with it exactly if we consider our issues with context window so far. Perhaps we build another LLM call like for groundTruths and have it summarize or succinctly embody all of the various RLHF techinques we intend for this feature?

We are not injecting edits; rather, we will be incorporating a weight dictionary that enables the model to select from those options. These weights will be maintained for each repository. This approach can also be easily extended for fine-tuning if we decide to implement it in the future.

Keyrxng · 2024-10-28T16:09:46Z

Appreciate the response that clarified things for me, I'm looking forward to seeing it implemented.

sshivaditya2019 · 2024-10-29T02:40:57Z

@0x4007 I think implementing this would take about a week. Also, rfc for my approach?

0x4007 · 2024-10-29T02:45:24Z

How are you dealing with the numbers? Assuming using typescript because if it's being handled directly by the LLM it wont be great.

sshivaditya2019 · 2024-10-29T02:53:21Z

How are you dealing with the numbers? Assuming using typescript because if it's being handled directly by the LLM it wont be great.

I am not sure, what you mean by numbers. I am assuming you are referring to weights.

It would be Upto the LLM model for now to choose the high reward phrases and not choose low reward phrase, top phrases both positive and negative would be added to prompt as of now with the end goal being a fine tuned model later.

I think this the is closest we can get to RLHF approach without actually modifying weights as such.

sshivaditya2019 · 2024-10-29T03:33:15Z

/start

ubiquity-os-beta · 2024-10-29T03:36:40Z

@sshivaditya2019 the deadline is at Tue, Nov 5, 3:36 AM UTC

ubiquity-os-beta · 2024-11-01T18:39:11Z

Passed the deadline and no activity is detected, removing assignees: @sshivaditya2019.

0x4007 added the Priority: 3 (High) label Oct 27, 2024

devpool-directory-superintendent bot mentioned this issue Oct 27, 2024

RLHF ubiquity/devpool-directory#1791

Open

0x4007 added the Time: <1 Week label Oct 29, 2024

ubiquity-os-beta bot added the Price: 1200 USD label Oct 29, 2024

0x4007 assigned sshivaditya2019 Oct 29, 2024

ubiquity-os-beta bot unassigned sshivaditya2019 Nov 1, 2024

sshivaditya2019 mentioned this issue Nov 2, 2024

Benchmarking #32

Closed

rndquu added this to Ubiquity and Development Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLHF #18

RLHF #18

0x4007 commented Oct 27, 2024

0x4007 commented Oct 27, 2024 •

edited

Loading

Keyrxng commented Oct 27, 2024 •

edited

Loading

0x4007 commented Oct 27, 2024 •

edited

Loading

Keyrxng commented Oct 27, 2024 •

edited

Loading

0x4007 commented Oct 27, 2024

sshivaditya2019 commented Oct 28, 2024

Keyrxng commented Oct 28, 2024

sshivaditya2019 commented Oct 28, 2024

Keyrxng commented Oct 28, 2024

sshivaditya2019 commented Oct 29, 2024

0x4007 commented Oct 29, 2024

sshivaditya2019 commented Oct 29, 2024 •

edited

Loading

sshivaditya2019 commented Oct 29, 2024

ubiquity-os-beta bot commented Oct 29, 2024

ubiquity-os-beta bot commented Nov 1, 2024

RLHF #18

RLHF #18

Comments

0x4007 commented Oct 27, 2024

0x4007 commented Oct 27, 2024 • edited Loading

Keyrxng commented Oct 27, 2024 • edited Loading

0x4007 commented Oct 27, 2024 • edited Loading

Keyrxng commented Oct 27, 2024 • edited Loading

0x4007 commented Oct 27, 2024

sshivaditya2019 commented Oct 28, 2024

Keyrxng commented Oct 28, 2024

sshivaditya2019 commented Oct 28, 2024

Keyrxng commented Oct 28, 2024

sshivaditya2019 commented Oct 29, 2024

0x4007 commented Oct 29, 2024

sshivaditya2019 commented Oct 29, 2024 • edited Loading

sshivaditya2019 commented Oct 29, 2024

ubiquity-os-beta bot commented Oct 29, 2024

ubiquity-os-beta bot commented Nov 1, 2024

0x4007 commented Oct 27, 2024 •

edited

Loading

Keyrxng commented Oct 27, 2024 •

edited

Loading

0x4007 commented Oct 27, 2024 •

edited

Loading

Keyrxng commented Oct 27, 2024 •

edited

Loading

sshivaditya2019 commented Oct 29, 2024 •

edited

Loading