Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLHF #18

Open
0x4007 opened this issue Oct 27, 2024 · 15 comments
Open

RLHF #18

0x4007 opened this issue Oct 27, 2024 · 15 comments

Comments

@0x4007
Copy link
Member

0x4007 commented Oct 27, 2024

The acronym is RLHF, which stands for Reinforcement Learning from Human Feedback. This method involves humans guiding the behavior and outputs of the model by providing feedback on its responses, helping it learn desired responses over time.

We should support three inputs for the model to learn from:

  1. Thumbs up
  2. Thumbs down
  3. Edits

Edits could be useful to compare what the output should be. That way we have a chance to correct some specific mistakes. In the below example it incorrectly writes the .ubiquity-os.config.yml name and this could be a good opportunity to teach it.

ubiquity-os/ubiquity-os-kernel#111 (comment)

@0x4007
Copy link
Member Author

0x4007 commented Oct 27, 2024

@sshivaditya2019 Rfc

Roughly estimating a week time estimate but also curious if you can help clarify implementation details.

I'm assuming v1 of this can be used to affect our RAG/embeddings search first and then v2 maybe eventually to fine tune a specific model.

@Keyrxng
Copy link
Member

Keyrxng commented Oct 27, 2024

This feels like the wrong repo to be opening this task in as it's a model-training task, but we'd add structuredMetadata in here through this idea though for sure.

  1. We'd need to store the AI response and using the DB comes with overheads.
  2. We should use structuredMetadata like I've implemented in feat: dynamic ground truths #14 for posting ground truths, here we'd post which models were used creating the embeddings, ranking, general query, etc all of that stuff.
  3. When the time comes we write a script to parse all of our LLM response comments and we extract the comment metadata which contains the reactions to the comment and edits (maybe need GQL to fetch these)
  4. Sanitize and prep the dataset and then train our own model.

@0x4007
Copy link
Member Author

0x4007 commented Oct 27, 2024

Simpler to just compare the earliest and latest comment revision. Don't think we need to complicate this and store in database or generate any embeddings. Just add diff to context and say

For this example output 
(Before)
Make sure to adjust the response to accommodate 
(After diff) 

Something like that.

Can possibly optimize to make it only show diff instead of entire beginning comment example. This would show it clearly the filename of our .ubiquity-os.config.yml

@Keyrxng
Copy link
Member

Keyrxng commented Oct 27, 2024

I may be out of my depth but we need to move away from thinking in terms of prompt injection when considering model training. We need to fine-tune a model with specific datasets to embed it into it's foundational knowledge separate from our prompts.

This training will also be partner specific so we should build with this in mind so we can white label it as a service for our partners chatbot.

@0x4007
Copy link
Member Author

0x4007 commented Oct 27, 2024

First step is always prompting.

Later comes RAG/true fine tuning.

Technically this may not qualify as true RLHF

@sshivaditya2019
Copy link
Collaborator

I think this could be designed like, we could keep a track of the reactions and edits, we keep track of edit diffs as well. So, We can monitor both positive and negative examples within a repository and incorporate them into the prompt at query time. This process should be implemented at the repository level.

For positive examles we could add this

Input: [Original context]
Weight: +1

For Negative Examples

Input: [Original context]
Weight: -1

For edits we would need to track what was changes in a particular edit:

Original Output Line: [Model output]
Corrected Output Line: [User's correction]

With this approach, we can create a word/phrase weight dictionary, where certain words receive higher rewards while others incur greater penalties. We can provide this dictionary to the LLM as context and calculate the overall score for each generation. If the score falls below the organization-wide limit, we can trigger a restart of the prompt generation process.

@Keyrxng
Copy link
Member

Keyrxng commented Oct 28, 2024

we could keep a track of the reactions and edits, we keep track of edit diffs as well

GitHub does this for us already so we should not use our own DB to handle that aspect

We can monitor both positive and negative examples within a repository and incorporate them into the prompt at query time.

This would imply that for all correction and edits that are made repository wide they are going to be pulled from a DB (I assumed you meant keeping track via DB storage) and they'd all be fed into the context window?


  1. No Supabase it's not required we can use GitHub alone if we wanted to.
  2. We use structuredMetadata to create a trackable header per LLM response comment, have it contain key details: comment_id it replied to for easy indexing, models used, tokens, groundTruths, application (e.g chatbot or codeReview)
  3. We either A) use Search API for the comment header in that repo B) list all comments and use the parser.
  4. The comment obj contains the reactions on the comment (perhaps restrict one to regenerate the response with?)

I don't think injecting all of the edits etc into the systemMessage is the way to go with it exactly if we consider our issues with context window so far. Perhaps we build another LLM call like for groundTruths and have it summarize or succinctly embody all of the various RLHF techinques we intend for this feature?

Our system message is already MASSIVE and littered with context from all sorts of tasks, I don't think injecting 10s of diffs and full body llm responses and the original context window it was fed is a great idea.

@sshivaditya2019
Copy link
Collaborator

GitHub does this for us already so we should not use our own DB to handle that aspect

As I mentioned, we would only require the weights and not the edits or reactions.

This would imply that for all correction and edits that are made repository wide they are going to be pulled from a DB (I assumed you meant keeping track via DB storage) and they'd all be fed into the context window?

We would only retain the word/phrase weight pairs to provide to the LLM, without any edits or diffs.

  1. No Supabase it's not required we can use GitHub alone if we wanted to.

I don't think we can store inside plugins, so we might be limited to env and supabase for data storage.

  1. We use structuredMetadata to create a trackable header per LLM response comment, have it contain key details: comment_id it replied to for easy indexing, models used, tokens, groundTruths, application (e.g chatbot or codeReview)
  2. We either A) use Search API for the comment header in that repo B) list all comments and use the parser.
  3. The comment obj contains the reactions on the comment (perhaps restrict one to regenerate the response with?)

I’m not sure how this approach differs from our current process. Our goal is to introduce feedback into the system, not to revisit the same set of issues and comments repeatedly.

I don't think injecting all of the edits etc into the systemMessage is the way to go with it exactly if we consider our issues with context window so far. Perhaps we build another LLM call like for groundTruths and have it summarize or succinctly embody all of the various RLHF techinques we intend for this feature?

We are not injecting edits; rather, we will be incorporating a weight dictionary that enables the model to select from those options. These weights will be maintained for each repository. This approach can also be easily extended for fine-tuning if we decide to implement it in the future.

@Keyrxng
Copy link
Member

Keyrxng commented Oct 28, 2024

Appreciate the response that clarified things for me, I'm looking forward to seeing it implemented.

@sshivaditya2019
Copy link
Collaborator

@0x4007 I think implementing this would take about a week. Also, rfc for my approach?

@0x4007
Copy link
Member Author

0x4007 commented Oct 29, 2024

How are you dealing with the numbers? Assuming using typescript because if it's being handled directly by the LLM it wont be great.

@sshivaditya2019
Copy link
Collaborator

sshivaditya2019 commented Oct 29, 2024

How are you dealing with the numbers? Assuming using typescript because if it's being handled directly by the LLM it wont be great.

I am not sure, what you mean by numbers. I am assuming you are referring to weights.

It would be Upto the LLM model for now to choose the high reward phrases and not choose low reward phrase, top phrases both positive and negative would be added to prompt as of now with the end goal being a fine tuned model later.

I think this the is closest we can get to RLHF approach without actually modifying weights as such.

@sshivaditya2019
Copy link
Collaborator

/start

Copy link

@sshivaditya2019 the deadline is at Tue, Nov 5, 3:36 AM UTC

Copy link

Passed the deadline and no activity is detected, removing assignees: @sshivaditya2019.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants