Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chatbot: Dynamic ground truths #13

Closed
Keyrxng opened this issue Oct 24, 2024 · 13 comments · Fixed by #14
Closed

Chatbot: Dynamic ground truths #13

Keyrxng opened this issue Oct 24, 2024 · 13 comments · Fixed by #14

Comments

@Keyrxng
Copy link
Member

Keyrxng commented Oct 24, 2024

"You Must obey the following ground truths: [" +
groundTruths.join(":") +
"]\n" +

["typescript", "github", "cloudflare worker", "actions", "jest", "supabase", "openai"],

I think these should be dynamic, I don't have super clear context on them but from what I can surmise they'd benefit from it.

  • Are they restricted to being one-word strings? What's the max length if not? How many truths can we provide/what's the optimal amount?
  • Are they supposed to embody the org, the incoming query or the task in relation to the current @ubqbot command or the fully built formattedChat that includes all context?

I think we have GPT-4o build our array based on the input it's given because:

  1. Truths are dynamic and based on either A) fully built/initial incoming user query B) built around the payload issue spec or whatever C) task spec when considering code review
  2. We need a quick response imo, a huge spec to truth followed by the spec + the entire diff both to o1 may be too long to wait for little added benefit.
@sshivaditya2019
Copy link
Collaborator

sshivaditya2019 commented Oct 24, 2024

Ground truths act as essential guardrails for LLMs, influencing the assumptions the models make while generating outputs. These truths should be derived from the actual context of the task at hand, rather than from the input query provided. The aim is to ensure that the LLM's responses remain relevant and accurate.

For example, consider an organization that specializes in working with Rust files and employs tools like Rustup for managing Rust versions and Railway for deployment. The relevant ground truths for this scenario might include terms that encapsulate the key aspects of their technology stack and workflow. In this case, the ground truths could be:
["rust", "rustup", "default-nightly-toolchain", "wasm32bindgen", "ffi", "railway"].

If someone requests information about deployment, the model should make a reasonable assumption based on the provided ground truths. In this case, it should offer Rust-based examples for deployment.

@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 24, 2024

These truths should be derived from the actual context of the task at hand, rather than from the input query provided.

"task" here represents what, do you mean our github issues/tasks? Or the use-case behind the query?

E.g: My use-case is performing code review. The ground truths then by your description should embody the tech stack involved in that review?

It seems more logical to me to guardrail the model to perform review that is grounded in truth in relation to the task spec as opposed to the code being reviewed.

What do you suggest? Truths are source from the code being reviewed or are sourced from the spec, which is the objective/guideline/standards that the review must be performed to?

@ubqbot grounded in truth with the tech stack as a general chatbot makes sense because we only ever work with those stacks you stated, so in general it should offer responses involving those.

But a code-review specific logic flow already has clear context on tech stack and coding languages used as it's parsing the entire raw text diff.

@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 24, 2024

for @ubqbot we can pull the tech stacks dynamically for the partner per repo by:

  • fetching the repo's language stats and passing them in, we could adjust the prompt to give more weight to the higher percentage language present
  • We can use the package.json to find out what libs/frameworks etc the repo uses

this covers tech stacks and frameworks, we could leverage the org readme, not the repo readme, too which might include more but package.json and language stats should do it imo. What do you think?


Each application deserves it's own discussion as to where and how the "Ground Truths" are sourced from and how they are injected into that flow' system message, so I've made this issue specifically for the chatbot (@ubqbot command).

@Keyrxng Keyrxng changed the title Dynamic ground truths Chatbot: Dynamic ground truths Oct 25, 2024
@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 25, 2024

Just over 4 hours at this point in terms of time - awaiting review. And I'm unsure of priority.

Copy link

ubiquity-os-beta bot commented Oct 27, 2024

! No price label has been set. Skipping permit generation.

@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 27, 2024

@0x4007 Can you price and regen the reward please?

@sshivaditya2019
Copy link
Collaborator

@Keyrxng, could you take a look at this? If a repository is missing a package.json, it's causing a fatal error. Let me know if you are able to replicate this.

Link

@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 27, 2024

Good catch, needs wrapped.

It either should be included in the next PR to be merged mine or yours, or addressed in #17, whatever approach you think is best.

@sshivaditya2019
Copy link
Collaborator

sshivaditya2019 commented Oct 27, 2024

Good catch, needs wrapped.

It either should be included in the next PR to be merged mine or yours, or addressed in #17, whatever approach you think is best.

I can take care of this quick fix; I just wanted to confirm that I didn’t break anything, lol!

@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 27, 2024

No it was definitely me I wrote that fetch call, everyone makes mistakes we're only human after all.

@0x4007
Copy link
Member

0x4007 commented Oct 27, 2024

@0x4007 Can you price and regen the reward please?

These need to be approved to be funded before work is done and expecting to get paid for it.

Copy link

ubiquity-os-beta bot commented Oct 27, 2024

 [ 12.1855 WXDAI ] 

@sshivaditya2019
Contributions Overview
ViewContributionCountReward
IssueComment312.1855
ReviewComment90
Conversation Incentives
CommentFormattingRelevanceReward
Ground truths act as essential guardrails for LLMs, influencing …
6.83
content:
  content:
    p:
      score: 0
      elementCount: 4
  result: 0
regex:
  wordCount: 144
  wordValue: 0.1
  result: 6.83
0.755.1225
@Keyrxng, could you take a look at this? If a repository is miss…
6.85
content:
  content:
    p:
      score: 0
      elementCount: 2
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 31
  wordValue: 0.1
  result: 1.85
0.76.295
I can take care of this quick fix; I just wanted to confirm that…
1.28
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 20
  wordValue: 0.1
  result: 1.28
0.60.768
Wouldn't it be better to check for files such as `requiremen…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 25
  wordValue: 0
  result: 0
0.90
This is a suggestion: use capital letters for conditions to ensu…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 19
  wordValue: 0
  result: 0
0.50
We currently support OpenRouter, so we should check if we have a…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 45
  wordValue: 0
  result: 0
0.80
Is there a reason to create another OpenAI client here? If not, …
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 26
  wordValue: 0
  result: 0
0.70
```jsDO NOT LIST every LANGUAGE or DEPENDENCY; foc…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 29
  wordValue: 0
  result: 0
0.60
As I mentioned, it works at times, but when it does, it effectiv…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 39
  wordValue: 0
  result: 0
0.40
Claude performs much better in real-world coding tasks, so I sug…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 74
  wordValue: 0
  result: 0
0.750
Yes, all prompts related to code, such as pull prechecks, should…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 59
  wordValue: 0
  result: 0
0.650
I was working on setting this up in the repo, sorry for the dela…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 15
  wordValue: 0
  result: 0
0.30

 [ 53.937 WXDAI ] 

@Keyrxng
Contributions Overview
ViewContributionCountReward
IssueSpecification137.95
IssueComment615.987
ReviewComment250
Conversation Incentives
CommentFormattingRelevanceReward
https://github.com/ubiquity-os-marketplace/command-ask/blob/e45d…
12.65
content:
  content:
    p:
      score: 0
      elementCount: 8
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 4
    ol:
      score: 1
      elementCount: 1
  result: 4
regex:
  wordCount: 190
  wordValue: 0.1
  result: 8.65
137.95
"task" here represents what, do you mean our github issues/tasks…
7.63
content:
  content:
    p:
      score: 0
      elementCount: 6
  result: 0
regex:
  wordCount: 164
  wordValue: 0.1
  result: 7.63
0.96.867
for `@ubqbot` we can pull the tech stacks dynamically fo…
8.06
content:
  content:
    p:
      score: 0
      elementCount: 5
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 2
    hr:
      score: 0
      elementCount: 1
  result: 2
regex:
  wordCount: 125
  wordValue: 0.1
  result: 6.06
0.857.151
Just over 4 hours at this point in terms of time - awaiting revi…
1.22
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 19
  wordValue: 0.1
  result: 1.22
0.20.244
@0x4007 Can you price and regen the reward please?
0.65
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 9
  wordValue: 0.1
  result: 0.65
0.10.065
Good catch, needs wrapped.It either should be included in the …
1.75
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 29
  wordValue: 0.1
  result: 1.75
0.61.05
No it was definitely me I wrote that fetch call, everyone makes …
1.22
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 19
  wordValue: 0.1
  result: 1.22
0.50.61
Resolves #13
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 2
  wordValue: 0
  result: 0
0.10
This can be ignored as it's only relevant to #11 so this is all …
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 37
  wordValue: 0
  result: 0
0.20
I implemented this thinking it'd be too slow but I think `o1…
2
content:
  content:
    p:
      score: 0
      elementCount: 3
    ol:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 2
  result: 2
regex:
  wordCount: 45
  wordValue: 0
  result: 0
0.50
Broke this into a fn to make it easier to see what we are pullin…
0
content:
  content:
    p:
      score: 0
      elementCount: 3
  result: 0
regex:
  wordCount: 68
  wordValue: 0
  result: 0
0.60
We should eitherA. Just hardcode a list of these sorts of file…
0
content:
  content:
    p:
      score: 0
      elementCount: 7
    hr:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 104
  wordValue: 0
  result: 0
0.40
Can you show me what you mean and convert one or two of those li…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 34
  wordValue: 0
  result: 0
0.30
```1. ASSUME your output BUILDS the FOUNDATION for…
0
content:
  content:
    p:
      score: 0
      elementCount: 3
  result: 0
regex:
  wordCount: 56
  wordValue: 0
  result: 0
0.70
The suggestion is to use claude when parsing code as was recomme…
0
content:
  content:
    p:
      score: 0
      elementCount: 3
  result: 0
regex:
  wordCount: 121
  wordValue: 0
  result: 0
0.80
What should be fed to Claude is my base question: just code? or …
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 26
  wordValue: 0
  result: 0
0.50
I've used/seen it used sparingly throughout a prompt to enforce …
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 99
  wordValue: 0
  result: 0
0.50
So to clarify, for #11, the context window will contain:- raw …
3
content:
  content:
    p:
      score: 0
      elementCount: 7
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 4
  result: 3
regex:
  wordCount: 95
  wordValue: 0
  result: 0
0.60
Okay this will need another task opened to alter the `@ubqbo…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 33
  wordValue: 0
  result: 0
0.30
As am I but would you want me to still embed the truths that get…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 34
  wordValue: 0
  result: 0
0.70
`@ubqbot` because it's pulling in context from multiple …
2
content:
  content:
    p:
      score: 0
      elementCount: 5
    hr:
      score: 0
      elementCount: 1
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 2
  result: 2
regex:
  wordCount: 121
  wordValue: 0
  result: 0
0.80
I was referring to that also my question was referring to meanin…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 68
  wordValue: 0
  result: 0
0.40
Resolving this thread as this requires it's own task as there is…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 45
  wordValue: 0
  result: 0
0.20
Marking as ready for review to get eyes on it and opinions as-is…
10
content:
  content:
    p:
      score: 0
      elementCount: 7
    a:
      score: 5
      elementCount: 2
  result: 10
regex:
  wordCount: 125
  wordValue: 0
  result: 0
0.50
@0x4007 @sshivaditya2019 @gentlementlegen @rndquu requesting rev…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 31
  wordValue: 0
  result: 0
0.10
I'm not manually adding anything it's GPT that's creating the ar…
0
content:
  content:
    p:
      score: 0
      elementCount: 3
  result: 0
regex:
  wordCount: 85
  wordValue: 0
  result: 0
0.40
the @ubqbot command right now that's in `development` th…
0
content:
  content:
    p:
      score: 0
      elementCount: 4
  result: 0
regex:
  wordCount: 83
  wordValue: 0
  result: 0
0.30
In my opinion "Ground Truths" should be considered in relation t…
7
content:
  content:
    p:
      score: 0
      elementCount: 4
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 2
    a:
      score: 5
      elementCount: 1
  result: 7
regex:
  wordCount: 124
  wordValue: 0
  result: 0
0.60
Dynamic chatbot ground truths QA used within my fork of this rep…
0
content:
  content:
    p:
      score: 0
      elementCount: 3
  result: 0
regex:
  wordCount: 41
  wordValue: 0
  result: 0
0.70
@gentlementlegen @0x4007 @sshivaditya2019 @rndquu Why can't we…
0
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 22
  wordValue: 0
  result: 0
0.20
QA: https://github.com/ubq-testing/ask-plugin/issues/11#issuecom…
1.5
content:
  content:
    p:
      score: 0
      elementCount: 3
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 1
  result: 1.5
regex:
  wordCount: 79
  wordValue: 0
  result: 0
0.80
lmk if there is anything holding back this PR and I'll push it f…
0
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 15
  wordValue: 0
  result: 0
0.10

 [ 11.243 WXDAI ] 

@0x4007
Contributions Overview
ViewContributionCountReward
IssueComment10.61
ReviewComment710.633
Conversation Incentives
CommentFormattingRelevanceReward
These need to be approved to be funded before work is done and e…
1.22
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 19
  wordValue: 0.1
  result: 1.22
0.50.61
OpenAI system prompts don't look like this so I'm inclined to be…
1.54
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 25
  wordValue: 0.1
  result: 1.54
0.71.078
Doesn't seem right to add this line
0.59
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 8
  wordValue: 0.1
  result: 0.59
0.50.295
In my deep experience, when I am building off of and integrating…
6.91
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 146
  wordValue: 0.1
  result: 6.91
0.96.219
We can try ground truths I was referring to capitalized words fo…
0.88
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 13
  wordValue: 0.1
  result: 0.88
0.60.528
Why are you adding redundant information in those arrays?
0.65
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 9
  wordValue: 0.1
  result: 0.65
0.80.52
I think we should not have redundant messages but they should be…
1.28
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 20
  wordValue: 0.1
  result: 1.28
0.60.768
@sshivaditya2019 you should review and decide when this pull is …
1.75
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 29
  wordValue: 0.1
  result: 1.75
0.71.225

@Keyrxng
Copy link
Member Author

Keyrxng commented Oct 30, 2024

@0x4007 Can you price and regen the reward please?

These need to be approved to be funded before work is done and expecting to get paid for it.

It never produced a $200 reward and I claimed the $53 so I guess it worked out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants