Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure and Train an LLM-based chatbot on Data.gov catalog DB #4541

Open
btylerburton opened this issue Nov 29, 2023 · 1 comment
Open

Comments

@btylerburton
Copy link
Contributor

Feature/what we're after

Datagovteam would like a utility that can search for and serve datasets that correspond to a natural language query

Anticipated/hypothesized benefits

  • Employing natural-language search will allow all users, both novice and expert, to locate datasets and derive connections between them that would otherwise not be doable with a traditional Solr query.

Measurements/metrics

  • measure the adoption of the tool
  • measure the quality of the results
  • compare the use of the tool against the use of the standard Solr-query backed search.

References/background

@nickumia
Copy link

nickumia commented Dec 30, 2024

Knowing a little bit more about LLMs now, I would ask the team two questions which may redefine the scope of this ticket,

  1. What is the intent for the chatbot? If I ask a question about "what state has the most potholes", will it give me the answer? Or will it just point me to the source that it thinks has the answer?
  2. With knowledge of a catalog of data that's constantly changing, do you really want to train an LLM or just employ a more informed LLM? There are different methods to do the latter.

I would suggest a Retrieval Augmented Generation (RAG)-based approach which just takes a bunch of search results and passes it to an LLM to process. As it gets more information from the user, it can get different data from the DB and make a more specialized response. It's extremely lightweight to use an off-the-shelf open-source model and then ask it a question while passing it information to interpret.

For this to be an open-source contribution (i.e. for like me to do it haha..), the things that I'd like to see are an api endpoint to get search results (I know it was CKAN before, but I don't know what part of the API is still functional in the transition) and hmm... maybe that's it?

As a side note, for generic questions, ChatGPT's web crawler probably has enough information to give decent response 🤷‍♀️

image
image

Not using data.gov data,

image

Answer with data.gov (sounds accurate the the data isn't there),

image

Asking a question that I know moreso is on data.gov,

image

All of these demos to prove that if the LLM just had access to the data, it can provide more informed answers, training an LLM from scratch is almost not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

2 participants