Configure and Train an LLM-based chatbot on Data.gov catalog DB #4541

btylerburton · 2023-11-29T19:54:46Z

Feature/what we're after

Datagovteam would like a utility that can search for and serve datasets that correspond to a natural language query

Anticipated/hypothesized benefits

Employing natural-language search will allow all users, both novice and expert, to locate datasets and derive connections between them that would otherwise not be doable with a traditional Solr query.

Measurements/metrics

measure the adoption of the tool
measure the quality of the results
compare the use of the tool against the use of the standard Solr-query backed search.

References/background

nickumia · 2024-12-30T07:03:50Z

Knowing a little bit more about LLMs now, I would ask the team two questions which may redefine the scope of this ticket,

What is the intent for the chatbot? If I ask a question about "what state has the most potholes", will it give me the answer? Or will it just point me to the source that it thinks has the answer?
With knowledge of a catalog of data that's constantly changing, do you really want to train an LLM or just employ a more informed LLM? There are different methods to do the latter.

I would suggest a Retrieval Augmented Generation (RAG)-based approach which just takes a bunch of search results and passes it to an LLM to process. As it gets more information from the user, it can get different data from the DB and make a more specialized response. It's extremely lightweight to use an off-the-shelf open-source model and then ask it a question while passing it information to interpret.

For this to be an open-source contribution (i.e. for like me to do it haha..), the things that I'd like to see are an api endpoint to get search results (I know it was CKAN before, but I don't know what part of the API is still functional in the transition) and hmm... maybe that's it?

As a side note, for generic questions, ChatGPT's web crawler probably has enough information to give decent response 🤷‍♀️

Not using data.gov data,

Answer with data.gov (sounds accurate the the data isn't there),

Asking a question that I know moreso is on data.gov,

All of these demos to prove that if the LLM just had access to the data, it can provide more informed answers, training an LLM from scratch is almost not necessary.

btylerburton added Epic Feature Explore labels Nov 29, 2023

btylerburton added this to data.gov team board Nov 29, 2023

gujral-rei moved this to 🧊 Icebox in data.gov team board Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure and Train an LLM-based chatbot on Data.gov catalog DB #4541

Configure and Train an LLM-based chatbot on Data.gov catalog DB #4541

btylerburton commented Nov 29, 2023

nickumia commented Dec 30, 2024 •

edited

Loading

Configure and Train an LLM-based chatbot on Data.gov catalog DB #4541

Configure and Train an LLM-based chatbot on Data.gov catalog DB #4541

Comments

btylerburton commented Nov 29, 2023

Feature/what we're after

Anticipated/hypothesized benefits

Measurements/metrics

References/background

nickumia commented Dec 30, 2024 • edited Loading

As a side note, for generic questions, ChatGPT's web crawler probably has enough information to give decent response 🤷‍♀️

Not using data.gov data,

Answer with data.gov (sounds accurate the the data isn't there),

Asking a question that I know moreso is on data.gov,

All of these demos to prove that if the LLM just had access to the data, it can provide more informed answers, training an LLM from scratch is almost not necessary.

nickumia commented Dec 30, 2024 •

edited

Loading