This is a small demo project illustrating how to create a chatbot that can query a scraped website. It uses LangChain to manage the chatbot's framework, Gradio for a user friendly interface, OpenAI's gpt-3.5-turbo
LLM model, and ChromaDB for as a vector store.
This project accompanies a blog post on my website, and can be read here.
This project supports both pip
and pipenv
. I recommend using pipenv
for the best (and least error prone) experience.
Run
pip install -r requirements.txt
if using pip
.
Run
pipenv install
if using pipenv
, followed by pipenv shell
to start a shell with the installed packages.
You need to create a new .env
file from the .env.example
file with your OPENAI_API_KEY
. You can create one of these on OpenAI's platform. This will require an OpenAI developer account.
To scrape a site, run
python scrape.py --site <site_url> --depth <int>
This will scrape a url and all links found at that url recursively up to the specified depth
. This will only scrape sites with the same origin as the given <site_url>
, so for example scraping https://python.langchain.com/docs
will only scrape sites at https://python.langchain.com
.
The data will be stored in a new scrape/
directory.
To generate and persist the embeddings and create a vector store, run
python embed.py
A new persisted vector store will be created in the chroma/
directory.
To launch the chatbot, you can run
python main.py
This will start a Gradio server at http://127.0.0.1:7860, allowing you to chat to the scraped website and data store.
NOTE: you must both first scrape a site and persist a vector store in order for this to work.