How can I keep using same cache directory for same file over multiple runs? #475
-
I am building an application with langchain's RetrievalQAChain to query Documents created from a pdf file. The hash value of each pdf file is used as the name of the cache directory. The problem is that when I run the program(even from the terminal or from jupyter notebook), if I terminate the program once and rerun, It appears that the existing cache is not used and be rebuilt from scratch. What could be causing this? Here is my code. I referenced this page: https://gptcache.readthedocs.io/en/latest/bootcamp/langchain/question_answering.html#init-for-similar-match-cache from gptcache.manager import manager_factory
from gptcache.embedding import Huggingface
from gptcache import cache
from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
import hashlib
os.environ['OPENAI_API_KEY'] ="openaikey"
model_name = "sonoisa/sentence-bert-base-ja-mean-tokens-v2"
filename = "20230428005-2.pdf"
# Get the pdf file's hash value
with open(filename, 'rb') as file:
data = file.read()
filehash = hashlib.sha256(data).hexdigest()
# prepare database and retriever from the pdf file
loader = PyMuPDFLoader(filename)
docs = loader.load()
text_splitter = CharacterTextSplitter()
documents= text_splitter.split_documents(docs)
db_embeddings = HuggingFaceEmbeddings(model_name=model_name)
db = Chroma.from_documents(documents,
db_embeddings,
persist_directory=None)
chroma_retriever = db.as_retriever()
# prepare RetrievalQA object from the retriever
prompt_template = """Use the following pieces of context and source information to answer the question at the end.
If you cannot find the answer, do not try to make up an answer, just say that you don't know.
{context}
Question: {question}
answer in Japanese:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
chain_type_kwargs = {"prompt": PROMPT}
llm = LangChainLLMs(llm=OpenAI(temperature=0.0, model_name="gpt-3.5-turbo",request_timeout=60,
max_retries=3))
qa = RetrievalQA.from_chain_type(
llm = llm,
chain_type_kwargs = chain_type_kwargs,
retriever=chroma_retrievers,
return_source_documents=True,
)
# prepare cache for the pdf file
def get_content_func(data, **_):
return data.get("prompt").split("Question")[-1]
cache_embedding = Huggingface(model_name)
data_manager =manager_factory(
"sqlite,faiss",
data_dir = f"./data/similarity_cache/similar_cache_{filehash}", # Separate directories for each file
vector_params = {"dimension": cache_embedding.dimension}
)
cache.init(
pre_embedding_func=get_content_func,
embedding_func = cache_embedding.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
) |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
@SleepingSkipper It seems that there is no problem. But it should the same cache when rerunning the program. Have you try to print the |
Beta Was this translation helpful? Give feedback.
-
I found out that this is caused by |
Beta Was this translation helpful? Give feedback.
@SleepingSkipper It seems that there is no problem. But it should the same cache when rerunning the program. Have you try to print the
filehash
values to make sure to be same. You can confirm what files and folders are in./data/similarity_cache
folder