How can I keep using same cache directory for same file over multiple runs? #475

SleepingSkipper · 2023-07-07T12:56:47Z

SleepingSkipper
Jul 7, 2023

I am building an application with langchain's RetrievalQAChain to query Documents created from a pdf file.

The hash value of each pdf file is used as the name of the cache directory.
When I initialize RetrievalQA object, I also initialize the cache object with the specified cache directory of the file (I intend to use existing cache for the files I've seen before).

The problem is that when I run the program(even from the terminal or from jupyter notebook), if I terminate the program once and rerun, It appears that the existing cache is not used and be rebuilt from scratch. What could be causing this?

Here is my code. I referenced this page: https://gptcache.readthedocs.io/en/latest/bootcamp/langchain/question_answering.html#init-for-similar-match-cache

from gptcache.manager import manager_factory
from gptcache.embedding import Huggingface
from gptcache import cache
from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
import hashlib

os.environ['OPENAI_API_KEY'] ="openaikey"
model_name = "sonoisa/sentence-bert-base-ja-mean-tokens-v2"

filename = "20230428005-2.pdf"

# Get the pdf file's hash value
with open(filename, 'rb') as file:
    data = file.read()
    filehash = hashlib.sha256(data).hexdigest()


# prepare database and retriever from the pdf file
loader = PyMuPDFLoader(filename)
docs = loader.load()
text_splitter = CharacterTextSplitter()
documents= text_splitter.split_documents(docs)

db_embeddings =  HuggingFaceEmbeddings(model_name=model_name)
db = Chroma.from_documents(documents, 
                           db_embeddings, 
                           persist_directory=None)
chroma_retriever = db.as_retriever()


# prepare RetrievalQA object from the retriever
prompt_template = """Use the following pieces of context and source information to answer the question at the end.
If you cannot find the answer, do not try to make up an answer, just say that you don't know.
{context}

Question: {question}
answer in Japanese:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
chain_type_kwargs = {"prompt": PROMPT}
llm = LangChainLLMs(llm=OpenAI(temperature=0.0, model_name="gpt-3.5-turbo",request_timeout=60,
                max_retries=3))
qa = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type_kwargs = chain_type_kwargs,
    retriever=chroma_retrievers,
    return_source_documents=True,
    )


# prepare cache for the pdf file
def get_content_func(data, **_):
    return data.get("prompt").split("Question")[-1]

cache_embedding = Huggingface(model_name)
data_manager =manager_factory(
    "sqlite,faiss",
    data_dir = f"./data/similarity_cache/similar_cache_{filehash}",    # Separate directories for each file
    vector_params = {"dimension": cache_embedding.dimension}
)
cache.init(
    pre_embedding_func=get_content_func,
    embedding_func = cache_embedding.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )

Answered by SimFG

Jul 7, 2023

@SleepingSkipper It seems that there is no problem. But it should the same cache when rerunning the program. Have you try to print the filehash values to make sure to be same. You can confirm what files and folders are in ./data/similarity_cache folder

View full answer

SimFG · 2023-07-07T13:36:31Z

SimFG
Jul 7, 2023
Maintainer

@SleepingSkipper It seems that there is no problem. But it should the same cache when rerunning the program. Have you try to print the filehash values to make sure to be same. You can confirm what files and folders are in ./data/similarity_cache folder

3 replies

SleepingSkipper Jul 10, 2023
Author

Thank you.
I have already checked that the hash value in the file is always the same value.
Is there any way to check if cache is being used?

SimFG Jul 10, 2023
Maintainer

You can try to input the same value and print qa consumption time

SleepingSkipper Jul 13, 2023
Author

I found out that the problem appears to happen when running as gradio app... Hash value is always same for the same file, but cache made in previous runs are not used.

SleepingSkipper · 2023-07-19T10:34:52Z

SleepingSkipper
Jul 19, 2023
Author

I found out that this is caused by cache.config.auto_flush setting.
Default value is cache.config.auto_flush=20 which means that it writes cache values to the local file every 20 new responses generated.
I changed to cache.config.auto_flush=1 and it seems to be working as expected.

1 reply

SimFG Jul 19, 2023
Maintainer

yes, you can do it. There may be some impact on performance if the cache is flushed every time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I keep using same cache directory for same file over multiple runs? #475

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How can I keep using same cache directory for same file over multiple runs? #475

SleepingSkipper Jul 7, 2023

Replies: 2 comments · 4 replies

SimFG Jul 7, 2023 Maintainer

SleepingSkipper Jul 10, 2023 Author

SimFG Jul 10, 2023 Maintainer

SleepingSkipper Jul 13, 2023 Author

SleepingSkipper Jul 19, 2023 Author

SimFG Jul 19, 2023 Maintainer

SleepingSkipper
Jul 7, 2023

Replies: 2 comments 4 replies

SimFG
Jul 7, 2023
Maintainer

SleepingSkipper Jul 10, 2023
Author

SimFG Jul 10, 2023
Maintainer

SleepingSkipper Jul 13, 2023
Author

SleepingSkipper
Jul 19, 2023
Author

SimFG Jul 19, 2023
Maintainer