-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement client-side dataset caching #802
Conversation
After testing this, the way records are cached (with all their child records) is not going to work. It works for singlepoints, but torsiondrives are way too big. So this is going to need a bit more work I think the solution is to store individual records (without children) in a separate table, and store foreign keys in the current |
I've been playing around with this today, and it's great! Very intuitive. Two comments along with their importance out of 10 (I wouldn't consider either one blocking).
The code that I'm playing with is: import qcportal
client = qcportal.PortalClient("https://api.qcarchive.molssi.org:443", cache_dir="./cache2")
ds = client.get_dataset("torsiondrive", "XtalPi Shared Fragments TorsiondriveDataset v1.0")
# the next two lines didn't immediately do what I wanted, so I ran the loops below
#ds.fetch_entries()
#ds.fetch_records(include=["optimizations"], force_refetch=True)
for entry in ds.iterate_entries():
entry
for record in ds.iterate_records():
for angle, opt in record[2].minimum_optimizations.items():
opt.final_molecule The resulting cache file size is pretty reasonable: (bespokefit) jw@mba$ ls -lrth cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite
-rw-r--r-- 1 jeffreywagner staff 13M Feb 16 15:32 cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite Then in a separate interpreter (and with minor changes to qcsubmit): from qcportal import dataset_models
ds2 = dataset_models.dataset_from_cache("./cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite")
from openff.qcsubmit.results import TorsionDriveResultCollection
tdrc = TorsionDriveResultCollection.from_datasets([ds2])
tdrc
Success!! |
Glad it's working so far!
At the moment, there is no limit to the cache (basically if you set the max size to
If you create a client with the same |
I'm going to go ahead and merge this. There's still some tasks to be done before the next release, but it seems to be working well. The main reason is I have another feature being built on top of this and leaving this open makes it a bit complicated. |
Description
Previously, dataset information was not cached locally at all. So rerunning a script, or just calling
client.get_dataset
again could require re-fetching all data, even if it had been fetched before.This PR implements this caching. All storage of records is now in an SQLite database, either in a file or in memory. Some care has been taken in trying to keep the cache up-to-date as much as possible, but I am sure there are still loopholes. This includes records writing themselves back to the cache when they have been updated with additional data (for example, fetching molecules or trajectories).
There are a few ways to use:
cache_dir
parameter when creating a client. This will then automatically create SQLite files for each dataset, and re-use them as long as the samecache_dir
is used in subsequent client construction.dataset_from_cache
function where you can pass a file directly (ie, downloaded out-of-band).dataset_from_cache
function indataset_models.py
that works similarly, but will result in an offline dataset object completely disconnected from any server.This is purely a client-side change, so this branch will work with the currently-deployed MolSSI QCArchive servers.
There is still some polishing to be done (and docs), but I am looking for feedback and any bugs before merging.
See #740
Todos and missing features:
refresh_cache
needs to be finishedChangelog description
Implement client-side caching of datasets
Status