-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too much opening and closing files #153
Comments
We could keep around a persistent handle for the snapshot and re-use it. I've avoided doing that so far as it seemed cleaner to me this way, and you could ask the question of how on earth we would ever give back the handle. I guess we would have that in the destructor of the swift dataset, but you would still need to query the metadata server once per galaxy (though that doesn't seem too bad to me). Specifically, you could add a handle as a property of the |
From slack: Is this not a massive concern for distributed snapshots? There is surely no other way of doing this than swapping out file handles, otherwise we'll have thousands of handles hanging around. Maybe @jchelly has some insight here? Does HDF5 do some magic under the covers? |
Maybe swiftsimio needs an API that allows reading multiple arrays in one operation? That's how I usually deal with multi-file snapshots: make a list of everything I need from each file then open the files in sequence. It should be possible to get the number of file opens down to one opening of file 0 to read all of the metadata and then each time a region is requested each file with particles in that region is opened once. You could also use something like functools.lru_cache to keep the last few files around if it's likely the same files will be needed again. I'm not sure if HDF5 does anything clever with the virtual file. It seems to delay opening sub files until you request data from them but I don't know how it decides to close them or what happens if there are more files than the maximum number of file handles. |
So what I guess as long as we hold onto the file handle, maybe hdf5 will under the covers? I've always hated the idea of the distributed snapshots, but I guess people have to use them... |
The best way to do this would be to add an |
Then, when loading multiple regions, you should be able to re-use the same |
Can we load all the meta-data once and for all? That would already reduce the number of reads and is light-weight. |
I've proposed a fix in #155. |
Last week I got told off by Peter Draper for overloading the meta-data server of the cosma7 file system. The reason: I was running 192 parallel instances of a script that uses swiftsimio to extract galaxies from a snapshot for analysis. Apparently the meta-data server is queried whenever a file is opened, even if you just closed that same file after another reading operation. In swiftsimio, the snapshot file is opened (and almost immediately closed again):
That is a lot. I understand the reason for reading datasets separately, but I don't see a good reason to open and close the file for meta-data; this can and should all be read in one go. That might already help a bit.
Even if the number of reads per
load()
can be reduced, the more fundamental issue is that, right now, we need to do a separateload()
whenever the mask is changed (because you want to read a different galaxy for example). That is great for parallelisation, since it essentially means you can spawn lots of processes for different galaxies without needing to worry about parallel HDF5 (each process uses its ownh5py
instance). But it means a huge overhead (in file accesses and also in amount of data being read) when you are trying to read a significant fraction of the snapshot. We need a better mechanism to deal with that. Ideally, we would have a mechanism that reads large chunks of datasets in few operations (Lustre file systems like that), but hides this from the user.@JBorrow any thoughts on this? We really want to use swiftsimio for the morphology pipeline we plan to use for COLIBRE, but without some way to improve performance for loops over galaxies, we will never be able to run this on a large box.
The text was updated successfully, but these errors were encountered: