-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to check BagIt archive with a large number of files (~20'000) with BagVerifier #137
Comments
The validation speed is mostly determined by the IO throughput as the hashing is typically done by a specialized unit on your CPU and is therefore very low overhead. Where is the bag located - spinning disk, SSD, NFS, Samba mount, etc.? All of those choices will dramatically affect the rate of the verification. Also, how many threads can your CPU actively use? If it is like mine where it has 4 cores, having more than 4 threads isn't going to help very much as they will just be waiting anyway. So, my suggestion would be these:
|
It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from We ended up resolving with a similar solution, but I also had to avoid calling |
Or you could use my fork which I actually maintain and has fixed these problems. |
What is the best configuration of ExecutorService, parameter of BagVerifier, to check a very big BagIt archive? I used the 5.0.3 version of the library.
By default, the 'isValid' function create a thread of each file. With ~20'000 files, the process crashed.
I tried the different option:
=> crash
=> very long
What is your advice?
Thx
The text was updated successfully, but these errors were encountered: