Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to check BagIt archive with a large number of files (~20'000) with BagVerifier #137

Open
UkDv opened this issue Feb 25, 2020 · 3 comments

Comments

@UkDv
Copy link

UkDv commented Feb 25, 2020

What is the best configuration of ExecutorService, parameter of BagVerifier, to check a very big BagIt archive? I used the 5.0.3 version of the library.

By default, the 'isValid' function create a thread of each file. With ~20'000 files, the process crashed.

I tried the different option:

  • ExecutorService exeService = new ThreadPoolExecutor(0, 10000, 60L, TimeUnit.SECONDS, new SynchronousQueue());
    => crash
  • ExecutorService exeService = Executors.newFixedThreadPool(3000);
    => very long

What is your advice?
Thx

@jscancella
Copy link
Contributor

The validation speed is mostly determined by the IO throughput as the hashing is typically done by a specialized unit on your CPU and is therefore very low overhead.

Where is the bag located - spinning disk, SSD, NFS, Samba mount, etc.? All of those choices will dramatically affect the rate of the verification.

Also, how many threads can your CPU actively use? If it is like mine where it has 4 cores, having more than 4 threads isn't going to help very much as they will just be waiting anyway.

So, my suggestion would be these:

  • use the fastest IO media you can (RAM disk would be fastest, followed by a good SSD)
  • set to the number of cores ExecutorService exeService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

@bbpennel
Copy link

It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from gov.loc.repository.bagit.verify.CheckIfFileExistsTask.existsNormalized(CheckIfFileExistsTask.java:59).

We ended up resolving with a similar solution, but I also had to avoid calling close on the BagVerifier instance (which is AutoCloseable) since it would shut down executor it was given which we were sharing across BagVerifiers.

@jscancella
Copy link
Contributor

It may be a good idea to have the validator default to an fixed threadpool executor like described. We ran into the same issue for a bag with 10000+ files, in our case it resulted in a FileSystemException with message "Too many open files" from gov.loc.repository.bagit.verify.CheckIfFileExistsTask.existsNormalized(CheckIfFileExistsTask.java:59).

We ended up resolving with a similar solution, but I also had to avoid calling close on the BagVerifier instance (which is AutoCloseable) since it would shut down executor it was given which we were sharing across BagVerifiers.

Or you could use my fork which I actually maintain and has fixed these problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants