Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process datasets as a whole even with storage unit datafile #107

Open
RKrahl opened this issue Feb 28, 2020 · 11 comments
Open

Process datasets as a whole even with storage unit datafile #107

RKrahl opened this issue Feb 28, 2020 · 11 comments
Labels
enhancement New feature or request idea An idea that might need some discussion first performance Issues related to poor performance of the program
Milestone

Comments

@RKrahl
Copy link
Member

RKrahl commented Feb 28, 2020

I suggest to change the behavior of ids.server if the storage unit is set to datafile: at the moment, each datafile is archived and restored individually. The suggestion is to change that and to always archive and restore entire datasets, regardless of the storage unit. That is, the distinction of the storage unit would almost completely be dropped from the core of ids.server. The distinction would still be retained in the way how the archive storage is accessed: for instance the DfRestorer would then restore one single dataset rather an arbitrary list of datafiles, but as opposed to the DsRestorer, it would still read each file individually from archive storage, rather then a single zip file containing all the files from the dataset.

The main benefit would (hopefully) be a significant improvement of the performance in two aspects:

  • checking the status of requested data: for most user requests, we need to check whether the requested data is online. At the moment with storage unit datafile, each single datafile must be checked. This requires a fstat call on each file which is an expensive operation. Following the suggestion, only the presence of the dataset directory need to be checked.

  • restore: at the moment with storage unit datafile, in order to avoid to start thousands of individual threads, all pending restore requests in the queue are combined to a single restore call. That has the side effect that all concerned datasets need to be exclusively locked. These locks are kept until the full restore call is done. Following the suggestion, one restore thread per dataset would be started and the exclusive lock on that dataset would be released as soon as the restore of the dataset is done.

The only drawback I can see for the moment is that we get a coarser granularity on archive and restore operations. If a user requests one single datafile from a dataset having a large number of datafiles, the full dataset will be restored, not only the requested file.

The suggestion would keep compatibility with existing archive storage. However, the upgrade procedure for the main storage will be not trivial: before upgrading to a new version having this suggestion implemented, it must be ensured that only complete datasets are online.

@RKrahl RKrahl added enhancement New feature or request idea An idea that might need some discussion first labels Feb 28, 2020
@dfq16044
Copy link

Currently at DLS, we have individual datasets with more than 1 million datafiles.

@RKrahl
Copy link
Member Author

RKrahl commented Feb 28, 2020

Yes, and that is exacly what causes the performance issues you reported. It means that only checking whether such a dataset is online costs you more than 1 milion fstats as opposed to one single fstat if this suggestion is implemented.

@dfq16044
Copy link

dfq16044 commented Feb 28, 2020 via email

@antolinos
Copy link

I wonder, taking into account the available file formats (like HDF5), does it make sense to store 1M files per dataset? Would it be more efficient to address to the root of the problem instead of the IDS?
We are reducing by x1000 just by using HDF5.

Just my opinion,
A.

@dfq16044
Copy link

dfq16044 commented Feb 28, 2020 via email

@RKrahl
Copy link
Member Author

RKrahl commented Feb 28, 2020

@antolinos : sure that makes sense. But still, it also makes sense to improve ids.server where its behavior is inefficient. Both are orthogonal paths of improvement that should be followed independently of each other. And finally it makes sense to use challenging cases such as the situation at DLS to detect and to understand inefficient behavior.

@RKrahl
Copy link
Member Author

RKrahl commented Mar 4, 2020

This depends on #109.

@RKrahl RKrahl added this to the 2.0.0 milestone Jan 6, 2021
@RKrahl RKrahl added the performance Issues related to poor performance of the program label Mar 15, 2021
@kevinphippsstfc
Copy link
Contributor

I just came across this during my work fixing IDS issues for Diamond and I'm afraid that I totally agree with @dfq16044. Due to the fact that some Diamond datasets contain so many files, the proposed behaviour would be disastrous for Diamond. I also agree with the comments that datasets should not have so many files in them, but this is data that has been ingested over the last 10+ years and cannot be just be deleted, tidied up or re-processed into nexus files, so for now we are stuck with it.

@RKrahl
Copy link
Member Author

RKrahl commented Mar 24, 2021

@kevinphippsstfc, I rather believe that the current behavior of ids.server to process the millions of datafiles in a dataset each one individually is disastrous for Diamond.

I regularly get complaints from Diamond about the poor performance of ids.server. This proposal is the direct result of an in-depth analysis of an event at Diamond from January 2019 that caused problems to the tape systems due to the particular pattern of sending restore requests in the current implementation of ids.server. The main cause of performance issues at Diamond is exactly the combination of datasets having a very large number of datafiles and the setting of storageUnit = datafile. The thorough solution would be switching to storageUnit = dataset. But I understand that this is impossible, because you cannot convert the legacy of ten years storage in the backend. This proposal is exactly tailored to your situation at Diamond and would bring you some of the performance benefits of storageUnit = dataset without the need to modify your tape archive. I still believe it would significantly improve the performance at Diamond.

@kevinphippsstfc
Copy link
Contributor

Apologies @RKrahl I was not aware that this suggestion originated from Chris's email to the ICAT group. It's good that conversation is now linked into this issue. Also many thanks for looking into this - I appreciate that this is not trivial in itself, having spent quite some time trying to understand the IDS myself!

I did some further thinking about this, and I realised that it would not be easy to decide whether a Dataset is online for Diamond. Diamond Datasets do not have the location field populated (full path locations are in the Datafiles) and I presume this would be required, or perhaps some programmatic way to create a path to a top level Dataset folder unique to that Dataset. If that folder exists on main storage, then you assume that all the Datafiles within the Dataset are also online.

@RKrahl
Copy link
Member Author

RKrahl commented Mar 26, 2021

No need for apologies.

You hit a valid point: The decision whether a dataset is online is taken in the storage plugin, it needs to implement a method mainStorage.exists(DsInfo) (if storageUnit = dataset shall be supported by the plugin or if this proposal would get implemented). How this decision is taken is up to the plugin. When formulating this proposal, I unconsciously took for granted that each dataset has a dedicated folder in main storage and it would be easy to determine the path of that folder from the DsInfo attributes, because that is the case in the reference plugin ids.storage_file and also in our own plugin at HZB. (In both cases, they don't use the location attribute, but the folders in main storage are prescribed by the plugin itself.)

If this assumption that the plugin could implement mainStorage.exists(DsInfo) is not given, then there is indeed a problem.

@RKrahl RKrahl modified the milestones: 2.0.0, 3.0.0 Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request idea An idea that might need some discussion first performance Issues related to poor performance of the program
Projects
None yet
Development

No branches or pull requests

4 participants