-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process datasets as a whole even with storage unit datafile #107
Comments
Currently at DLS, we have individual datasets with more than 1 million datafiles. |
Yes, and that is exacly what causes the performance issues you reported. It means that only checking whether such a dataset is online costs you more than 1 milion fstats as opposed to one single fstat if this suggestion is implemented. |
The problem is restoring 1 million datafiles from the tape system in this case. This may add some stress on the tape library.
Regards,
Sylvie
…________________________________
From: Rolf Krahl <[email protected]>
Sent: 28 February 2020 13:32
To: icatproject/ids.server <[email protected]>
Cc: Da Graca Ramos, Silvia (DLSLtd,RAL,LSCI) <[email protected]>; Comment <[email protected]>
Subject: Re: [icatproject/ids.server] Process datasets as a whole even with storage unit datafile (#107)
Yes, and that is exacly what causes the performance issues you reported. It means that only checking whether such a dataset is online costs you more the 1 milion fstats as opposed to one single fstat if this suggestion is implemented.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#107?email_source=notifications&email_token=ADH6KDZPVDLGSA3MZYGSLIDRFEHAFA5CNFSM4K5P22J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENIQX2A#issuecomment-592514024>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADH6KD453YAR5ECI4MOCCK3RFEHAFANCNFSM4K5P22JQ>.
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
|
I wonder, taking into account the available file formats (like HDF5), does it make sense to store 1M files per dataset? Would it be more efficient to address to the root of the problem instead of the IDS? Just my opinion, |
Dear Alex,
I agree with you but I cannot do anything with data that is already in the database ...
This is mainly processed data.
Regards,
Sylvie.
…________________________________
From: antolinos <[email protected]>
Sent: 28 February 2020 15:59
To: icatproject/ids.server <[email protected]>
Cc: Da Graca Ramos, Silvia (DLSLtd,RAL,LSCI) <[email protected]>; Comment <[email protected]>
Subject: Re: [icatproject/ids.server] Process datasets as a whole even with storage unit datafile (#107)
I wonder, taking into account the available file formats (like HDF5), does it make sense to store 1M files per dataset? Would it be more efficient to address to the root of the problem instead of the IDS?
We are reducing by x1000 just by using HDF5.
Just my opinion,
A.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#107?email_source=notifications&email_token=ADH6KDYWR3JRXPOTT5BG7YLRFEYGRA5CNFSM4K5P22J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJAEVI#issuecomment-592577109>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADH6KD6BJ5UWLLGZFM3WQQ3RFEYGRANCNFSM4K5P22JQ>.
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
|
@antolinos : sure that makes sense. But still, it also makes sense to improve ids.server where its behavior is inefficient. Both are orthogonal paths of improvement that should be followed independently of each other. And finally it makes sense to use challenging cases such as the situation at DLS to detect and to understand inefficient behavior. |
This depends on #109. |
I just came across this during my work fixing IDS issues for Diamond and I'm afraid that I totally agree with @dfq16044. Due to the fact that some Diamond datasets contain so many files, the proposed behaviour would be disastrous for Diamond. I also agree with the comments that datasets should not have so many files in them, but this is data that has been ingested over the last 10+ years and cannot be just be deleted, tidied up or re-processed into nexus files, so for now we are stuck with it. |
@kevinphippsstfc, I rather believe that the current behavior of ids.server to process the millions of datafiles in a dataset each one individually is disastrous for Diamond. I regularly get complaints from Diamond about the poor performance of ids.server. This proposal is the direct result of an in-depth analysis of an event at Diamond from January 2019 that caused problems to the tape systems due to the particular pattern of sending restore requests in the current implementation of ids.server. The main cause of performance issues at Diamond is exactly the combination of datasets having a very large number of datafiles and the setting of |
Apologies @RKrahl I was not aware that this suggestion originated from Chris's email to the ICAT group. It's good that conversation is now linked into this issue. Also many thanks for looking into this - I appreciate that this is not trivial in itself, having spent quite some time trying to understand the IDS myself! I did some further thinking about this, and I realised that it would not be easy to decide whether a Dataset is online for Diamond. Diamond Datasets do not have the location field populated (full path locations are in the Datafiles) and I presume this would be required, or perhaps some programmatic way to create a path to a top level Dataset folder unique to that Dataset. If that folder exists on main storage, then you assume that all the Datafiles within the Dataset are also online. |
No need for apologies. You hit a valid point: The decision whether a dataset is online is taken in the storage plugin, it needs to implement a method If this assumption that the plugin could implement |
I suggest to change the behavior of ids.server if the storage unit is set to datafile: at the moment, each datafile is archived and restored individually. The suggestion is to change that and to always archive and restore entire datasets, regardless of the storage unit. That is, the distinction of the storage unit would almost completely be dropped from the core of ids.server. The distinction would still be retained in the way how the archive storage is accessed: for instance the
DfRestorer
would then restore one single dataset rather an arbitrary list of datafiles, but as opposed to theDsRestorer
, it would still read each file individually from archive storage, rather then a single zip file containing all the files from the dataset.The main benefit would (hopefully) be a significant improvement of the performance in two aspects:
checking the status of requested data: for most user requests, we need to check whether the requested data is online. At the moment with storage unit datafile, each single datafile must be checked. This requires a fstat call on each file which is an expensive operation. Following the suggestion, only the presence of the dataset directory need to be checked.
restore: at the moment with storage unit datafile, in order to avoid to start thousands of individual threads, all pending restore requests in the queue are combined to a single restore call. That has the side effect that all concerned datasets need to be exclusively locked. These locks are kept until the full restore call is done. Following the suggestion, one restore thread per dataset would be started and the exclusive lock on that dataset would be released as soon as the restore of the dataset is done.
The only drawback I can see for the moment is that we get a coarser granularity on archive and restore operations. If a user requests one single datafile from a dataset having a large number of datafiles, the full dataset will be restored, not only the requested file.
The suggestion would keep compatibility with existing archive storage. However, the upgrade procedure for the main storage will be not trivial: before upgrading to a new version having this suggestion implemented, it must be ensured that only complete datasets are online.
The text was updated successfully, but these errors were encountered: