Allow local queries without write access to archive #258

dcesari · 2021-01-15T11:01:54Z

It is sometimes useful, especially in a shared HPC environment, where it is difficult to keep a daemon running, to be able to perform a local query on filesystem without having write permission on the archive directories. It is acceptable to make the assumption that no messages are removed from the archived files, i.e. changes to the archives only add messages to files and indices.

spanezz · 2021-01-15T13:25:50Z

My question is, if I have no way to synchronize using the file system, how can I avoid that a repack on the dataset, which can potentially rewrite any part of an existing data segment, turns a running query into garbage?

I think answering that question depends on organizational structures and processes. For example, if queries are tied to specific times, one could add a dataset configuration defining query times and maintenance times on a daily schedule basis, like saying that one cannot query between 00:00 and 04:00, and one cannot do repacks between 04:00 and 24:00.

Or there could be the assumption that when a dataset is down for maintenance, it gets unmounted/unexported from the readonly part of the filesystem where the queries happen? Like, taken offline for maintenance?

It's ok to assume that messages are not removed. How about messages overwriting old ones (like datasets with rewrite=yes), where a rewrite is a deletion+import?

Note also that a repack would reorder data in a dataset without deleting anything. For example, if data is imported not in strictly reftime order, a repack reorders it so that a query, which returns data sorted by reftime, can read the segment as much as possible sequentially rather than jumping back and forth. I don't know how significant is the impact of that optimization, and I guess it would depend on what kind of data are in a dataset. I'd expect it to be worse for BUFR and VM2, and not so bad for big GRIBs and HDF5 files. It's ok not to do that if the performance change is understood not to be a big deal.

I feel like there are many options and none universally good, and I'd like to identify some scenarios in detail in order to identify specific sets of tradeoffs

dcesari · 2023-12-12T16:58:47Z

Actually a low-profile implementation that performs the query on a best-effort basis (possibly returning an error code if there is the chance to assess that some of the relevant metadata have changed in the middle) would be enough.

Of course this behavior shoud be enabled by an option acting as a disclaimer for the users that they can receive rubbish.

If there is a chance to implement such a behavior without a big effort we could go on, otherwise just close as WONTDO.

dcesari added the enhancement label Jan 15, 2021

dcesari assigned spanezz and unassigned spanezz Jan 15, 2021

dcesari mentioned this issue Jan 15, 2021

Implementare eatmydata: yes nella configurazione dei dataset #233

Closed

dcesari assigned spanezz Jan 15, 2021

spanezz assigned dcesari and unassigned spanezz Mar 22, 2021

dcesari mentioned this issue Jul 25, 2023

query fails for permission denied #314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow local queries without write access to archive #258

Allow local queries without write access to archive #258

dcesari commented Jan 15, 2021 •

edited

Loading

spanezz commented Jan 15, 2021

dcesari commented Dec 12, 2023

Allow local queries without write access to archive #258

Allow local queries without write access to archive #258

Comments

dcesari commented Jan 15, 2021 • edited Loading

spanezz commented Jan 15, 2021

dcesari commented Dec 12, 2023

dcesari commented Jan 15, 2021 •

edited

Loading