Unmanaged datasets destroyed at boot time after import of zpool.cache #103

almereyda · 2020-04-30T15:07:04Z

Describe the bug

When custom datasets are present in the system, they are not recognised by zsys and sceduled for destuction. apt actions, the zsys-commit and zsys-gc journals report:

WARNING Couldn't find any association for user dataset epool/USERDATA/edata

zsys-gc further informs about impossible destruction of newly created datasets, as they are in use:

WARNING Couldn't destroy unmanaged user dataset epool/USERDATA/edata: couldn't destroy "epool/USERDATA/edata" and its children: stop destroying dataset on "epool/USERDATA/edata", cannot destroy child: cannot destroy dataset "epool/USERDATA/edata/pub": dataset is busy

After the next reboot, the dataset will not yet be mounted, and can be destroyed:

# zpool history -il epool
2020-04-28.23:20:34 [txg:16285] open pool version 5000; software version unknown; uts acacia 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 x86_64 [on acacia]
2020-04-28.23:20:35 [txg:16287] import pool version 5000; software version unknown; uts acacia 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 x86_64 [on acacia]
2020-04-28.23:20:40 zpool import -c /etc/zfs/zpool.cache -aN [user 0 (root) on acacia:linux]
2020-04-28.23:30:44 [txg:16409] destroy epool/USERDATA/edata/pub (600)  [on acacia]
...
2020-04-28.23:31:35 [txg:16451] destroy epool/USERDATA/edata/home (702)  [on acacia]
2020-04-28.23:31:38 [txg:16453] destroy epool/USERDATA/edata/data (607)  [on acacia]
2020-04-28.23:31:41 [txg:16455] destroy epool/USERDATA/edata/src (902)  [on acacia]
2020-04-28.23:31:43 [txg:16457] destroy epool/USERDATA/edata (70)  [on acacia]

To Reproduce
Steps to reproduce the behavior:

Create a fresh install of Ubuntu 20.04 using the experimental ZFS support.
Create an encrypted pool epool and recursively other datasets below, like epool/USERDATA, epool/USERDATA/pub etc.
Transfer several TBs into your new pool.
Reboot the system and have the pool's data destroyed automatically.

Expected behavior

The unmanaged datasets are left intact.

ubuntu user /tmp/report
https://gist.github.com/almereyda/8c2af14ba9de5c26d5309d75ae654f40

Installed versions:

OS: Ubuntu 20.04
Zsysd running version: 0.4.5

Additional context
Early information about the case were documented in https://askubuntu.com/questions/1232637

The text was updated successfully, but these errors were encountered:

didrocks · 2020-04-30T15:17:56Z

Thanks for the bug report!

The only place on the zpool directory where people shouldn’t create manually dataset is <pool>/USERDATA/. This is where zsys has to collect (garbage collect) to free up space any datasets unassociated to any system state manually. We need to do that in that particular place because there are a lot of interactions between multiple machines where we end up with datasets there that are up for descrution. This is what is covered by the hundreds of tests we have in ZSys to ensure this (and system states, but those have to be clones of other datasets associated to root).

The rest of the pool is yours (rpool/edata for instance or rpool/home will never destroyed and kept unmanaged by ZSys). I agree that ZSys warning log is wrong though and it shouldn’t warn them as unmanaged to on the tobedeleted list.

A set of blog post is going to be published out from basic user to advanced ones on how ZSys interacts with the system. I will be sure to integrate it into it, and maybe even on the FAQ (if you are interested to draft a page, I will happily merge it in the repo).

EDIT: Another way to see it is to ignore any datasets not being a clone of any users of any states in any machines detected by ZSys, the issue with that approach is that it would mean that people who are going to reinstall their machines without resetting their pool and who don’t know how ZFS is working will never have an opportunity to get this additional space clean, but this would be doable. We need to think about the impact of this a decision.

almereyda · 2020-04-30T18:44:58Z

Thank you for the quick reaction.

Please excuse if this is a naive perspective, but given the assumptions that

using a USERDATA dataset is a convention and good practice on ZFS and
the custom pool epool is none of the system pools bpool nor rpool

I do not expect zsys having any authority to silently (!) destroy any data. In this case many TBs.

We are also using the USERDATA convention in deployment operations integrated with the paths that libre.sh expects, and this way all user domains would have been destroyed without warning.

This regression might as well explain why ZFS support is labelled experimental in the latest LTS, but under no circumstances it can delete data when following good practice and sane defaults without asking. The decision to maintain USERDATA datasets under rpool/ might still seem relevant for the zsys use case, but that should not automatically extend to manually created pools following other naming schemes.

Also a note would be helpful, how the datasets can be recovered, if there is any chance.

almereyda · 2020-04-30T20:13:55Z

After some more research, I am looking at

Desktop Team Updates - Monday 27th April 2020 - Desktop / Team Updates - Ubuntu Community Hub, which announces the blog posts
Ubuntu 20.04’s zsys adds ZFS snapshots to package management [Updated] | Ars Technica, for an early peek into how zsys integrates with apt.
zsys/zsys.conf at master · ubuntu/zsys, configuration of garbage collection, as per the Ars Technica article compiled into the binary

This is the part where simple changes could remove unmanaged datasets from garbage collection:

zsys/internal/machines/gc.go

Lines 55 to 64 in e027ea8

    
           allDatasets := make([]*zfs.Dataset, 0, len(ms.allSystemDatasets)+len(ms.allPersistentDatasets)+len(ms.allUsersDatasets)+len(ms.unmanagedDatasets)) 
        
           byOrigin := make(map[string][]string)      // list of clones for a given origin (snapshot) 
        
           snapshotsByDS := make(map[string][]string) // List of snapshots for a given dataset 
        
           log.Debug(ctx, i18n.G("Collect datasets")) 
        
           allDatasets = append(allDatasets, ms.allSystemDatasets...) 
        
           allDatasets = append(allDatasets, ms.allPersistentDatasets...) 
        
           allDatasets = append(allDatasets, ms.allUsersDatasets...) 
        
           allDatasets = append(allDatasets, ms.unmanagedDatasets...)

as the name already implies: an unmanaged dataset should not be managed.

To cite the article:

The actual facility is called with the command zsysctl service gc -a, which performs garbage collection on all zsys namespaces present on the system.
...
As Ubuntu developer Didier Roche explained on Twitter, Ubuntu 20.04 systems don't automatically run zsysctl service gc yet. Roche strongly recommends that nobody run the command on any production systems for the time being, since it's a data-destroying feature that's still in alpha.

The IRC logs linked in the thread also mention

Saviq | didrocks: and what does it consider "garbage"?
didrocks | automated snapshots (not taken by you manually) or manual clones, with a decay garbage rules
...
didrocks | I guess you start to underestand why we didn’t enable it by default yet (but we will for 20.04)

Combining the above together, it becomes even harder for me to understand, (1) why my personal datasets are considered to be snapshots in the first place, as they did not have any, and (2) why this has been enabled for the LTS release. It seems my very own data is considered as a manual clone, following the naming above.

Maybe garbage collection should only handle datasets that carry the _UUID particle in their name hierarchy?

Fortunately the SystemD unit does not run with the -a flag.

zsys/systemd/zsys-gc.service

Line 9 in e027ea8

ExecStart=/sbin/zsysctl service gc

Maybe it would be a good compromise to remove the all-encompassing policy to include unmanaged datasets in garbage collection, if the all option is not consciously set?

Similar to what we find in

zsys/internal/machines/gc.go

Line 48 in e027ea8

// If all is set manual snapshots are considered too

didrocks · 2020-05-01T07:07:47Z

In summary (sorry for the long answer, but I prefer to explain the different causes/consequences which also makes me help thinking about this issue): this needs a little bit of thoughts. Sorry for your experience of unexpected dataset deletion but I think we’ll find a way around it.

using a USERDATA dataset is a convention and good practice on ZFS and

Do you have reference to this? USERDATA has never been a convention on ZFS. ZSys came with this naming for subspacing while drafting the specificiation in April 2019. This name was discussed with ZFS upstream (who is making the ZFS on root ubuntu guide) and this is where we set it. So no, this is not an existing convention.

the custom pool epool is none of the system pools bpool nor rpool
Indeed, BUT, we had requets from advanced users, (as you are as well) to have user datasets capabilities to be set on any pool. We thus have an heuristic to find and merge all user datasets in any available pools:

user A can be on rpool and local
user B can be shared with machine C and part of any other pool.
Same for user creation, we prefer existing USERDATA if any (first local pool, then other pools). Finally, if no USERDATA is present, we look at the local pool.

So, even if a"don’t touch outside of rpool" seems trivial at first, the reality is way different for the above reasons as per advanced user requests.

This is for all those reasons that we have seen until now USERDATA as a reserved namespace. It seems we were wrong (I would still welcome documentation about this convention and how we can matching the multi-years known convention by chance).

The decision to maintain USERDATA datasets under rpool/ might still seem relevant for the zsys use case, but that should not automatically extend to manually created pools following other naming schemes.

You suggestion about limiting to rpool doesn’t work for the reasons I gave about multi-machines and multi-poools users case. And as you have fallen into this issue, I will bet that the next user report bug will be about "why did you remove my rpool/USERDATA/awesomedata?". So, that won’t fix the issue :)

Combining the above together, it becomes even harder for me to understand, (1) why my personal datasets are considered to be snapshots in the first place, as they did not have any, and (2) why this has been enabled for the LTS release. It seems my very own data is considered as a manual clone, following the naming above.

We have to track unmanaged datasets in this line of code to look for any dependencies which may prevent deleting one state (datasets clones and such). In the end of the GC, we look for any untagged (because some actions can’t be destroyed immediatelly, but only untagged) filesystem datasets and remove them one after another to "clean" the system.

Maybe garbage collection should only handle datasets that carry the _UUID particle in their name hierarchy?
I will forsee the same issue: something creating rpool/USERDATA/awesome_data and getting their filesystem datasets destroyed

Maybe it would be a good compromise to remove the all-encompassing policy to include unmanaged datasets in garbage collection, if the all option is not consciously set?

I think we should just leave alone all filesystem datasets, which is its own space: this means that it’s not a clone and none of its clone/snapshots (or snapshots on clones) is attached to any system state for any machines.

We loose the GC capabilities for other machines this way, but this is a reasonable tradeoff.
There is the issue with user deletion then: If a user is deleted, we can’t immediately delete all snapshots and clones for this users (because some system revert need to restore the user). As long as they have one state attached to a system state, this is fine. However, when we unattach the last user state from a system state (because of GC), it will end up under this unmanaged dataset definition, but we don’t want to keep it forever.

This needs a little bit of thoughts. Sorry again for your experience of unexpected dataset deletion. Fortunately, we have a log of tests which covers various cases in this area (https://github.com/ubuntu/zsys/blob/master/internal/machines/machines_test.go#L1120, https://github.com/ubuntu/zsys/blob/master/internal/machines/machines_test.go#L939, https://github.com/ubuntu/zsys/blob/master/internal/machines/internal_test.go#L420, https://github.com/ubuntu/zsys/blob/master/internal/machines/internal_test.go#L509) and you can see that there is a TODO that we need to change: https://github.com/ubuntu/zsys/blob/master/internal/machines/machines_test.go#L1192 to say "detached user datasets should not be collected".

I'll open another bug to track user deletion strategy with this new approach.

What do you think?

didrocks · 2020-05-01T09:21:37Z

Some notes after looking a little bit in:

the unmanaged warning is a leftover of the previous warning, we are indeed looping over AllUserDatasets and not UnmanagedDatasets, so this is correct:

zsys/internal/machines/gc.go

Lines 421 to 444 in e027ea8

    
           for _, d := range ms.allUsersDatasets { 
        
           	if d.IsSnapshot { 
        
           		continue 
        
           	} 
        
           	if d.BootfsDatasets != "" { 
        
           		continue 
        
           	} 
        
           	for rootDataset := range userDatasetsToKeep { 
        
           		if d.Name == rootDataset || strings.HasPrefix(d.Name, rootDataset+"/") { 
        
           			continue nextDataset 
        
           		} 
        
           	} 
        
           	for _, n := range alreadyDestroyedRoot { 
        
           		if strings.HasPrefix(d.Name, n+"/") { 
        
           			continue nextDataset 
        
           		} 
        
           	} 
        
           	hasChanges = true 
        
           	// We destroy here all snapshots and leaf attached. Snapshots won’t be taken into account, however, we don’t want 
        
           	// to try destroying leaves again, keep a list. 
        
           	if err := nt.Destroy(d.Name); err != nil { 
        
           		log.Warningf(ctx, i18n.G("Couldn't destroy unmanaged user dataset %s: %v"), d.Name, err) 
        
           	}

We thus need to ensure that kind of datasets don’t ends up in this AllUserDatasets when loading it and migrate to UnmanagedDatasets list.

This for filesystem datasets:

zsys/internal/machines/machines.go

Lines 304 to 306 in e027ea8

    
           if !associateWithAtLeastOne { 
        
           	log.Warningf(ctx, i18n.G("Couldn't find any association for user dataset %s"), r.Name) 
        
           }

This for snapshot datasets:

zsys/internal/machines/machines.go

Lines 304 to 306 in e027ea8

    
           if !associateWithAtLeastOne { 
        
           	log.Warningf(ctx, i18n.G("Couldn't find any association for user dataset %s"), r.Name) 
        
           }

We can append them to a list and skip over them when inflating the AllUserDatasets list so that they end up in the unmanaged one:

zsys/internal/machines/machines.go

Lines 329 to 334 in e027ea8

    
           for _, d := range flattenedUserDatas { 
        
           	if d.CanMount == "off" { 
        
           		continue 
        
           	} 
        
           	machines.allUsersDatasets = append(machines.allUsersDatasets, d) 
        
           }

EDIT: this is slightly more complex and need a second pass because we can have find a snapshot (same name) or a file system dataset clone associated with one part of the system, and so, all snapshots and file system datasets become then "managed" so that the user can see the whole history (this is typically the case of a deleted user).

The question is about the warning "Couldn't find any association for user dataset". I feel it should still be there (this is proceeded when refreshing the internal representation machine), maybe the wording can be slightely changed? Should we downgrade it to INFO (as we are starting to expect it as being a valid use case)? Any opinions are welcome!

almereyda · 2020-05-01T12:45:15Z

Merci Didier! I will have to think of this and get back to you with the reference to the USERDATA convention - on our machines in use since Oct. 2017 the latest, and try to understand the involved mechanics without knowing the code. Please allow me to meditate about that.

Today is international workers day ✊ so thanks for your work until here indeed!

didrocks · 2020-05-01T13:46:53Z

No worry! Thanks for raising this up :)

Keep me posted once you have put some thoughts on it. Just a disclaimer that in the following days, I may not be available as much as I want due to personal reasons, but I’ll get back to you :)

almereyda · 2020-05-26T21:42:07Z

Thank you for adding these changes and inventing this mechanism. I am sure it will help a great deal in avoiding accidental data loss in the future.

…

On Fri, 22 May 2020 at 16:34, Didier Roche ***@***.***> wrote: Closed #103 <#103> via 16b68b8 <16b68b8> . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#103 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMRV7BWQZHZRT5NER7SVTDRS2ENPANCNFSM4MVVAKSA> .

didrocks · 2020-05-27T06:18:15Z

Yeah :) And we found a way to deal with deleted users still (#131), so next release will fix this issue without regressing the other use case :)

ianmccul · 2020-06-28T11:50:50Z

It seems that many people are running into this problem and losing data. For newcomers to zfs, it is a fairly obvious thing to follow the existing naming convention that zfs list shows, and name datasets in some additional storage pool as pool/USERDATA/user. If Zsys wants to reserve a name for itself then it should be something really obvious that it is private for Zsys and users shouldn't name their datasets like that. 'USERDATA' is a terrible choice. 'ZSYS-MANAGED', or something like that, would be better.

malachid · 2020-07-07T17:00:34Z

I agree with @ianmccul . I also used rpool/USERDATA/user as that seemed like the logical place on a fresh install.

Lockszmith-GH · 2020-07-10T17:01:55Z

Sorry to butt my nose here, and if this has been discussed elsewhere and decided against, please feel free to mark as off-topic.

Wouldn't a ZFS custom option/property of zsys:managed_userdata_dataset=true solve the issue?
Using zfs get -t filesystem,volume zsys:managed_userdata_dataset to identify whether a dataset is relevant or not?
similar to how zfs_autobackup manages it's options.

didrocks · 2020-07-15T08:03:09Z

Isn’t that fixed by zsys 0.4.6? We are now expecting in addition to be in USERDATA/ that the zsys bootfs-dataset property is set to collect those. Do you have experimented it with a newer version?

h-2 · 2021-05-14T15:59:12Z

Oh no, the same thing just deleted my home dataset with a week of work on it 😱 Is there any way I can get it back? This is really bad default behaviour.

I followed: https://talldanestale.dk/2020/04/06/zfs-and-homedir-encryption/ and I manually set zsys bootfs-dataset property on the dataset so that it wouldn't be deleted. It worked fine for a week, but today I mistyped the password three times and after another reboot the dataset was gone completely. Why, oh why? 😭

loxK · 2022-03-04T00:04:09Z

On Ubuntu 21.10, zsys just destroyed some datasets on a zpool on a secondary hard drive, was lpool/USERDATA/user/www and lpool/USERDATA/user/dev I lost a week of hard work.

@didrocks how can this be possible ?

h-2 · 2022-03-07T14:15:46Z

@didrocks Please reopen this issue. Your services are deleting the data of your users.

didrocks · 2022-03-07T14:19:16Z

This discussion has moved since #196 as this has nothing to do with this bug. However, please be aware about #213

a0c · 2023-04-10T00:31:46Z

This might be helpful in case of issues with backups.

didrocks mentioned this issue May 1, 2020

Track user deletion #104

Closed

didrocks closed this as completed in 16b68b8 May 22, 2020

azazar mentioned this issue Mar 14, 2021

Unmanaged datasets destroyed at boot time #196

Open

lgosselin mentioned this issue May 11, 2021

ZSys alters unmanaged datasets #203

Open

JakFrost mentioned this issue Jan 30, 2025

zfs zsys rpool/USERDATA/<username>_*@* snapshots destroyed Deactivated /home/<username> losing data #261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmanaged datasets destroyed at boot time after import of zpool.cache #103

Unmanaged datasets destroyed at boot time after import of zpool.cache #103

almereyda commented Apr 30, 2020

didrocks commented Apr 30, 2020 •

edited

Loading

almereyda commented Apr 30, 2020 •

edited

Loading

almereyda commented Apr 30, 2020 •

edited

Loading

didrocks commented May 1, 2020

didrocks commented May 1, 2020 •

edited

Loading

almereyda commented May 1, 2020

didrocks commented May 1, 2020 •

edited

Loading

almereyda commented May 26, 2020 via email

didrocks commented May 27, 2020

ianmccul commented Jun 28, 2020

malachid commented Jul 7, 2020

Lockszmith-GH commented Jul 10, 2020 •

edited

Loading

didrocks commented Jul 15, 2020

h-2 commented May 14, 2021

loxK commented Mar 4, 2022

h-2 commented Mar 7, 2022

didrocks commented Mar 7, 2022

a0c commented Apr 10, 2023

Unmanaged datasets destroyed at boot time after import of zpool.cache #103

Unmanaged datasets destroyed at boot time after import of zpool.cache #103

Comments

almereyda commented Apr 30, 2020

didrocks commented Apr 30, 2020 • edited Loading

almereyda commented Apr 30, 2020 • edited Loading

almereyda commented Apr 30, 2020 • edited Loading

didrocks commented May 1, 2020

didrocks commented May 1, 2020 • edited Loading

almereyda commented May 1, 2020

didrocks commented May 1, 2020 • edited Loading

almereyda commented May 26, 2020 via email

didrocks commented May 27, 2020

ianmccul commented Jun 28, 2020

malachid commented Jul 7, 2020

Lockszmith-GH commented Jul 10, 2020 • edited Loading

didrocks commented Jul 15, 2020

h-2 commented May 14, 2021

loxK commented Mar 4, 2022

h-2 commented Mar 7, 2022

didrocks commented Mar 7, 2022

a0c commented Apr 10, 2023

didrocks commented Apr 30, 2020 •

edited

Loading

almereyda commented Apr 30, 2020 •

edited

Loading

almereyda commented Apr 30, 2020 •

edited

Loading

didrocks commented May 1, 2020 •

edited

Loading

didrocks commented May 1, 2020 •

edited

Loading

Lockszmith-GH commented Jul 10, 2020 •

edited

Loading