-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k0s reset
deleted all data on persistent volumes
#4318
Comments
The |
Would having a prompt make the UX better, and perhaps reduce the chance of such occurrences happening. I'd add a Goes without saying, this would be a breaking change |
Well, there is of course the safeguard that k0s must be not running. The main problem in this case is that first reset tries to unmount such volumes, but even if it fails to do so, it does not abort and proceeds to recursively delete data under datadir/rundir, which includes data on any volumes mounted under them. |
Ah that's useful to know, thanks! I didn't realise mounted volumes came under the potential data parts up for deletion during a reset. The docs
So I perhaps should have procedded with a little more caution! I do wonder whether a bullet point could be added to the docs there that includes something similar to this line
Happy to make a PR for it, if it helps! Otherwise good to close this issue. |
The issue is marked as stale since no activity has been recorded in 30 days |
@devsjc PRs are always welcome :D |
The issue is marked as stale since no activity has been recorded in 30 days |
@jnummelin Please Reopen this issue. This is a serious problem and a critical design flaw (yes even 'just' clearing /var/lib/k0s) IMO. This behaviour recently caused a serious data loss incident at my organisation where a k0s reset command was executed prior to un-mounting critical data as a part of a large and complex release. If we did not have backup policies in place, this would have been an existential problem for our organisation, as it was, it caused a week of headaches and a large Cloud Egress bill for restoring deleted data. There is just no way that silent deletion of data is an acceptable behaviour for any software that is intended for enterprise use. We have DevOps capability in my team, happy to contribute with some guidance. //from pkg/cleanup/directories.go
var dataDirMounted bool
// search and unmount kubelet volume mounts
for _, v := range procMounts {
if v.Path == filepath.Join(d.Config.dataDir, "kubelet") {
logrus.Debugf("%v is mounted! attempting to unmount...", v.Path)
if err = mounter.Unmount(v.Path); err != nil {
logrus.Warningf("failed to unmount %v", v.Path)
}
} else if v.Path == d.Config.dataDir {
dataDirMounted = true
}
}
...etc
if err := os.RemoveAll(d.Config.dataDir); err != nil { .... as discussed above, failing to unmount does not prevent the RemoveAll from executing |
I think the safest option is to implement recursive directory removal using openat2 along with |
We definitively need to refactor this. The problem is bigger than crossing mount points. It can delete completely unrelated stuff. mkdir -p /tmp/camerun/netnsomething/foo
echo data > /tmp/camerun/netnsomething/foo/data
sudo mount --bind /tmp/camerun/netnsomething/foo /tmp/camerun/netnsomething/ start and boom! the EDIT: this is a separate issue. |
A bind mount is a mount point, isn't it? Btw, I've been investigating this for quite some days already. |
Make sure that we don't have anything mounted under those directories so we don't delete persistent data. We do this py parsing /proc/mounts in reverse order as it is listed in mount order, and then we unmount anything that is under our directories before we delete them. Don't umount datadir itself if it is on a separate partition/mount Fixes k0sproject#4318 Signed-off-by: Natanael Copa <[email protected]>
Make sure that we don't have anything mounted under those directories so we don't delete persistent data. We do this py parsing /proc/mounts in reverse order as it is listed in mount order, and then we unmount anything that is under our directories before we delete them. Don't umount datadir itself if it is on a separate partition/mount Fixes k0sproject#4318 Signed-off-by: Natanael Copa <[email protected]>
Before creating an issue, make sure you've checked the following:
Platform
Linux 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Version
v1.29.3+k0s.0
Sysinfo
`k0s sysinfo`
What happened?
k0s reset
, followed by a node reboot deleted all files from all persistent volumes, irrespective of theirRetain
policies. Folders remained, but were completely empty.Steps to reproduce
I have not managed to reproduce the error (thankfully!)
Expected behavior
Persistent volumes mounted with the
Retain
policy are untouched on a reset, only k0s'/var/lib/k0s
directory gets cleaned.Actual behavior
See "What Happened?"
Screenshots and logs
I have the full set of logs from
sudo journalctl -u k0scontroller -r -U 2024-04-13 -S 2024-04-11
but I imagine a more focussed subset is more useful!Additional context
Firstly, thanks for the great tool!
I have previously run
k0s reset
a fair few times without issue with no changes to the way volumes were mounted or to the services running in the cluster. All I can think that seperates this one from the others is the context of the reset regarding why it was needed:This specific reset was prompted by an issue with the helm extensions: removal of a chart from the k0s config yaml, instead of uninstalling the chart, put the cluster in an unstartable state. The config was installed into the controller by
$ sudo k0s install controller --single -c ~/hs-infra/k0s.yaml
And
k0s stop
was run before making changes to the config. The only changes tok0s.yaml
were fromto
k0s start
would then error with logs such asor
Before running a k0s reset, to try to resolve the error and start the cluster successfully, I modified
/var/lib/k0s/helmhome/repositories.yaml
to remove reference to the charts, but this didn't work either. Therefore I rank0s reset
as I had a few times before, and performed the node reboot as requested. However on restarting the server, all files in any mounted volumes were deleted - definitely deleted and not just hidden or moved as an inspection of available space revealed. Strangely enough the folder structure within the volumes remained, but just all empty directories.If it helps, here's some manifest snippets!
Persistent Volume manifest example
Persistent Volume Claims manifest example
Volume mount in deployment example
The text was updated successfully, but these errors were encountered: