You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using an s3 bucket as volume for my app running in k8s (deployment, 1 replica, rolling update).
When I triggered the deployment of a new revision of my app, the new pod got up and the s3 bucket attached to the new pod.
However, the old pod is failed because it was terminated with an error 137 (probably SIGKILL instead of OOM due to a small gracefull shutdown window, because I don't see any memory related issues right now).
There is an issue with the old pod which is stuck in a terminating state, likely due to a volume problem.
Datashim can not unmount the volume from node where was the old pod. csi-s3 pod (csi-s3 container, daemonset) log:
2024-03-13T21:04:45.001866708Z stderr F I0313 21:04:45.001667 1 utils.go:98] GRPC request: {}
2024-03-13T21:04:45.001885266Z stderr F I0313 21:04:45.001715 1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
2024-03-13T21:05:33.635813571Z stderr F I0313 21:05:33.635664 1 utils.go:97] GRPC call: /csi.v1.Node/NodeGetCapabilities
2024-03-13T21:05:33.635874907Z stderr F I0313 21:05:33.635690 1 utils.go:98] GRPC request: {}
2024-03-13T21:05:33.63589224Z stderr F I0313 21:05:33.635736 1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
2024-03-13T21:05:37.030960679Z stderr F I0313 21:05:37.027972 1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:37.030981533Z stderr F I0313 21:05:37.027993 1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:37.035823492Z stderr F I0313 21:05:37.035232 1 util.go:75] Found matching pid 87 on path /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:37.035848546Z stderr F I0313 21:05:37.035254 1 mounter.go:80] Found fuse pid 87 of mount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount, checking if it still runs
2024-03-13T21:05:37.035853313Z stderr F I0313 21:05:37.035273 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.135710883Z stderr F I0313 21:05:37.135582 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.336121403Z stderr F I0313 21:05:37.335983 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:37.636843237Z stderr F I0313 21:05:37.636538 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:38.037307936Z stderr F I0313 21:05:38.037173 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:38.538075749Z stderr F I0313 21:05:38.537942 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:39.13912303Z stderr F I0313 21:05:39.138960 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:39.839917198Z stderr F I0313 21:05:39.839762 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:40.640972203Z stderr F I0313 21:05:40.640842 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:41.542074541Z stderr F I0313 21:05:41.541932 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:42.542234381Z stderr F I0313 21:05:42.542094 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:43.64277079Z stderr F I0313 21:05:43.642634 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:44.843399965Z stderr F I0313 21:05:44.843259 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:46.14365466Z stderr F I0313 21:05:46.143473 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:47.543820527Z stderr F I0313 21:05:47.543692 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:49.044170561Z stderr F I0313 21:05:49.044067 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:50.645274452Z stderr F I0313 21:05:50.645124 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:52.345495246Z stderr F I0313 21:05:52.345336 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:54.145711584Z stderr F I0313 21:05:54.145589 1 util.go:39] Fuse process with PID 87 still active, waiting...
2024-03-13T21:05:56.046229119Z stderr F E0313 21:05:56.046082 1 utils.go:101] GRPC error: rpc error: code = Internal desc = Timeout waiting for PID 87 to
end
2024-03-13T21:05:56.625507593Z stderr F I0313 21:05:56.625330 1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume
2024-03-13T21:05:56.625532689Z stderr F I0313 21:05:56.625367 1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}
2024-03-13T21:05:56.626775Z stderr F E0313 21:05:56.626655 1 utils.go:101] GRPC error: rpc error: code = Internal desc = unmount failed: exit status 1
2024-03-13T21:05:56.626785867Z stderr F Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67
-78bd-4501-9dd8-bd2c4cb58648/mount
2024-03-13T21:05:56.626789088Z stderr F Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument2024-03-13T21:05:57.632955973Z stderr F I0313 21:05:57.632811 1 utils.go:97] GRPC call: /csi.v1.Node/NodeUnpublishVolume2024-03-13T21:05:57.632982344Z stderr F I0313 21:05:57.632830 1 utils.go:98] GRPC request: {"target_path":"/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount","volume_id":"pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648"}2024-03-13T21:05:57.634391298Z stderr F E0313 21:05:57.634256 1 utils.go:101] GRPC error: rpc error: code = Internal desc = unmount failed: exit status 12024-03-13T21:05:57.634405692Z stderr F Unmounting arguments: /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount2024-03-13T21:05:57.634410456Z stderr F Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-
1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
When I go to the node where the failed pod is, there is no active fuse filesystem mounted to `/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/*:
In the csi-s3 container, I may found unfinished goofys process for /var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/ volume:
The pods directory /var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount actually exists on the node, but It is empty:
/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount # ls -al
total 0
drwxr-x--- 2 root root 6 Mar 12 11:19 .
drwxr-x--- 3 root root 40 Mar 12 11:19 ..
It appears that the old volumes filesystem is not mounted as it is not visible in the output of df -hT -t fuse.
I guess that my pod was stuck in a terminating state because Kubelet can not finish some tasks (maybe admissions controllers involved, garbage collector) and it leaves the pod in this state. I want to fix that.
Worth mentioned that if the pod finished without error (not failed), then no s3 errors/problems occur
Thanks in advance.
What did you expect to happen?
Kubelet completely terminates the failed pod. The volume management is healthy.
@Artebomba thanks for the detailed report! It seems that you have run into the same problem as #335. What we have been able to find out is that this may be caused by a change in volume attachment introduced in K8s 1.27 and we need to update csi-s3 to reflect this.
We are working on this issue and hope to have an update soon
What happened?
I am using an s3 bucket as volume for my app running in k8s (deployment, 1 replica, rolling update).
When I triggered the deployment of a new revision of my app, the new pod got up and the s3 bucket attached to the new pod.
However, the old pod is failed because it was terminated with an error 137 (probably SIGKILL instead of OOM due to a small gracefull shutdown window, because I don't see any memory related issues right now).
There is an issue with the old pod which is stuck in a terminating state, likely due to a volume problem.
Datashim can not unmount the volume from node where was the old pod. csi-s3 pod (csi-s3 container, daemonset) log:
Onwards,
Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc- 1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
repeats indefinitely.Kubelet produces similar logs on the node where the failed pod is:
Even after I forthfully deleted a failed pod, errors did not disappear.
Pods description: kubectl get pod survey-service-84cf8d9d49-xhbxq -o yaml
When I go to the node where the failed pod is, there is no active fuse filesystem mounted to `/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/*:
In the csi-s3 container, I may found unfinished goofys process for
/var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/
volume:The pods directory
/var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
actually exists on the node, but It is empty:It appears that the old volumes filesystem is not mounted as it is not visible in the output of
df -hT -t fuse
.I guess that my pod was stuck in a terminating state because Kubelet can not finish some tasks (maybe admissions controllers involved, garbage collector) and it leaves the pod in this state. I want to fix that.
Worth mentioned that if the pod finished without error (not failed), then no s3 errors/problems occur
Thanks in advance.
What did you expect to happen?
Kubelet completely terminates the failed pod. The volume management is healthy.
Kubernetes version
Cloud provider
AWS EKS 1.27
OS version
csi-attacher-s3 - image registry.k8s.io/sig-storage/csi-attacher:v3.3.0 ;
csi-provisioner-s3 - image registry.k8s.io/sig-storage/csi-provisioner:v2.2.2
csi-s3 - image quay.io/datashim-io/csi-s3:0.3.0
driver-registrar - image registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0
dataset-operator - image quay.io/datashim-io/dataset-operator:0.3.0
The text was updated successfully, but these errors were encountered: