-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment wiped out during concurrent builds #6167
Comments
Oh and if it happens to you, |
The problem was also amplified by a failing |
Thanks for this research, I was already wondering what caused all my services in a namespace to vanish randomly when it was particularly busy. Definitely not an option for production workloads ;) I think it will never be fully fixable by reordering the build steps in two concurrent pipelines by checking things or changing conditions, because that would be quite a burden to maintain, these might need to change in ways we cannot fully predict. I think we have two options here:
I'd say option 1, and then 2 later where it is safe to do so, so you do not wait forever when a pipeline gets stuck. |
The proposed solution to fix this issue would be to have a lock created as a ConfigMap when the Any concurrent upgrade while the lock is active, it will fail immediately. This is a more robust solution than retrying or waiting on a timer because a failed pipeline due to a concurrency issue can be re-triggered at any time. The same is valid when the |
relates to #5471 |
It will sure make everything much safer: only one In the future, a "wait for few minutes if I'm the latest pipeline" would however make things smoother. |
Hi @aure-olli, have you experienced any further occurrences of the issue? Given the effort to address the matter is quite involved, it would be good to help us understand the urgency based on the frequency that you experience the problem. |
Hi @deanesmith We have temporarily solved our problem with a custom The problem was much amplified by a
I must say that I like your solution, of locking the whole
And then delete the configmap once Kubernetes deals atomically with objects create and update (as long as you provide the current resource version), so no concurrency problem. I wouldn't mind implementing it myself. Of course there are missing details that will make it less easy, but I think this is globally straightforward. |
@aure-olli, we are handling other priority matters right now, do you want to create a PR for your proposal? Seems reasonable. |
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap "jx-lock-{namespace}" is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fix jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
I made a PR to fix the problem, using the algorithm we talked about: #6953. This is not safe for merge yet, but seems to work decently in my first tests. Please can you let me know what you think about it ? |
…Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fix jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes jenkins-x#6167 Signed-off-by: Aurélien Lambert <[email protected]>
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out. Only `jx step helm apply` is locked. A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished. The algorithm is approximately: ``` Label: CREATE try to create the ConfigMap if it succeeds: return Label: READ get the ConfigMap and the locking Pod if the locking Pod has finished remove the ConfigMap goto CREATE if the ConfigMap references a "higher" build fail if the ConfigMap references a "lower" build update the ConfigMap wait for the ConfigMap or the Pod to be updated if the ConfigMap is delete goto CREATE if the ConfigMap references a different build goto READ if the Pod has finished goto CREATE ``` fixes #6167 Signed-off-by: Aurélien Lambert <[email protected]>
Summary
This is a problem that kept me busy several days. When several environment build pipeline happen roughly at the same time (because several application master branches has been updated at the same time), the environment can be totally wiped out. It includes the deployments, the services, the ingresses, ... everything managed by the charts.
After adding outputs to jx (to see output of
kubect apply
andkubectl delete
), the cause is clear:kubect apply
and sets everything to version to 61kubect apply
and sets everything to versions to 62kubectl delete
and deletes everything which has not the version 61, meaning at this point literally everything.kubectl delete
but there's nothing left anyway.Both staging and production can be impacted. In production can be more careful to not promote applications too fast.
Solutions
This is a synchronization problem.
The condition
jenkins.io/chart-release=jx,jenkins.io/version!=78
seems quite violent,jenkins.io/chart-release=jx,jenkins.io/version<78
could already be much safer.I think that if the
jx step helm apply
is cut in two step (jx step helm apply
andjx step helm clean
), and that before each of the step, a step checks that the pipeline is the latest version, most of the problems could be avoided.But unlikely order of events could still be a problem.
Steps to reproduce the behavior
Just run many pipelines
Wait and pray to see it
Expected behavior
Pipelines should happen one by one, and everybody should be happy.
Actual behavior
The environment gets totally wiped out, and everybody is sad.
Jx version
The current head of
jx
(d2b7a71), customized to displaykubectl apply
andkubectl output
The output of
jx version
is:Jenkins type
Kubernetes cluster
Standard
jx create cluster gke
jx boot
process.Operating system / Environment
More
Here is what I could see while running 5 builds (form
#77
to#81
). I've lost#77
logs, but there was nothing specific to notice.#78
clearly wiped out everything (apparently with the kind help of#80
)xxx/environment-xxx-staging/master #78 promotion
xxx/environment-xxx-staging/master #79 promotion
xxx/environment-xxx-staging/master #80 promotion
xxx/environment-xxx-staging/master #81 promotion
The text was updated successfully, but these errors were encountered: