Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment wiped out during concurrent builds #6167

Closed
1 of 2 tasks
aure-olli opened this issue Nov 22, 2019 · 12 comments · Fixed by #6953
Closed
1 of 2 tasks

Environment wiped out during concurrent builds #6167

aure-olli opened this issue Nov 22, 2019 · 12 comments · Fixed by #6953

Comments

@aure-olli
Copy link

Summary

This is a problem that kept me busy several days. When several environment build pipeline happen roughly at the same time (because several application master branches has been updated at the same time), the environment can be totally wiped out. It includes the deployments, the services, the ingresses, ... everything managed by the charts.

After adding outputs to jx (to see output of kubect apply and kubectl delete), the cause is clear:

  • the first build runs kubect apply and sets everything to version to 61
  • the second build runs kubect apply and sets everything to versions to 62
  • the first build runs kubectl delete and deletes everything which has not the version 61, meaning at this point literally everything.
  • the first build runs kubectl delete but there's nothing left anyway.

Both staging and production can be impacted. In production can be more careful to not promote applications too fast.

Solutions

This is a synchronization problem.

The condition jenkins.io/chart-release=jx,jenkins.io/version!=78 seems quite violent, jenkins.io/chart-release=jx,jenkins.io/version<78 could already be much safer.

I think that if the jx step helm apply is cut in two step (jx step helm apply and jx step helm clean), and that before each of the step, a step checks that the pipeline is the latest version, most of the problems could be avoided.

But unlikely order of events could still be a problem.

Steps to reproduce the behavior

Just run many pipelines

jx start pipeline my-repo/my-project-1/master
jx start pipeline my-repo/my-project-2/master
jx start pipeline my-repo/my-project-3/master
jx start pipeline my-repo/my-project-4/master
jx start pipeline my-repo/my-project-5/master

Wait and pray to see it

watch kubectl get all -n jx-staging

Expected behavior

Pipelines should happen one by one, and everybody should be happy.

Actual behavior

The environment gets totally wiped out, and everybody is sad.

Jx version

The current head of jx (d2b7a71), customized to display kubectl apply and kubectl output

The output of jx version is:

NAME               VERSION
jx                 2.0.998-dev+46a9e4e6e
Kubernetes cluster v1.14.8-gke.12
kubectl            v1.14.7
helm client        Client: v2.14.3+g0e7f3b6
git                2.17.1
Operating System   Ubuntu 18.04.3 LTS

Jenkins type

  • Serverless Jenkins X Pipelines (Tekton + Prow)
  • Classic Jenkins

Kubernetes cluster

Standard jx create cluster gke jx boot process.

Operating system / Environment

Ubuntu 18.04.3 LTS

More

Here is what I could see while running 5 builds (form #77 to #81). I've lost #77 logs, but there was nothing specific to notice. #78 clearly wiped out everything (apparently with the kind help of #80)

xxx/environment-xxx-staging/master #78 promotion

kubectl apply --recursive -f /tmp/helm-template-workdir-948611119/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 created
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 created
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 created
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 created
service/testx configured
==========


kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
service "skills-server-redis-headless" deleted
service "skills-server-redis-master" deleted
service "skills-vue-x" deleted
deployment.apps "jx-api-poi-v2" deleted
deployment.apps "jx-olli-log-management" deleted
deployment.apps "jx-skills-server-x" deleted
deployment.apps "jx-skills-vue-x" deleted
deployment.apps "jx-testx" deleted
release.jenkins.io "olli-log-management-0.0.75" deleted
==========
kubectldelete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
configmap "exposecontroller" deleted
configmap "skills-server-redis" deleted
configmap "skills-server-redis-health" deleted
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
serviceaccount "cleanup" deleted
serviceaccount "expose" deleted
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
role.rbac.authorization.k8s.io "cleanup" deleted
role.rbac.authorization.k8s.io "expose" deleted
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
rolebinding.rbac.authorization.k8s.io "cleanup" deleted
rolebinding.rbac.authorization.k8s.io "expose" deleted
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
error: upgrading helm chart '.': failed to run 'kubectl delete -f /tmp/helm-template-workdir-948611119/jx/helmHooks/env/charts/expose/templates/job.yaml --namespace jx-staging --wait' command in directory '/tmp/jx-helm-apply-142711026/env', output: 'Error from server (NotFound): error when deleting "/tmp/helm-template-workdir-948611119/jx/helmHooks/env/charts/expose/templates/job.yaml": jobs.batch "expose" not found'

xxx/environment-xxx-staging/master #79 promotion

Showing logs for build xxx/environment-xxx-staging/master #79 promotion stage meta-pipeline and container step-create-tekton-crds                                                                                                                              
? A local Jenkins X versions repository already exists, pulling the latest: Yes
running command: jx step next-version --use-git-tag-only --tag
created new version: 0.0.73 and written to file: ./VERSION
error: Have you set up a git credential helper? See https://help.github.com/articles/caching-your-github-password-in-git/
: git output: To https://github.com/xxx/environment-x-staging.git
 ! [rejected]        v0.0.73 -> v0.0.73 (already exists)
error: failed to push some refs to 'https://github.com/xxx/environment-xxx-staging.git'
hint: Updates were rejected because the tag already exists in the remote.: failed to run 'git push origin v0.0.73' command in directory '', output: 'To https://github.com/xxx/environment-xxx-staging.git
 ! [rejected]        v0.0.73 -> v0.0.73 (already exists)
error: failed to push some refs to 'https://github.com/xxx/environment-xxx-staging.git'
hint: Updates were rejected because the tag already exists in the remote.'
error: failed to set the version on release pipelines: failed to run '/bin/sh -c jx step next-version --use-git-tag-only --tag' command in directory '/workspace/source', output: ''

Pipeline failed on stage 'meta-pipeline' : container 'step-create-tekton-crds'. The execution of the pipeline has stopped.

xxx/environment-xxx-staging/master #80 promotion

kubectl apply --recursive -f /tmp/helm-template-workdir-154755877/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 configured
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 configured
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 configured
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 configured
service/testx configured
==========


kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
service "api-poi-v2" deleted
service "olli-log-management" deleted
service "skills-server-x" deleted
service "testx" deleted
statefulset.apps "skills-server-redis-master" deleted
release.jenkins.io "api-poi-v2-0.0.20" deleted
release.jenkins.io "skills-server-x-0.0.37" deleted
release.jenkins.io "skills-vue-x-0.0.59" deleted
release.jenkins.io "testx-0.0.7" deleted
==========
kubectldelete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========

xxx/environment-xxx-staging/master #81 promotion

kubectl apply --recursive -f /tmp/helm-template-workdir-707178113/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 configured
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 configured
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 configured
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 configured
service/testx configured
==========


kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
release.jenkins.io "api-poi-v2-0.0.19" deleted
release.jenkins.io "skills-server-x-0.0.36" deleted
release.jenkins.io "skills-vue-x-0.0.57" deleted
release.jenkins.io "testx-0.0.6" deleted
==========
kubectl delete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
error: upgrading helm chart '.': failed to run 'kubectl delete -f /tmp/helm-template-workdir-707178113/jx/helmHooks/env/charts/expose/templates/job.yaml --namespace jx-staging --wait' command in directory '/tmp/jx-helm-apply-934058140/env', output: 'Error from server (NotFound): error when deleting "/tmp/helm-template-workdir-707178113/jx/helmHooks/env/charts/expose/templates/job.yaml": jobs.batch "expose" not found'
@aure-olli
Copy link
Author

Oh and if it happens to you, jx start pipeline xxx/environment-xxx-staging/master (or any action that eventually start pipeline) will fix the mess.

@aure-olli
Copy link
Author

The problem was also amplified by a failing post-upgrade job. Seems like the pipeline would wait for it to finally fail after several restarts. That together increased a lot the likehood of the wipe-out to happen, which made it happen several times a day.

@hferentschik hferentschik changed the title CRITICAL: environment totally wiped out during concurrent builds Environment wiped out during concurrent builds Nov 22, 2019
@hferentschik hferentschik added kind/bug Issue is a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/environment labels Nov 22, 2019
@deanesmith deanesmith added this to the Sprint 20 milestone Nov 26, 2019
@sfynx
Copy link

sfynx commented Nov 29, 2019

Thanks for this research, I was already wondering what caused all my services in a namespace to vanish randomly when it was particularly busy. Definitely not an option for production workloads ;)

I think it will never be fully fixable by reordering the build steps in two concurrent pipelines by checking things or changing conditions, because that would be quite a burden to maintain, these might need to change in ways we cannot fully predict.

I think we have two options here:

  1. Wait for the previous one to be completed before starting the next one. Could take a lot of time though, it might be uneccessary to wait if a newer build will do the same thing. The safest option though.

  2. Kill the running one first, which should be fine if things are meant to be idempotent. I see this regularly with other build systems where a build is restarted immediately once new information comes in, rendering the previous build obsolete. Side effect could be missing/inconsistent tags or artifacts though, when things got killed before creating them. Not sure how bad this would be for the environment pipelines, these are meant to replace all outdated services on each run anyway.

I'd say option 1, and then 2 later where it is safe to do so, so you do not wait forever when a pipeline gets stuck.

@ccojocar
Copy link
Contributor

ccojocar commented Jan 13, 2020

The proposed solution to fix this issue would be to have a lock created as a ConfigMap when the step helm apply begins to upgrade a chart release. The lock would be created per release and in the namespace where the release is deployed. Also the lock is removed automatically at the end of the upgrade process. This will ensure that every helm chart release is atomically applied.

Any concurrent upgrade while the lock is active, it will fail immediately. This is a more robust solution than retrying or waiting on a timer because a failed pipeline due to a concurrency issue can be re-triggered at any time. The same is valid when the step helm apply is executed manually.

@jstrachan
Copy link
Member

relates to #5471

@aure-olli
Copy link
Author

It will sure make everything much safer: only one helm apply at a time in each namespace.

In the future, a "wait for few minutes if I'm the latest pipeline" would however make things smoother.

@ccojocar ccojocar removed their assignment Jan 31, 2020
@deanesmith
Copy link
Contributor

Hi @aure-olli, have you experienced any further occurrences of the issue? Given the effort to address the matter is quite involved, it would be good to help us understand the urgency based on the frequency that you experience the problem.

@deanesmith deanesmith added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 13, 2020
@aure-olli
Copy link
Author

Hi @deanesmith

We have temporarily solved our problem with a custom jx by changing the selector to clean the environment, as suggested in my first post: see the diff

The problem was much amplified by a pre-install hook that failed and restarted over and over, so the whole process would take much longer than it should. The environment would be wiped out several times a day in those conditions ! In my opinion this is a critically serious matter because:

  • pre-install hooks may also fail due to bad luck (imagine the database you try to initiate is temporarily offline)
  • This can impact even the production environment
  • This may happen even without the hook, just because many people are working together and accidentally starting the same pipeline roughly at the same time

I must say that I like your solution, of locking the whole kubectl apply process. However, pardon my ignorance but I'm quite surprised it takes so long to implement it. The way I would personally implement it is (instep helm apply):

while true:
  try:
    create a jx-lock-<namespace> configmap with
      - owner reference to the pipeline
      - the pipeline build number
      - an empty "next" field
    break
  except already exists:
    get the configmap jx-lock-<namespace>
    if the pipeline in the owner reference is finished:
      try:
        delete the received version of the configmap
      except: pass
      continue
    if "next" is from a pipeline with higher build number:
      fail
    try:
      update the "next" field of the configmap with our own pipeline
    except: continue
    watch the configmap and the pipeline:
      if the configmap has changed or was deleted:
        continue
      if the pipeline status has changed or was deleted:
        continue

And then delete the configmap once kubectl apply and kubectl delete finished, successfully or not.

Kubernetes deals atomically with objects create and update (as long as you provide the current resource version), so no concurrency problem.

I wouldn't mind implementing it myself. Of course there are missing details that will make it less easy, but I think this is globally straightforward.

@deanesmith deanesmith added priority/critical and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Feb 18, 2020
@deanesmith
Copy link
Contributor

@aure-olli, we are handling other priority matters right now, do you want to create a PR for your proposal? Seems reasonable.

@deanesmith deanesmith removed this from the Sprint 3 2020 milestone Feb 26, 2020
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 23, 2020
This PR prevents builds to edit the same namespace at the same time. Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap "jx-lock-{namespace}" is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fix jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
@aure-olli
Copy link
Author

I made a PR to fix the problem, using the algorithm we talked about: #6953. This is not safe for merge yet, but seems to work decently in my first tests.

Please can you let me know what you think about it ?

aure-olli added a commit to olli-ai/jx that referenced this issue Mar 24, 2020
…Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fix jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 25, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 25, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 25, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 26, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 27, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Mar 30, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
aure-olli added a commit to olli-ai/jx that referenced this issue Apr 3, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes jenkins-x#6167

Signed-off-by: Aurélien Lambert <[email protected]>
jenkins-x-bot pushed a commit that referenced this issue Apr 3, 2020
This PR prevents builds to edit the same namespace at the same time.

Such a behavior could lead to a namespace to be accidentally wiped out.

Only `jx step helm apply` is locked.

A ConfigMap `jx-lock-{namespace}` is used as a lock. No other build can run while this configmap exists. Waiting builds can edit the data of the ConfigMap in order to be the next one to run. If a build sees that the locking or waiting build is "higher", it will fail. When the build finished, the ConfigMap is removed. A waiting build can also remove the ConfigMap if the lokcing pod has finished.

The algorithm is approximately:
```
    Label: CREATE
    try to create the ConfigMap
    if it succeeds:
        return
    Label: READ
    get the ConfigMap and the locking Pod
    if the locking Pod has finished
        remove the ConfigMap
        goto CREATE
    if the ConfigMap references a "higher" build
        fail
    if the ConfigMap references a "lower" build
        update the ConfigMap
    wait for the ConfigMap or the Pod to be updated
        if the ConfigMap is delete
            goto CREATE
        if the ConfigMap references a different build
            goto READ
        if the Pod has finished
            goto CREATE
```

fixes #6167

Signed-off-by: Aurélien Lambert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants