Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

Make "destroy-build-slave" job more resilient when the clone step fails #20

Open
prakashsurya opened this issue Oct 31, 2015 · 0 comments
Labels

Comments

@prakashsurya
Copy link
Member

If the clone-build-slave job fails, it's likely to cause the destroy-build-slave job to fail, potentially leaking VMs. For example, see: http://psurya-jenkins.dcenter.delphix.com:8080/job/destroy-build-slave/55/console

TASK: [destroy-dc-instance | Unregister DCenter instance] ********************* 
<dcenter.delphix.com> ESTABLISH CONNECTION FOR USER: blackbox
<dcenter.delphix.com> REMOTE_MODULE command /opt/dcenter/bin/dc unregister --expires 2 openzfs-9a98bee0-A #USE_SHELL
<dcenter.delphix.com> EXEC sshpass -d11 ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/var/lib/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r" -o StrictHostKeyChecking=no -o GSSAPIAuthentication=no -o PubkeyAuthentication=no -o User=blackbox -o ConnectTimeout=10 dcenter.delphix.com /bin/sh -c 'mkdir -p /tmp/ansible-tmp-1446188082.15-224873084695844 && chmod a+rx /tmp/ansible-tmp-1446188082.15-224873084695844 && echo /tmp/ansible-tmp-1446188082.15-224873084695844'
<dcenter.delphix.com> PUT /tmp/tmpDvp7O2 TO /tmp/ansible-tmp-1446188082.15-224873084695844/command
<dcenter.delphix.com> EXEC sshpass -d11 ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/var/lib/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r" -o StrictHostKeyChecking=no -o GSSAPIAuthentication=no -o PubkeyAuthentication=no -o User=blackbox -o ConnectTimeout=10 dcenter.delphix.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /tmp/ansible-tmp-1446188082.15-224873084695844/command; rm -rf /tmp/ansible-tmp-1446188082.15-224873084695844/ >/dev/null 2>&1'
failed: [dcenter.delphix.com] => {"changed": true, "cmd": "/opt/dcenter/bin/dc unregister --expires 2 openzfs-9a98bee0-A", "delta": "0:00:00.348058", "end": "2015-10-30 06:54:46.851365", "rc": 1, "start": "2015-10-30 06:54:46.503307", "warnings": []}
stderr: there is no VM named 'openzfs-9a98bee0-A'

FATAL: all hosts have already failed -- aborting

In this case, the openzfs-9a98bee0 still existed, but it was not destroyed because this failure aborted the job before openzfs-9a98bee0 could be destroyed. We should ignore cases where the VM does not exist, so any other VMs can still be processed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant