Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPC: Win2012 unstable when booting VM / running playbook #2056

Closed
Willsparker opened this issue Mar 19, 2021 · 9 comments · Fixed by #2112
Closed

VPC: Win2012 unstable when booting VM / running playbook #2056

Willsparker opened this issue Mar 19, 2021 · 9 comments · Fixed by #2112
Assignees
Milestone

Comments

@Willsparker
Copy link
Contributor

Logs: https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1114/console ,
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1112/console ,
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1097/console

The Windows VPC seems to be 'weird'. It will randomly disconnect from the VM halway through a task (i.e. from 1112):

18:22:25 TASK [NTP_TIME : NTP - Service w32time restart] ********************************
18:24:10 An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ReadTimeout: HTTPSConnectionPool(host='127.0.0.1', port=2201): Read timed out. (read timeout=30)
18:24:10 fatal: [127.0.0.1]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

Occasionally, it won't even get to the start the playbook (e.g. from 1097):

15:46:05     adoptopenjdkW2012: WinRM transport: negotiate
15:51:05 Timed out while waiting for the machine to boot. This means that
15:51:05 Vagrant was unable to communicate with the guest machine within
15:51:05 the configured ("config.vm.boot_timeout" value) time period.
15:51:05 
15:51:05 If you look above, you should be able to see the error(s) that
15:51:05 Vagrant had when attempting to connect to the machine. These errors
15:51:05 are usually good hints as to what may be wrong.
15:51:05 
15:51:05 If you're using a custom box, make sure that networking is properly
15:51:05 working and you're able to connect to the machine. It is a common
15:51:05 problem that networking isn't setup properly in these boxes.
15:51:05 Verify that authentication configurations are also setup properly,
15:51:05 as well.
15:51:05 
15:51:05 If the box appears to be booting properly, you may want to increase
15:51:05 the timeout ("config.vm.boot_timeout") value.
15:51:06 Build step 'Execute shell' marked build as failure

I am assuming this is due to the VM closing after awhile of execution, but I don't know for sure. I have had similar things happen to me on my local machine too.
For the timeout issue, I'd first suggest increasing the boot timeout with config.vm.boot_timeout = 600 , which bumps it up to 10 minutes, for 5. If it still times out after that, we can assume there's something causing it to hang.

@Willsparker Willsparker changed the title VPC: Win2012 unstable when booting VM VPC: Win2012 unstable when booting VM / running playbook Mar 19, 2021
@Willsparker Willsparker self-assigned this Mar 24, 2021
@Willsparker
Copy link
Contributor Author

The Vagrant VM consistently says Windows Server 2012R2 Standard Evaluation Licence is expired when booting. Apparently (According to this website), when this is the case, it can cause unexpected issues, such as shutting down randomly. So, I'm going to add the rearm command (slmgr.vbs /rearm) into the Vagrantfile, to see if that stops the random connection drop issues.

@aahlenst
Copy link
Contributor

@Willsparker Why we don't fix it for real, for example by regenerating the image regularly or purchasing a license? You cannot rearm it forever.

@Willsparker
Copy link
Contributor Author

AFAIK, We could rearm it forever. When vagrant up is run, the vagrant box image is copied, and that copy is what's changed whenever anything is done to the VM. So, rearming it via the VM would presumably not affect the amount of rearms we have on the original image (..? not 100% sure on how evaluation licenses work).

@Willsparker
Copy link
Contributor Author

Been able to automatically rearm the evaluation license using
vagrant winrm --shell -c "slmgr.vbs /rearm //b" - the //b will suppress the popups. After this, the VM is rebooted using vagrant reload.
I've made the changes in vagrantPlaybookCheck.sh (can be seen here), however, it could be done in the Vagrantfile's Provision script. My only concern with that, is that once shutdown /r /t 0 is run as part of that script, the script doesn't wait for the VM to reboot, so there could be a chance that vagrantPlaybookCheck.sh trys to run the ansible playbook whilst the machine is still rebooting.

VPC run : https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1116/console

@karianna karianna added this to the March 2021 milestone Mar 28, 2021
@Willsparker
Copy link
Contributor Author

Follow up runs:
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1120/
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1121/ <-- In Progress

It should be noted that runs 1117 & 1118 were also runs to test this, however they were all queued up and run in parallel with 1116, which seems to have caused them issues with vagrant up. They were both booting at the same time after 1116 had boot, so It seems that Windows VMs can't boot at the same time, but this may not mean they can't run in parallel.

Otherwise, it appears that forcing the rearm has cause the machines to be more stable in terms of shutting down midway through the playbook. However, runs 1116 and 1120 have connection issues to the VM when running the Python script that starts the builds. I think this may be due to the reboot, post playbook.

@Willsparker
Copy link
Contributor Author

https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1122/

so, out of 4, 3 of them got through the playbook, 1 timed out with MSVS_2017. The rearming does seem to make it more stable, so I'll make the PR, but there's still a fair bit of work to make it stable.

Next is the Python script. Out of the 3 runs that got past the playbooks, they all failed at the Python script starting. Chances are, something is wrong with it. The error message:

Running python script
15:06:37  This is what is in the 'inputArgs' var: --fork AdoptOpenJDK --branch master --version jdk8  
15:06:37  This is what is in the 'ipAddress' var: 127.0.0.1:2199 
15:06:37 1
15:06:37 Running :      Start-Process powershell.exe -Verb runAs; cd C:/tmp; sh C:/vagrant/pbTestScripts/buildJDKWin.sh --fork AdoptOpenJDK --branch master --version jdk8 
15:06:37 Traceback (most recent call last):
15:06:37   File "pbTestScripts/startScriptWin.py", line 53, in <module>
15:06:37     main(sys.argv[1:])
15:06:37   File "pbTestScripts/startScriptWin.py", line 50, in main
15:06:37     run_winrm(str(ipAddress), str(inputArgs), mode)
15:06:37   File "pbTestScripts/startScriptWin.py", line 22, in run_winrm
15:06:37     session.run_ps(cmd_str, sys.stdout, sys.stderr)
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/__init__.py", line 53, in run_ps
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/__init__.py", line 40, in run_cmd
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/protocol.py", line 166, in open_shell
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/protocol.py", line 243, in send_message
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/transport.py", line 323, in send_message
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/transport.py", line 328, in _send_message_request
15:06:37   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 643, in send
15:06:37     r = adapter.send(request, **kwargs)
15:06:37   File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 516, in send
15:06:37     raise ConnectionError(e, request=request)
15:06:37 requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=2199): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7aeea49190>: Failed to establish a new connection: [Errno 111] Connection refused',))
15:06:37 ++ date +%T

(That's from 1122)

WinRM being unstable is not news:
ansible/ansible#25532 -- which leads to ---> diyan/pywinrm#174
ansible/ansible#23320

However, I'm still going to double check that the script works (locally). The linked PR above may help as it implements a retry if there's a Connection Error. Considering we already use my forked version of pywinrm installed on VPC, I could put in the changes in that pull request and see if it helps without too much bother.

@Willsparker
Copy link
Contributor Author

I've determined that the python script does work, the issue may occur from the port number that vagrantPlaybookCheck.sh passes to it. I've put in some debug commands and rerun VPC, to see if I can spot the issue :-) https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1123/

For ease, I'll also run this locally, as the machine that handles VPC is currently executing a QPC job.

@Willsparker
Copy link
Contributor Author

The changes I've made have appeared to worked! Run 1124. Rerunning just to confirm:
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1125/

If this works as expected, I'll PR the changes 👍

@Willsparker
Copy link
Contributor Author

Excellent, it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants