VPC: Win2012 unstable when booting VM / running playbook #2056

Willsparker · 2021-03-19T10:39:49Z

Logs: https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1114/console ,
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1112/console ,
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1097/console

The Windows VPC seems to be 'weird'. It will randomly disconnect from the VM halway through a task (i.e. from 1112):

18:22:25 TASK [NTP_TIME : NTP - Service w32time restart] ********************************
18:24:10 An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ReadTimeout: HTTPSConnectionPool(host='127.0.0.1', port=2201): Read timed out. (read timeout=30)
18:24:10 fatal: [127.0.0.1]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

Occasionally, it won't even get to the start the playbook (e.g. from 1097):

15:46:05     adoptopenjdkW2012: WinRM transport: negotiate
15:51:05 Timed out while waiting for the machine to boot. This means that
15:51:05 Vagrant was unable to communicate with the guest machine within
15:51:05 the configured ("config.vm.boot_timeout" value) time period.
15:51:05 
15:51:05 If you look above, you should be able to see the error(s) that
15:51:05 Vagrant had when attempting to connect to the machine. These errors
15:51:05 are usually good hints as to what may be wrong.
15:51:05 
15:51:05 If you're using a custom box, make sure that networking is properly
15:51:05 working and you're able to connect to the machine. It is a common
15:51:05 problem that networking isn't setup properly in these boxes.
15:51:05 Verify that authentication configurations are also setup properly,
15:51:05 as well.
15:51:05 
15:51:05 If the box appears to be booting properly, you may want to increase
15:51:05 the timeout ("config.vm.boot_timeout") value.
15:51:06 Build step 'Execute shell' marked build as failure

I am assuming this is due to the VM closing after awhile of execution, but I don't know for sure. I have had similar things happen to me on my local machine too.
For the timeout issue, I'd first suggest increasing the boot timeout with config.vm.boot_timeout = 600 , which bumps it up to 10 minutes, for 5. If it still times out after that, we can assume there's something causing it to hang.

The text was updated successfully, but these errors were encountered:

Willsparker · 2021-03-24T09:48:10Z

The Vagrant VM consistently says Windows Server 2012R2 Standard Evaluation Licence is expired when booting. Apparently (According to this website), when this is the case, it can cause unexpected issues, such as shutting down randomly. So, I'm going to add the rearm command (slmgr.vbs /rearm) into the Vagrantfile, to see if that stops the random connection drop issues.

aahlenst · 2021-03-24T11:36:08Z

@Willsparker Why we don't fix it for real, for example by regenerating the image regularly or purchasing a license? You cannot rearm it forever.

Willsparker · 2021-03-24T11:59:15Z

AFAIK, We could rearm it forever. When vagrant up is run, the vagrant box image is copied, and that copy is what's changed whenever anything is done to the VM. So, rearming it via the VM would presumably not affect the amount of rearms we have on the original image (..? not 100% sure on how evaluation licenses work).

Willsparker · 2021-03-25T10:15:19Z

Been able to automatically rearm the evaluation license using
vagrant winrm --shell -c "slmgr.vbs /rearm //b" - the //b will suppress the popups. After this, the VM is rebooted using vagrant reload.
I've made the changes in vagrantPlaybookCheck.sh (can be seen here), however, it could be done in the Vagrantfile's Provision script. My only concern with that, is that once shutdown /r /t 0 is run as part of that script, the script doesn't wait for the VM to reboot, so there could be a chance that vagrantPlaybookCheck.sh trys to run the ansible playbook whilst the machine is still rebooting.

VPC run : https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1116/console

Willsparker · 2021-03-30T08:11:15Z

Follow up runs:
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/OS=Win2012,label=vagrant/1120/
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1121/ <-- In Progress

It should be noted that runs 1117 & 1118 were also runs to test this, however they were all queued up and run in parallel with 1116, which seems to have caused them issues with vagrant up. They were both booting at the same time after 1116 had boot, so It seems that Windows VMs can't boot at the same time, but this may not mean they can't run in parallel.

Otherwise, it appears that forcing the rearm has cause the machines to be more stable in terms of shutting down midway through the playbook. However, runs 1116 and 1120 have connection issues to the VM when running the Python script that starts the builds. I think this may be due to the reboot, post playbook.

Willsparker · 2021-04-01T07:25:38Z

https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1122/

so, out of 4, 3 of them got through the playbook, 1 timed out with MSVS_2017. The rearming does seem to make it more stable, so I'll make the PR, but there's still a fair bit of work to make it stable.

Next is the Python script. Out of the 3 runs that got past the playbooks, they all failed at the Python script starting. Chances are, something is wrong with it. The error message:

Running python script
15:06:37  This is what is in the 'inputArgs' var: --fork AdoptOpenJDK --branch master --version jdk8  
15:06:37  This is what is in the 'ipAddress' var: 127.0.0.1:2199 
15:06:37 1
15:06:37 Running :      Start-Process powershell.exe -Verb runAs; cd C:/tmp; sh C:/vagrant/pbTestScripts/buildJDKWin.sh --fork AdoptOpenJDK --branch master --version jdk8 
15:06:37 Traceback (most recent call last):
15:06:37   File "pbTestScripts/startScriptWin.py", line 53, in <module>
15:06:37     main(sys.argv[1:])
15:06:37   File "pbTestScripts/startScriptWin.py", line 50, in main
15:06:37     run_winrm(str(ipAddress), str(inputArgs), mode)
15:06:37   File "pbTestScripts/startScriptWin.py", line 22, in run_winrm
15:06:37     session.run_ps(cmd_str, sys.stdout, sys.stderr)
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/__init__.py", line 53, in run_ps
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/__init__.py", line 40, in run_cmd
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/protocol.py", line 166, in open_shell
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/protocol.py", line 243, in send_message
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/transport.py", line 323, in send_message
15:06:37   File "build/bdist.linux-x86_64/egg/winrm/transport.py", line 328, in _send_message_request
15:06:37   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 643, in send
15:06:37     r = adapter.send(request, **kwargs)
15:06:37   File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 516, in send
15:06:37     raise ConnectionError(e, request=request)
15:06:37 requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=2199): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7aeea49190>: Failed to establish a new connection: [Errno 111] Connection refused',))
15:06:37 ++ date +%T

(That's from 1122)

WinRM being unstable is not news:
ansible/ansible#25532 -- which leads to ---> diyan/pywinrm#174
ansible/ansible#23320

However, I'm still going to double check that the script works (locally). The linked PR above may help as it implements a retry if there's a Connection Error. Considering we already use my forked version of pywinrm installed on VPC, I could put in the changes in that pull request and see if it helps without too much bother.

Willsparker · 2021-04-06T07:03:20Z

I've determined that the python script does work, the issue may occur from the port number that vagrantPlaybookCheck.sh passes to it. I've put in some debug commands and rerun VPC, to see if I can spot the issue :-) https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1123/

For ease, I'll also run this locally, as the machine that handles VPC is currently executing a QPC job.

Willsparker · 2021-04-08T07:27:15Z

The changes I've made have appeared to worked! Run 1124. Rerunning just to confirm:
https://ci.adoptopenjdk.net/job/VagrantPlaybookCheck/1125/

If this works as expected, I'll PR the changes 👍

Willsparker · 2021-04-09T07:48:23Z

Excellent, it works!

Willsparker added os:windows Vagrant labels Mar 19, 2021

Willsparker changed the title ~~VPC: Win2012 unstable when booting VM~~ VPC: Win2012 unstable when booting VM / running playbook Mar 19, 2021

Willsparker mentioned this issue Mar 19, 2021

unixPB/winPB: Add ansible_version to debug info #2047

Merged

5 tasks

Willsparker self-assigned this Mar 24, 2021

karianna added this to the March 2021 milestone Mar 28, 2021

Willsparker mentioned this issue Apr 1, 2021

pbTests: Rearm Windows 2012 VF before playbook #2104

Merged

5 tasks

karianna modified the milestones: March 2021, Backlog, April 2021 Apr 1, 2021

Willsparker mentioned this issue Apr 9, 2021

pbTests: Ensure use of correct port when running Win build #2112

Merged

5 tasks

Willsparker closed this as completed in #2112 Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPC: Win2012 unstable when booting VM / running playbook #2056

VPC: Win2012 unstable when booting VM / running playbook #2056

Willsparker commented Mar 19, 2021

Willsparker commented Mar 24, 2021

aahlenst commented Mar 24, 2021

Willsparker commented Mar 24, 2021

Willsparker commented Mar 25, 2021

Willsparker commented Mar 30, 2021

Willsparker commented Apr 1, 2021

Willsparker commented Apr 6, 2021

Willsparker commented Apr 8, 2021

Willsparker commented Apr 9, 2021

VPC: Win2012 unstable when booting VM / running playbook #2056

VPC: Win2012 unstable when booting VM / running playbook #2056

Comments

Willsparker commented Mar 19, 2021

Willsparker commented Mar 24, 2021

aahlenst commented Mar 24, 2021

Willsparker commented Mar 24, 2021

Willsparker commented Mar 25, 2021

Willsparker commented Mar 30, 2021

Willsparker commented Apr 1, 2021

Willsparker commented Apr 6, 2021

Willsparker commented Apr 8, 2021

Willsparker commented Apr 9, 2021