Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove build-create-binary-log and dotnet-test-process-timeout configurations from OSOCE workflows (OSOE-834) #736

Open
dministro opened this issue Apr 2, 2024 · 14 comments · May be fixed by #808
Assignees

Comments

@dministro
Copy link
Member

dministro commented Apr 2, 2024

After Lombiq/UI-Testing-Toolbox#228, remove the build-create-binary-log and dotnet-test-process-timeout configurations from OSOCE workflows.

Jira issue

@dministro dministro self-assigned this Apr 2, 2024
@github-actions github-actions bot changed the title Remove build-create-binary-log and dotnet-test-process-timeout configurations from OSOCE workflows Remove build-create-binary-log and dotnet-test-process-timeout configurations from OSOCE workflows (OSOE-834) Apr 2, 2024
@Piedone
Copy link
Member

Piedone commented Apr 3, 2024

As per Lombiq/UI-Testing-Toolbox#228 (comment):

Lets keep the current configuration for a whilw (1-2 months).

Then we can also remove it from Lombiq Tenants Core.

@Piedone
Copy link
Member

Piedone commented May 7, 2024

I again see random, but frequent Windows hangs. E.g. the two runs here: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/8976025812?pr=746. These hang here, only telling me that a PS command is hanging, which is the dotnet test one:

image

Then this same code worked here and here, while getting stuck here and here (I disabled hand timeouts deliberately for these two to see what happens).

Do you perhaps have any suggestions?

@dministro
Copy link
Member Author

Twice in a line, it was really rare. I can say.

As I see in the dump, the AsyncStreamReader.ReadBufferAsync is waiting to read something from stdout or stderr of the dotnet process. The dotnet process was not dumped so possibly it was finished before the timeout expired.
image

The dotnet process was exited or the ::warning::The process $($process.Id) didn't exit in $Timeout seconds. was not logged for an unknown reason.
image

I don't know if it is possible, but maybe this can be caused by a special character or control character coming out earlier on the given stream.

I'll look into the logs and dumps if it happens again. Please report similar failures under this issue.

@Piedone
Copy link
Member

Piedone commented May 7, 2024

Thank you for looking into this! How do you get to see what you show in your first screenshot? This is all I see when opening the dump (Debug with Mixed):

image

I've done three test runs, with hang timeouts: the first one passed, the other two hang, see here and here. The hang dumb has the same.

These are for some new code, see #746, that might cause new issues too (I don't see why). And there's no output from the hanging test run.

But! If I skip TimeoutTests, then I get 4 passes in a row (two runs here, then here and here). @sarahelsaig does this ring a bell? Something in the timeout config seems to get tests to stuck.

@dministro
Copy link
Member Author

You can turn on the Tasks window by selecting Debug/Windows/Tasks item. See:
image

Once you have the Tasks window double click on the line selected below:
image

Also Threads, Watch N, and Call Stack windows come in handy in similar situations.

I'll look at the logs latest tomorrow.

@Piedone
Copy link
Member

Piedone commented May 7, 2024

Ah, thanks!

Looking at these, there's a bunch of Applications Insights references, what seems quite strange (however, no stack trace beyond a depth of 1, so I don't see where those are actually coming from). In any case, I removed the Application Insights module of ours being enabled during tests. It resulted in two passes (here and here, and two hangs (here and here). These still include AI references in the same way. So, it's not our AI, but something by GitHub.

I also did some test runs with reruns of what we have in dev (here and here), and they passed (though the second one failed with an "ENOENT: no such file or directory..." error, that's a problem with the artifact upload).

@Piedone
Copy link
Member

Piedone commented May 9, 2024

BTW it's always the Lombiq.Tests.UI.Samples that hangs. And there's no output, though I thought we'd have that now with streaming test outputs. Locally I can see the same behavior when running Invoke-SolutionTests: no output from a test project until the next one starts.

Having the test run's output show up in the console output would help immensely to pinpoint where this goes wrong.

@Piedone
Copy link
Member

Piedone commented May 9, 2024

That may be an xUnit/dotnet test issue; I've opened xunit/xunit#2925.

@Piedone
Copy link
Member

Piedone commented May 10, 2024

That's something xUnit needs to add: xunit/visualstudio.xunit#408.

@Piedone
Copy link
Member

Piedone commented May 10, 2024

I added a debug log file that gets updated immediately on every ITestOutputHelper write in the FailuredDumps, and observed a hang here: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9033609153. That log and the workflow output both show that the tests in Lombiq.OSOCE.Tests.UI finished (49 completed, as it should be), but still we get test failed:

2024-05-10T14:42:10.5686978Z Results File: C:\a\Open-Source-Orchard-Core-Extensions\Open-Source-Orchard-Core-Extensions\test\Lombiq.OSOCE.Tests.UI\TestResults\test-results.trx
2024-05-10T14:42:10.5687128Z 
2024-05-10T14:42:10.5688680Z Test Run Successful.
2024-05-10T14:42:10.5689829Z      Passed: 49
2024-05-10T14:42:10.5691093Z Total tests: 49
2024-05-10T14:42:10.5692141Z  Total time: 3.5414 Minutes
2024-05-10T14:42:10.5713126Z Test failed: test\Lombiq.OSOCE.Tests.UI\Lombiq.OSOCE.Tests.UI.csproj
2024-05-10T14:42:10.6026845Z ##[error]Process completed with exit code 1.
2024-05-10T14:42:10.6151953Z ##[group]Run Merge-FailureDumps -Directory "."
2024-05-10T14:42:10.6152088Z �[36;1mMerge-FailureDumps -Directory "."�[0m
2024-05-10T14:42:10.6183772Z shell: C:\Program Files\PowerShell\7\pwsh.EXE -command ". '{0}'"
2024-05-10T14:42:10.6183861Z env:
2024-05-10T14:42:10.6184153Z   CHECKOUT_TOKEN: ***
2024-05-10T14:42:10.6184278Z   DOTNET_SKIP_FIRST_TIME_EXPERIENCE: true
2024-05-10T14:42:10.6184386Z   DOTNET_CLI_TELEMETRY_OPTOUT: true
2024-05-10T14:42:10.6184494Z   DOTNET_ROOT: C:\Program Files\dotnet
2024-05-10T14:42:10.6184592Z   DOTNET_SDK_TO_USE: 8.0.204
2024-05-10T14:42:10.6184869Z   Lombiq_Tests_UI__OrchardCoreUITestExecutorConfiguration__MaxParallelTests: 0
2024-05-10T14:42:10.6184946Z ##[endgroup]
2024-05-10T14:42:10.8788979Z 
2024-05-10T14:42:10.8795785Z     Directory: C:\a\Open-Source-Orchard-Core-Extensions\Open-Source-Orchard-Core-Extensions
2024-05-10T14:42:10.8796453Z 
2024-05-10T14:42:10.8797334Z �[32;1mMode   �[0m�[32;1m              LastWriteTime�[0m �[32;1;3m        Length�[0m�[32;1m Name�[0m
2024-05-10T14:42:10.8797934Z �[32;1m----   �[0m �[32;1m             -------------�[0m �[32;1m        ------�[0m �[32;1m----�[0m
2024-05-10T14:42:10.8944776Z d----           5/10/2024  2:42 PM                �[44;1mFailureDumps�[0m

This is with dotnet-test-process-timeout: 600000 and blame-hang-timeout being empty. It looks like some kind of premature task kill to me. Also, Write-Output in StartProcessAndWaitForExit won't work, since in PS output isn't surfaced from functions for some reason. I've fixed that with | Out-Host. Then something strange happend, as you can see under this build:

  • Lombiq.OSOCE.Tests.UI ran to completion with "Test Run Successful.", Lombiq.Tests.UI.Samples didn't start.
  • But still, "Test failed: test\Lombiq.OSOCE.Tests.UI\Lombiq.OSOCE.Tests.UI.csproj ##[error]Process completed with exit code 1."
  • So, it seems that this is not an actual hang but somehow the dotnest test process didn't exit, or didn't exit with the exit code 0, or somehow this line stopped working.

Worth noting that previously, I've seen different behaviors with actual hangs, where a test project was started but didn't finish. Not this time though.

@Piedone
Copy link
Member

Piedone commented May 13, 2024

It was a PS syntax issue: https://github.com/Lombiq/GitHub-Actions/pull/346/files#diff-c9181b664fb498c4bf7694492ba4c67f5c9fe31a8b86be446ca2a80b0fdde3c6L206 After that and adjusting the timeouts, it seems OK, e.g.: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9055050994/attempts/1. And randomly the whole build can be horribly slow, but still succeed despite the test process being killed: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9055173471/attempts/1. So, this looks good now!

@Piedone
Copy link
Member

Piedone commented May 13, 2024

I think why hangs became more frequent, also generally, is because we switched from GitRunners to GitHub Windows runners after the former shut down. And the performance of these varies wildly, with even the build taking between 5 and 10 minutes. So, if we by chance get a worse runner, then tests will be slower too.

@dministro
Copy link
Member Author

I think why hangs became more frequent, also generally, is because we switched from GitRunners to GitHub Windows runners after the former shut down. And the performance of these varies wildly, with even the build taking between 5 and 10 minutes. So, if we by chance get a worse runner, then tests will be slower too.

This is what I tried to point here and here.

@Piedone
Copy link
Member

Piedone commented May 13, 2024

OK then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants