-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JPERF-811: Fix flaky shouldTolerateEarlyFinish test by waiting until … #23
base: master
Are you sure you want to change the base?
JPERF-811: Fix flaky shouldTolerateEarlyFinish test by waiting until … #23
Conversation
Related discussion: #13 (comment) |
processCommand: String, | ||
timeout: Duration | ||
) = this.newConnection().use { | ||
it.execute("wait `pgrep '$processCommand'`", timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a bit of an overkill, but I don't have a better idea how we can wait for the process to finish.
It's also a shotgun, because we are waiting not only for the one process we started, but also all of other processes that might be named the same. I think we don't expect there to be any other processes like that anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the chance of other processes with the same name is minimal, and this solution should be sufficient.
val failResult = fail.stop(Duration.ofMillis(20)) | ||
repeat(1000) { | ||
val fail = sshHost.runInBackground("nonexistant-command") | ||
sshHost.waitForAllProcessesToFinish("nonexistant-command", Duration.ofMillis(100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout with Duration.ofMillis(50)
wasn't enough. I had a run where in 1000 repeats it timed out.
I believe that those kind of timeouts are inevitable when we are working with OS. There could always be some huge lag spike where this and any other processes with timeout would just fail.
Anyway it's better to timeout on a ssh command where an exception explicitly says that it was a timeout then when we fail because we didn't get a response. In the former it's obvious that we can just increase the timeout to reduce the likelyhood of the flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that those kind of timeouts are inevitable when we are working with OS.
Yeah, process scheduling can introduce random delays, but the delay distribution is not open-ended. It will never take 1 hour.
Anyway it's better to timeout on a ssh command where an exception explicitly says that it was a timeout then when we fail because we didn't get a response. In the former it's obvious that we can just increase the timeout to reduce the likelyhood of the flakiness.
Exactly ❤️
@@ -56,14 +56,24 @@ class SshTest { | |||
SshContainer().useSsh { sshHost -> | |||
installPing(sshHost) | |||
|
|||
val fail = sshHost.runInBackground("nonexistent-command") | |||
val failResult = fail.stop(Duration.ofMillis(20)) | |||
repeat(1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to remove this repeat
block or at least redice the number to e.g. 10
. I pushed it only to show a proof that the wait helps (maybe I should push without the fix first 🤔).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have a PR with repeat
open (and red) and then open PRs targeting that red branch. They'd show it's green. And it would allow alternatives to be explored.
…OS actually starts (and finishes) the process before interrupting it
a1ba4f2
to
e79bb92
Compare
I removed the |
@@ -56,7 +56,8 @@ class SshTest { | |||
SshContainer().useSsh { sshHost -> | |||
installPing(sshHost) | |||
|
|||
val fail = sshHost.runInBackground("nonexistent-command") | |||
val fail = sshHost.runInBackground("nonexistant-command") | |||
sshHost.waitForAllProcessesToFinish("nonexistant-command", Duration.ofMillis(100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the test shows that the runInBackground
API cannot be used easily. We should fix the problem in the API rather than in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could wait for the process to start before returning from runInBackground
similarly to how my solution proposal does it in this test, however this approach is very system specific (usage of wait
and pgrep
). I don't know how we could do that in more portable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be to just return some specific SshConnection.SshResult
as part of BackgroundProcess.stop
when we don't really get the exitStatus
, however I don't know what would that be and we need to return something if we want to maintain the current API of BackgroundProcess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dagguh I'd like to fix the flakiness of the test. This is my main goal.
My understanding of possible implementations of your suggestion is to either make it less portable or break the API (see my 2 previous comments). I don't like any of those options and I'm a bit stuck with this PR. Do you have any other ideas how this could be fixed? If not then maybe you have opinion about which of those 2 is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pczuj why are we afraid to break the API and properly apply semantic versioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break the API and properly apply semantic versioning
Yes, we can just break the API, it's a tool in our belt.
It can require some extra effort. Let's say we release ssh:3.0.0
. Theinfrastructure
is a ssh
consumer:
api("com.atlassian.performance.tools:ssh:[2.3.0,3.0.0)")
It will have a choice:
ssh:[2.3.0,3.0.0)
- stay on oldssh
, this is the default state. It takes no effort, but what was the point ofssh:3.0.0
, if no consumer needs it?ssh:[2.3.0,4.0.0)
- use the new, without dropping the old. Only possible if the removed API was not used.ssh:[3.0.0,4.0.0)
- use the new, drop the old. This means that all consumers ofinfrastructure
, which are still onssh:[2.0.0,3.0.0)
, will no longer be compatible. Therefore that bump is breakage ofinfrastructure
(its POM contract), which is a another major release. This bump would cascade to multiple libs:aws-infrastructure
andjira-performance-tests
, even if they don't usessh
directly.
We used scenario 3 many times in the past. It's a bit annoying, but not that scary.
We successfully used scenario 2 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix the problem in the API rather than in tests.
I didn't mean we have to break API. I meant: the tests found a real flakiness pain, let's fix it for the consumers as well.
this approach is very system specific (usage of
wait
andpgrep
)
I would avoid it too. Not only does it lose generality, but also more moving parts: brittle and complex.
@@ -56,7 +56,8 @@ class SshTest { | |||
SshContainer().useSsh { sshHost -> | |||
installPing(sshHost) | |||
|
|||
val fail = sshHost.runInBackground("nonexistent-command") | |||
val fail = sshHost.runInBackground("nonexistant-command") | |||
sshHost.waitForAllProcessesToFinish("nonexistant-command", Duration.ofMillis(100)) | |||
val failResult = fail.stop(Duration.ofMillis(20)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this fails due to command.exitStatus must not be null
.
The exitStatus
is supposed to be filled by if (exit-status)
, but it misses it and falls into if (exit-signal)
:
We can see that exitSignal
= INT
, so that's our tryToInterrupt
. But why is req
coming in with different values? Where exactly is the race condition? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The close
in readOutput
fails if the command was interrupted.
Softening it unveils the real flakiness source:
SSH command failed to finish in extended time (PT0.025S): SshjExecutedCommand(stdout=^Cbash: nonexistent-command-290: command not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW. It seems arbitrary to have both SshResult
and SshjExecutedCommand
and two ways of reading stdout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I traced it down to ancient times behind the wall: https://bulldog.internal.atlassian.com/browse/JPT-292
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I already had the fixes, but I was checking the effect of each commit via rebase (to populate the CHANGELOG correctly). And it turns out that failed to finish in extended time (PT0.025S)
happens even without the fixes. Sometimes it's a timeout, sometimes it's a null. I must have mixed up the observations around exitStatus=null
with the wrong actual symptom. Gotta, really hammer down my understanding of the problem.
…OS actually starts (and finishes) the process before interrupting it