-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pending allocations stuck in pending state after adoption by a new deployment #23305
pending allocations stuck in pending state after adoption by a new deployment #23305
Comments
Hi @lattwood and thanks for raising this issue with the detail included. I'll add this to our backlog to investigate further. In the meantime, if you discover any other information or concrete reproduction, please let us know. |
More log spelunking- after this shows up in the logs on an agent/client, no further alloc updates occur, and the drain issue with the pending allocs also starts occuring too.
|
grabbed a goroutine stack dump and found a clue. the same node is blocked here, and was for 1489 minutes, which ends up being just after the "error performing RPC to server" stuff in the above comment: https://github.com/hashicorp/yamux/blob/6c5a7317d6e3b6e7f85db696d6c83ed353e7cb4c/stream.go#L145-L153 Maybe timeouts aren't getting set on retries? idk. edit: the deserialization methods for a jobspec are also evident in the stack trace.
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Confirmed that hashicorp/yamux#127 fixes the issue. We'll be running a forked Nomad build until there's a release of Nomad with the yamux fix in it. edit: idk if I'm out of line for saying this, but the yamux fix probably should be backported everywhere it's in use 😅 |
Did you have TLS configured for Nomad's RPC? |
@schmichael yup. (our nomad agents are distributed around the world) |
Specifically to debug hashicorp/nomad#23305 but tests should probably run against multiple net.Conn implementations as yamux is sensitive to net.Conn behaviors such as concurrent Read|Write|Close and what errors are returned.
Just saw another issue opened on yamux this week that could be responsible for what we're seeing here as well- hashicorp/yamux#133 |
In #20165 we fixed a bug where a partially configured `client.template` retry block would set any unset fields to nil instead of their default values. But this patch introduced a regression in the default values, so we were now defaulting to unlimited retries. Restore the correct behavior and add better test coverage at both the config parsing and template configuration code. Ref: #20165 Ref: #23305 (comment)
In #20165 we fixed a bug where a partially configured `client.template` retry block would set any unset fields to nil instead of their default values. But this patch introduced a regression in the default values, so we were now defaulting to unlimited retries. Restore the correct behavior and add better test coverage at both the config parsing and template configuration code. Ref: #20165 Ref: #23305 (comment)
In #20165 we fixed a bug where a partially configured `client.template` retry block would set any unset fields to nil instead of their default values. But this patch introduced a regression in the default values, so we were now defaulting to unlimited retries if the retry block was unset. Restore the correct behavior and add better test coverage at both the config parsing and template configuration code. Ref: #20165 Ref: #23305 (comment)
Fix for the template config is up here: #25113 |
In #20165 we fixed a bug where a partially configured `client.template` retry block would set any unset fields to nil instead of their default values. But this patch introduced a regression in the default values, so we were now defaulting to unlimited retries if the retry block was unset. Restore the correct behavior and add better test coverage at both the config parsing and template configuration code. Ref: #20165 Ref: #23305 (comment)
Hey @lattwood just by way of follow-up, we're nudging along the folks who need to approve that go-msgpack PR. Once that's done I'll open a PR updating the dependency, and that PR will close this issue. |
@tgross thanks for the update, completely understand |
So, turns out we put them in Would be really nice if Nomad loudly complained about stuff like that 😅 |
Accidentally closed. |
It turns out when `net/rpc` calls `WriteResponse`, it [**doesn't do anything with the error**](https://github.com/golang/go/blob/cb47156e90cac6b1030d747f1ccd433c00494dfc/src/net/rpc/server.go#L359-L362)! (it does correctly propagate errors when `WriteRequest` is called though) This means this codec needs to handle connection disposal, because there's no way for someone using `net/rpc` to find out about the error and handle it- we're in the twilight zone. Additionally, because `net/rpc` is frozen upstream, there is a lack of desire to fix even the logging aspect, let alone the underlying defect: golang/go#71559 The `WriteResponse` and `WriteRequest` methods both use the same underlying `write` method, so this fix took three commits- * adding the first test coverage to this package, to prove de-DRYing doesn't break anything * de-DRYing by inlining `write` into `WriteResponse` and `WriteRequest` * the actual fix, along with a test for it This puts an end to two years of "spooky behaviour at a distance" on our large and geographically distributed Nomad cluster. Fixes: hashicorp/nomad#23305
Fixes a bug where connections would not be closed on write errors in the msgpack encoder, which would cause the reader end of RPC connections to hang indefinitely. This resulted in clients in widely-distributed geographies being unable to poll for allocation updates. Fixes: #23305
Closed by #25201 |
Fixes a bug where connections would not be closed on write errors in the msgpack encoder, which would cause the reader end of RPC connections to hang indefinitely. This resulted in clients in widely-distributed geographies being unable to poll for allocation updates. Fixes: #23305
Fixes a bug where connections would not be closed on write errors in the msgpack encoder, which would cause the reader end of RPC connections to hang indefinitely. This resulted in clients in widely-distributed geographies being unable to poll for allocation updates. Fixes: #23305
Fixes a bug where connections would not be closed on write errors in the msgpack encoder, which would cause the reader end of RPC connections to hang indefinitely. This resulted in clients in widely-distributed geographies being unable to poll for allocation updates. Fixes: #23305
Nomad version
Operating system and Environment details
All Linux
Issue
nomad job scale JOB TG COUNT
when it changes.nomad job scale
command immediately afterwards.Reproduction steps
Expected Result
Actual Result
Screenshots of Absurdity
In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.
Additionally, you can see the
Modified
time is the same for ones staying up to date, but isn't changing on other ones- you can also see theCreated
times are all over the place.In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.
Reporter's speculation
Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.
Nomad logs
Useless, unfortunately, due to this bug: #22431
The text was updated successfully, but these errors were encountered: