-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BTS-1644] ResignLeadership + Wait #21401
base: devel
Are you sure you want to change the base?
[BTS-1644] ResignLeadership + Wait #21401
Conversation
@@ -337,7 +344,28 @@ bool ResignLeadership::start(bool& aborts) { | |||
|
|||
// Schedule shard relocations | |||
if (!scheduleMoveShards(pending)) { | |||
finish("", "", false, "Could not schedule MoveShard."); | |||
LOG_TOPIC("d4473", DEBUG, Logger::SUPERVISION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a "WARN" log level, according to our policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "INFO", but it should show up in the logs whenever it happens in production, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that it is enough to display this message when the job actually fails (see further down). Furthermore scheduleMoveShards
already creates log messages which explain why it won't start yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got two type-improvement suggestions, otherwise LGTM.
<< "Not starting resign leadership job because some shards have no " | ||
"common in sync follower"; | ||
// check if a timeout value is specified | ||
if (_waitForInSyncTimeout > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand it correctly that if _waitForInSyncTimeout == 0
, there is no timeout and we can wait indefinitely? This is a bit confusing, because 0 timeout normally just means instantly.
Possible fix: We could have an optional instead of just a number and if the the snapshot-velocypack does not include waitForInSyncTimeout
, the option is std::nullopt
.
@@ -55,6 +55,8 @@ struct ResignLeadership : public Job { | |||
|
|||
std::string _server; | |||
bool _undoMoves{true}; | |||
bool _waitForInSync{false}; | |||
uint64_t _waitForInSyncTimeout{30 * 60}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 minutes sound a lot to me, but perhaps this is correct.
This value is actually set per default to 0 in the ResignLeadership constructor, so I guess it is more readable to set it to zero here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: I would be nice to make it visible that they both belong together - _waitForInSyncTimeout
does not make any sense if _waitForInSync
is false - as far as I understood.
E.g. via an std::variant<bool, std::tuple<bool, TimeoutInSec>>
or with the comment above: std::variant<bool, std::tuple<bool, std::optional<TimeoutInSec>>>
Scope & Purpose
Added waitForInSync and waitForInSyncTimeout parameters to the resign leadership job. This allows the user to wait for a certain amount of time to make sure that there exist common in sync followers. Previously they were just ignored and could cause downtime.
Design Doc: https://github.com/arangodb/documents/pull/136