-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Retry on ES connection error #3082
Comments
It looks like the go client has some default retry logic which is enabled by default for up to 3 attempts. We may need to tweak the config to make sure it's retrying everything it should and it's working correctly: https://github.com/elastic/go-elasticsearch/blob/v8.11.0/elasticsearch.go#L69 |
After chatting with @jsoriano it would also be great to use this issue as a starting point to think how we should better decouple Fleet Server from Elasticsearch. |
Curious to hear more from @jsoriano on what that would imply. I do think we need a well defined startup sequence for how fleet-server initiates it's Elasticsearch connection and starts the rest of the process. For example, in Kibana, the core service blocks most of Kibana from completing their startup and blocks serving any HTTP endpoints (except status endpoint) until a successful connection to ES has been established. This check is done on a loop w/ backoff until it's successful. This allows the core layer to handle this responsibility without every subsystem to have to handle this scenario. For Fleet Server, we start the bulker which creates an ES client before we even do the version check and establish connectivity to ES. We also don't appear to do this check at all in standalone mode, which means we're starting the entire application even before we know we can reach ES: fleet-server/internal/pkg/server/fleet.go Lines 428 to 430 in a244bbe
I'd rather not start serving an HTTP traffic (except the status endpoint) until we can be sure we're connected to ES correctly, to avoid additional noise in our signals. |
This would be to replace Elasticsearch with other storage for fleet policies and so on, but there is no definition for this yet. Till this arrives we will have to work on the retries with Elasticsearch. |
We can add a wait loop before trying to create the bulker and run the subsystems, but we still need to be able to retry on connection errors happening after this point. |
Description
We currently do not do any retry on ES connection errors, it seems something that can be improved and will improve fleet server reliability
The text was updated successfully, but these errors were encountered: