Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Retry on ES connection error #3082

Open
nchaulet opened this issue Nov 6, 2023 · 5 comments
Open

[Fleet] Retry on ES connection error #3082

nchaulet opened this issue Nov 6, 2023 · 5 comments
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@nchaulet
Copy link
Member

nchaulet commented Nov 6, 2023

Description

We currently do not do any retry on ES connection errors, it seems something that can be improved and will improve fleet server reliability

@nchaulet nchaulet added bug Something isn't working Team:Fleet Label for the Fleet team labels Nov 6, 2023
@joshdover
Copy link
Contributor

It looks like the go client has some default retry logic which is enabled by default for up to 3 attempts. We may need to tweak the config to make sure it's retrying everything it should and it's working correctly: https://github.com/elastic/go-elasticsearch/blob/v8.11.0/elasticsearch.go#L69

@jlind23
Copy link
Contributor

jlind23 commented Dec 12, 2023

After chatting with @jsoriano it would also be great to use this issue as a starting point to think how we should better decouple Fleet Server from Elasticsearch.
What do you think?

@joshdover
Copy link
Contributor

Curious to hear more from @jsoriano on what that would imply.

I do think we need a well defined startup sequence for how fleet-server initiates it's Elasticsearch connection and starts the rest of the process.

For example, in Kibana, the core service blocks most of Kibana from completing their startup and blocks serving any HTTP endpoints (except status endpoint) until a successful connection to ES has been established. This check is done on a loop w/ backoff until it's successful. This allows the core layer to handle this responsibility without every subsystem to have to handle this scenario.

For Fleet Server, we start the bulker which creates an ES client before we even do the version check and establish connectivity to ES. We also don't appear to do this check at all in standalone mode, which means we're starting the entire application even before we know we can reach ES:

if !f.standAlone {
// Check version compatibility with Elasticsearch
remoteVersion, err := ver.CheckCompatibility(ctx, esCli, f.bi.Version)

I'd rather not start serving an HTTP traffic (except the status endpoint) until we can be sure we're connected to ES correctly, to avoid additional noise in our signals.

@jsoriano
Copy link
Member

jsoriano commented Jan 8, 2024

Curious to hear more from @jsoriano on what that would imply.

This would be to replace Elasticsearch with other storage for fleet policies and so on, but there is no definition for this yet. Till this arrives we will have to work on the retries with Elasticsearch.

@jsoriano
Copy link
Member

jsoriano commented Jan 8, 2024

I'd rather not start serving an HTTP traffic (except the status endpoint) until we can be sure we're connected to ES correctly, to avoid additional noise in our signals.

We can add a wait loop before trying to create the bulker and run the subsystems, but we still need to be able to retry on connection errors happening after this point.

@jlind23 jlind23 added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Fleet Label for the Fleet team labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants