[Fleet] Retry on ES connection error #3082

nchaulet · 2023-11-06T15:59:44Z

Description

We currently do not do any retry on ES connection errors, it seems something that can be improved and will improve fleet server reliability

joshdover · 2023-11-13T15:48:58Z

It looks like the go client has some default retry logic which is enabled by default for up to 3 attempts. We may need to tweak the config to make sure it's retrying everything it should and it's working correctly: https://github.com/elastic/go-elasticsearch/blob/v8.11.0/elasticsearch.go#L69

jlind23 · 2023-12-12T16:04:36Z

After chatting with @jsoriano it would also be great to use this issue as a starting point to think how we should better decouple Fleet Server from Elasticsearch.
What do you think?

joshdover · 2023-12-28T11:11:20Z

Curious to hear more from @jsoriano on what that would imply.

I do think we need a well defined startup sequence for how fleet-server initiates it's Elasticsearch connection and starts the rest of the process.

For example, in Kibana, the core service blocks most of Kibana from completing their startup and blocks serving any HTTP endpoints (except status endpoint) until a successful connection to ES has been established. This check is done on a loop w/ backoff until it's successful. This allows the core layer to handle this responsibility without every subsystem to have to handle this scenario.

For Fleet Server, we start the bulker which creates an ES client before we even do the version check and establish connectivity to ES. We also don't appear to do this check at all in standalone mode, which means we're starting the entire application even before we know we can reach ES:

fleet-server/internal/pkg/server/fleet.go

Lines 428 to 430 in a244bbe

    
           if !f.standAlone { 
        
           	// Check version compatibility with Elasticsearch 
        
           	remoteVersion, err := ver.CheckCompatibility(ctx, esCli, f.bi.Version)

I'd rather not start serving an HTTP traffic (except the status endpoint) until we can be sure we're connected to ES correctly, to avoid additional noise in our signals.

jsoriano · 2024-01-08T10:10:12Z

Curious to hear more from @jsoriano on what that would imply.

This would be to replace Elasticsearch with other storage for fleet policies and so on, but there is no definition for this yet. Till this arrives we will have to work on the retries with Elasticsearch.

jsoriano · 2024-01-08T10:16:56Z

I'd rather not start serving an HTTP traffic (except the status endpoint) until we can be sure we're connected to ES correctly, to avoid additional noise in our signals.

We can add a wait loop before trying to create the bulker and run the subsystems, but we still need to be able to retry on connection errors happening after this point.

nchaulet added bug Something isn't working Team:Fleet Label for the Fleet team labels Nov 6, 2023

jlind23 added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Fleet Label for the Fleet team labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Retry on ES connection error #3082

[Fleet] Retry on ES connection error #3082

nchaulet commented Nov 6, 2023

joshdover commented Nov 13, 2023

jlind23 commented Dec 12, 2023

joshdover commented Dec 28, 2023

jsoriano commented Jan 8, 2024

jsoriano commented Jan 8, 2024

[Fleet] Retry on ES connection error #3082

[Fleet] Retry on ES connection error #3082

Comments

nchaulet commented Nov 6, 2023

Description

joshdover commented Nov 13, 2023

jlind23 commented Dec 12, 2023

joshdover commented Dec 28, 2023

jsoriano commented Jan 8, 2024

jsoriano commented Jan 8, 2024