Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on etcd too many requests error #132

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

omertuc
Copy link
Member

@omertuc omertuc commented Apr 25, 2024

Solves MGMT-17651

tl;dr

A fix for a rare error:

Error: finalizing

Caused by:
    0: commiting etcd cache to actual etcd
    1: grpc request error: status: Unknown, message: "etcdserver: too many requests", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

Background

When committing our in-memory etcd representation to actual etcd, we send all delete requests concurrently (we have many).

Issue

Sometimes this leads to us receiving an error from etcd which says "etcdserver: too many requests". Recert treated this error as a hard error and as a result it exits.

Solution

Compare the error string to this exact phrasing (as there doesn't seem to be a more robust error code we can check, the code just says Unknown), and if we encounter it, just repeat the request again. Eventually hopefully all requests should go through.

# tl;dr

A fix for a rare error:

```
Error: finalizing

Caused by:
    0: commiting etcd cache to actual etcd
    1: grpc request error: status: Unknown, message: "etcdserver: too many requests", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }
```

# Background

When committing our in-memory etcd representation to actual etcd, we
send all delete requests concurrently (we have many).

# Issue

Sometimes this leads to us receiving an error from etcd which says
"etcdserver: too many requests". Recert treated this error as a hard
error and as a result it exits.

# Solution

Compare the error string to this exact phrasing (as there doesn't seem
to be a more robust error code we can check, the code just says
`Unknown`), and if we encounter it, just repeat the request again.
Eventually hopefully all requests should go through.
@openshift-ci openshift-ci bot requested a review from mresvanis April 25, 2024 10:42
Copy link

openshift-ci bot commented Apr 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omertuc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@omertuc
Copy link
Member Author

omertuc commented Apr 25, 2024

/retest

1 similar comment
@omertuc
Copy link
Member Author

omertuc commented Apr 25, 2024

/retest

@mresvanis
Copy link
Member

/test baremetalds-sno-recert-cluster-rename

@mresvanis
Copy link
Member

/lgtm

@omertuc
Copy link
Member Author

omertuc commented Apr 26, 2024

/retest

2 similar comments
@mresvanis
Copy link
Member

/retest

@mresvanis
Copy link
Member

/retest

@omertuc
Copy link
Member Author

omertuc commented Apr 29, 2024

I'm starting to think it's actually broken

@omertuc
Copy link
Member Author

omertuc commented Apr 29, 2024

/retest

@omertuc
Copy link
Member Author

omertuc commented Apr 29, 2024

Checking clean CI in #134

@omertuc
Copy link
Member Author

omertuc commented May 2, 2024

/retest

2 similar comments
@omertuc
Copy link
Member Author

omertuc commented May 3, 2024

/retest

@omertuc
Copy link
Member Author

omertuc commented May 15, 2024

/retest

@mresvanis
Copy link
Member

/test baremetalds-sno-recert-cluster-rename

@mresvanis
Copy link
Member

/lgtm

@omertuc
Copy link
Member Author

omertuc commented May 21, 2024

/hold not sure if works

@omertuc
Copy link
Member Author

omertuc commented May 21, 2024

/retest

@omertuc
Copy link
Member Author

omertuc commented May 21, 2024

/test e2e-aws-ovn-single-node-recert-serial

1 similar comment
@eranco74
Copy link
Collaborator

eranco74 commented Jun 9, 2024

/test e2e-aws-ovn-single-node-recert-serial

@omertuc
Copy link
Member Author

omertuc commented Jun 11, 2024

/unhold

@mresvanis
Copy link
Member

/test e2e-aws-ovn-single-node-recert-serial

@mresvanis
Copy link
Member

/override ci/prow/e2e-aws-ovn-single-node-recert-parallel

Copy link

openshift-ci bot commented Jun 12, 2024

@mresvanis: Overrode contexts on behalf of mresvanis: ci/prow/e2e-aws-ovn-single-node-recert-parallel

In response to this:

/override ci/prow/e2e-aws-ovn-single-node-recert-parallel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 3b58208 into rh-ecosystem-edge:main Jun 12, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants