Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One of the Testnet clients gets often marked as down in Grafana, despite performing work #3387

Open
michalinacienciala opened this issue Oct 27, 2022 · 1 comment

Comments

@michalinacienciala
Copy link
Contributor

michalinacienciala commented Oct 27, 2022

A strange behavior was noticed on keep-client-3-0 (0x3FF855895EF4aC833c32Ab6A0d6C7fBfA137E26E) - it's Grafana uptime graph since 25 Oct looks pretty fragmented:
image
By looking into logs I noticed that the client was restarted near the time of the first reported downtime (2022-10-25 ~02:50 CEST). I can also see in logs that during the periods of downtime reported in Grafana there was activity going on on the client - for example in Grafana there is a period of downtime for the client on 2022-10-25 14:40-16:10, but during that time the client was doing work (for example was involved in the tBTC DKG started at 2022-10-25 15:15:01.182 CEST).

Uptime data for the client (taken from Grafana):
downloaded-logs-20220923-143058.csv

Further investigation of the issue is needed.

@michalinacienciala michalinacienciala changed the title One of the Testnet clients gets often mark as down in Grafana, despite performing work One of the Testnet clients gets often marked as down in Grafana, despite performing work Oct 27, 2022
@michalinacienciala
Copy link
Contributor Author

There is a problem with discovery of the client:

{"address":"10.102.1.79", "level":"warn", "msg":"network address is not reachable", "networkPort":3307, "peer":"0x3FF855895EF4aC833c32Ab6A0d6C7fBfA137E26E", "ts":"2022-10-28T10:40:31.702658546Z"}

In the log we can see port 3307, it should be 3919.

Querying bootstrap node (bootstrap node curl bst-a01.test.keep.boar.network:9601/diagnostics) returns:

{
    "chain_address": "0x3FF855895EF4aC833c32Ab6A0d6C7fBfA137E26E",
    "multiaddrs":
    [
        "/ip4/127.0.0.1/tcp/3919",
        "/ip4/104.154.211.185/tcp/3307",
        "/ip4/10.102.1.79/tcp/3919"
    ],
    "network_id": "16Uiu2HAm8KJX32kr3eYUhDuzwTucSfAfspnjnXNf9veVhB12t6Vf"
}

This may be something with the diagnostics output from the bootstrap node.
To handle this correctly in the discovery we need to implement keep-network/prometheus-sd#2.

@pdyraga pdyraga modified the milestones: v2.0.0-m3, v2.0.0-m4 Nov 21, 2022
@pdyraga pdyraga removed this from the v2.0.0-m4 milestone Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants