Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network Loss! - gruntjs-02.ops.stage.jquery.net #60

Closed
Krinkle opened this issue Aug 26, 2024 · 1 comment
Closed

Network Loss! - gruntjs-02.ops.stage.jquery.net #60

Krinkle opened this issue Aug 26, 2024 · 1 comment
Assignees

Comments

@Krinkle
Copy link
Member

Krinkle commented Aug 26, 2024

Continuing here from #54

I tried logging into the droplet, to check its puppet log and nginx error, but it's not responding to SSH.

Looks like something on 22 Aug (two days before your first patch). Could it be a coincidence?

DigitalOcean control panel - gruntjs-02.ops.stage.jquery.net

I've rebooted the instance and the site is now back up.

Reviewing /var/log/syslog.1...

On Wed Aug 21 everything seemed fine (the only thing of note is a periodic check from certbot, which decides not to renew the cert since it was recently renewed and is valid until November).

On Thu Aug 22, around 09:20 UTC we start to see an error from puppet, being unable to resolve puppet-04.ops from DNS and thus can't fetch or apply new server updates. This was presumably some kind of network failure at DigitalOcean or Cloudflare DNS. This should be fine by itself, as server can work fine, it doesn't depend on Puppet to keep running as-is.

Aug 22 07:54:16 gruntjs-02 puppet-agent[1050528]: Applied catalog in 2.46 seconds
…
Aug 22 08:24:15 gruntjs-02 puppet-agent[1050935]: Applied catalog in 2.34 seconds
…
Aug 22 08:54:16 gruntjs-02 puppet-agent[1051455]: Applied catalog in 2.58 seconds
…
Aug 22 09:24:18 gruntjs-02 puppet-agent[1051605]: Connection to https://puppet-04.ops.jquery.net:8140/puppet/v3 failed, trying next route: Request to https://puppet-04.ops.jquery.net:8140/puppet/v3 failed after 12.294 seconds: Failed to open TCP connection to puppet-04.ops.jquery.net:8140 (getaddrinfo: Temporary failure in name resolution)
…
Aug 22 09:54:18 gruntjs-02 puppet-agent[1051678]: Connection to https://puppet-04.ops.jquery.net:8140/puppet/v3 failed, trying next route: Request to https://puppet-04.ops.jquery.net:8140/puppet/v3 fail
…
Aug 22 10:24:18 gruntjs-02 puppet-agent[1051785]: Connection to https://puppet-04.ops.jquery.net:8140/puppet/v3 failed, trying next route: Request to https://puppet-04.ops.jquery.net:8140/puppet/v3 fail
…
Aug 22 10:54:18 gruntjs-02 puppet-agent[1051860]: Connection to https://puppet-04.ops.jquery.net:8140/puppet/v3 failed, trying next route: Request to https://puppet-04.ops.jquery.net:8140/puppet/v3 failed after 12.296 seconds: Failed to open TCP connection to puppet-04.ops.jquery.net:8140 (getaddrinfo: Temporary failure in name resolution)

However somewhere in-between 09:54 and 10:24, the droplet somehow became disconnected from the Internet. Given that there were two similar puppet fetches before that, I don't think it's due to Puppet. However, whatever was causing the networking problem, perhaps affected puppet-04 at DigitalOcean first, and then later started affected grunt-02.

Then later we see it also failing to reach the Let's Encrypt servers, further supporting that the droplet is just disconnected from the Internet:

Aug 22 17:01:21 gruntjs-02 systemd[1]: Starting Certbot...
…
Aug 22 17:01:33 gruntjs-02 certbot[1053046]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f6e36d04f70>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
Aug 22 17:01:33 gruntjs-02 certbot[1053046]: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='e6.o.lencr.org', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6e36d04f70>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

…

Then later is a daily background check from gruntjs.com's Node server to update some data, which naturally fails as well. At this point the droplet has been offline for ~20 hours

Aug 23 02:12:00 gruntjs-02 node[621757]: Running plugin updater...
Aug 23 02:12:12 gruntjs-02 node[621757]: Error: getaddrinfo EAI_AGAIN registry.npmjs.com
Aug 23 02:12:12 gruntjs-02 node[621757]:     at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26) {
Aug 23 02:12:12 gruntjs-02 node[621757]:   syscall: 'getaddrinfo',
Aug 23 02:12:12 gruntjs-02 node[621757]:   hostname: 'registry.npmjs.com'

The above repeats for Fri Aug 23, Sat Aug 24, and Sun Aug 25; throughout which there appears to not be a single successful attempt to establish any kind of DNS or TCP connection. Then today, Mon 26 Aug, these are the last entries before I shut down the instance from the DigitalOcean panel for a reboot:

Aug 26 14:57:06 gruntjs-02 puppet-agent[1071282]: Failed to open TCP connection to puppet-04.ops.jquery.net:8140 (getaddrinfo: Temporary failure in name resolution)
…
Aug 26 14:58:19 gruntjs-02 puppet-agent[1071282]: Failed to open TCP connection to puppet-04.ops.jquery.net:8140 (getaddrinfo: Temporary failure in name resolution)
…
Aug 26 15:07:56 gruntjs-02 systemd[1]: Stopping chrony, an NTP client/server...
Aug 26 15:07:56 gruntjs-02 systemd[1]: Stopping gruntjs.com website...
Aug 26 15:07:56 gruntjs-02 systemd[1]: Stopping A high performance web server and a reverse proxy server...
Aug 26 15:07:56 gruntjs-02 systemd[1]: Stopping node-notifier server...
Aug 26 15:07:56 gruntjs-02 puppet-agent[30864]: Caught TERM; exiting
Aug 26 15:07:56 gruntjs-02 systemd[1]: Stopping Puppet agent...
…
Aug 26 15:08:58 gruntjs-02 systemd-udevd[229]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Aug 26 15:08:58 gruntjs-02 systemd[1]: Mounted FUSE Control File System.
Aug 26 15:08:58 gruntjs-02 systemd[1]: Found device /dev/ttyS0.
Aug 26 15:08:58 gruntjs-02 systemd[1]: Found device /dev/disk/by-uuid/6E18-4B78.
Aug 26 15:08:58 gruntjs-02 systemd[1]: Mounting /boot/efi...

Networking worked fine after reboot. I'm filing this to help with search/discovery in the future, and to make it easier to share with DigitalOcean support.

@Krinkle
Copy link
Member Author

Krinkle commented Sep 10, 2024

No explaination furthcoming from DigitalOcean Support. I'll close this for now. If it happens again, we can find this in the archives for context.

@Krinkle Krinkle closed this as completed Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants
@Krinkle and others